Extraction of Syntactically Similar Sentences from Huge Corpus for Language Research

Year of Publication : 2018
Authors : Sanjay Kumar, Sandhya Umrao

The Corpus Based and statistical approaches exploits several heuristics to determine the summary- worthiness of sentences. It actually uses statistical appearances of words, words-pairs and noun phrases to calculate sentence weights and then extract the highest scoring sentences. The purpose of this research is to build a tool for Extraction of Syntactically similar sentences from huge corpus for language research. To discuss its design, use and implementation. The proposed tool is based on a logical approach to computational corpus linguistics where sentences of logic are used to express statements about texts and logical inference is used to manipulate these sentences in order to analyze the texts. The research based on functionalities needed in a corpus system can be implemented when based upon adequate means of representing, querying and reasoning. The proposed system implements hand coding, searching and parsing. Apart from being interesting from a practical point of view, the development of such a system raises intriguing philosophical and methodological questions: What is corpus texts? What is a corpus theory? What is the link between the truth of such a tool and its usefulness for natural language processing purposes? These and related questions are discussed in the research. The system exist in a prototype implementation and the research contains numerous examples from this implementation in action.


Corpus Linguistics, Corpus tools, Grammar, Grammar development, Logic programming.


