Computational linguistics A brief overview
Computational Linguistics • might be considered as a synonym of automatic processing of natural language, since the main task of computational linguistics is just the construction of computer programs to process words and texts in natural language.
Purpose • Automatic hyphenation • Hyphenation is intended for the proper splitting of words in natural language texts. When a word occurring at the end of a line is too long to fit on that line within the accepted margins, a part of it is moved to the next line. The word is thus wrapped, i.e., split and partially transferred to the next line.
Spell checking • The objective of spell checking is the detection and correction of typographic and orthographic errors in the text at the level of word occurrence considered out of its context. • Nobody can write without any errors. Even people well acquainted with the rules of language can, just by accident, press a wrong key on the keyboard (maybe adjacent to the correct one) or miss out a letter. Additionally, when typing, one sometimes does not synchronize properly the movements of the hands and fingers. All such errors are called typos, or typographic errors. On the other hand, some people do not know the correct spelling of some words, especially in a foreign language. Such errors are called spelling errors.
Grammar checking • Detection and correction of grammatical errors by taking into account adjacent words in the sentence or even the whole sentence are much more difficult tasks for computational linguists and software developers than just checking orthography. • Grammar errors are those violating, for example, the syntactic laws or the laws related to the structure of a sentence. In Spanish, one of these laws is the agreement between a noun and an adjective in gender and grammatical number. For example, in the combination *mujerviejos each word by itself does exist in Spanish, but together they form a syntactically ill-formed combination. Another example of a syntactic agreement is the agreement between the noun in the role of subject and the main verb, in number and person (*tútiene).
Sometimes, rather simple operations can give helpful results by detecting some very frequent errors. The following two classes of errors specific for Spanish language can be mentioned here:
Grammar • Absence of agreement between an article and the succeeding noun, in number and gender, like in *la gatos. Such errors are easily detectable within a very narrow context, i.e., of two adjacent words. For this task, it is necessary to resort to the grammatical categories for Spanish words. • · Omission of the written accent in such nouns as *articulo, *genero, *termino. Such errors cannot be detected by a usual spell checker taking the words out of context, since they convert one existing word to another existent one, namely, to a personal form of a verb. It is rather easy to define some properties of immediate contexts for nouns that never occur with the corresponding verbs, e.g., the presence of agreed articles, adjectives, or pronouns .
Style checking • The stylistic errors are those violating the laws of use of correct words and word combinations in language, in general or in a given literary genre. • This application is the nearest in its tasks to normative grammars and manuals on stylistics in the printed, oriented to humans, form. Thus, style checkers play a didactic and prescriptive role for authors of texts.
The style checker should use a dictionary of words supplied with their usage marks, synonyms, information on proper use of prepositions, compatibility with other words, etc. It should also use automatic parsing, which can detect improper syntactic constructions.
References to words and word combinations • Synonyms, antonyms
Information retrieval • Information retrieval systems (irs) are designed to search for relevant information in large documentary databases. This information can be of various kinds, with the queries ranging from “Find all the documents containing the word writing”. Accordingly, various systems use different methods of search.
Topical summarization • In many cases, it is necessary to automatically determine what a given document is about. This information is used to classify the documents by their main topics, to deliver by Internet the documents on a specific subject to the users, to automatically index the documents in an irs, to quickly orient people in a large set of documents, and for other purposes.
Summary • It has been shown that only very simple tasks like hyphenation or simple spell checking can be solved on a modest linguistic basis. All the other systems should employ relatively deep linguistic knowledge: dictionaries, morphologic and syntactic analyzers, and in some cases deep semantic knowledge and reasoning. What is more, nearly all of the discussed tasks, even spell checking, have to employ very deep analysis to be solved with an accuracy approaching 100%. It was also shown that most of the language processing tasks could be considered as special cases of the general task of language understanding, one of the ultimate goals of computational linguistics and artificial intelligence.
Compiled from http://web.archive.org/web/20071225181521/www.gelbukh.com/clbook/Computational-Linguistics.htm