Ch-1: Introduction (1.3 & 1.4 & 1.5) Prepared by Qaiser Abbas (07-0906)

Ch-1: Introduction (1.3 & 1.4 & 1.5) Prepared by Qaiser Abbas (07-0906)

1.3:- Ambiguity of Language • NLP system determine structure of text e.g. “who did what to whom?” • Conventional Parsing System answer this question syntactically limited e.g. “Our company is training workers”. The three different parses represented as in (1.11): Making Verb Group while in others “is” as main verb Adj. Particle modifies workers Present Participle followed by Noun (Gerund)

Last two parses b & c are anomalous. It means when sentences get longer and grammar get more comprehensive then such ambiguities lead to terrible multiplication of parses. Martin(1987) report 455 parses for the following sentence (1.12): “List the sales of the products produced in 1973 with the products produced in 1972”. • Practical NLP system are good for making disambiguation decisions of word sense, word category, syntactic structure and semantic scope. • Goal is to maximize coverage while minimize ambiguity but maximize coverage increases the undesired parses and vice versa. • AI approaches to parsing and disambiguation has shown that hand coded syntactic constraints and preference rules are time consuming and not scale up well and hard and easily broken(Lakoff 1987).

In selectional restriction e.g. a verb “swallow” requires an animate as subject and physical object as object. These restriction disallow common and simple extensions of usage of swallow as in (1.13): a. I swallowed (believe) his story, hook, line, and sinker. b. The supernova swallowed (Nervously landing) the planet. • Statistical NLP approaches solves these problems automatically by learning lexical and structural preferences from corpora through information lies in relationship between words. • Statistical models are robust, generalize well, and behave gracefully in presence of errors and new data. Moreover, parameters of SNLP models can often be estimated automatically from corpora. • Automatic learning reduces human effort and raises interesting scientific issues.

1.4: Dirty Hands 1.4.1: Lexical Resources • Read machine readable text, dictionaries, thesauri and tools for processing them • Brown Corpus (1960-70): widely known, million words corpus, American English, pay to use it, include press reportage, fiction, scientific text, legal text, and many others. • Lancaster Oslo Bergen (LOB) corpus is British English replication of Brown Corpus. • Susanne Corpus : 130,000 words, freely available, subset of Brown Corpus, contain information on syntactic structure of sentence, • Penn Treebank: text from Wall Street Journal, widely used, not free • Canadian Hansards: proceeding of Canadian parliament, bilingual corpus, not freely available, such parallel text corpus is important for statistical machine translation and other cross lingual NLP work. • WordNet Electronic Dictionary: hierarchical, includes synset ( identical meaning), meronym or part-whole relations between words, free and downloaded from internet. • Further details in Ch-4

Some common words are occurring over 700 times and individually accounting for over 1% of the words e.g. 3332x100/71370 = 4.67% and 772x100/71370 = 1.08% 1.4.2: Word Counts • Question: what are the most common words in the text?. Table 1.1 includes common words known as function words from the Mark Twain’s Tom Sawyer Corpus e.g. determiners, prepositions and complementizers • Frequency of Tom, corpus reflect the material from which it was constructed, • Question: how many words are there in the text?. Corpus includes 71370 work tokens (very small corpus), less than half a MB of online text. • it includes 8018 word types (different words) while a sample of newswire of same size contains 11,000 word types • Ratio of tok to typ  71370/8018 = 8.9 which is average frequency with which each type is used • Table1.2 shows word types occur with a certain frequency. Vast majority of word types occur extremely infrequently e.g. over 90% of wordtypes occur 10 times or less e.g. 91+82+131+…3993 = 7277 out of 8018 word types. Rare words make up a considerable proportion of the text e.g. 12% of the text is words that occur 3 times or less e.g. 3993+2584+1992 = 8569 out of 71370 On the other extreme, almost half (49.8%) of the word types occur only once in the corpus known as hapax legomena (read only once) Word in corpus occure “on average” about 9 times each Overall the most common 100 words account for over half (50.9%) of the word tokens in the text

1.4.3: Zipf’s Law • The Principle of Least Effort: The people will act to minimize their probable average rate of work. Zipf uncovered this theory through certain empirical laws. • Count how often each word type occurs in a large corpus and then list the words in order of their frequency. We can explore the relationship between the frequency of a word f and its position in the list known as its rank r. The Law states f ∞ 1/r (1.14) or in other words f . r = k where k is constant. • This equation says e.g. 50th most common word should occur with three times the frequency of the 150th most common word. This concept first introduced by Estoup(1916) but widely publicized by Zipf. • Zipf’s Law holds for table 1.3 approx. except the three highest frequency words and product f.r make a curve (bulge) for words of rank around 100. • This curve gives information about frequency distribution that a few very common words, a middling number of medium frequency words and many low frequency words are exists in human languages. • The validity and possibilities for the derivation of Zipf’s law is studies by Mandelbrot(1954) and found that Zipf’s law show closer match with large Corpus sometime and give general shape of the curve but poor in reflecting details. • Figure 1.1 is rank frequency plot. Zipf’s law predicts that this graph should be a straight line with the slop-1 but mandelbrot showed that it is bad fit especially for low (most low ranks) and high ranks (greater than 10,000). Low ranks, The slop-I line is too low High ranks >10,000. The line is too high

Slight bulge in the upper left corner and large slope of model the lowest and highest ranks better than Zipf’s law • Mandelbrot derives the following relationship to achieve the closer fit. f = P(r+p)-B or logf = logP – B log(r+p) where P, B and p(ro) are parameters of text that collectively measure the richness of the text’s use of words. • Hyperbolic distribution still exist as in the case of Zipf law but for large value of r, it closely approximate a straight line descending with slop –B just as Zipf’s law. • By appropriate setting of parameters, one can model a curve where the frequency of most common words is lower. • The graph in fig 1.2 shows the Mandelbrot formula which is better fit than Zipf’s law for given corpus. Other Laws • Zipf proposed a number of other empirical laws relating to language. Among them two important SNLP concerns are as follows: “the number of meaning of a word is correlated with its frequency” or m∞√f where m is number of meaning and f is frequency. Or m ∞ 1/√r. • Zipf gives empirical support in his study as words of frequency rank about 10,000 average about 2.1 meaning, 5000 average about 3 meanings and 2000 about 4.6 meaning. Straight line at end • One can measure the number of line and pages b/t each occurrence of the word in a text and then calculate the frequency F of different interval size I e.g. for words of frequency at most 24 in 260,000 word corpus zipf found • F ∞ I-p where p varied b/t 1 and 1.3 in Zipf’s studies. In short, most of the time content words occur near another occurrence of the same word. (Detail in ch-7 and 15.3). • Other laws of Zipf’s almost represent there is an inverse relationship b/t the frequency of words and their length. • The significance of power law (read yourself)

Problem: Not normalized for the frequency of words. In case of “of the” and “in the” the most common word sequences concludes that the determiner commonly follows a preposition but these are not collocations. Solution: count frequency of each word 1.4.4: Collocations • Collocation include compound words (disk drive). Phrasal verbs (make up) and other stock phrases (bacon and eggs). Have specialized meaning and idiomatic (natural style of speech and writing) but they need not be e.g. international best practice. • The frequent use of fixed expression is candidate for collocation. Important in areas of SNLP e.g. machine translation (ch-13) and information retrieval (ch-15). Lexicographer are also interested in collocations to put it in dictionary due to frequent ways of word usage, multiword units and independent existence. • Chomskyan focus on the creativity of language use is de-emphasized by the people practice of collocation and Hallidayan gives another idea that language is inseparable from its pragmatic (words with special meaning w.r.t their use) and social context. • Collocations may be several words long or discontinuous (make [something] up). Common bigram collocation from New York Times is given in Table 1.4. • Another approach to filter collocations first, then remove those that are POS or syntactic categories or rarely associated with collocations. Two most frequent patterns are adj-noun and noun-noun as shown in Table 1.5.

1.4.5: Concordances • Key word in context (KWIC) concordancing programme which produces displays of data as in fig 1.3 • 5 uses of “showed off” in double quotes either due to neologism (new word) or slang at that time. All of these uses are intransitive(which has subject and no object) although some take prepositional phrase modifiers e.g. in and with in sentences. • (6,8,12,15) uses transitive verb (which has object compulsory) • (16) uses ditransitive (which has direct and indirect objects) verb • In (13,15), object is NP and that clause and (7) as non finite and (10) as finite question form complement clauses. • (9,14) has NP object followed by PP but quite idiomatic. In both cases object noun is modified to make a more complex NP. We could systematize the pattern as in fig1.4. Collecting information about patterns of occurrence of verbs like this is useful for dictionaries for foreign language learners, guiding statistical parses,

1.5 Further Readings • References from text book Questions, Discussion and Comments are Welcomed

Ch-1: Introduction (1.3 & 1.4 & 1.5) Prepared by Qaiser Abbas (07-0906)