1 / 28

Tutorial - I

Tutorial - I. 2 nd September 2005. Problem 1: N-grams. Let C be a natural language corpus consisting of N tokens and V types w 1 , w 2 , ..., w V . Let p i be the unigram probability of w i estimated from C . Also, given that  ij, i < j  p i  p j

annot
Download Presentation

Tutorial - I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tutorial - I 2nd September 2005

  2. Problem 1: N-grams • Let C be a natural language corpus consisting of N tokens and V types w1, w2, ..., wV. Let pi be the unigram probability of wi estimated from C. Also, given that ij, i < j  pi  pj • Give an estimate for pi in terms of N, V, and i. • An artificial corpus C1 was generated stochastically on the basis of the unigram probabilities pi. Estimate the bigram probabilities pij = P(wi wj) for C1 in terms of N, V, i & j. [Hint: Use the expression for pi derived above] Soln. Soln.

  3. Problem 1: N-grams (contd.) • Show that the bigram distribution of C1 does not follow Zipf’s law perfectly. For this, use the estimated expression for pij derived in (b). • It is known that natural languages exhibit Zipfian distribution over n-grams for all n. Can you use this fact to show that the bigram characteristics of C1 is different from C. • Prove the generalization of (d), i.e. “for any finite n, a stochastically generated corpus Cnbased on the n-gram estimates of C has different (n+1)-gram characteristics from C”. What can you infer from this about n-gram models for natural languages? Soln. Soln. Soln.

  4. Problem 2: Problematic AND! • Given below is a toy grammar G for English.

  5. Problem 2: Problematic AND! (contd.) • Show that the sentence “John liked Mary and Mary liked John” is ambiguous for G. Point out the parse(s) that you think is/are semantically correct. • The sentence “John said John and Mary liked John”? has the same structure as that of (a). Is the semantically valid parse for (a) also meaning-ful for (b)? Why or why not? Soln. Soln.

  6. Problem 2: Problematic AND! (contd.) • The ambiguity arises because and can connect noun and verb phrases as well as clauses. Can you suggest a method to resolve this (at least partially) by • Verb sub-categorization • By introducing new POS categories (not for verbs) and augmenting G accordingly. [Assume that POS tagging is a step before parsing and the process is perfect] Soln.

  7. Problem 3: Geo-Morph • Consider the following pairs of the name of the Geographical location and the corresponding terms for their dwellers. Let us call this system of morphology Geo-Morph.

  8. Problem 3: Geo-Morph (contd.) • Classify Geo-Morph as derivational/inflectional and linear/non-linear system of morphology. • Identify the set of affixes. Classify the examples as regular and irregular cases. Classify the regular cases further by the affixes. • Identify the different morphological paradigms. Can you classify the Geo-roots based on their graphemic/phonemic structure into these paradigms? • Design rewrite rules to capture orthographic changes for these paradigms. Soln. Soln.

  9. Problem 3: Geo-Morph (contd.) • Predict the dweller terms for the following Geo-roots based on the morphological system developed with the help of the paradigms and the rewrite rules (c-d). Which of them do you think are used in standard English? • Sweden • Oman • Libya • Vienna • Europe Soln.

  10. SOLUTIONS

  11. Solution 1(a): N-grams a) ij, i < j  pi  pj implies that wi s are sorted in descending order of unigram probability, i.e. frequencies. In other words, the rank (according to frequency) of wi is i. According to Zipf’s law, frequency  rank = constant

  12. Solution 1(b): N-grams b) Since C1 was generated stochastically based on the unigram probabilities only, the two tokens ts and ts+1 in C1 were generated independent of each other. In other words, the events ts = wi and ts+1 = wj are independent. Therefore, pij = P(ts = wi ts+1 = wj) = P(ts = wi) P(ts+1 = wj) = pi pj 1/(ijln2V)

  13. Solution 1(c): N-grams c) If the bigram distribution of C1 has to follow Zipf’s law, then bigram-probability  bigram-rank = constant (say k’), We know that pij  1/(ijln2V) Therefore, first few bigram probabilities in order of rank are p1,1, p1,2, p2,1, p3,1, p1,3, p4,1, ...  k’ = p1,1  1 = 1/ ln2V But, then p2,1 = 1/2ln2V 1/3ln2V p3,1 = 1/3ln2V 1/4ln2V p1,3 = 1/3ln2V 1/5ln2V Thus, it does not follow Zipf’s law (and even Mandelbrot’s law)

  14. Solution 1(d): N-grams d) It follows from (c) that the bigram distribution of C1 does not follow Zipf’s law, whereas that of C does. Therefore, the bigram characteristics of the two distribution must be different. We know that for C1, pij  1/(ijln2V). However, just as in (a) we can estimate the bigram distribution of Cfrom the Zipfian assumption. There are V2 probabilities. Therefore, we can assume that [br is the probability of the rth bigram. br = 1/(2rlnV) But, this estimate may be quite erroneous. Why?

  15. Solution 1(e): N-grams e)Hint: Assume Zipf’s law for n-grams. Estimate n+1-gram probabilities from n-grams (product of two n-gram probabilities). Now show that n+1-grams does not follow Zipf's law • Try to prove the following (more general) results: • Mandelbrot’s law, a generalization of Zipf’s law says (frequency + ρ)  rankα= constant. Prove (c), (d) and (e) when the distribution follows Mandelbrot’s law rather than Zipf’s law. • For any finite length corpus (i.e. when N is finite), we cannot have n-gram distributions that follow Mandelbrot’s law perfectly.

  16. Solution 2(a): Problematic AND! PARSE 1

  17. Solution 2(a): Problematic AND! PARSE 2

  18. Solution 2(b): Problematic AND! PARSE 1

  19. Solution 2(b): Problematic AND! PARSE 2

  20. Solution 2(c): problematic AND! • Verb Sub-categorization: Verbs liked and said belong to subcategories 1 and 2 respectively, where • VP  V NP [For V in 1] • VP  V S [For V in 2] • POS category Augmentation: Break CNJ into two categories CNJP and CNJC for phrasal and clausal conjunctions respectively. The grammar G is augmented as:

  21. Solution 2(c): problematic AND! • The new G for English.

  22. Solution 2(c): Problematic AND! Parsing using the new grammar

  23. Solution 2(c): Problematic AND! Parsing using the new grammar

  24. Solution 2(b): Problematic AND! Cannot parse otherwise

  25. Solution (3ab): Geo-Morph • Derivational and Linear • Irregulars are shown in red, affixes: n, ese

  26. Solution (3cd): Geo-Morph • Based on endings of the roots we might try to classify them into 4 paradigms [C:consonant-y, V:Vowel+y]: • CVa, [V/a]CC* takes n, • Ca, aCtakes ese • The Rewrite rules: • n  ian / C^_$ (Egypt^n  Egyptian) • a  Φ/C_^ese (China^ese  Chinese etc.)

  27. Solution (3e): Geo-Morph

  28. A Problem to Ponder • Try to design a complete set of morphological rules for English Geo-Morph • How many affixes, paradigms and exceptions do you expect? • Is it possible to classify the Geo-roots based solely on the graphemic/phonemic forms?

More Related