1 / 38

CPSC 503 Computational Linguistics

CPSC 503 Computational Linguistics. Lecture 5 Giuseppe Carenini. Today 22/9. Finish spelling n -grams Model evaluation. funn -> funny, funnel. Find the most likely correct word. trust funn a lot of funn. …in this context. I want too go their. Spelling: the problem(s).

corye
Download Presentation

CPSC 503 Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPSC 503Computational Linguistics Lecture 5 Giuseppe Carenini CPSC503 Winter 2008

  2. Today 22/9 • Finish spelling • n-grams • Model evaluation CPSC503 Winter 2008

  3. funn -> funny, funnel... Find the most likely correct word • trust funn • a lot of funn …in this context • I want toogo their Spelling: the problem(s) Correction Detection Non-word isolated Non-word context ?! Real-word isolated Is it an impossible (or very unlikely) word in this context? Real-word context Find the most likely sub word in this context CPSC503 Winter 2008

  4. Key Transition • Up to this point we’ve mostly been discussing words in isolation • Now we’re switching to sequences of words • And we’re going to worry about assigning probabilities to sequences of words CPSC503 Winter 2008

  5. Knowledge-Formalisms Map(including probabilistic formalisms) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Logical formalisms (First-Order Logics) Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics Pragmatics Discourse and Dialogue AI planners CPSC503 Winter 2008

  6. Only Spelling? • Assign a probability to a sentence • Part-of-speech tagging • Word-sense disambiguation • Probabilistic Parsing • Predict the next word • Speech recognition • Hand-writing recognition • Augmentative communication for the disabled AB Impossible to estimate  CPSC503 Winter 2008

  7. Impossible to estimate! Assuming 104 word types and average sentence contains 10 words -> sample space? • Google language model Update (22 Sept. 2006): was based on corpusNumber of sentences: 95,119,665,584 Most sentences will not appear or appear only once  Key point in Stat. NLP: your corpus should be >> than your sample space! CPSC503 Winter 2008

  8. Chain Rule: Decompose: apply chain rule Applied to a word sequence from position 1 to n: CPSC503 Winter 2008

  9. Example • Sequence “The big red dog barks” • P(The big red dog barks)= P(The) * P(big|the) * P(red|the big)* P(dog|the big red)* P(barks|the big red dog) Note - P(The) is better expressed as: P(The|<Beginning of sentence>) written as P(The|<S>) CPSC503 Winter 2008

  10. Not a satisfying solution  Even for small n (e.g., 6) we would need a far too large corpus to estimate: Markov Assumption: the entire prefix history isn’t necessary. unigram bigram trigram CPSC503 Winter 2008

  11. Prob of a sentence: N-Grams Chain-rule simplifications unigram bigram trigram CPSC503 Winter 2008

  12. Bigram<s>The big red dog barks P(The big red dog barks)= P(The|<S>) * Trigram? CPSC503 Winter 2008

  13. Estimates for N-Grams bigram ..in general CPSC503 Winter 2008

  14. Silly Corpus : “<S>The big red dog barks against the big pink dog” Estimates for Bigrams Word types vs. Word tokens CPSC503 Winter 2008

  15. Berkeley ____________Project (1994) Table: Counts Corpus: ~10,000 sentences, 1616 word types What domain? Dialog? Reviews? CPSC503 Winter 2008

  16. BERP Table: CPSC503 Winter 2008

  17. BERP Table Comparison Counts Prob. 1? CPSC503 Winter 2008

  18. Some Observations • What’s being captured here? • P(want|I) = .32 • P(to|want) = .65 • P(eat|to) = .26 • P(food|Chinese) = .56 • P(lunch|eat) = .055 CPSC503 Winter 2008

  19. P(I | I) P(want | I) P(I | food) I I I want I want I want to The food I want is Some Observations Speech-based restaurant consultant! CPSC503 Winter 2008

  20. Generation • Choose N-Grams according to their probabilities and string them together • I want -> want to -> to eat -> eatlunch • I want -> want to -> to eat -> eat Chinese -> Chinese food CPSC503 Winter 2008

  21. Two problems with applying: to CPSC503 Winter 2008

  22. Problem (1) • We may need to multiply many very small numbers (underflows!) • Easy Solution: • Convert probabilities to logs and then do …… • To get the real probability (if you need it) go back to the antilog. CPSC503 Winter 2008

  23. Problem (2) • The probability matrix for n-grams is sparse • How can we assign a probability to a sequence where one of the component n-grams has a value of zero? • Solutions: • Add-one smoothing • Good-Turing • Back off and Deleted Interpolation CPSC503 Winter 2008

  24. Add-One • Make the zero counts 1. • Rationale: If you had seen these “rare” events chances are you would only have seen them once. unigram discount CPSC503 Winter 2008

  25. Add-One: Bigram Counts …… CPSC503 Winter 2008

  26. BERP Original vs. Add-one smoothed Counts 6 5 2 2 19 15 CPSC503 Winter 2008

  27. Add-One Smoothed Problems • An excessive amount of probability mass is moved to the zero counts • When compared empirically with MLE or other smoothing techniques, it performs poorly • -> Not used in practice • Detailed discussion [Gale and Church 1994] CPSC503 Winter 2008

  28. Better smoothing technique • Good-Turing Discounting (clear example on textbook) More advanced: Backoff and Interpolation • To estimate an ngram use “lower order” ngrams • Backoff: Rely on the lower order ngrams when we have zero counts for a higher order one; e.g.,use unigrams to estimate zero count bigrams • Interpolation: Mix the prob estimate from all the ngrams estimators; e.g., we do a weighted interpolation of trigram, bigram and unigram counts CPSC503 Winter 2008

  29. N-Grams Summary: final Impossible to estimate! Sample space much bigger than any realistic corpus  Chain rule does not help  • Markov assumption : • Unigram… sample space? • Bigram … sample space? • Trigram … sample space? Sparse matrix: Smoothing techniques Look at practical issues: sec. 4.8 CPSC503 Winter 2008

  30. You still need a big corpus… • The biggest one is the Web! • Impractical to download every page from the Web to count the ngrams => • Rely on page counts (which are only approximations) • Page can contain an ngram multiple times • Search Engines round-off their counts • Such “noise” is tolerable in practice  CPSC503 Winter 2008

  31. Today 22/9 • Finish spelling • n-grams • Model evaluation CPSC503 Winter 2008

  32. Entropy • Def1. Measure of uncertainty • Def2. Measure of the information that we need to resolve an uncertain situation • Let p(x)=P(X=x); where x  X. • H(p)= H(X)= - xXp(x)log2p(x) • It is normally measured in bits. CPSC503 Winter 2008

  33. Model Evaluation Actual distribution How different? Our approximation • Relative Entropy (KL divergence) ? • D(p||q)=xXp(x)log(p(x)/q(x)) CPSC503 Winter 2008

  34. Entropy rate Language Entropy Assumptions: ergodic and stationary Entropy of Shannon-McMillan-Breiman NL? Entropy can be computed by taking the average log probability of a looooong sample CPSC503 Winter 2008

  35. Applied to Language Cross-Entropy Between probability distribution P and another distribution Q (model for P) Between two models Q1 and Q2 the more accurate is the one with higher =>lower cross-entropy => lower CPSC503 Winter 2008

  36. Model Evaluation: In practice Corpus A:split Training Set Testing set B: train model C:Apply model • counting • frequencies • smoothing • Compute cross-perplexity Model: Q CPSC503 Winter 2008

  37. Knowledge-Formalisms Map(including probabilistic formalisms) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Logical formalisms (First-Order Logics) Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics Pragmatics Discourse and Dialogue AI planners CPSC503 Winter 2008

  38. Next Time • Hidden Markov-Models (HMM) • Part-of-Speech (POS) tagging CPSC503 Winter 2008

More Related