Download
language models smoothing n.
Skip this Video
Loading SlideShow in 5 Seconds..
Language Models & Smoothing PowerPoint Presentation
Download Presentation
Language Models & Smoothing

Language Models & Smoothing

135 Views Download Presentation
Download Presentation

Language Models & Smoothing

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011

  2. Announcements • Career exploration talk: Bill McNeill • Thursday (10/20): 2:30-3:30pm • Thomson 135 & Online (Treehouse URL) • Treehouse meeting: Friday 10/21: 11-12 • Thesis topic brainstorming • GP Meeting: Friday 10/21: 3:30-5pm • PCAR 291 & Online (…/clmagrad)

  3. Roadmap • Ngram language models • Constructing language models • Generative language models • Evaluation: • Training and Testing • Perplexity • Smoothing: • Laplace smoothing • Good-Turing smoothing • Interpolation & backoff

  4. Ngram Language Models • Independence assumptions moderate data needs • Approximate probability given all prior words • Assume finitehistory • Unigram: Probability of word in isolation • Bigram: Probability of word given 1 previous • Trigram: Probability of word given 2 previous • N-gram approximation Bigram sequence

  5. Berkeley Restaurant Project Sentences • can you tell me about any good cantonese restaurants close by • mid priced thai food is what i’m looking for • tell me about chez panisse • can you give me a listing of the kinds of food that are available • i’m looking for a good place to eat breakfast • when is caffevenezia open during the day

  6. Bigram Counts • Out of 9222 sentences • Eg. “I want” occurred 827 times

  7. Bigram Probabilities • Divide bigram counts by prefix unigram counts to get probabilities.

  8. Bigram Estimates of Sentence Probabilities • P(<s> I want english food </s>) = P(i|<s>)* P(want|I)* P(english|want)* P(food|english)* P(</s>|food) =.000031

  9. P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25 Kinds of Knowledge What types of knowledge are captured by ngram models?

  10. P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25 Kinds of Knowledge What types of knowledge are captured by ngram models? World knowledge

  11. P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25 Kinds of Knowledge What types of knowledge are captured by ngram models? World knowledge Syntax

  12. P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25 Kinds of Knowledge What types of knowledge are captured by ngram models? World knowledge Syntax Discourse

  13. Probabilistic Language Generation • Coin-flipping models • A sentence is generated by a randomized algorithm • The generator can be in one of several “states” • Flip coins to choose the next state • Flip other coins to decide which letter or word to output

  14. Generated Language:Effects of N • 1. Zero-order approximation: • XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD

  15. Generated Language:Effects of N • 1. Zero-order approximation: • XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD • 2. First-order approximation: • OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL

  16. Generated Language:Effects of N • 1. Zero-order approximation: • XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD • 2. First-order approximation: • OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL • 3. Second-order approximation: • ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE

  17. Word Models: Effects of N • 1. First-order approximation: • REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE

  18. Word Models: Effects of N • 1. First-order approximation: • REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE • 2. Second-order approximation: • THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED

  19. Shakespeare

  20. The Wall Street Journal is Not Shakespeare

  21. Evaluation

  22. Evaluation - General • Evaluation crucial for NLP systems • Required for most publishable results • Should be integrated early • Many factors:

  23. Evaluation - General • Evaluation crucial for NLP systems • Required for most publishable results • Should be integrated early • Many factors: • Data • Metrics • Prior results • …..

  24. Evaluation Guidelines • Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic)

  25. Evaluation Guidelines • Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic) • Clearly lay out experimental setting

  26. Evaluation Guidelines • Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic) • Clearly lay out experimental setting • Compare to baseline and previous results • Perform error analysis

  27. Evaluation Guidelines • Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic) • Clearly lay out experimental setting • Compare to baseline and previous results • Perform error analysis • Show utility in real application (ideally)

  28. Data Organization • Training: • Training data: used to learn model parameters

  29. Data Organization • Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters

  30. Data Organization • Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters • Development (Dev) set: • Used to evaluate system during development • Avoid overfitting

  31. Data Organization • Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters • Development (Dev) set: • Used to evaluate system during development • Avoid overfitting • Test data: Used for final, blind evaluation

  32. Data Organization • Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters • Development (Dev) set: • Used to evaluate system during development • Avoid overfitting • Test data: Used for final, blind evaluation • Typical division of data:80/10/10 • Tradeoffs • Cross-validation

  33. Evaluting LMs • Extrinsic evaluation (aka in vivo) • Embed alternate models in system • See which improves overall application • MT, IR, …

  34. Evaluting LMs • Extrinsic evaluation (aka in vivo) • Embed alternate models in system • See which improves overall application • MT, IR, … • Intrinsic evaluation: • Metric applied directly to model • Independent of larger application • Perplexity

  35. Evaluting LMs • Extrinsic evaluation (aka in vivo) • Embed alternate models in system • See which improves overall application • MT, IR, … • Intrinsic evaluation: • Metric applied directly to model • Independent of larger application • Perplexity • Why not just extrinsic?

  36. Perplexity

  37. Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data

  38. Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally,

  39. Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally,

  40. Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally,

  41. Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally, • For bigrams:

  42. Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally, • For bigrams: • Inversely related to probability of sequence • Higher probability  Lower perplexity

  43. Perplexity • Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally, • For bigrams: • Inversely related to probability of sequence • Higher probability  Lower perplexity • Can be viewed as average branching factor of model

  44. Perplexity Example • Alphabet: 0,1,…,9 • Equiprobable

  45. Perplexity Example • Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10

  46. Perplexity Example • Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10 • PP(W)=

  47. Perplexity Example • Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10 • PP(W)= • If probability of 0 is higher, PP(W) will be

  48. Perplexity Example • Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10 • PP(W)= • If probability of 0 is higher, PP(W) will be lower

  49. Thinking about Perplexity • Given some vocabulary V with a uniform distribution • I.e. P(w) = 1/|V|

  50. Thinking about Perplexity • Given some vocabulary V with a uniform distribution • I.e. P(w) = 1/|V| • Under a unigram LM, the perplexity is • PP(W) =