1 / 82

830 likes | 1.06k Views

Language Models & Smoothing. Shallow Processing Techniques for NLP Ling570 October 19, 2011. Announcements. Career exploration talk: Bill McNeill Thursday (10/20): 2:30-3:30pm Thomson 135 & Online ( T reehouse URL) Treehouse meeting: Friday 10/21: 11-12 Thesis topic brainstorming

Download Presentation
## Language Models & Smoothing

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

**Language Models & Smoothing**Shallow Processing Techniques for NLP Ling570 October 19, 2011**Announcements**• Career exploration talk: Bill McNeill • Thursday (10/20): 2:30-3:30pm • Thomson 135 & Online (Treehouse URL) • Treehouse meeting: Friday 10/21: 11-12 • Thesis topic brainstorming • GP Meeting: Friday 10/21: 3:30-5pm • PCAR 291 & Online (…/clmagrad)**Roadmap**• Ngram language models • Constructing language models • Generative language models • Evaluation: • Training and Testing • Perplexity • Smoothing: • Laplace smoothing • Good-Turing smoothing • Interpolation & backoff**Ngram Language Models**• Independence assumptions moderate data needs • Approximate probability given all prior words • Assume finitehistory • Unigram: Probability of word in isolation • Bigram: Probability of word given 1 previous • Trigram: Probability of word given 2 previous • N-gram approximation Bigram sequence**Berkeley Restaurant Project Sentences**• can you tell me about any good cantonese restaurants close by • mid priced thai food is what i’m looking for • tell me about chez panisse • can you give me a listing of the kinds of food that are available • i’m looking for a good place to eat breakfast • when is caffevenezia open during the day**Bigram Counts**• Out of 9222 sentences • Eg. “I want” occurred 827 times**Bigram Probabilities**• Divide bigram counts by prefix unigram counts to get probabilities.**Bigram Estimates of Sentence Probabilities**• P(<s> I want english food </s>) = P(i|<s>)* P(want|I)* P(english|want)* P(food|english)* P(</s>|food) =.000031**P(english|want) = .0011**P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25 Kinds of Knowledge What types of knowledge are captured by ngram models?**P(english|want) = .0011**P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25 Kinds of Knowledge What types of knowledge are captured by ngram models? World knowledge**P(english|want) = .0011**P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25 Kinds of Knowledge What types of knowledge are captured by ngram models? World knowledge Syntax**P(english|want) = .0011**P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25 Kinds of Knowledge What types of knowledge are captured by ngram models? World knowledge Syntax Discourse**Probabilistic Language Generation**• Coin-flipping models • A sentence is generated by a randomized algorithm • The generator can be in one of several “states” • Flip coins to choose the next state • Flip other coins to decide which letter or word to output**Generated Language:Effects of N**• 1. Zero-order approximation: • XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD**Generated Language:Effects of N**• 1. Zero-order approximation: • XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD • 2. First-order approximation: • OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL**Generated Language:Effects of N**• 1. Zero-order approximation: • XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD • 2. First-order approximation: • OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL • 3. Second-order approximation: • ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE**Word Models: Effects of N**• 1. First-order approximation: • REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE**Word Models: Effects of N**• 1. First-order approximation: • REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE • 2. Second-order approximation: • THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED**Evaluation - General**• Evaluation crucial for NLP systems • Required for most publishable results • Should be integrated early • Many factors:**Evaluation - General**• Evaluation crucial for NLP systems • Required for most publishable results • Should be integrated early • Many factors: • Data • Metrics • Prior results • …..**Evaluation Guidelines**• Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic)**Evaluation Guidelines**• Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic) • Clearly lay out experimental setting**Evaluation Guidelines**• Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic) • Clearly lay out experimental setting • Compare to baseline and previous results • Perform error analysis**Evaluation Guidelines**• Evaluate your system • Use standard metrics • Use (standard) training/dev/test sets • Describing experiments: (Intrinsic vs Extrinsic) • Clearly lay out experimental setting • Compare to baseline and previous results • Perform error analysis • Show utility in real application (ideally)**Data Organization**• Training: • Training data: used to learn model parameters**Data Organization**• Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters**Data Organization**• Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters • Development (Dev) set: • Used to evaluate system during development • Avoid overfitting**Data Organization**• Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters • Development (Dev) set: • Used to evaluate system during development • Avoid overfitting • Test data: Used for final, blind evaluation**Data Organization**• Training: • Training data: used to learn model parameters • Held-out data: used to tune additional parameters • Development (Dev) set: • Used to evaluate system during development • Avoid overfitting • Test data: Used for final, blind evaluation • Typical division of data:80/10/10 • Tradeoffs • Cross-validation**Evaluting LMs**• Extrinsic evaluation (aka in vivo) • Embed alternate models in system • See which improves overall application • MT, IR, …**Evaluting LMs**• Extrinsic evaluation (aka in vivo) • Embed alternate models in system • See which improves overall application • MT, IR, … • Intrinsic evaluation: • Metric applied directly to model • Independent of larger application • Perplexity**Evaluting LMs**• Extrinsic evaluation (aka in vivo) • Embed alternate models in system • See which improves overall application • MT, IR, … • Intrinsic evaluation: • Metric applied directly to model • Independent of larger application • Perplexity • Why not just extrinsic?**Perplexity**• Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data**Perplexity**• Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally,**Perplexity**• Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally,**Perplexity**• Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally,**Perplexity**• Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally, • For bigrams:**Perplexity**• Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally, • For bigrams: • Inversely related to probability of sequence • Higher probability Lower perplexity**Perplexity**• Intuition: • A better model will have tighter fit to test data • Will yield higher probability on test data • Formally, • For bigrams: • Inversely related to probability of sequence • Higher probability Lower perplexity • Can be viewed as average branching factor of model**Perplexity Example**• Alphabet: 0,1,…,9 • Equiprobable**Perplexity Example**• Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10**Perplexity Example**• Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10 • PP(W)=**Perplexity Example**• Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10 • PP(W)= • If probability of 0 is higher, PP(W) will be**Perplexity Example**• Alphabet: 0,1,…,9; • Equiprobable: P(X)=1/10 • PP(W)= • If probability of 0 is higher, PP(W) will be lower**Thinking about Perplexity**• Given some vocabulary V with a uniform distribution • I.e. P(w) = 1/|V|**Thinking about Perplexity**• Given some vocabulary V with a uniform distribution • I.e. P(w) = 1/|V| • Under a unigram LM, the perplexity is • PP(W) =

More Related