I256 Applied Natural Language Processing Fall 2009

I256 Applied Natural Language ProcessingFall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario

Today • Exercises • Design a graphical model • Learn parameters for naïve bayes • Language models (n-grams) • Sparse data & smoothing methods

Exercise • Let’s design a GM • Problem: topic and subtopics classification • Each document has one broad semantic topic (e.g. politics, sports, etc.) • There are several subtopics in each document • Example: a sport document can contain a part describing a match, a part describing the location of the match and one on the persons

Exercise • The goal is to classify the overall topic (T) of the documents and all the subtopics (STi) • Assumptions: • The subtopics STi depend on the topic of the T document • The subtopics STi are conditionally independent of each other (given T) • The words of the document wj depend on the subtopic STi and are conditionally independent of each other (given STi) • For simplicity assume as many topics nodes as there are words • How would a GM encoding this assumptions look like? • Variable? Edges? Joint Pb distributions?

Exercise • What about now if the words of the document depend also directly from the topic T? • The subtopic persons may be quite different if the overall topic is sport or politics • What about now if there is an ordering in the subtopics, i.e. STi depend on T and also STi-1

Estimate:for each wi , Tj Exercise • Topic = sport (num words = 15) • D1: 2009 open season • D2: against Maryland Sept • D3: play six games • D3: schedule games weekends • D4: games games games • Topic = politics (num words = 19) • D1: Obama hoping rally support • D2: billion stimulus package • D3: House Republicans tax • D4: cuts spending GOP games • D4: Republicans obama open • D5: political season P(obama | T = politics) = P(w= obama, T = politcs)/ P(T = politcs) = (c(w= obama, T = politcs)/ 34 )/(19/34) = 2/19 P(obama | T = sport) = P(w= obama, T = sport)/ P(T = sport) = (c(w= obama, T = sport)/ 34 )/(15/34) = 0 P(season | T=politics) = P(w=season, T=politcs)/ P(T=politcs) = (c(w=season, T=politcs)/ 34 )/(19/34) = 1/19 P(season | T= sport) = P(w=season, T= sport)/ P(T= sport) = (c(w=season, T= sport)/ 34 )/(15/34) = 1/19 P(republicans|T=politics)=P(w=republicans,T=politcs)/ P(T=politcs)=c(w=republicans,T=politcs)/19 = 2/19 P(republicans|T= sport)=P(w=republicans,T= sport)/ P(T= sport)=c(w=republicans,T= sport)/19 = 0/15 = 0

Exercise: inference • What is the topic of new documents: • Republicans obama season • games season open • democrats kennedy house

Exercise: inference • Recall: Bayes decision rule Decide Tj if P(Tj | c) > P(Tk | c) for Tj ≠Tk c is the context, here the words of the documents • We want to assign the topic T for which T’ = argmaxTjP(Tj | c)

Because of the dependencies encoded in the GM Bayes rule This GM Exercise: Bayes classification • We compute P(Tj | c)with Bayes rule

That is, for each Tjwe calculate and see which one is higher Exercise: Bayes classification New sentences: republicans obama season T = politics? P(politics I c) = P(politics) P(Republicans|politics) P(obama|politics) P(season| politics) = 19/34 2/19 2/19 1/19 > 0 T = sport? P(sport I c) = P(sport) P(Republicans|sport) P(obama| sport) P(season| sport) = 15/34 0 0 1/19 = 0 Choose T = politics

That is, for each Tjwe calculate and see which one is higher Exercise: Bayes classification New sentences: democrats kennedy house T = politics? P(politics I c) = P(politics) P(democrats |politics) P(kennedy|politics) P(house| politics) = 19/34 0 0 1/19 = 0 democrats kennedy: unseen words  data sparsity How can we address this?

Today • Exercises • Design of a GM • Learn parameters • Language models (n-grams) • Sparse data & smoothing methods

Language Models • Model to assign scores to sentences • Probabilities should broadly indicate likelihood of sentences • P( I saw a van) >> P( eyes awe of an) • Not grammaticality • P(artichokes intimidate zippers) ≈ 0 • In principle, “likely” depends on the domain, context, speaker… Adapted from Dan Klein’s CS 288 slides

Language models • Related: the task of predicting the next word • Can be useful for • Spelling corrections • I need to notified the bank • Machine translations • Speech recognition • OCR (optical character recognition) • Handwriting recognition • Augmentative communication • Computer systems to help the disabled in communication • For example, systems that let choose words with hand movements

Language Models • Model to assign scores to sentences • Sentence: w1, w2, … wn • Break sentence probability down with chain rule (no loss of generality) • Too many histories!

Wi-2 wi Markov assumption: n-gram solution w1 wi • Markov assumption: only the prior local context --- the last “few” n words– affects the next word • N-gram models: assume each word depends only on a short linear history • Use N-1 words to predict the next one

n-gram: Unigrams (n = 1) From Dan Klein’s CS 288 slides

n-gram: Bigrams (n = 2) From Dan Klein’s CS 288 slides

W1 W2 W3 . . . WN n-gram: Trigrams (n = 3) From Dan Klein’s CS 288 slides

Choice of n • In principle we would like the n of the n-gram to be large • green • large green • the large green • swallowed the large green • swallowed should influence the choice of the next word (mountain is unlikely, pea more likely) • The crocodile swallowed the large green .. • Mary swallowed the large green .. • And so on…

Discrimination vs. reliability • Looking at longer histories (large n) should allows us to make better prediction (better discrimination) • But it’s much harder to get reliable statistics since the number of parameters to estimate becomes too large • The larger n, the larger the number of parameters to estimate, the larger the data needed to do statistically reliable estimations

Language Models • N size of vocabulary • Unigrams • Bi-grams • Tri-grams For each wi calculate P(wi): N of such numbers: N parameters For each wi, wj calculate P(wi| wj,): NxN parameters For each wi, wjwk calculate P(wi| wj, wk): NxNxN parameters

N-grams and parameters • Assume we have a vocabulary of 20,000 words • Growth in number of parameters for n-grams models:

Sparsity • Zipf’s law: most words are rare • This makes frequency-based approaches to language hard • New words appear all the time, new bigrams more often, trigrams or more, still worse! • These relative frequency estimates are the MLE (maximum likelihood estimates): choice of parameters that give the highest probability to the training corpus

Sparsity • The larger the number of parameters, the more likely it is to get 0 probabilities • Note also the product: • If we have one 0 for un unseen events, the 0 propagates and gives us 0 probabilities for the whole sentence

Tackling data sparsity • Discounting or smoothing methods • Change the probabilities to avoid zeros • Remember pd have to sum to 1 • Decrease the non zeros probabilities (seen events) and put the rest of the probability mass to the zeros probabilities (unseen events)

Smoothing From Dan Klein’s CS 288 slides

Smoothing • Put probability mass on “unseen events” • Add one /delta (uniform prior) • Add one /delta (unigram prior) • Linear interpolation • ….

Smoothing: Combining estimators • Make linear combination of multiple probability estimates • (Providing that we weight the contribution of each of them so that the result is another probability function) • Linear interpolation or mixture models

Smoothing: Combining estimators • Back-off models • Special case of linear interpolation

Smoothing: Combining estimators • Back-off models: trigram version

Beyond N-Gram LMs • Discriminative models (n-grams are generative model) • Grammar based • Syntactic models: use tree models to capture long-distance syntactic effects • Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero • Lexical • Word forms • Unknown words • Semantic based • Semantic classes: do statistic at the semantic classes level (eg., WordNet) • More data (Web)

Summary • Given a problem (topic and subtopic classification, language models): design a GM • Learn parameters from data • But: data sparsity • Need to smooth the parameters

I256 Applied Natural Language Processing Fall 2009