I256 Applied Natural Language Processing Fall 2009

120 Views

Download Presentation
## I256 Applied Natural Language Processing Fall 2009

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**I256 Applied Natural Language ProcessingFall 2009**Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario**Today**• Exercises • Design a graphical model • Learn parameters for naïve bayes • Language models (n-grams) • Sparse data & smoothing methods**Exercise**• Let’s design a GM • Problem: topic and subtopics classification • Each document has one broad semantic topic (e.g. politics, sports, etc.) • There are several subtopics in each document • Example: a sport document can contain a part describing a match, a part describing the location of the match and one on the persons**Exercise**• The goal is to classify the overall topic (T) of the documents and all the subtopics (STi) • Assumptions: • The subtopics STi depend on the topic of the T document • The subtopics STi are conditionally independent of each other (given T) • The words of the document wj depend on the subtopic STi and are conditionally independent of each other (given STi) • For simplicity assume as many topics nodes as there are words • How would a GM encoding this assumptions look like? • Variable? Edges? Joint Pb distributions?**Exercise**• What about now if the words of the document depend also directly from the topic T? • The subtopic persons may be quite different if the overall topic is sport or politics • What about now if there is an ordering in the subtopics, i.e. STi depend on T and also STi-1**Recall the general joint probability distribution:**P(X1, ..XN) = i P(Xi | Par(Xi) ) T w1 wn w2 Naïve Bayes for topic classification P(T, w1..wn) = P(T) P(w1| T) P(w2| T) … P(wn| T )= = P(T)i P(wi | T) Estimation (Training): Given data, estimate: P(T), P(wi | T) Inference (Testing): Compute conditional probabilities: P(T | w1, w2, ..wn)**Estimate:for each wi , Tj**Exercise • Topic = sport (num words = 15) • D1: 2009 open season • D2: against Maryland Sept • D3: play six games • D3: schedule games weekends • D4: games games games • Topic = politics (num words = 19) • D1: Obama hoping rally support • D2: billion stimulus package • D3: House Republicans tax • D4: cuts spending GOP games • D4: Republicans obama open • D5: political season P(obama | T = politics) = P(w= obama, T = politcs)/ P(T = politcs) = (c(w= obama, T = politcs)/ 34 )/(19/34) = 2/19 P(obama | T = sport) = P(w= obama, T = sport)/ P(T = sport) = (c(w= obama, T = sport)/ 34 )/(15/34) = 0 P(season | T=politics) = P(w=season, T=politcs)/ P(T=politcs) = (c(w=season, T=politcs)/ 34 )/(19/34) = 1/19 P(season | T= sport) = P(w=season, T= sport)/ P(T= sport) = (c(w=season, T= sport)/ 34 )/(15/34) = 1/19 P(republicans|T=politics)=P(w=republicans,T=politcs)/ P(T=politcs)=c(w=republicans,T=politcs)/19 = 2/19 P(republicans|T= sport)=P(w=republicans,T= sport)/ P(T= sport)=c(w=republicans,T= sport)/19 = 0/15 = 0**Exercise: inference**• What is the topic of new documents: • Republicans obama season • games season open • democrats kennedy house**Exercise: inference**• Recall: Bayes decision rule Decide Tj if P(Tj | c) > P(Tk | c) for Tj ≠Tk c is the context, here the words of the documents • We want to assign the topic T for which T’ = argmaxTjP(Tj | c)**Because of the**dependencies encoded in the GM Bayes rule This GM Exercise: Bayes classification • We compute P(Tj | c)with Bayes rule**That is, for each Tjwe calculate**and see which one is higher Exercise: Bayes classification New sentences: republicans obama season T = politics? P(politics I c) = P(politics) P(Republicans|politics) P(obama|politics) P(season| politics) = 19/34 2/19 2/19 1/19 > 0 T = sport? P(sport I c) = P(sport) P(Republicans|sport) P(obama| sport) P(season| sport) = 15/34 0 0 1/19 = 0 Choose T = politics**That is, for each Tjwe calculate**and see which one is higher Exercise: Bayes classification New sentences: democrats kennedy house T = politics? P(politics I c) = P(politics) P(democrats |politics) P(kennedy|politics) P(house| politics) = 19/34 0 0 1/19 = 0 democrats kennedy: unseen words data sparsity How can we address this?**Today**• Exercises • Design of a GM • Learn parameters • Language models (n-grams) • Sparse data & smoothing methods**Language Models**• Model to assign scores to sentences • Probabilities should broadly indicate likelihood of sentences • P( I saw a van) >> P( eyes awe of an) • Not grammaticality • P(artichokes intimidate zippers) ≈ 0 • In principle, “likely” depends on the domain, context, speaker… Adapted from Dan Klein’s CS 288 slides**Language models**• Related: the task of predicting the next word • Can be useful for • Spelling corrections • I need to notified the bank • Machine translations • Speech recognition • OCR (optical character recognition) • Handwriting recognition • Augmentative communication • Computer systems to help the disabled in communication • For example, systems that let choose words with hand movements**Language Models**• Model to assign scores to sentences • Sentence: w1, w2, … wn • Break sentence probability down with chain rule (no loss of generality) • Too many histories!**Wi-2**wi Markov assumption: n-gram solution w1 wi • Markov assumption: only the prior local context --- the last “few” n words– affects the next word • N-gram models: assume each word depends only on a short linear history • Use N-1 words to predict the next one**n-gram: Unigrams (n = 1)**From Dan Klein’s CS 288 slides**n-gram: Bigrams (n = 2)**From Dan Klein’s CS 288 slides**W1**W2 W3 . . . WN n-gram: Trigrams (n = 3) From Dan Klein’s CS 288 slides**Choice of n**• In principle we would like the n of the n-gram to be large • green • large green • the large green • swallowed the large green • swallowed should influence the choice of the next word (mountain is unlikely, pea more likely) • The crocodile swallowed the large green .. • Mary swallowed the large green .. • And so on…**Discrimination vs. reliability**• Looking at longer histories (large n) should allows us to make better prediction (better discrimination) • But it’s much harder to get reliable statistics since the number of parameters to estimate becomes too large • The larger n, the larger the number of parameters to estimate, the larger the data needed to do statistically reliable estimations**Language Models**• N size of vocabulary • Unigrams • Bi-grams • Tri-grams For each wi calculate P(wi): N of such numbers: N parameters For each wi, wj calculate P(wi| wj,): NxN parameters For each wi, wjwk calculate P(wi| wj, wk): NxNxN parameters**N-grams and parameters**• Assume we have a vocabulary of 20,000 words • Growth in number of parameters for n-grams models:**Sparsity**• Zipf’s law: most words are rare • This makes frequency-based approaches to language hard • New words appear all the time, new bigrams more often, trigrams or more, still worse! • These relative frequency estimates are the MLE (maximum likelihood estimates): choice of parameters that give the highest probability to the training corpus**Sparsity**• The larger the number of parameters, the more likely it is to get 0 probabilities • Note also the product: • If we have one 0 for un unseen events, the 0 propagates and gives us 0 probabilities for the whole sentence**Tackling data sparsity**• Discounting or smoothing methods • Change the probabilities to avoid zeros • Remember pd have to sum to 1 • Decrease the non zeros probabilities (seen events) and put the rest of the probability mass to the zeros probabilities (unseen events)**Smoothing**From Dan Klein’s CS 288 slides**Smoothing**• Put probability mass on “unseen events” • Add one /delta (uniform prior) • Add one /delta (unigram prior) • Linear interpolation • ….**Smoothing: Combining estimators**• Make linear combination of multiple probability estimates • (Providing that we weight the contribution of each of them so that the result is another probability function) • Linear interpolation or mixture models**Smoothing: Combining estimators**• Back-off models • Special case of linear interpolation**Smoothing: Combining estimators**• Back-off models: trigram version**Beyond N-Gram LMs**• Discriminative models (n-grams are generative model) • Grammar based • Syntactic models: use tree models to capture long-distance syntactic effects • Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero • Lexical • Word forms • Unknown words • Semantic based • Semantic classes: do statistic at the semantic classes level (eg., WordNet) • More data (Web)**Summary**• Given a problem (topic and subtopic classification, language models): design a GM • Learn parameters from data • But: data sparsity • Need to smooth the parameters