1 / 75

Language Modeling

Language Modeling. Roadmap (for next two classes). Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation Absolute Discounting Kneser -Ney. Language Model Evaluation Metrics. Applications. Entropy and perplexity.

dorothy
Download Presentation

Language Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Modeling

  2. Roadmap (for next two classes) • Review LM evaluation metrics • Entropy • Perplexity • Smoothing • Good-Turing • Backoff and Interpolation • Absolute Discounting • Kneser-Ney

  3. Language Model Evaluation Metrics

  4. Applications

  5. Entropy and perplexity • Entropy – measure information content, in bits • is message length with ideal code • Use if you want to measure in bits! • Cross entropy – measure ability of trained model to compactly represent test data • Average logprob of test data • Perplexity – measure average branching factor

  6. Entropy and perplexity • Entropy – measure information content, in bits • is message length with ideal code • Use if you want to measure in bits! • Cross entropy – measure ability of trained model to compactly represent test data • Average logprob of test data • Perplexity – measure average branching factor

  7. Entropy and perplexity • Entropy – measure information content, in bits • is message length with ideal code • Use if you want to measure in bits! • Cross entropy – measure ability of trained model to compactly represent test data • Average logprob of test data • Perplexity – measure average branching factor

  8. Entropy and perplexity • Entropy – measure information content, in bits • is message length with ideal code • Use if you want to measure in bits! • Cross entropy – measure ability of trained model to compactly represent test data • Average logprob of test data • Perplexity – measure average branching factor

  9. Language model perplexity • Recipe: • Train a language model on training data • Get negative logprobs of test data, compute average • Exponentiate! • Perplexity correlates rather well with: • Speech recognition error rates • MT quality metrics • LM Perplexities for word-based models are normally between say 50 and 1000 • Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact

  10. Parameter estimation • What is it?

  11. Parameter estimation • Model form is fixed (coin unigrams, word bigrams, …) • We have observations • H HH T T H T H H • Want to find the parameters • Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data • c(H) = 6; c(T) = 3 • P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 • MLE picks parameters best for training data… • …but these don’t generalize well to test data – zeros!

  12. Parameter estimation • Model form is fixed (coin unigrams, word bigrams, …) • We have observations • H HH T T H T H H • Want to find the parameters • Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data • c(H) = 6; c(T) = 3 • P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 • MLE picks parameters best for training data… • …but these don’t generalize well to test data – zeros!

  13. Parameter estimation • Model form is fixed (coin unigrams, word bigrams, …) • We have observations • H HH T T H T H H • Want to find the parameters • Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data • c(H) = 6; c(T) = 3 • P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 • MLE picks parameters best for training data… • …but these don’t generalize well to test data – zeros!

  14. Smoothing • Take mass from seen events, give to unseen events • Robin Hood for probability models • MLE at one end of the spectrum; uniform distribution the other • Need to pick a happy medium, and yet maintain a distribution

  15. Smoothing techniques • Laplace • Good-Turing • Backoff • Mixtures • Interpolation • Kneser-Ney

  16. Laplace • From MLE: • To Laplace:

  17. Good-Turing Smoothing • New idea: Use counts of things you have seen to estimate those you haven’t

  18. Good-Turing Josh Goodman Intuition • Imagine you are fishing • There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass • You have caught • 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish • How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? Slide adapted from Josh Goodman, Dan Jurafsky

  19. Good-Turing Josh Goodman Intuition • Imagine you are fishing • There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass • You have caught • 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish • How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? • 3/18 • Assuming so, how likely is it that next species is trout? Slide adapted from Josh Goodman, Dan Jurafsky

  20. Good-Turing Josh Goodman Intuition • Imagine you are fishing • There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass • You have caught • 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish • How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? • 3/18 • Assuming so, how likely is it that next species is trout? • Must be less than 1/18 Slide adapted from Josh Goodman, Dan Jurafsky

  21. Some more hypotheticals How likely is it to find a new fish in each of these places?

  22. Good-Turing Smoothing • New idea: Use counts of things you have seen to estimate those you haven’t

  23. Good-Turing Smoothing • New idea: Use counts of things you have seen to estimate those you haven’t • Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams

  24. Good-Turing Smoothing • New idea: Use counts of things you have seen to estimate those you haven’t • Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams • Notation: Nc is the frequency of frequency c • Number of ngrams which appear c times • N0: # ngrams of count 0; N1: # of ngrams of count 1

  25. Good-Turing Smoothing • Estimate probability of things which occur c times with the probability of things which occur c+1 times • Discounted counts: steal mass from seen cases to provide for the unseen: • MLE • GT

  26. GT Fish Example

  27. Enough about the fish…how does this relate to language? • Name some linguistic situations where the number of new words would differ

  28. Enough about the fish…how does this relate to language? • Name some linguistic situations where the number of new words would differ • Different languages: • Chinese has almost no morphology • Turkish has a lot of morphology • Lots of new words in Turkish!

  29. Enough about the fish…how does this relate to language? • Name some linguistic situations where the number of new words would differ • Different languages: • Chinese has almost no morphology • Turkish has a lot of morphology • Lots of new words in Turkish! • Different domains: • Airplane maintenance manuals: controlled vocabulary • Random web posts: uncontrolled vocab

  30. Bigram Frequencies of Frequencies and GT Re-estimates

  31. Good-Turing Smoothing • N-gram counts to conditional probability • Use c* from GT estimate

  32. Additional Issues in Good-Turing • General approach: • Estimate of c* for Nc depends on N c+1 • What if Nc+1 = 0? • More zero count problems • Not uncommon: e.g. fish example, no 4s

  33. Modifications • Simple Good-Turing • Compute Nc bins, then smooth Nc to replace zeroes • Fit linear regression in log space • log(Nc) = a +b log(c) • What about large c’s? • Should be reliable • Assume c*=c if c is large, e.g c > k (Katz: k =5) • Typically combined with other approaches

  34. Backoff and Interpolation • Another really useful source of knowledge • If we are estimating: • trigram p(z|x,y) • but count(xyz) is zero • Use info from:

  35. Backoff and Interpolation • Another really useful source of knowledge • If we are estimating: • trigram p(z|x,y) • but count(xyz) is zero • Use info from: • Bigram p(z|y) • Or even: • Unigram p(z)

  36. Backoff and Interpolation • Another really useful source of knowledge • If we are estimating: • trigram p(z|x,y) • but count(xyz) is zero • Use info from: • Bigram p(z|y) • Or even: • Unigram p(z) • How to combine this trigram, bigram, unigram info in a valid fashion?

  37. Backoffvs. Interpolation • Backoff: use trigram if you have it, otherwise bigram, otherwise unigram

  38. Backoffvs. Interpolation • Backoff: use trigram if you have it, otherwise bigram, otherwise unigram • Interpolation: always mix all three

  39. Backoff • Bigram distribution • But could be zero… • What if we fell back (or “backed off”) to a unigram distribution? • Also could be zero…

  40. Backoff • What’s wrong with this distribution? • Doesn’t sum to one! • Need to steal mass…

  41. Backoff

  42. Mixtures • Given distributions and • Pick any number between and • is a distribution • (Laplace is a mixture!)

  43. Interpolation • Simple interpolation • Or, pick interpolation value based on context • Intuition: Higher weight on more frequent n-grams

  44. How to Set the Lambdas? • Use a held-out, or development, corpus • Choose lambdas which maximize the probability of some held-out data • I.e. fix the N-gram probabilities • Then search for lambda values • That when plugged into previous equation • Give largest probability for held-out set • Can use EM to do this search

  45. Kneser-Ney Smoothing • Most commonly used modern smoothing technique • Intuition: improving backoff • I can’t see without my reading…… • Compare P(Francisco|reading) vs P(glasses|reading)

  46. Kneser-Ney Smoothing • Most commonly used modern smoothing technique • Intuition: improving backoff • I can’t see without my reading…… • Compare P(Francisco|reading) vs P(glasses|reading) • P(Francisco|reading) backs off to P(Francisco)

  47. Kneser-Ney Smoothing • Most commonly used modern smoothing technique • Intuition: improving backoff • I can’t see without my reading…… • Compare P(Francisco|reading) vs P(glasses|reading) • P(Francisco|reading) backs off to P(Francisco) • P(glasses|reading) > 0 • High unigram frequency of Francisco > P(glasses|reading)

  48. Kneser-Ney Smoothing • Most commonly used modern smoothing technique • Intuition: improving backoff • I can’t see without my reading…… • Compare P(Francisco|reading) vs P(glasses|reading) • P(Francisco|reading) backs off to P(Francisco) • P(glasses|reading) > 0 • High unigram frequency of Francisco > P(glasses|reading) • However, Francisco appears in few contexts, glasses many

  49. Kneser-Ney Smoothing • Most commonly used modern smoothing technique • Intuition: improving backoff • I can’t see without my reading…… • Compare P(Francisco|reading) vs P(glasses|reading) • P(Francisco|reading) backs off to P(Francisco) • P(glasses|reading) > 0 • High unigram frequency of Francisco > P(glasses|reading) • However, Francisco appears in few contexts, glasses many • Interpolate based on # of contexts • Words seen in more contexts, more likely to appear in others

  50. Kneser-Ney Smoothing: bigrams • Modeling diversity of contexts • So

More Related