1 / 19

Part 5 Language Model

Part 5 Language Model. CSE717, SPRING 2008 CUBS, Univ at Buffalo. Examples of Good & Bad Language Models Excerption from Herman , comic strips by Jim Unger. 1. 2. 3. 4. What’s a Language Model. A Language model is a probability distribution over word sequences

azure
Download Presentation

Part 5 Language Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part 5Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo

  2. Examples of Good& Bad Language Models Excerption from Herman, comic strips by Jim Unger 1 2 3 4

  3. What’s a Language Model • A Language model is a probability distribution over word sequences • P(“And nothing but the truth”)  0.001 • P(“And nuts sing on the roof”)  0

  4. What’s a language model for? • Speech recognition • Handwriting recognition • Spelling correction • Optical character recognition • Machine translation • (and anyone doing statistical modeling)

  5. The Equation The observation can be image features (handwriting recognition), acoustics (speech recognition), word sequence in another language (MT), etc.

  6. How Language Models work • Hard to compute P(“And nothing but the truth”) • Decompose probability P(“and nothing but the truth) = P(“and”) P(“nothing|and”)  P(“but|and nothing”)  P(“the|and nothing but”)  P(“truth|and nothing but the”)

  7. The Trigram Approximation Assume each word depends only on the previous two words P(“the|and nothing but”)  P(“the|nothing but”) P(“truth|and nothing but the”)  P(“truth|but the”)

  8. How to find probabilities? Count from real text Pr(“the | nothing but”)  c(“nothing but the”) / c(“nothing but”)

  9. Evaluation • How can you tell a good language model from a bad one? • Run a speech recognizer (or your application of choice), calculate word error rate • Slow • Specific to your recognizer

  10. Perplexity An example Data: “the whole truth and nothing but the truth” Lexicon: L={the, whole, truth, and, nothing, but} Model 1: uni-gram, Pr(L1)=…=Pr(L6)=1/6 Model 2: unigram, Pr(“the”)=Pr(“truth”)=1/4, Pr(“whole”)=Pr(“and”)=Pr(“nothing”)=Pr(“but”)=1/8

  11. Perplexity:Is lower better? • Remarkable fact: the “true” model for data has the lowest possible perplexity • Lower the perplexity, the closer we are to true model. • Perplexity correlates well with the error rate of recognition task • Correlates better when both models are trained on same data • Doesn’t correlate well when training data changes

  12. Smoothing • Terrible on test data: If no occurrences of C(xyz), probability is 0 • P(sing|nuts) =0 leads to infinite perplexity!

  13. Smoothing: Add One • Add one smoothing: • Add delta smoothing: • Simple add-one smoothing does not perform well – the probability of rarely seen events is over-estimated

  14. Smoothing: Simple Interpolation Interpolate Trigram, Bigram, Unigram for best combination Almost good enough

  15. Smoothing: Redistribution of Probability Mass (Backing Off) [Katz87] • Discounting Discounted probability mass • Redistribution (n-1)-gram

  16. Linear Discount Factor can be determined by the relative frequency of singletons, i.e., events observed exactly once in the data [Ney95]

  17. More General Formulation • Drawback of linear discount The counts of frequently observed events are modified the most ; against the “law of large numbers” • Generalization : function of y, determined by cross-validation Requires more data Computation is expensive

  18. Absolute Discounting The discount is an absolute value Works pretty well, easier than linear discounting

  19. References [1] Katz S, Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Trans on Acoustics, Speech, and Signal Processing 35(3):400-401, 1987 [2] Ney H, Essen U, Kneser R, On the estimation of “small” probabilities by leaving-one-out, ITTT Trans. on PAMI 17(12): 1202-1212, 1995 [3] Joshua Goodman, A tutorial of language model: the State of The Art in Language Modeling, research.microsoft.com/~joshuago/lm-tutorial-public.ppt

More Related