1 / 20

Advanced Smoothing, Evaluation of Language Models

Advanced Smoothing, Evaluation of Language Models. Witten-Bell Discounting. A zero ngram is just an ngram you haven’t seen yet…but every ngram in the corpus was unseen once…so... How many times did we see an ngram for the first time? Once for each ngram type (T)

bjorn
Download Presentation

Advanced Smoothing, Evaluation of Language Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Smoothing, Evaluation of Language Models

  2. Witten-Bell Discounting • A zero ngram is just an ngram you haven’t seen yet…but every ngram in the corpus was unseen once…so... • How many times did we see an ngram for the first time? Once for each ngram type (T) • Est. total probability of unseen bigrams as • View training corpus as series of events, one for each token (N) and one for each new type (T) • We can divide the probability mass equally among unseen bigrams….or we can condition the probability of an unseen bigram on the first word of the bigram • Discount values for Witten-Bell are much more reasonable than Add-One

  3. Good-Turing Discounting • Re-estimate amount of probability mass for zero (or low count) ngrams by looking at ngrams with higher counts • Estimate • E.g. N0’s adjusted count is a function of the count of ngrams that occur once, N1 • Assumes: • word bigrams follow a binomial distribution • We know number of unseen bigrams (VxV-seen)

  4. Interpolation and Backoff • Typically used in addition to smoothing techniques/ discounting • Example: trigrams • Smoothing gives some probability mass to all the trigram types not observed in the training data • We could make a more informed decision! How? • If backoff finds an unobserved trigram in the test data, it will “back off” to bigrams (and ultimately to unigrams) • Backoff doesn’t treat all unseen trigrams alike • When we have observed a trigram, we will rely solely on the trigram counts

  5. Backoff methods (e.g. Katz ‘87) • For e.g. a trigram model • Compute unigram, bigram and trigram probabilities • In use: • Where trigram unavailable back off to bigram if available, o.w. unigram probability • E.g An omnivorous unicorn

  6. Smoothing: Simple Interpolation • Trigram is very context specific, very noisy • Unigram is context-independent, smooth • Interpolate Trigram, Bigram, Unigram for best combination • Find 0<<1 by optimizing on “held-out” data • Almost good enough

  7. Smoothing: Held-out estmation • Finding parameter values • Split data into training, “heldout”, test • Try lots of different values for   on heldout data, pick best • Test on test data • Sometimes, can use tricks like “EM” (estimation maximization) to find values • How much data for training, heldout, test? • Answer: enough test data to be statistically significant. (1000s of words perhaps)

  8. Summary • N-gram probabilities can be used to estimate the likelihood • Of a word occurring in a context (N-1) • Of a sentence occurring at all • Smoothing techniques deal with problems of unseen words in a corpus

  9. Practical Issues • Represent and compute language model probabilities on log format p1  p2  p3  p4 = exp (log p1 + log p2 + log p3 + log p4)

  10. Class-based n-grams • P(wi|wi-1) = P(ci|ci-1) x P(wi|ci) Factored Language Models

  11. Evaluating language models • We need evaluation metrics to determine how good our language models predict the next word • Intuition: one should average over the probability of new words

  12. Some basic information theory • Evaluation metrics for language models • Information theory: measures of information • Entropy • Perplexity

  13. Entropy • Average length of most efficient coding for a random variable • Binary encoding

  14. Entropy • Example: betting on horses • 8 horses, each horse is equally likely to win • (Binary) Message required: 001, 010, 011, 100, 101, 110, 111, 000 • 3-bit message required

  15. Entropy • 8 horses, some horses are more likely to win • Horse 1: ½ 0 • Horse 2: ¼ 10 • Horse 3: 1/8 110 • Horse 4: 1/16 1110 • Horse 5-8: 1/64 111100, 111101, 111110, 111111

  16. Perplexity • Entropy: H • Perplexity: 2H • Intuitively: weighted average number of choices a random variable has to make • Equally likely horses: Entropy 3 – Perplexity 23 =8 • Biased horses: Entropy 2 – Perplexity 22 =4

  17. Uncertainty measure (Shannon) given a random variable x r =2, pi = probability the event is i Biased coin: -0.8 * lg 0.8 + -0.2 * lg 0.2 = 0.258 + 0.464 =0.722 Unbiased coin: - 2* 0.5 * lg 0.5 = 1 lg = log2 (log base 2) entropy= H(x) = Shannon uncertainty Perplexity (average) branching factor weighted average number of choices a random variable has to make Formula: 2H directly related to the entropy value H Examples Biased coin: 20.722 = 0.52 Unbiased coin: - 21= 2 Entropy

  18. Given a word sequence: W = w1…wn Entropy for word sequences of length n in language L H(w1…wn)= - p(w1…wn) log p(w1…wn) over all sequences of length n in language L Entropy rate for word sequences of length n 1/n H(w1…wn) = -1/n p(w1…wn) log p(w1…wn) Entropy rateH(L) = limn> -1/n  p(w1…wn) log p(w1…wn) n is number of words in the sequence Shannon-McMillan-Breiman theorem H(L) = limn→ - 1/n log p(w1…wn) select sufficiently large n possible then to take a single sequence instead of summing over all possible w1…wn long sequence will contain many shorter sequences Entropy and Word Sequences

  19. Entropy of a sequence • Finite sequence: strings from a language L • Entropy rate (per-word entropy)

  20. Entropy of a language • Entropy rate of language L • Shannon-McMillan-Breimann Theorem: • If a language is stationary and ergodic • A single sequence – if it is long enough – is representative for the language

More Related