1 / 15

Smoothing

Smoothing. Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004. The Sparse Data Problem.

pilar
Download Presentation

Smoothing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004

  2. The Sparse Data Problem • Maximum likelihood estimation works fine for data that occur frequently in the training corpus • Problem 1: Low frequency n-grams • If n-gram x occurs twice, and n-gram y occurs once. Is x really twice as likely as y? • Problem 2: Zero counts • If n-gram y does not occur in the training data, does that mean that it should have probability zero?

  3. The Sparse Data Problem • Data sparseness is a serious and frequently occurring problem • Probability of a sequence is zero if it contains unseen n-grams

  4. Smoothing=Redistributing Probability Mass

  5. Add-One Smoothing • Most simple smoothing technique • For all n-grams, including unseen n-grams, add one to their counts • Un-smoothed probability: • Add-one probability:

  6. Add-One Smoothing P(wn|wn-1) = C(wn-1wn)/C(wn-1) P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V]

  7. ci=c(wi,wi-1) ci’=(ci+1) Add-One Smoothing

  8. Add-One Smoothing • Pro: Very simple technique • Cons: • Too much probability mass is shifted towards unseen n-grams • Probability of frequent n-grams is underestimated • Probability of rare (or unseen) n-grams is overestimated • All unseen n-grams are smoothed in the same way • Using a smaller added-counted does not solve this problem in principle

  9. Witten-Bell Discounting • Probability mass is shifted around, depending on the context of words • If P(wi | wi-1,…,wi-m) = 0, then the smoothed probability PWB(wi | wi-1,…,wi-m) is higher if the sequence wi-1,…,wi-m occurs with many different words wi

  10. Witten-Bell Smoothing • Let’s consider bi-grams • T(wi-1) is the number of different words (types) that occur to the right of wi-1 • N(wi-1) is the number of all word occurrences (tokens) to the right of wi-1 • Z(wi-1) is the number of bigrams in the current data set starting with wi-1 that do not occur in the training data

  11. Witten-Bell Smoothing • If c(wi-1,wi) = 0 • If c(wi-1,wi) > 0

  12. ci ci′=(ci+1) ci′ = T/Z · if ci=0 ci · otherwise Witten-Bell Smoothing

  13. Witten-Bell Smoothing • Witten-Bell Smoothing is more conservative when subtracting probability mass • Gives rather good estimates • Problem: If wi-1 and wi did not occur in the training data the smoothed probability is still zero

  14. Backoff Smoothing • Deleted interpolation • If the n-gram wi-n,…,wi is not in the training data, use wi-(n-1) ,…,wi • More general, combine evidence from different n-grams • Where lambda is the ‘confidence’ weight for the longer n-gram • Compute lambda parameters from held-out data • Lambdas can be n-gram specific

  15. Other Smoothing Approaches • Good-Turing Discounting: Re-estimate amount of probability mass for zero (or low count) n-grams by looking at n-grams with higher counts • Kneser-Ney Smoothing: Similar to Witten-Bell smoothing but considers number of word types preceding a word • Katz Backoff Smoothing: Reverts to shorter n-gram contexts if the count for the current n-gram is lower than some threshold

More Related