Smoothing. LING 570 Fei Xia Week 5: 10/24/07. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A. Smoothing. What? Why? To deal with events observed zero times. “event”: a particular ngram How?
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Week 5: 10/24/07
TexPoint fonts used in EMF.
Read the TexPoint manual before you delete this box.: AAAAAAAAAA
Add-one smoothing works horribly in practice.
Need to choose
It works better than add-one,
but still works horribly.
For unseen n-grams, we assume that
each of them occurs c0* times.
N0 is the number of unseen ngrams
Therefore, the prob of EACH unseen ngram is:
The total prob mass for unseen ngrams is:
c* comes from GT estimate.
Ex: log(Nc) = a + b log(c)
Ex: c* = c for c > gt_max
calculate N_0, N_1, …, Nmax3+1, and N
calculate a function f(c) , for c=0, 1, …, max3.
= f(c) otherwise
Do the same for unigram and bigram models
Backoff and interpolation
Pkatz (wi|wi-2, wi-1) =
P3(wi |wi-2, wi-1) if c(wi-2, wi-1, wi) > 0
®(wi-2, wi-1) Pkatz (wi|wi-1) o.w.
Pkatz (wi|wi-1) =
P2(wi | wi-1) if c(wi-1, wi) > 0
®(wi-1) P1(wi) o.w
What is the value for D?
How to set ®(x)?
more likely to appear in some new context as well.
¼ P(ci-1 | wi-1) * P(ci | ci-1) * P(wi | ci)