- 76 Views
- Uploaded on
- Presentation posted in: General

N-gram model limitations

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

- Q: What do we do about N-grams which were not in our training corpus?
- A: We distribute some probability mass from seen N-grams to previously unseen N-grams.
- This leads to another question: how do we do this?

- Recall that we use unigram and bigram counts to compute bigram probabilities:
- P(wn|wn-1) = C(wn-1wn) / C(wn-1)

- Suppose text had N words, how many bigrams (tokens) does it contain?
- At most N: we assume <s> appearing before first word to get a bigram probability for the word in the initial position.
- Example (5 words):
- words: <s> w1 w2 w3 w4 w5
- bigrams: <s> w1, w1 w2, w2 w3, w3 w4, w4 w5

- With a vocabulary of N words, there are N2 possible bigrams.

- Berkeley Restaurant Project corpus
- approximately 10,000 sentences
- 1616 word types
- tables will show counts or probabilities for 7 word types, carefully chosen so that the 7 by 7 matrix is not too sparse
- notice that many counts in first table are zero (25 zeros of 49 entries)

Bigram counts (figure 6.4 from text)

- Recall formula (we normalize by unigram counts):
- P(wn|wn-1) = C(wn-1wn) / C(wn-1)

- Unigram counts are:

p( eat | to ) = c( to eat ) / c( to ) = 860 / 3256 = .26

p( to | eat ) = c( eat to ) / c(eat) = 2 / 938 = .0021

Bigram probabilities (figure 6.5 from text): p( wn | wn-1 )

- Just because a bigram has a zero count or a zero probability does not mean that it cannot occur – it just means it didn’t occur in the training corpus.
- So we arrive back at our question: what do we do with bigrams that have zero counts when we encounter them?

- How can we ensure that none of the possible bigrams have zero counts/probabilities?
- Process of spreading the probability mass around to all possible bigrams is called smoothing.
- We start with a very simple model, called add-one smoothing.

- Basic idea: add one to actual counts, across the board.
- This ensures that there are no unigrams or bigrams with zero counts.
- Typically this adds too much probability mass to non-occurring bigrams.

- Unadjusted probabilities:
- P(wn|wn-1) = C(wn-1wn) / C(wn-1)

- Adjusted probabilities:
- P*(wn|wn-1) = [ C(wn-1wn) + 1 ] / [ C(wn-1) + V ]

- Notes
- V is total number of word types in vocabulary
- In numerator we add one to the count of each bigram, just as we do with the unigram counts.
- In denominator we add V, since we are adding one more bigram token of the form wn-1w, for each w in our vocabulary

Add-one smoothed bigram counts (figure 6.6 from text)

- Recall the formula for the adjusted probabilities:
- P*(wn|wn-1) = [ C(wn-1wn) + 1 ] / [ C(wn-1) + V ]

- Unigram counts (adjusted by adding V=1616):

p( eat | to ) = c( to eat ) / c( to ) = 861 / 4872 = .18 (was .26)

p( to | eat ) = c( eat to ) / c( eat ) = 3 / 2554 = .0012 (was .0021)

p( eat | lunch ) = c( lunch eat ) / c( lunch ) = 1 / 2075 = .00048 (was 0)

p( eat | want ) = c( want eat ) / c( want ) = 1 / 2931 = .00034 (was 0)

Add-one smoothed bigram probabilities (figure 6.7 from text)

- We define the discount to be the ratio of new and old counts (in our case smoothed and unsmoothed counts).
- Discounts for add-one smoothing for this example:

- The discount tells us from where the probability mass is coming.
- "Looking at the discount … shows us how strikingly the counts for each prefix-word have been reduced; the bigrams starting with Chinese were discounted by a factor of 8!" [p. 209]

- Another approach to smoothing
- Basic idea: “Use the count of things you’ve seen once to help estimate the count of things you’ve never seen.” [p. 211]
- “How can we compute the probability of seeing an N-gram for the first time? By counting the number of times we saw N-grams for the first time in our training corpus. This is very simple to produce since the count of “first-time” N-grams is just the number of N-gram types we saw in the data (since we had to see each type for the first time exactly once).” [p. 211]

- Total probability mass assigned to all (as yet) unseen n-grams is T / [ T + N ], where
- T is the total number of observed types (not vocabulary size)
- N is the number of tokens

- “We can think of our training corpus as a series of events; one event for each token and one event for each new type.” [p. 211]
- Formula above estimates “the probability of a new type event occurring.” [p. 211]

- This probability mass is distributed evenly amongst the unseen n-grams.
- Z = number of zero-count n-grams.
- pi* = [ T / (N+T) ] / Z = T / Z(N+T)

- This probability mass has to come from somewhere.
- Recall the unsmoothed probability is pi = ci / N (N is number of tokens)
- The smoothed (discounted) probability for non-zero count unigrams is pi* = ci / (N + T)
- We can also give smoothed counts (Z is # unigram types with zero counts):
- For zero-count unigrams: ci* = (T/Z) N/(N+T)
- For non-zero count unigrams: ci* = ci * N/(N+T)

- Total probability mass being redistributed (conditioned on first word of bigram):
i:c(wxwi)=0p*(wi|wx) = T(wx) / (N(wx) + T(wx))

- Distrubute this probability mass to unseen bigrams.
- Z is # bigram types with given first word with zero counts.
- The smoothed (discounted) probability isp*(wi|wx) = T(wx) / Z(wx)(N(wx) + T(wx))c(wi-1wi) = 0p*(wi|wx) = c(wxwi) / (c(wx) + T(wx))c(wi-1wi) > 0

Witten-Bell smoothed (discounted) bigram counts (figure 6.9 from text)

Notice that counts which were 0 unsmoothed are <1 smoothed; contrast with Add-One Smoothing.

- Table shows discounts for add-one and Witten-Bell smoothing for this example:

- Corpus divided into training set and test set
- Need test items to not be in training set, else they will have artificially high probability
- Can use this to evaluate different systems:
- train two different systems on the same training set
- compare performance of systems on the same test set