N-gram model limitations

1 / 25

# N-gram model limitations - PowerPoint PPT Presentation

N-gram model limitations. Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams to previously unseen N-grams. This leads to another question: how do we do this?. Unsmoothed bigrams.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' N-gram model limitations' - mindy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
N-gram model limitations
• Q: What do we do about N-grams which were not in our training corpus?
• A: We distribute some probability mass from seen N-grams to previously unseen N-grams.
• This leads to another question: how do we do this?
Unsmoothed bigrams
• Recall that we use unigram and bigram counts to compute bigram probabilities:
• P(wn|wn-1) = C(wn-1wn) / C(wn-1)
How many bigrams are in a text?
• Suppose text had N words, how many bigrams (tokens) does it contain?
• At most N: we assume <s> appearing before first word to get a bigram probability for the word in the initial position.
• Example (5 words):
• words: <s> w1 w2 w3 w4 w5
• bigrams: <s> w1, w1 w2, w2 w3, w3 w4, w4 w5
How many possible bigrams are there?
• With a vocabulary of N words, there are N2 possible bigrams.
Example description
• Berkeley Restaurant Project corpus
• approximately 10,000 sentences
• 1616 word types
• tables will show counts or probabilities for 7 word types, carefully chosen so that the 7 by 7 matrix is not too sparse
• notice that many counts in first table are zero (25 zeros of 49 entries)
Unsmoothed N-grams

Bigram counts (figure 6.4 from text)

Computing probabilities
• Recall formula (we normalize by unigram counts):
• P(wn|wn-1) = C(wn-1wn) / C(wn-1)
• Unigram counts are:

p( eat | to ) = c( to eat ) / c( to ) = 860 / 3256 = .26

p( to | eat ) = c( eat to ) / c(eat) = 2 / 938 = .0021

Unsmoothed N-grams

Bigram probabilities (figure 6.5 from text): p( wn | wn-1 )

What do zeros mean?
• Just because a bigram has a zero count or a zero probability does not mean that it cannot occur – it just means it didn’t occur in the training corpus.
• So we arrive back at our question: what do we do with bigrams that have zero counts when we encounter them?
Let’s rephrase the question
• How can we ensure that none of the possible bigrams have zero counts/probabilities?
• Process of spreading the probability mass around to all possible bigrams is called smoothing.
• Basic idea: add one to actual counts, across the board.
• This ensures that there are no unigrams or bigrams with zero counts.
• Typically this adds too much probability mass to non-occurring bigrams.
• P(wn|wn-1) = C(wn-1wn) / C(wn-1)
• P*(wn|wn-1) = [ C(wn-1wn) + 1 ] / [ C(wn-1) + V ]
• Notes
• V is total number of word types in vocabulary
• In numerator we add one to the count of each bigram, just as we do with the unigram counts.
• In denominator we add V, since we are adding one more bigram token of the form wn-1w, for each w in our vocabulary
A simple approach to smoothing:Add-one smoothing

Add-one smoothed bigram counts (figure 6.6 from text)

Back to the probabilities
• Recall the formula for the adjusted probabilities:
• P*(wn|wn-1) = [ C(wn-1wn) + 1 ] / [ C(wn-1) + V ]

p( eat | to ) = c( to eat ) / c( to ) = 861 / 4872 = .18 (was .26)

p( to | eat ) = c( eat to ) / c( eat ) = 3 / 2554 = .0012 (was .0021)

p( eat | lunch ) = c( lunch eat ) / c( lunch ) = 1 / 2075 = .00048 (was 0)

p( eat | want ) = c( want eat ) / c( want ) = 1 / 2931 = .00034 (was 0)

A simple approach to smoothing:Add-one smoothing

Add-one smoothed bigram probabilities (figure 6.7 from text)

Discounting
• We define the discount to be the ratio of new and old counts (in our case smoothed and unsmoothed counts).
• Discounts for add-one smoothing for this example:
What does the discount tell us?
• The discount tells us from where the probability mass is coming.
• "Looking at the discount … shows us how strikingly the counts for each prefix-word have been reduced; the bigrams starting with Chinese were discounted by a factor of 8!" [p. 209]
Witten-Bell discounting
• Another approach to smoothing
• Basic idea: “Use the count of things you’ve seen once to help estimate the count of things you’ve never seen.” [p. 211]
• “How can we compute the probability of seeing an N-gram for the first time? By counting the number of times we saw N-grams for the first time in our training corpus. This is very simple to produce since the count of “first-time” N-grams is just the number of N-gram types we saw in the data (since we had to see each type for the first time exactly once).” [p. 211]
How much probability mass can we reassign?
• Total probability mass assigned to all (as yet) unseen n-grams is T / [ T + N ], where
• T is the total number of observed types (not vocabulary size)
• N is the number of tokens
• “We can think of our training corpus as a series of events; one event for each token and one event for each new type.” [p. 211]
• Formula above estimates “the probability of a new type event occurring.” [p. 211]
Distribution of probability mass
• This probability mass is distributed evenly amongst the unseen n-grams.
• Z = number of zero-count n-grams.
• pi* = [ T / (N+T) ] / Z = T / Z(N+T)
Discounting (unigram case)
• This probability mass has to come from somewhere.
• Recall the unsmoothed probability is pi = ci / N (N is number of tokens)
• The smoothed (discounted) probability for non-zero count unigrams is pi* = ci / (N + T)
• We can also give smoothed counts (Z is # unigram types with zero counts):
• For zero-count unigrams: ci* = (T/Z) N/(N+T)
• For non-zero count unigrams: ci* = ci * N/(N+T)
Discounting (bigram case)
• Total probability mass being redistributed (conditioned on first word of bigram):

i:c(wxwi)=0p*(wi|wx) = T(wx) / (N(wx) + T(wx))

• Distrubute this probability mass to unseen bigrams.
• Z is # bigram types with given first word with zero counts.
• The smoothed (discounted) probability is p*(wi|wx) = T(wx) / Z(wx)(N(wx) + T(wx)) c(wi-1wi) = 0 p*(wi|wx) = c(wxwi) / (c(wx) + T(wx)) c(wi-1wi) > 0
Witten-Bell: discounted counts

Witten-Bell smoothed (discounted) bigram counts (figure 6.9 from text)

Notice that counts which were 0 unsmoothed are <1 smoothed; contrast with Add-One Smoothing.

Discount comparison
• Table shows discounts for add-one and Witten-Bell smoothing for this example:
Training sets and test sets
• Corpus divided into training set and test set
• Need test items to not be in training set, else they will have artificially high probability
• Can use this to evaluate different systems:
• train two different systems on the same training set
• compare performance of systems on the same test set