n gram model limitations
Download
Skip this Video
Download Presentation
N-gram model limitations

Loading in 2 Seconds...

play fullscreen
1 / 25

N-gram model limitations - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

N-gram model limitations. Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams to previously unseen N-grams. This leads to another question: how do we do this?. Unsmoothed bigrams.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' N-gram model limitations' - mindy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
n gram model limitations
N-gram model limitations
  • Q: What do we do about N-grams which were not in our training corpus?
  • A: We distribute some probability mass from seen N-grams to previously unseen N-grams.
  • This leads to another question: how do we do this?
unsmoothed bigrams
Unsmoothed bigrams
  • Recall that we use unigram and bigram counts to compute bigram probabilities:
    • P(wn|wn-1) = C(wn-1wn) / C(wn-1)
how many bigrams are in a text
How many bigrams are in a text?
  • Suppose text had N words, how many bigrams (tokens) does it contain?
  • At most N: we assume <s> appearing before first word to get a bigram probability for the word in the initial position.
  • Example (5 words):
    • words: <s> w1 w2 w3 w4 w5
    • bigrams: <s> w1, w1 w2, w2 w3, w3 w4, w4 w5
how many possible bigrams are there
How many possible bigrams are there?
  • With a vocabulary of N words, there are N2 possible bigrams.
example description
Example description
  • Berkeley Restaurant Project corpus
    • approximately 10,000 sentences
    • 1616 word types
    • tables will show counts or probabilities for 7 word types, carefully chosen so that the 7 by 7 matrix is not too sparse
    • notice that many counts in first table are zero (25 zeros of 49 entries)
unsmoothed n grams
Unsmoothed N-grams

Bigram counts (figure 6.4 from text)

computing probabilities
Computing probabilities
  • Recall formula (we normalize by unigram counts):
    • P(wn|wn-1) = C(wn-1wn) / C(wn-1)
  • Unigram counts are:

p( eat | to ) = c( to eat ) / c( to ) = 860 / 3256 = .26

p( to | eat ) = c( eat to ) / c(eat) = 2 / 938 = .0021

unsmoothed n grams1
Unsmoothed N-grams

Bigram probabilities (figure 6.5 from text): p( wn | wn-1 )

what do zeros mean
What do zeros mean?
  • Just because a bigram has a zero count or a zero probability does not mean that it cannot occur – it just means it didn’t occur in the training corpus.
  • So we arrive back at our question: what do we do with bigrams that have zero counts when we encounter them?
let s rephrase the question
Let’s rephrase the question
  • How can we ensure that none of the possible bigrams have zero counts/probabilities?
  • Process of spreading the probability mass around to all possible bigrams is called smoothing.
  • We start with a very simple model, called add-one smoothing.
add one smoothing
Add-one smoothing
  • Basic idea: add one to actual counts, across the board.
  • This ensures that there are no unigrams or bigrams with zero counts.
  • Typically this adds too much probability mass to non-occurring bigrams.
add one smoothing computing the probabilities
Add-one smoothing:computing the probabilities
  • Unadjusted probabilities:
    • P(wn|wn-1) = C(wn-1wn) / C(wn-1)
  • Adjusted probabilities:
    • P*(wn|wn-1) = [ C(wn-1wn) + 1 ] / [ C(wn-1) + V ]
  • Notes
    • V is total number of word types in vocabulary
    • In numerator we add one to the count of each bigram, just as we do with the unigram counts.
    • In denominator we add V, since we are adding one more bigram token of the form wn-1w, for each w in our vocabulary
a simple approach to smoothing add one smoothing
A simple approach to smoothing:Add-one smoothing

Add-one smoothed bigram counts (figure 6.6 from text)

back to the probabilities
Back to the probabilities
  • Recall the formula for the adjusted probabilities:
    • P*(wn|wn-1) = [ C(wn-1wn) + 1 ] / [ C(wn-1) + V ]
  • Unigram counts (adjusted by adding V=1616):

p( eat | to ) = c( to eat ) / c( to ) = 861 / 4872 = .18 (was .26)

p( to | eat ) = c( eat to ) / c( eat ) = 3 / 2554 = .0012 (was .0021)

p( eat | lunch ) = c( lunch eat ) / c( lunch ) = 1 / 2075 = .00048 (was 0)

p( eat | want ) = c( want eat ) / c( want ) = 1 / 2931 = .00034 (was 0)

a simple approach to smoothing add one smoothing1
A simple approach to smoothing:Add-one smoothing

Add-one smoothed bigram probabilities (figure 6.7 from text)

discounting
Discounting
  • We define the discount to be the ratio of new and old counts (in our case smoothed and unsmoothed counts).
  • Discounts for add-one smoothing for this example:
what does the discount tell us
What does the discount tell us?
  • The discount tells us from where the probability mass is coming.
  • "Looking at the discount … shows us how strikingly the counts for each prefix-word have been reduced; the bigrams starting with Chinese were discounted by a factor of 8!" [p. 209]
witten bell discounting
Witten-Bell discounting
  • Another approach to smoothing
  • Basic idea: “Use the count of things you’ve seen once to help estimate the count of things you’ve never seen.” [p. 211]
  • “How can we compute the probability of seeing an N-gram for the first time? By counting the number of times we saw N-grams for the first time in our training corpus. This is very simple to produce since the count of “first-time” N-grams is just the number of N-gram types we saw in the data (since we had to see each type for the first time exactly once).” [p. 211]
how much probability mass can we reassign
How much probability mass can we reassign?
  • Total probability mass assigned to all (as yet) unseen n-grams is T / [ T + N ], where
    • T is the total number of observed types (not vocabulary size)
    • N is the number of tokens
  • “We can think of our training corpus as a series of events; one event for each token and one event for each new type.” [p. 211]
  • Formula above estimates “the probability of a new type event occurring.” [p. 211]
distribution of probability mass
Distribution of probability mass
  • This probability mass is distributed evenly amongst the unseen n-grams.
  • Z = number of zero-count n-grams.
  • pi* = [ T / (N+T) ] / Z = T / Z(N+T)
discounting unigram case
Discounting (unigram case)
  • This probability mass has to come from somewhere.
  • Recall the unsmoothed probability is pi = ci / N (N is number of tokens)
  • The smoothed (discounted) probability for non-zero count unigrams is pi* = ci / (N + T)
  • We can also give smoothed counts (Z is # unigram types with zero counts):
    • For zero-count unigrams: ci* = (T/Z) N/(N+T)
    • For non-zero count unigrams: ci* = ci * N/(N+T)
discounting bigram case
Discounting (bigram case)
  • Total probability mass being redistributed (conditioned on first word of bigram):

i:c(wxwi)=0p*(wi|wx) = T(wx) / (N(wx) + T(wx))

  • Distrubute this probability mass to unseen bigrams.
  • Z is # bigram types with given first word with zero counts.
  • The smoothed (discounted) probability is p*(wi|wx) = T(wx) / Z(wx)(N(wx) + T(wx)) c(wi-1wi) = 0 p*(wi|wx) = c(wxwi) / (c(wx) + T(wx)) c(wi-1wi) > 0
witten bell discounted counts
Witten-Bell: discounted counts

Witten-Bell smoothed (discounted) bigram counts (figure 6.9 from text)

Notice that counts which were 0 unsmoothed are <1 smoothed; contrast with Add-One Smoothing.

discount comparison
Discount comparison
  • Table shows discounts for add-one and Witten-Bell smoothing for this example:
training sets and test sets
Training sets and test sets
  • Corpus divided into training set and test set
  • Need test items to not be in training set, else they will have artificially high probability
  • Can use this to evaluate different systems:
    • train two different systems on the same training set
    • compare performance of systems on the same test set
ad