N gram model limitations
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

N-gram model limitations PowerPoint PPT Presentation


  • 72 Views
  • Uploaded on
  • Presentation posted in: General

N-gram model limitations. Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams to previously unseen N-grams. This leads to another question: how do we do this?. Unsmoothed bigrams.

Download Presentation

N-gram model limitations

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


N gram model limitations

N-gram model limitations

  • Q: What do we do about N-grams which were not in our training corpus?

  • A: We distribute some probability mass from seen N-grams to previously unseen N-grams.

  • This leads to another question: how do we do this?


Unsmoothed bigrams

Unsmoothed bigrams

  • Recall that we use unigram and bigram counts to compute bigram probabilities:

    • P(wn|wn-1) = C(wn-1wn) / C(wn-1)


How many bigrams are in a text

How many bigrams are in a text?

  • Suppose text had N words, how many bigrams (tokens) does it contain?

  • At most N: we assume <s> appearing before first word to get a bigram probability for the word in the initial position.

  • Example (5 words):

    • words: <s> w1 w2 w3 w4 w5

    • bigrams: <s> w1, w1 w2, w2 w3, w3 w4, w4 w5


How many possible bigrams are there

How many possible bigrams are there?

  • With a vocabulary of N words, there are N2 possible bigrams.


Example description

Example description

  • Berkeley Restaurant Project corpus

    • approximately 10,000 sentences

    • 1616 word types

    • tables will show counts or probabilities for 7 word types, carefully chosen so that the 7 by 7 matrix is not too sparse

    • notice that many counts in first table are zero (25 zeros of 49 entries)


Unsmoothed n grams

Unsmoothed N-grams

Bigram counts (figure 6.4 from text)


Computing probabilities

Computing probabilities

  • Recall formula (we normalize by unigram counts):

    • P(wn|wn-1) = C(wn-1wn) / C(wn-1)

  • Unigram counts are:

p( eat | to ) = c( to eat ) / c( to ) = 860 / 3256 = .26

p( to | eat ) = c( eat to ) / c(eat) = 2 / 938 = .0021


Unsmoothed n grams1

Unsmoothed N-grams

Bigram probabilities (figure 6.5 from text): p( wn | wn-1 )


What do zeros mean

What do zeros mean?

  • Just because a bigram has a zero count or a zero probability does not mean that it cannot occur – it just means it didn’t occur in the training corpus.

  • So we arrive back at our question: what do we do with bigrams that have zero counts when we encounter them?


Let s rephrase the question

Let’s rephrase the question

  • How can we ensure that none of the possible bigrams have zero counts/probabilities?

  • Process of spreading the probability mass around to all possible bigrams is called smoothing.

  • We start with a very simple model, called add-one smoothing.


Add one smoothing

Add-one smoothing

  • Basic idea: add one to actual counts, across the board.

  • This ensures that there are no unigrams or bigrams with zero counts.

  • Typically this adds too much probability mass to non-occurring bigrams.


Add one smoothing computing the probabilities

Add-one smoothing:computing the probabilities

  • Unadjusted probabilities:

    • P(wn|wn-1) = C(wn-1wn) / C(wn-1)

  • Adjusted probabilities:

    • P*(wn|wn-1) = [ C(wn-1wn) + 1 ] / [ C(wn-1) + V ]

  • Notes

    • V is total number of word types in vocabulary

    • In numerator we add one to the count of each bigram, just as we do with the unigram counts.

    • In denominator we add V, since we are adding one more bigram token of the form wn-1w, for each w in our vocabulary


A simple approach to smoothing add one smoothing

A simple approach to smoothing:Add-one smoothing

Add-one smoothed bigram counts (figure 6.6 from text)


Back to the probabilities

Back to the probabilities

  • Recall the formula for the adjusted probabilities:

    • P*(wn|wn-1) = [ C(wn-1wn) + 1 ] / [ C(wn-1) + V ]

  • Unigram counts (adjusted by adding V=1616):

p( eat | to ) = c( to eat ) / c( to ) = 861 / 4872 = .18 (was .26)

p( to | eat ) = c( eat to ) / c( eat ) = 3 / 2554 = .0012 (was .0021)

p( eat | lunch ) = c( lunch eat ) / c( lunch ) = 1 / 2075 = .00048 (was 0)

p( eat | want ) = c( want eat ) / c( want ) = 1 / 2931 = .00034 (was 0)


A simple approach to smoothing add one smoothing1

A simple approach to smoothing:Add-one smoothing

Add-one smoothed bigram probabilities (figure 6.7 from text)


Discounting

Discounting

  • We define the discount to be the ratio of new and old counts (in our case smoothed and unsmoothed counts).

  • Discounts for add-one smoothing for this example:


What does the discount tell us

What does the discount tell us?

  • The discount tells us from where the probability mass is coming.

  • "Looking at the discount … shows us how strikingly the counts for each prefix-word have been reduced; the bigrams starting with Chinese were discounted by a factor of 8!" [p. 209]


Witten bell discounting

Witten-Bell discounting

  • Another approach to smoothing

  • Basic idea: “Use the count of things you’ve seen once to help estimate the count of things you’ve never seen.” [p. 211]

  • “How can we compute the probability of seeing an N-gram for the first time? By counting the number of times we saw N-grams for the first time in our training corpus. This is very simple to produce since the count of “first-time” N-grams is just the number of N-gram types we saw in the data (since we had to see each type for the first time exactly once).” [p. 211]


How much probability mass can we reassign

How much probability mass can we reassign?

  • Total probability mass assigned to all (as yet) unseen n-grams is T / [ T + N ], where

    • T is the total number of observed types (not vocabulary size)

    • N is the number of tokens

  • “We can think of our training corpus as a series of events; one event for each token and one event for each new type.” [p. 211]

  • Formula above estimates “the probability of a new type event occurring.” [p. 211]


Distribution of probability mass

Distribution of probability mass

  • This probability mass is distributed evenly amongst the unseen n-grams.

  • Z = number of zero-count n-grams.

  • pi* = [ T / (N+T) ] / Z = T / Z(N+T)


Discounting unigram case

Discounting (unigram case)

  • This probability mass has to come from somewhere.

  • Recall the unsmoothed probability is pi = ci / N (N is number of tokens)

  • The smoothed (discounted) probability for non-zero count unigrams is pi* = ci / (N + T)

  • We can also give smoothed counts (Z is # unigram types with zero counts):

    • For zero-count unigrams: ci* = (T/Z) N/(N+T)

    • For non-zero count unigrams: ci* = ci * N/(N+T)


Discounting bigram case

Discounting (bigram case)

  • Total probability mass being redistributed (conditioned on first word of bigram):

    i:c(wxwi)=0p*(wi|wx) = T(wx) / (N(wx) + T(wx))

  • Distrubute this probability mass to unseen bigrams.

  • Z is # bigram types with given first word with zero counts.

  • The smoothed (discounted) probability isp*(wi|wx) = T(wx) / Z(wx)(N(wx) + T(wx))c(wi-1wi) = 0p*(wi|wx) = c(wxwi) / (c(wx) + T(wx))c(wi-1wi) > 0


Witten bell discounted counts

Witten-Bell: discounted counts

Witten-Bell smoothed (discounted) bigram counts (figure 6.9 from text)

Notice that counts which were 0 unsmoothed are <1 smoothed; contrast with Add-One Smoothing.


Discount comparison

Discount comparison

  • Table shows discounts for add-one and Witten-Bell smoothing for this example:


Training sets and test sets

Training sets and test sets

  • Corpus divided into training set and test set

  • Need test items to not be in training set, else they will have artificially high probability

  • Can use this to evaluate different systems:

    • train two different systems on the same training set

    • compare performance of systems on the same test set


  • Login