language modeling with n grams
Download
Skip this Video
Download Presentation
Language Modeling with N-Grams

Loading in 2 Seconds...

play fullscreen
1 / 34

Language Modeling with N-Grams - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

Language Modeling with N-Grams. Language Modeling. A Language Model is a probabilistic model that allows us to compute the probability of a sentence. Let w 1:n denote the word sequence w 1 w 2 …w n . What is the probability P(w 1:n )?. Why Language Modeling?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Language Modeling with N-Grams' - donatella-freyne


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
language modeling
Language Modeling
  • A Language Model is a probabilistic model that allows us to compute the probability of a sentence.
    • Let w1:n denote the word sequence w1w2…wn.
    • What is the probability P(w1:n)?
why language modeling
Why Language Modeling?
  • Determine which sequence of words is more likely
  • Predicting the next word given the previous words
  • Shannon game:
    • guessing the next letter given previous letters.
  • Applications in
    • Speech Recognition
    • Machine Translation
    • Context sensitive spelling check
language modeling in speech recognition
Language Modeling in Speech Recognition
  • Some sequences of words sounds alike, but not all of them are good English sentences.
    • I went to a party
    • Eye went two a bar tea

Rudolph the red nose reindeer.

Rudolph the Red knows rain, dear.

Rudolph the Red Nose reigned here.

language modeling in machine translation
Language Modeling in Machine Translation
  • Given a French sentence
    • On voit Jon à la télévision
  • And several possible English translations:
    • Jon appeared in TV.
    • In Jon appeared TV.
    • Jon appeared on TV.
  • Which one is more likely to be correct?
context sensitive spelling
Context Sensitive Spelling
  • Which is most probable?
    • … I think they’re okay …
    • … I think there okay …
    • … I think their okay …
  • Which is most probable?
    • … by the way, are they’re likely to …
    • … by the way, are there likely to …
    • … by the way, are their likely to …
axioms of probability theory
Axioms of Probability Theory
  • Suppose P(.) is a probability function, then

1. for any event E, 0≤P(E) ≤1.

2. P(S) = 1, where S is the sample space.

3. for any two mutually exclusive events E1 and E2,

P(E1 U E2) = P(E1) + P(E2)

  • Any function that satisfies the above three axioms is a probability function.
properties of probability
Properties of Probability

1. P(¬E) = 1– P(E)

2. If E1 and E2 are logically equivalent, then

P(E1)=P(E2).

  • E1: Not all philosophers are more than six feet tall.
  • E2: Some philosopher is not more that six feet tall.

Then P(E1)=P(E2).

3. P(E1, E2)≤P(E1).

conditional probability
Conditional Probability
  • The probability of an event may change after knowing another event.

The probability of A given B is denoted by P(A|B).

  • Example
    • P( W=space ) the probability of a randomly selected word from an English text is ‘space’
    • P( W=space | W’=outer) the probability of ‘space’ if the previous word is ‘outer’
chain rule and bayes theorem
Chain Rule and Bayes Theorem
  • Chain Rule:

P(A, B)=P(A)P(B|A)

  • Bayes Theorem

If P(E2)>0, then P(E1|E2)=P(E2|E1)P(E1)/P(E2)

This can be derived from the definition of conditional probability.

the n gram language model
The n-gram Language Model

Using the Chain Rule: P(A,B)=P(A)P(B|A)

P(w1:n)

=P(w1:n-1)P(wn|w1:n-1)

= P(w1:n-2)P(wn-1|w1:n-2)P(wn|w1:n-1)

= P(w1:n-3)P(wn-2|w1:n-3)P(wn-1|w1:n-2)P(wn|w1:n-1)

= P(w1)P(w2|w1) P(w3|w1:2) P(w4|w1:3) …… P(wn-1|w1:n-2)P(wn|w1:n-1)

Can we compute P(w1:n) in the reverse order?

markov assumption
Markov Assumption
  • W1:n-1 is called the history of wn
    • Sue swallowed the large green ______.
  • The statistics for the complete history is very sparse.
  • Markov Assumption: only the closest n words are relevant:

P(wn|w1:n-1)≈P(wn|wn-N+1:n-1)

    • Bigram: only the previous one word matters
    • Trigram: only the previous two words matter
  • Therefore

P(w1:n) ≈k=1,n P(wk|wk-N+1:k-1)

examples
Examples:
  • Without Markov Assumption:
    • P(I went to a party) = ?
  • With Markov Assumption (n=3)
    • P(I went to a party) = ?
  • With Markov Assumption (n=2)
    • P(I went to a party) = ?
  • What does n=1 mean?
parameters in n gram models
Parameters in N-gram Models
  • Suppose there are 20,000 words
    • very conservative assumption
  • Parameters
    • Bigram: 20,000x19,999 = 400M
    • Trigram:20,0002x19,999=8 trillion
    • 4-gram: 20,0003x19,999=1.6x1017
  • Reliability vs. Relevance
    • as n increases, n-gram becomes more relevant, but less reliable.
estimation of probability
Estimation of Probability
  • P(wn | w1:n-1) = P(w1:n)/P(w1:n-1)
  • Probabilities (subjective/objective) exist independent of data.
  • However, probabilities have to be estimated from data.
  • Maximum Likelihood Estimation
    • PMLE(wn | w1:n)=C(w1:n)/C(w1:n-1)
maximum likelihood estimation
Maximum Likelihood Estimation
  • MLE assigns the highest probability to data.
  • Example:
    • training corpus: <s> a b a b </s>
    • MLE P(a|b)= ½, P(b|a)=1, P(a|<s>)=1, P(</s>|b) = ½, P(corpus)=1/2.
  • MLE is not suitable for NLP
    • MLE assigns 0 probability to unseen events.
    • One experiment shows that 23% of trigrams were previously unseen after 1.5M words.
how to estimate
p(z | xy) = ?

Suppose our training data includes

… xya ..

… xyd …

… xyd …

but never xyz

Should we conclude p(a | xy) = 1/3?p(d | xy) = 2/3?p(z | xy) = 0/3?

NO! Absence of xyz might just be bad luck.

How to Estimate
smoothing the estimates
Smoothing the Estimates
  • Should we conclude
    • p(a | xy) = 1/3? reduce this
    • p(d | xy) = 2/3? reduce this
    • p(z | xy) = 0/3? increase this
  • Discount the positive counts somewhat
  • Reallocate that probability to the zeroes
slide19
Especially if the denominator is small …
    • 1/3 probably too high, 100/300 probably about right
  • Especially if numerator is small …
    • 1/300 probably too high, 100/30000 probably about right
dealing with 0 probability
Dealing with 0 Probability
  • Back-off
    • If the frequency count of N-gram is 0, used N-1 gram
  • Smoothing
    • Mix MLE with another probability distribution that guarantees not to give 0 probability.
slide21

UNIGRAM 438699

...

DNS 298

DNS/WINS 2

dns1.isp.net 1

dnsadmin.exe 2

DNSName 1

DNSServer 1

do 384

...

NT 3313

...

pertinent 2

pervasiveness 1

Ph33r 3

phase 24

phased 1

...

phone 60

Phonebook 23

phrase 9

phrases 2

physical 123

PhysicalDisk 1

...

do 384

...

anything 2

approach 1

...

for 5

have 5

I 1

If 4

...

Link 1

list 1

...

no 1

not 97

Novell 1

offer 1

...

workitem 1

you 7

your 1

they,do 22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

Courtesy of Patrick Pantel

slide22

not 7

do 384

not 97

UNIGRAM 438699

...

DNS 298

DNS/WINS 2

dns1.isp.net 1

dnsadmin.exe 2

DNSName 1

DNSServer 1

do 384

...

NT 3313

...

pertinent 2

pervasiveness 1

Ph33r 3

phase 24

phased 1

...

phone 60

Phonebook 23

phrase 9

phrases 2

physical 123

PhysicalDisk 1

...

not 97

do 384

...

anything 2

approach 1

...

for 5

have 5

I 1

If 4

...

Link 1

list 1

...

no 1

not 97

Novell 1

offer 1

...

workitem 1

you 7

your 1

they,do 22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

C(they,do,not)

= 7

C(do,not)

= 97

PMLE(not|they,do)

= 7/22 = 0.318

PMLE(not|do)

= 97/384 = 0.253

PMLE(offer|they,do)

= 0/22 = 0

PMLE(have|they,do)

= 2/22 = 0.091

Courtesy of Patrick Pantel

add one smoothing
Add-One Smoothing
  • V is the number of types we might see
    • the vocabulary size (unique words)
  • Add-One Smoothing (+1):
  • Too much mass is reserved for 0-frequency N-grams
    • arbitrarily picked value “1” to add to N-grams

Courtesy of Patrick Pantel

slide24

Vocabulary Size (V) = 10,543

Vocabulary Size (V) = 10,543

They,do 22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

P+1(not|they,do)

P+1(offer|they,do)

P+1(have|they,do)

Courtesy of Patrick Pantel

witten bell discounting unigrams
Witten-Bell Discounting (unigrams)
  • T is the number of types in the training corpus (T < V)
  • N is the number of tokens in the training corpus
  • Idea: Use the count of things seen once to estimate unseen events
    • we saw T words once

Courtesy of Patrick Pantel

witten bell discounting unigrams1
Witten-Bell Discounting (unigrams)
  • Total mass reserved for all 0-frequency N-grams is:
  • Where does this mass come from?
  • Z = number of 0-frequency words = V – T

Courtesy of Patrick Pantel

witten bell discounting n grams
Witten-Bell Discounting (N-grams)
  • Condition T, N and Z on N-gram context
    • unseen N-gram estimate is specific to a word history (context)
  • b is the number of N-gram types with the given context
  • b is the number of N-gram tokens with the given context
  • b is the number of 0-frequency N-grams with the given context

Courtesy of Patrick Pantel

slide28

N(they,do)

N(do)

N()

UNIGRAM 438699

...

DNS 298

DNS/WINS 2

dns1.isp.net 1

dnsadmin.exe 2

DNSName 1

DNSServer 1

do 384

...

NT 3313

...

pertinent 2

pervasiveness 1

Ph33r 3

phase 24

phased 1

...

phone 60

Phonebook 23

phrase 9

phrases 2

physical 123

PhysicalDisk 1

...

do 384

...

anything 2

approach 1

...

for 5

have 5

I 1

If 4

...

Link 1

list 1

...

no 1

not 97

Novell 1

offer 1

...

workitem 1

you 7

your 1

T(they,do)

=9

they,do 22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

T(do)=81

T()=10543

Courtesy of Patrick Pantel

witten bell discounting n grams1
Witten-Bell Discounting (N-grams)
  • For N-grams with non-zero frequency:
  • Mass reserved for 0-frequency N-grams:
  • For 0-frequency N-grams:

Courtesy of Patrick Pantel

slide30

PWB(not|they,do)

T=9

they,do 22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

PWB(offer|they,do)

Total N-gram Types

1 - 10543

2 - 114707

3 - 256844

PWB(have|they,do)

Courtesy of Patrick Pantel

good turing estimation
Good-Turing Estimation
  • where
    • r = C(w1, …, wn)
    • Nr= the number of n-grams that occurred r times
  • This should only be used when r is small.
example
Example
  • Corpus: a b a b
  • Observed bigrams:
    • b a: 1
    • a b: 2
    • N0=2, N1=1, N2=1, N=3
  • Probability estimations:
    • f0= N1 /N0 =0.5
backing off
Backing off
  • Estimate the probability with a linear combination of lower order estimations which are less likely to be 0.
  • Simple linear interpolation
evaluation of language model
Evaluation of Language Model
  • Best method:
    • Use the language model in an application, e.g., spelling check, machine translation, speech recognition, …
  • Perplexity: the language model that assign the higher probability to the testing data is better.
ad