- 101 Views
- Uploaded on
- Presentation posted in: General

Language Modeling with N-Grams

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Language Modeling with N-Grams

- A Language Model is a probabilistic model that allows us to compute the probability of a sentence.
- Let w1:n denote the word sequence w1w2…wn.
- What is the probability P(w1:n)?

- Determine which sequence of words is more likely
- Predicting the next word given the previous words
- Shannon game:
- guessing the next letter given previous letters.

- Applications in
- Speech Recognition
- Machine Translation
- Context sensitive spelling check

- Some sequences of words sounds alike, but not all of them are good English sentences.
- I went to a party
- Eye went two a bar tea

Rudolph the red nose reindeer.

Rudolph the Red knows rain, dear.

Rudolph the Red Nose reigned here.

- Given a French sentence
- On voit Jon à la télévision

- And several possible English translations:
- Jon appeared in TV.
- In Jon appeared TV.
- Jon appeared on TV.

- Which one is more likely to be correct?

- Which is most probable?
- … I think they’re okay …
- … I think there okay …
- … I think their okay …

- Which is most probable?
- … by the way, are they’re likely to …
- … by the way, are there likely to …
- … by the way, are their likely to …

- Suppose P(.) is a probability function, then
1.for any event E, 0≤P(E) ≤1.

2.P(S) = 1, where S is the sample space.

3.for any two mutually exclusive events E1 and E2,

P(E1 U E2) = P(E1) + P(E2)

- Any function that satisfies the above three axioms is a probability function.

1.P(¬E) = 1– P(E)

2.If E1 and E2 are logically equivalent, then

P(E1)=P(E2).

- E1: Not all philosophers are more than six feet tall.
- E2: Some philosopher is not more that six feet tall.
Then P(E1)=P(E2).

3.P(E1, E2)≤P(E1).

- The probability of an event may change after knowing another event.
The probability of A given B is denoted by P(A|B).

- Example
- P( W=space ) the probability of a randomly selected word from an English text is ‘space’
- P( W=space | W’=outer) the probability of ‘space’ if the previous word is ‘outer’

- Chain Rule:
P(A, B)=P(A)P(B|A)

- Bayes Theorem
If P(E2)>0, then P(E1|E2)=P(E2|E1)P(E1)/P(E2)

This can be derived from the definition of conditional probability.

Using the Chain Rule: P(A,B)=P(A)P(B|A)

P(w1:n)

=P(w1:n-1)P(wn|w1:n-1)

= P(w1:n-2)P(wn-1|w1:n-2)P(wn|w1:n-1)

= P(w1:n-3)P(wn-2|w1:n-3)P(wn-1|w1:n-2)P(wn|w1:n-1)

= P(w1)P(w2|w1) P(w3|w1:2) P(w4|w1:3) …… P(wn-1|w1:n-2)P(wn|w1:n-1)

Can we compute P(w1:n) in the reverse order?

- W1:n-1 is called the history of wn
- Sue swallowed the large green ______.

- The statistics for the complete history is very sparse.
- Markov Assumption: only the closest n words are relevant:
P(wn|w1:n-1)≈P(wn|wn-N+1:n-1)

- Bigram: only the previous one word matters
- Trigram: only the previous two words matter

- Therefore
P(w1:n) ≈k=1,n P(wk|wk-N+1:k-1)

- Without Markov Assumption:
- P(I went to a party) = ?

- With Markov Assumption (n=3)
- P(I went to a party) = ?

- With Markov Assumption (n=2)
- P(I went to a party) = ?

- What does n=1 mean?

- Suppose there are 20,000 words
- very conservative assumption

- Parameters
- Bigram: 20,000x19,999 = 400M
- Trigram:20,0002x19,999=8 trillion
- 4-gram: 20,0003x19,999=1.6x1017

- Reliability vs. Relevance
- as n increases, n-gram becomes more relevant, but less reliable.

- P(wn | w1:n-1) = P(w1:n)/P(w1:n-1)
- Probabilities (subjective/objective) exist independent of data.
- However, probabilities have to be estimated from data.
- Maximum Likelihood Estimation
- PMLE(wn | w1:n)=C(w1:n)/C(w1:n-1)

- MLE assigns the highest probability to data.
- Example:
- training corpus: <s> a b a b </s>
- MLE P(a|b)= ½, P(b|a)=1, P(a|<s>)=1, P(</s>|b) = ½, P(corpus)=1/2.

- MLE is not suitable for NLP
- MLE assigns 0 probability to unseen events.
- One experiment shows that 23% of trigrams were previously unseen after 1.5M words.

p(z | xy) = ?

Suppose our training data includes

… xya ..

… xyd …

… xyd …

but never xyz

Should we conclude p(a | xy) = 1/3?p(d | xy) = 2/3?p(z | xy) = 0/3?

NO! Absence of xyz might just be bad luck.

- Should we conclude
- p(a | xy) = 1/3? reduce this
- p(d | xy) = 2/3? reduce this
- p(z | xy) = 0/3? increase this

- Discount the positive counts somewhat
- Reallocate that probability to the zeroes

- Especially if the denominator is small …
- 1/3 probably too high, 100/300 probably about right

- Especially if numerator is small …
- 1/300 probably too high, 100/30000 probably about right

- Back-off
- If the frequency count of N-gram is 0, used N-1 gram

- Smoothing
- Mix MLE with another probability distribution that guarantees not to give 0 probability.

UNIGRAM438699

...

DNS 298

DNS/WINS 2

dns1.isp.net 1

dnsadmin.exe 2

DNSName 1

DNSServer 1

do 384

...

NT 3313

...

pertinent 2

pervasiveness 1

Ph33r 3

phase 24

phased 1

...

phone 60

Phonebook 23

phrase 9

phrases 2

physical 123

PhysicalDisk 1

...

do384

...

anything 2

approach 1

...

for 5

have 5

I 1

If 4

...

Link 1

list 1

...

no 1

not 97

Novell 1

offer 1

...

workitem 1

you 7

your 1

they,do22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

Courtesy of Patrick Pantel

not 7

do384

not 97

UNIGRAM438699

...

DNS 298

DNS/WINS 2

dns1.isp.net 1

dnsadmin.exe 2

DNSName 1

DNSServer 1

do 384

...

NT 3313

...

pertinent 2

pervasiveness 1

Ph33r 3

phase 24

phased 1

...

phone 60

Phonebook 23

phrase 9

phrases 2

physical 123

PhysicalDisk 1

...

not 97

do384

...

anything 2

approach 1

...

for 5

have 5

I 1

If 4

...

Link 1

list 1

...

no 1

not 97

Novell 1

offer 1

...

workitem 1

you 7

your 1

they,do22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

C(they,do,not)

= 7

C(do,not)

= 97

PMLE(not|they,do)

= 7/22 = 0.318

PMLE(not|do)

= 97/384 = 0.253

PMLE(offer|they,do)

= 0/22 = 0

PMLE(have|they,do)

= 2/22 = 0.091

Courtesy of Patrick Pantel

- V is the number of types we might see
- the vocabulary size (unique words)

- Add-One Smoothing (+1):
- Too much mass is reserved for 0-frequency N-grams
- arbitrarily picked value “1” to add to N-grams

Courtesy of Patrick Pantel

Vocabulary Size (V) = 10,543

Vocabulary Size (V) = 10,543

They,do22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

P+1(not|they,do)

P+1(offer|they,do)

P+1(have|they,do)

Courtesy of Patrick Pantel

- T is the number of types in the training corpus (T < V)
- N is the number of tokens in the training corpus
- Idea: Use the count of things seen once to estimate unseen events
- we saw T words once

Courtesy of Patrick Pantel

- Total mass reserved for all 0-frequency N-grams is:
- Where does this mass come from?
- Z = number of 0-frequency words = V – T

Courtesy of Patrick Pantel

- Condition T, N and Z on N-gram context
- unseen N-gram estimate is specific to a word history (context)

- b is the number of N-gram types with the given context
- b is the number of N-gram tokens with the given context
- b is the number of 0-frequency N-grams with the given context

Courtesy of Patrick Pantel

N(they,do)

N(do)

N()

UNIGRAM438699

...

DNS 298

DNS/WINS 2

dns1.isp.net 1

dnsadmin.exe 2

DNSName 1

DNSServer 1

do 384

...

NT 3313

...

pertinent 2

pervasiveness 1

Ph33r 3

phase 24

phased 1

...

phone 60

Phonebook 23

phrase 9

phrases 2

physical 123

PhysicalDisk 1

...

do384

...

anything 2

approach 1

...

for 5

have 5

I 1

If 4

...

Link 1

list 1

...

no 1

not 97

Novell 1

offer 1

...

workitem 1

you 7

your 1

T(they,do)

=9

they,do22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

T(do)=81

T()=10543

Courtesy of Patrick Pantel

- For N-grams with non-zero frequency:
- Mass reserved for 0-frequency N-grams:
- For 0-frequency N-grams:

Courtesy of Patrick Pantel

PWB(not|they,do)

T=9

they,do22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

PWB(offer|they,do)

Total N-gram Types

1 - 10543

2 - 114707

3 - 256844

PWB(have|they,do)

Courtesy of Patrick Pantel

- where
- r = C(w1, …, wn)
- Nr= the number of n-grams that occurred r times

- This should only be used when r is small.

- Corpus: a b a b
- Observed bigrams:
- b a: 1
- a b: 2
- N0=2, N1=1, N2=1, N=3

- Probability estimations:
- f0= N1 /N0 =0.5

- Estimate the probability with a linear combination of lower order estimations which are less likely to be 0.
- Simple linear interpolation

- Best method:
- Use the language model in an application, e.g., spelling check, machine translation, speech recognition, …

- Perplexity: the language model that assign the higher probability to the testing data is better.