- 129 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Language Modeling with N-Grams' - donatella-freyne

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Language Modeling

- A Language Model is a probabilistic model that allows us to compute the probability of a sentence.
- Let w1:n denote the word sequence w1w2…wn.
- What is the probability P(w1:n)?

Why Language Modeling?

- Determine which sequence of words is more likely
- Predicting the next word given the previous words
- Shannon game:
- guessing the next letter given previous letters.
- Applications in
- Speech Recognition
- Machine Translation
- Context sensitive spelling check

Language Modeling in Speech Recognition

- Some sequences of words sounds alike, but not all of them are good English sentences.
- I went to a party
- Eye went two a bar tea

Rudolph the red nose reindeer.

Rudolph the Red knows rain, dear.

Rudolph the Red Nose reigned here.

Language Modeling in Machine Translation

- Given a French sentence
- On voit Jon à la télévision
- And several possible English translations:
- Jon appeared in TV.
- In Jon appeared TV.
- Jon appeared on TV.
- Which one is more likely to be correct?

Context Sensitive Spelling

- Which is most probable?
- … I think they’re okay …
- … I think there okay …
- … I think their okay …
- Which is most probable?
- … by the way, are they’re likely to …
- … by the way, are there likely to …
- … by the way, are their likely to …

Axioms of Probability Theory

- Suppose P(.) is a probability function, then

1. for any event E, 0≤P(E) ≤1.

2. P(S) = 1, where S is the sample space.

3. for any two mutually exclusive events E1 and E2,

P(E1 U E2) = P(E1) + P(E2)

- Any function that satisfies the above three axioms is a probability function.

Properties of Probability

1. P(¬E) = 1– P(E)

2. If E1 and E2 are logically equivalent, then

P(E1)=P(E2).

- E1: Not all philosophers are more than six feet tall.
- E2: Some philosopher is not more that six feet tall.

Then P(E1)=P(E2).

3. P(E1, E2)≤P(E1).

Conditional Probability

- The probability of an event may change after knowing another event.

The probability of A given B is denoted by P(A|B).

- Example
- P( W=space ) the probability of a randomly selected word from an English text is ‘space’
- P( W=space | W’=outer) the probability of ‘space’ if the previous word is ‘outer’

Chain Rule and Bayes Theorem

- Chain Rule:

P(A, B)=P(A)P(B|A)

- Bayes Theorem

If P(E2)>0, then P(E1|E2)=P(E2|E1)P(E1)/P(E2)

This can be derived from the definition of conditional probability.

The n-gram Language Model

Using the Chain Rule: P(A,B)=P(A)P(B|A)

P(w1:n)

=P(w1:n-1)P(wn|w1:n-1)

= P(w1:n-2)P(wn-1|w1:n-2)P(wn|w1:n-1)

= P(w1:n-3)P(wn-2|w1:n-3)P(wn-1|w1:n-2)P(wn|w1:n-1)

= P(w1)P(w2|w1) P(w3|w1:2) P(w4|w1:3) …… P(wn-1|w1:n-2)P(wn|w1:n-1)

Can we compute P(w1:n) in the reverse order?

Markov Assumption

- W1:n-1 is called the history of wn
- Sue swallowed the large green ______.
- The statistics for the complete history is very sparse.
- Markov Assumption: only the closest n words are relevant:

P(wn|w1:n-1)≈P(wn|wn-N+1:n-1)

- Bigram: only the previous one word matters
- Trigram: only the previous two words matter
- Therefore

P(w1:n) ≈k=1,n P(wk|wk-N+1:k-1)

Examples:

- Without Markov Assumption:
- P(I went to a party) = ?
- With Markov Assumption (n=3)
- P(I went to a party) = ?
- With Markov Assumption (n=2)
- P(I went to a party) = ?
- What does n=1 mean?

Parameters in N-gram Models

- Suppose there are 20,000 words
- very conservative assumption
- Parameters
- Bigram: 20,000x19,999 = 400M
- Trigram:20,0002x19,999=8 trillion
- 4-gram: 20,0003x19,999=1.6x1017
- Reliability vs. Relevance
- as n increases, n-gram becomes more relevant, but less reliable.

Estimation of Probability

- P(wn | w1:n-1) = P(w1:n)/P(w1:n-1)
- Probabilities (subjective/objective) exist independent of data.
- However, probabilities have to be estimated from data.
- Maximum Likelihood Estimation
- PMLE(wn | w1:n)=C(w1:n)/C(w1:n-1)

Maximum Likelihood Estimation

- MLE assigns the highest probability to data.
- Example:
- training corpus: <s> a b a b </s>
- MLE P(a|b)= ½, P(b|a)=1, P(a|<s>)=1, P(</s>|b) = ½, P(corpus)=1/2.
- MLE is not suitable for NLP
- MLE assigns 0 probability to unseen events.
- One experiment shows that 23% of trigrams were previously unseen after 1.5M words.

p(z | xy) = ?

Suppose our training data includes

… xya ..

… xyd …

… xyd …

but never xyz

Should we conclude p(a | xy) = 1/3?p(d | xy) = 2/3?p(z | xy) = 0/3?

NO! Absence of xyz might just be bad luck.

How to EstimateSmoothing the Estimates

- Should we conclude
- p(a | xy) = 1/3? reduce this
- p(d | xy) = 2/3? reduce this
- p(z | xy) = 0/3? increase this
- Discount the positive counts somewhat
- Reallocate that probability to the zeroes

Especially if the denominator is small …

- 1/3 probably too high, 100/300 probably about right
- Especially if numerator is small …
- 1/300 probably too high, 100/30000 probably about right

Dealing with 0 Probability

- Back-off
- If the frequency count of N-gram is 0, used N-1 gram
- Smoothing
- Mix MLE with another probability distribution that guarantees not to give 0 probability.

...

DNS 298

DNS/WINS 2

dns1.isp.net 1

dnsadmin.exe 2

DNSName 1

DNSServer 1

do 384

...

NT 3313

...

pertinent 2

pervasiveness 1

Ph33r 3

phase 24

phased 1

...

phone 60

Phonebook 23

phrase 9

phrases 2

physical 123

PhysicalDisk 1

...

do 384

...

anything 2

approach 1

...

for 5

have 5

I 1

If 4

...

Link 1

list 1

...

no 1

not 97

Novell 1

offer 1

...

workitem 1

you 7

your 1

they,do 22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

Courtesy of Patrick Pantel

do 384

not 97

UNIGRAM 438699

...

DNS 298

DNS/WINS 2

dns1.isp.net 1

dnsadmin.exe 2

DNSName 1

DNSServer 1

do 384

...

NT 3313

...

pertinent 2

pervasiveness 1

Ph33r 3

phase 24

phased 1

...

phone 60

Phonebook 23

phrase 9

phrases 2

physical 123

PhysicalDisk 1

...

not 97

do 384

...

anything 2

approach 1

...

for 5

have 5

I 1

If 4

...

Link 1

list 1

...

no 1

not 97

Novell 1

offer 1

...

workitem 1

you 7

your 1

they,do 22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

C(they,do,not)

= 7

C(do,not)

= 97

PMLE(not|they,do)

= 7/22 = 0.318

PMLE(not|do)

= 97/384 = 0.253

PMLE(offer|they,do)

= 0/22 = 0

PMLE(have|they,do)

= 2/22 = 0.091

Courtesy of Patrick Pantel

Add-One Smoothing

- V is the number of types we might see
- the vocabulary size (unique words)
- Add-One Smoothing (+1):
- Too much mass is reserved for 0-frequency N-grams
- arbitrarily picked value “1” to add to N-grams

Courtesy of Patrick Pantel

Vocabulary Size (V) = 10,543

They,do 22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

P+1(not|they,do)

P+1(offer|they,do)

P+1(have|they,do)

Courtesy of Patrick Pantel

Witten-Bell Discounting (unigrams)

- T is the number of types in the training corpus (T < V)
- N is the number of tokens in the training corpus
- Idea: Use the count of things seen once to estimate unseen events
- we saw T words once

Courtesy of Patrick Pantel

Witten-Bell Discounting (unigrams)

- Total mass reserved for all 0-frequency N-grams is:
- Where does this mass come from?
- Z = number of 0-frequency words = V – T

Courtesy of Patrick Pantel

Witten-Bell Discounting (N-grams)

- Condition T, N and Z on N-gram context
- unseen N-gram estimate is specific to a word history (context)
- b is the number of N-gram types with the given context
- b is the number of N-gram tokens with the given context
- b is the number of 0-frequency N-grams with the given context

Courtesy of Patrick Pantel

N(do)

N()

UNIGRAM 438699

...

DNS 298

DNS/WINS 2

dns1.isp.net 1

dnsadmin.exe 2

DNSName 1

DNSServer 1

do 384

...

NT 3313

...

pertinent 2

pervasiveness 1

Ph33r 3

phase 24

phased 1

...

phone 60

Phonebook 23

phrase 9

phrases 2

physical 123

PhysicalDisk 1

...

do 384

...

anything 2

approach 1

...

for 5

have 5

I 1

If 4

...

Link 1

list 1

...

no 1

not 97

Novell 1

offer 1

...

workitem 1

you 7

your 1

T(they,do)

=9

they,do 22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

T(do)=81

T()=10543

Courtesy of Patrick Pantel

Witten-Bell Discounting (N-grams)

- For N-grams with non-zero frequency:
- Mass reserved for 0-frequency N-grams:
- For 0-frequency N-grams:

Courtesy of Patrick Pantel

T=9

they,do 22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

PWB(offer|they,do)

Total N-gram Types

1 - 10543

2 - 114707

3 - 256844

PWB(have|they,do)

Courtesy of Patrick Pantel

Good-Turing Estimation

- where
- r = C(w1, …, wn)
- Nr= the number of n-grams that occurred r times
- This should only be used when r is small.

Example

- Corpus: a b a b
- Observed bigrams:
- b a: 1
- a b: 2
- N0=2, N1=1, N2=1, N=3
- Probability estimations:
- f0= N1 /N0 =0.5

Backing off

- Estimate the probability with a linear combination of lower order estimations which are less likely to be 0.
- Simple linear interpolation

Evaluation of Language Model

- Best method:
- Use the language model in an application, e.g., spelling check, machine translation, speech recognition, …
- Perplexity: the language model that assign the higher probability to the testing data is better.

Download Presentation

Connecting to Server..