Language modeling with n grams
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Language Modeling with N-Grams PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on
  • Presentation posted in: General

Language Modeling with N-Grams. Language Modeling. A Language Model is a probabilistic model that allows us to compute the probability of a sentence. Let w 1:n denote the word sequence w 1 w 2 …w n . What is the probability P(w 1:n )?. Why Language Modeling?.

Download Presentation

Language Modeling with N-Grams

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Language modeling with n grams

Language Modeling with N-Grams


Language modeling

Language Modeling

  • A Language Model is a probabilistic model that allows us to compute the probability of a sentence.

    • Let w1:n denote the word sequence w1w2…wn.

    • What is the probability P(w1:n)?


Why language modeling

Why Language Modeling?

  • Determine which sequence of words is more likely

  • Predicting the next word given the previous words

  • Shannon game:

    • guessing the next letter given previous letters.

  • Applications in

    • Speech Recognition

    • Machine Translation

    • Context sensitive spelling check


Language modeling in speech recognition

Language Modeling in Speech Recognition

  • Some sequences of words sounds alike, but not all of them are good English sentences.

    • I went to a party

    • Eye went two a bar tea

Rudolph the red nose reindeer.

Rudolph the Red knows rain, dear.

Rudolph the Red Nose reigned here.


Language modeling in machine translation

Language Modeling in Machine Translation

  • Given a French sentence

    • On voit Jon à la télévision

  • And several possible English translations:

    • Jon appeared in TV.

    • In Jon appeared TV.

    • Jon appeared on TV.

  • Which one is more likely to be correct?


Context sensitive spelling

Context Sensitive Spelling

  • Which is most probable?

    • … I think they’re okay …

    • … I think there okay …

    • … I think their okay …

  • Which is most probable?

    • … by the way, are they’re likely to …

    • … by the way, are there likely to …

    • … by the way, are their likely to …


Axioms of probability theory

Axioms of Probability Theory

  • Suppose P(.) is a probability function, then

    1.for any event E, 0≤P(E) ≤1.

    2.P(S) = 1, where S is the sample space.

    3.for any two mutually exclusive events E1 and E2,

    P(E1 U E2) = P(E1) + P(E2)

  • Any function that satisfies the above three axioms is a probability function.


Properties of probability

Properties of Probability

1.P(¬E) = 1– P(E)

2.If E1 and E2 are logically equivalent, then

P(E1)=P(E2).

  • E1: Not all philosophers are more than six feet tall.

  • E2: Some philosopher is not more that six feet tall.

    Then P(E1)=P(E2).

    3.P(E1, E2)≤P(E1).


Conditional probability

Conditional Probability

  • The probability of an event may change after knowing another event.

    The probability of A given B is denoted by P(A|B).

  • Example

    • P( W=space ) the probability of a randomly selected word from an English text is ‘space’

    • P( W=space | W’=outer) the probability of ‘space’ if the previous word is ‘outer’


Chain rule and bayes theorem

Chain Rule and Bayes Theorem

  • Chain Rule:

    P(A, B)=P(A)P(B|A)

  • Bayes Theorem

    If P(E2)>0, then P(E1|E2)=P(E2|E1)P(E1)/P(E2)

    This can be derived from the definition of conditional probability.


The n gram language model

The n-gram Language Model

Using the Chain Rule: P(A,B)=P(A)P(B|A)

P(w1:n)

=P(w1:n-1)P(wn|w1:n-1)

= P(w1:n-2)P(wn-1|w1:n-2)P(wn|w1:n-1)

= P(w1:n-3)P(wn-2|w1:n-3)P(wn-1|w1:n-2)P(wn|w1:n-1)

= P(w1)P(w2|w1) P(w3|w1:2) P(w4|w1:3) …… P(wn-1|w1:n-2)P(wn|w1:n-1)

Can we compute P(w1:n) in the reverse order?


Markov assumption

Markov Assumption

  • W1:n-1 is called the history of wn

    • Sue swallowed the large green ______.

  • The statistics for the complete history is very sparse.

  • Markov Assumption: only the closest n words are relevant:

    P(wn|w1:n-1)≈P(wn|wn-N+1:n-1)

    • Bigram: only the previous one word matters

    • Trigram: only the previous two words matter

  • Therefore

    P(w1:n) ≈k=1,n P(wk|wk-N+1:k-1)


Examples

Examples:

  • Without Markov Assumption:

    • P(I went to a party) = ?

  • With Markov Assumption (n=3)

    • P(I went to a party) = ?

  • With Markov Assumption (n=2)

    • P(I went to a party) = ?

  • What does n=1 mean?


Parameters in n gram models

Parameters in N-gram Models

  • Suppose there are 20,000 words

    • very conservative assumption

  • Parameters

    • Bigram: 20,000x19,999 = 400M

    • Trigram:20,0002x19,999=8 trillion

    • 4-gram: 20,0003x19,999=1.6x1017

  • Reliability vs. Relevance

    • as n increases, n-gram becomes more relevant, but less reliable.


Estimation of probability

Estimation of Probability

  • P(wn | w1:n-1) = P(w1:n)/P(w1:n-1)

  • Probabilities (subjective/objective) exist independent of data.

  • However, probabilities have to be estimated from data.

  • Maximum Likelihood Estimation

    • PMLE(wn | w1:n)=C(w1:n)/C(w1:n-1)


Maximum likelihood estimation

Maximum Likelihood Estimation

  • MLE assigns the highest probability to data.

  • Example:

    • training corpus: <s> a b a b </s>

    • MLE P(a|b)= ½, P(b|a)=1, P(a|<s>)=1, P(</s>|b) = ½, P(corpus)=1/2.

  • MLE is not suitable for NLP

    • MLE assigns 0 probability to unseen events.

    • One experiment shows that 23% of trigrams were previously unseen after 1.5M words.


How to estimate

p(z | xy) = ?

Suppose our training data includes

… xya ..

… xyd …

… xyd …

but never xyz

Should we conclude p(a | xy) = 1/3?p(d | xy) = 2/3?p(z | xy) = 0/3?

NO! Absence of xyz might just be bad luck.

How to Estimate


Smoothing the estimates

Smoothing the Estimates

  • Should we conclude

    • p(a | xy) = 1/3? reduce this

    • p(d | xy) = 2/3? reduce this

    • p(z | xy) = 0/3? increase this

  • Discount the positive counts somewhat

  • Reallocate that probability to the zeroes


Language modeling with n grams

  • Especially if the denominator is small …

    • 1/3 probably too high, 100/300 probably about right

  • Especially if numerator is small …

    • 1/300 probably too high, 100/30000 probably about right


Dealing with 0 probability

Dealing with 0 Probability

  • Back-off

    • If the frequency count of N-gram is 0, used N-1 gram

  • Smoothing

    • Mix MLE with another probability distribution that guarantees not to give 0 probability.


Language modeling with n grams

UNIGRAM438699

...

DNS 298

DNS/WINS 2

dns1.isp.net 1

dnsadmin.exe 2

DNSName 1

DNSServer 1

do 384

...

NT 3313

...

pertinent 2

pervasiveness 1

Ph33r 3

phase 24

phased 1

...

phone 60

Phonebook 23

phrase 9

phrases 2

physical 123

PhysicalDisk 1

...

do384

...

anything 2

approach 1

...

for 5

have 5

I 1

If 4

...

Link 1

list 1

...

no 1

not 97

Novell 1

offer 1

...

workitem 1

you 7

your 1

they,do22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

Courtesy of Patrick Pantel


Language modeling with n grams

not 7

do384

not 97

UNIGRAM438699

...

DNS 298

DNS/WINS 2

dns1.isp.net 1

dnsadmin.exe 2

DNSName 1

DNSServer 1

do 384

...

NT 3313

...

pertinent 2

pervasiveness 1

Ph33r 3

phase 24

phased 1

...

phone 60

Phonebook 23

phrase 9

phrases 2

physical 123

PhysicalDisk 1

...

not 97

do384

...

anything 2

approach 1

...

for 5

have 5

I 1

If 4

...

Link 1

list 1

...

no 1

not 97

Novell 1

offer 1

...

workitem 1

you 7

your 1

they,do22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

C(they,do,not)

= 7

C(do,not)

= 97

PMLE(not|they,do)

= 7/22 = 0.318

PMLE(not|do)

= 97/384 = 0.253

PMLE(offer|they,do)

= 0/22 = 0

PMLE(have|they,do)

= 2/22 = 0.091

Courtesy of Patrick Pantel


Add one smoothing

Add-One Smoothing

  • V is the number of types we might see

    • the vocabulary size (unique words)

  • Add-One Smoothing (+1):

  • Too much mass is reserved for 0-frequency N-grams

    • arbitrarily picked value “1” to add to N-grams

Courtesy of Patrick Pantel


Language modeling with n grams

Vocabulary Size (V) = 10,543

Vocabulary Size (V) = 10,543

They,do22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

P+1(not|they,do)

P+1(offer|they,do)

P+1(have|they,do)

Courtesy of Patrick Pantel


Witten bell discounting unigrams

Witten-Bell Discounting (unigrams)

  • T is the number of types in the training corpus (T < V)

  • N is the number of tokens in the training corpus

  • Idea: Use the count of things seen once to estimate unseen events

    • we saw T words once

Courtesy of Patrick Pantel


Witten bell discounting unigrams1

Witten-Bell Discounting (unigrams)

  • Total mass reserved for all 0-frequency N-grams is:

  • Where does this mass come from?

  • Z = number of 0-frequency words = V – T

Courtesy of Patrick Pantel


Witten bell discounting n grams

Witten-Bell Discounting (N-grams)

  • Condition T, N and Z on N-gram context

    • unseen N-gram estimate is specific to a word history (context)

  • b is the number of N-gram types with the given context

  • b is the number of N-gram tokens with the given context

  • b is the number of 0-frequency N-grams with the given context

Courtesy of Patrick Pantel


Language modeling with n grams

N(they,do)

N(do)

N()

UNIGRAM438699

...

DNS 298

DNS/WINS 2

dns1.isp.net 1

dnsadmin.exe 2

DNSName 1

DNSServer 1

do 384

...

NT 3313

...

pertinent 2

pervasiveness 1

Ph33r 3

phase 24

phased 1

...

phone 60

Phonebook 23

phrase 9

phrases 2

physical 123

PhysicalDisk 1

...

do384

...

anything 2

approach 1

...

for 5

have 5

I 1

If 4

...

Link 1

list 1

...

no 1

not 97

Novell 1

offer 1

...

workitem 1

you 7

your 1

T(they,do)

=9

they,do22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

T(do)=81

T()=10543

Courtesy of Patrick Pantel


Witten bell discounting n grams1

Witten-Bell Discounting (N-grams)

  • For N-grams with non-zero frequency:

  • Mass reserved for 0-frequency N-grams:

  • For 0-frequency N-grams:

Courtesy of Patrick Pantel


Language modeling with n grams

PWB(not|they,do)

T=9

they,do22

. 1

approach 1

have 2

Link 1

not 7

on 3

open 1

so 1

under 5

PWB(offer|they,do)

Total N-gram Types

1 - 10543

2 - 114707

3 - 256844

PWB(have|they,do)

Courtesy of Patrick Pantel


Good turing estimation

Good-Turing Estimation

  • where

    • r = C(w1, …, wn)

    • Nr= the number of n-grams that occurred r times

  • This should only be used when r is small.


Example

Example

  • Corpus: a b a b

  • Observed bigrams:

    • b a: 1

    • a b: 2

    • N0=2, N1=1, N2=1, N=3

  • Probability estimations:

    • f0= N1 /N0 =0.5


Backing off

Backing off

  • Estimate the probability with a linear combination of lower order estimations which are less likely to be 0.

  • Simple linear interpolation


Evaluation of language model

Evaluation of Language Model

  • Best method:

    • Use the language model in an application, e.g., spelling check, machine translation, speech recognition, …

  • Perplexity: the language model that assign the higher probability to the testing data is better.


  • Login