collocations
Download
Skip this Video
Download Presentation
Collocations

Loading in 2 Seconds...

play fullscreen
1 / 94

Collocations - PowerPoint PPT Presentation


  • 222 Views
  • Uploaded on

Collocations. Definition Of Collocation (wrt Corpus Literature).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Collocations' - lilah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
definition of collocation wrt corpus literature
Definition Of Collocation (wrt Corpus Literature)
  • A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. [Chouekra, 1988]
word collocations
Word Collocations
  • Collocation
    • Firth: “word is characterized by the company it keeps”; collocations of a given word are statements of the habitual or customary places of that word.
    • non-compositionality of meaning
      • cannot be derived directly from its parts (heavy rain)
    • non-substitutability in context
      • for parts (make a decision)
    • non-modifiability (& non-transformability)
      • kick the yellow bucket; take exceptions to
collocations1
Collocations
  • Collocations are not necessarily adjacent
  • Collocations cannot be directly translated into other languages.
example classes
Example Classes
  • Names
  • Technical Terms
  • “Light” Verb Constructions
  • Phrasal verbs
  • Noun Phrases
linguistic subclasses of collocations
Linguistic Subclasses of Collocations
  • Light verbs: verbs with little semantic content like make, take, do
  • Terminological Expressions: concepts and objects in technical domains (e.g., hard drive)
  • Idioms: fixed phrases
      • kick the bucket, birds-of-a-feather, run for office
  • Proper names: difficult to recognize even with lists
      • Tuesday (person’s name), May, Winston Churchill, IBM, Inc.
  • Numerical expressions
    • containing “ordinary” words
      • Monday Oct 04 1999, two thousand seven hundred fifty
  • Verb particle constructions or Phrasal Verbs
    • Separable parts:
      • look up, take off, tell off
collocation detection techniques
Collocation Detection Techniques
  • Selection of Collocations by Frequency
  • Selection of Collocation based on Mean and Variance of the distance between focal word and collocating word.
  • Hypothesis Testing
  • Pointwise Mutual Information
frequency
Frequency
  • Technique:
    • Count the number of times a bigram co-occurs
    • Extract top counts and report them as candidates
  • Results:
    • Corpus: New York Times
      • August – November, 1990
    • Extremely un-interesting
frequency with tag filters technique
Frequency with Tag Filters Technique
  • Technique:
    • Count the number of times a bigram co-occurs
    • Tag candidates for POS
    • Pass all candidates through POS filter, considering only ones matching filter
    • Extract top counts and report them as candidates
mean and variance smadja et al 1993
Mean and Variance (Smadja et al., 1993)
  • Frequency-based search works well for fixed phrases. However, many collocations consist of two words in more flexible (although regular) relationships. For example,
    • Knock and door may not occur at a fixed distance from each other
  • One method of detecting these flexible relationships uses the mean and variance of the offset (signed distance) between the two words in the corpus.
example knock and door
Example: Knock and Door
  • She knocked on his door.
  • They knocked at the door.
  • 100 women knocked on the big red door.
  • A man knocked on the metal front door.
  • Average offset between knock and door:

(3 + 3 + 5 + 5)/ 4 = 4

  • Variance:

((3-4)2 + (3-4)2 + (5-4)2 + (5-4)2 )/(4-1) = 4/3=1.15

mean and variance
Mean and Variance
  • Technique (bigram at distance)
    • Produce all possible pairs in a window
    • Consider all pairs in window as candidates
    • Keep data about distance of one word from another
    • Count the number of time each candidate occurs
  • Measures:
    • Mean: average offset (possibly negative)
      • Whether two words are related to each other
    • Variance: s(offset)
      • Variability in position of two words
mean and variance illustration
Mean and Variance Illustration
  • Candidate Generation example:
    • Window: 3
  • Used to find collocations with long-distance relationships
hypothesis testing overview
Hypothesis Testing: Overview
  • Two (or more) words co-occur a lot
  • Is a candidate a true collocation, or a (not-at-all-interesting) phantom?
the t test intuition
The t test Intuition
  • Intuition:
    • Compute chance occurrence and ensure observed is significantly higher
    • Take several permutations of the words in the corpus
    • How more frequent is the set of all possible permutations than what is observed?
  • Assumptions:
    • H0 is the null hypothesis (words occur independently)
      • P(w1, w2) = P(w1) P(w2)
    • Distribution is “normal”
the t test formula
The t test Formula
  • Measures:
    • x = bigram count
    • m = H0 = P(w1) P(w2)
    • s2 = bigram count (since p ~ p[1 – p])
    • N = total number of bigrams
  • Result:
    • Number to look up in a table
    • Degree of confidence that collocation is not created by chance
      • a = the confidence (%) with which one can reject H0
the t test criticism
The t test Criticism
  • Words are not normally distributed
    • Can reject valid collocation
  • Not good on sparse data
c 2 intuition
c2 Intuition
  • Pearson’s chi-square test
  • Intuition
    • Compare observed frequencies to expected frequencies for independence
  • Assumptions
    • If sample is not small, the distribution is not normal
c 2 general formula
c2 General Formula
  • Measures:
    • Eij = Expected count of the bigram
    • Oij = Observed count of the bigram
  • Result
    • A number to look up in a table (like the t test)
    • Degree of confidence (a) with which H0
c 2 bigram method and formula
c2 Bigram Method and Formula
  • Technique for Bigrams:
    • Arrange the bigrams in a 2x2 table with counts for each
    • Formula
      • Oij: i = column; j = row
c 2 sample findings
c2 Sample Findings
  • Comparing corpora
    • Machine Translation
      • Comparison of (English) “cow” and (French) “vache” gives a
      • c2 = 456400
    • Similarity of two corpora
c 2 criticism
c2 Criticism
  • Not good for small datasets
likelihood ratios within a single corpus dunning 1993
Likelihood Ratios Within a Single Corpus (Dunning, 1993)
  • Likelihood ratios are more appropriate for sparse data than the Chi-Square test. In addition, they are easier to interpret than the Chi-Square statistic.
  • In applying the likelihood ratio test to collocation discovery, use the following two alternative explanations for the occurrence frequency of a bigram w1 w2:
    • H1: The occurrence of w2 is independent of the previous occurrence of w1: P(w2 | w1) = P(w2 | w1 ) = p
    • H2: The occurrence of w2 is dependent of the previous occurrence of w1: p1 = P(w2 | w1)  P(w2 | w1) = p2
likelihood ratios within a single corpus
Likelihood Ratios Within a Single Corpus
  • Use the MLE for probabilities for p, p1, and p2 and assume the binomial distribution:
    • Under H1: P(w2 | w1) = c2/N, P(w2 | w1) = c2/N
    • Under H2: P(w2 | w1) = c12/ c1= p1, P(w2 | w1) = (c2-c12)/(N-c1) = p2
    • Under H1: b(c12; c1, p) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p) gives c2- c12 out of N-c1 bigrams are w1w2
    • Under H2: b(c12; c1, p1) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p2) gives c2- c12 out of N-c1 bigrams are w1w2
likelihood ratios within a single corpus1
Likelihood Ratios Within a Single Corpus
  • The likelihood of H1
    • L(H1) = b(c12; c1, p)b(c2-c12; N-c1, p) (likelihood of independence)
  • The likelihood of H2
    • L(H2) = b(c12; c1, p1)b(c2- c12; N-c1, p2) (likelihood of dependence)
  • The log of likelihood ratio
    • log  = log [L(H1)/ L(H2)] = log b(..) + log b(..) – log b(..) –log b(..)
  • The quantity –2 log  is asymptotically 2 distributed, so we can test for significance.
pointwise mutual information i
[Pointwise] Mutual Information (I)
  • Intuition:
    • Given a collocation (w1, w2) and an observation of w1
    • I(w1; w2) indicates how more likely it is to see w2
    • The same measure also works in reverse (observe w2)
  • Assumptions:
    • Data is not sparse
mutual information formula
Mutual Information Formula
  • Measures:
    • P(w1) = unigram prob.
    • P(w1w2) = bigram prob.
    • P (w2|w1) = probability of w2 given we see w1
  • Result:
    • Number indicating increased confidence that we will see w2 after w1
mutual information criticism
Mutual Information Criticism
  • A better measure of the independence of two words rather than the dependence of one word on another
  • Horrible on [read: misidentifies] sparse data
applications
Applications
  • Collocations are useful in:
    • Comparison of Corpora
    • Parsing
    • New Topic Detection
    • Computational Lexicography
    • Natural Language Generation
    • Machine Translation
comparison of corpora
Comparison of Corpora
  • Compare corpora to determine:
    • Document clustering (for information retrieval)
    • Plagiarism
  • Comparison techniques:
    • Competing hypotheses:
      • Documents are dependent
      • Documents are independent
    • Compare hypotheses using l, etc.
parsing
Parsing
  • When parsing, we may get more accurate data by treating a collocation as a unit (rather than individual words)
    • Example: [ hand to hand ] is a unit in:

(S (NP They)

(VP engaged

(PP in hand)

(PP to

(NP hand combat))))

new topic detection
New Topic Detection
  • When new topics are reported, the count of collocations associated with those topics increases
    • When topics become old, the count drops
computational lexicography
Computational Lexicography
  • As new multi-word expressions become part of the language, they can be detected
    • Existing collocations can be acquired
  • Can also be used for cultural identification
    • Examples:
      • My friend got an A in his class
      • My friend took an A in his class
      • My friend made an A in his class
      • My friend earned an A in his class
natural language generation
Natural Language Generation
  • Problem:
    • Given two (or more) possible productions, which is more feasible?
    • Productions usually involve synonyms or near-synonyms
    • Languages generally favour one production
machine translation
Machine Translation
  • Collocation-complete problem?
    • Must find all used collocations
    • Must parse collocation as a unit
    • Must translate collocation as a unit
    • In target language production, must select among many plausible alternatives
thanks
Thanks!
  • Questions?
statistical inference
Statistical inference
  • Statistical inference consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inferences about its distribution.
language models
Language Models
  • Predict the next word, given the previous words

(this sort of task is often referred to as a shannon game)

  • A language model can take the context into account.
  • Determine probability of different sequences by examining training corpus
  • Applications:
      • OCR / Speech recognition – resolve ambiguity
      • Spelling correction
      • Machine translation etc
statistical estimators
Statistical Estimators
  • Example:

Corpus: five Jane Austen novels

N = 617,091 words, V = 14,585 unique words

Task: predict the next word of the trigram “inferior to ___”

from test data, Persuasion: “[In person, she was] inferior to both [sisters.]”

  • Given the observed training data …
  • How do you develop a model (probability distribution) to predict future events?
the perfect language model
The Perfect Language Model
  • Sequence of word forms
  • Notation: W = (w1,w2,w3,...,wn)
  • The big (modeling) question is what is p(W)?
  • Well, we know (Bayes/chain rule):

p(W) = p(w1,w2,w3,...,wn) = p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1)

  • Not practical (even short for W ® too many parameters)
markov chain
Markov Chain
  • Unlimited memory (cf. previous foil):
    • for wi, we know its predecessors w1,w2,w3,...,wi-1
  • Limited memory:
    • we disregard predecessors that are “too old”
    • remember only k previous words: wi-k,wi-k+1,...,wi-1
    • called “kth order Markov approximation”
  • Stationary character (no change over time):

p(W) @Pi=1..n p(wi|wi-k,wi-k+1,...,wi-1), n = |W|

n gram language models
N-gram Language Models
  • (n-1)th order Markov approximation ® n-gram LM:

p(W) = Pi=1..n p(wi|wi-n+1,wi-n+2,...,wi-1)

  • In particular (assume vocabulary |V| = 20k):

0-gram LM: uniform model p(w) = 1/|V| 1 parameter

1-gram LM: unigram model p(w) 2´104 parameters

2-gram LM: bigram model p(wi|wi-1) 4´108 parameters

3-gram LM: trigram mode p(wi|wi-2,wi-1) 8´1012 parameters

4-gram LM: tetragram model p(wi| wi-3,wi-2,wi-1) 1.6´1017 parameters

reliability vs discrimination
Reliability vs. Discrimination

“large green ___________”

tree? mountain? frog? car?

“swallowed the large green ________”

pill? tidbit?

  • larger n: more information about the context of the specific instance (greater discrimination)
  • smaller n: more instances in training data, better statistical estimates (more reliability)
lm observations
LM Observations
  • How large n?
    • zero is enough (theoretically)
    • but anyway: as much as possible (as close to “perfect” model as possible)
    • empirically: 3
      • parameter estimation? (reliability, data availability, storage space, ...)
      • 4 is too much: |V|=60k ® 1.296´1019 parameters
      • but: 6-7 would be (almost) ideal (having enough data)
  • For now, word forms only (no “linguistic” processing)
parameter estimation
Parameter Estimation
  • Parameter: numerical value needed to compute p(w|h)
  • From data (how else?)
  • Data preparation:
      • get rid of formatting etc. (“text cleaning”)
      • define words (separate but include punctuation, call it “word”, unless speech)
      • define sentence boundaries (insert “words” <s> and </s>)
      • letter case: keep, discard, or be smart:
        • name recognition
        • number type identification
maximum likelihood estimate
Maximum Likelihood Estimate
  • MLE: Relative Frequency...
    • ...best predicts the data at hand (the “training data”)
    • See (Ney et al. 1997) for a proof that the relative frequency really is the maximum likelihood estimate.
  • Trigrams from Training Data T:
    • count sequences of three words in T: C3(wi-2,wi-1,wi)
    • count sequences of two words in T: C2(wi-2,wi-1):

PMLE(wi-2,wi-1,wi) = C3(wi-2,wi-1,wi) / N

PMLE(wi|wi-2,wi-1) = C3(wi-2,wi-1,wi) / C2(wi-2,wi-1)

character language model
Character Language Model
  • Use individual characters instead of words:

Same formulas and methods

  • Might consider 4-grams, 5-grams or even more
  • Good for cross-language comparisons
  • Transform cross-entropy between letter- and word-based models:

HS(pc) = HS(pw) / avg. # of characters/word in S

p(W) =dfPi=1..n p(ci|ci-n+1,ci-n+2,...,ci-1)

lm an example
LM: an Example
  • Training data:

<s0> <s> He can buy you the can of soda </s>

    • Unigram: (8 words in vocabulary)

p1(He) = p1(buy) = p1(you) = p1(the) = p1(of) = p1(soda) = .125 p1(can) = .25

    • Bigram:

p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5, p2(of|can) = .5,

p2(you |buy) = 1,...

    • Trigram:

p3(He|<s0>,<s>) = 1, p3(can|<s>,He) = 1, p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(</s>|of,soda) = 1.

    • Entropy: H(p1) = 2.75, H(p2) = 1, H(p3) = 0
lm an example the problem
LM: an Example (The Problem)
  • Cross-entropy:
  • S = <s0> <s> It was the greatest buy of all </s>
  • Even HS(p1) fails (= HS(p2) = HS(p3) = ¥), because:
    • all unigrams but p1(the), p1(buy), and p1(of) are 0.
    • all bigram probabilities are 0.
    • all trigram probabilities are 0.
  • Need to make all “theoretically possible” probabilities non-zero.
lm another example
LM: Another Example
  • Training data S: |V| =11 (not counting <s> and </s>)
    • <s> John read Moby Dick </s>
    • <s> Mary read a different book </s>
    • <s> She read a book by Cher </s>
  • Bigram estimates:
    • P(She | <s>) = C(<s> She)/ Sw C(<s> w) = 1/3
    • P(read | She) = C(She read)/ Sw C(She w) = 1
    • P (Moby | read) = C(read Moby)/ Sw C(read w) = 1/3
    • P (Dick | Moby) = C(Moby Dick)/ Sw C(Moby w) = 1
    • P(</s> | Dick) = C(Dick </s> )/ Sw C(Dick w) = 1
  • p(She read Moby Dick) =

p(She | <s>)  p(read | She)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = 1/3  1  1/3  1  1 = 1/9

the zero problem
The Zero Problem
  • “Raw” n-gram language model estimate:
    • necessarily, there will be some zeros
      • Often trigram model ® 2.16´1014 parameters, data ~ 109 words
    • which are true zeros?
      • optimal situation: even the least frequent trigram would be seen several times, in order to distinguish it’s probability vs. other trigrams (hapax legomena = only-once term => uniqueness)
      • optimal situation cannot happen, unfortunately (question: how much data would we need?)
    • we don’t know; hence, we eliminate them.
  • Different kinds of zeros: p(w|h) = 0, p(w) = 0
why do we need non zero probabilities
Why do we need non-zero probabilities?
  • Avoid infinite Cross Entropy:
    • happens when an event is found in the test data which has not been seen in training data
  • Make the system more robust
    • low count estimates:
      • they typically happen for “detailed” but relatively rare appearances
    • high count estimates: reliable but less “detailed”
eliminating the zero probabilities smoothing
Eliminating the Zero Probabilities:Smoothing
  • Get new p’(w) (same W): almost p(w) except for eliminating zeros
  • Discount w for (some) p(w) > 0: new p’(w) < p(w)

SwÎdiscounted (p(w) - p’(w)) = D

  • Distribute D to all w; p(w) = 0: new p’(w) > p(w)
    • possibly also to other w with low p(w)
  • For some w (possibly): p’(w) = p(w)
  • Make sure SwÎW p’(w) = 1
  • There are many ways of smoothing
laplace s law smoothing by adding 1
Laplace’s Law: Smoothing by Adding 1
  • Laplace’s Law:
    • PLAP(w1,..,wn)=(C(w1,..,wn)+1)/(N+B), where C(w1,..,wn) is the frequency of n-gram w1,..,wn, N is the number of training instances, and B is the number of bins training instances are divided into (vocabulary size)
    • Problem if B > C(W) (can be the case; even >> C(W))
    • PLAP(w | h) = (C(h,w) + 1) / (C(h) + B)
  • The idea is to give a little bit of the probability space to unseen events.
add 1 smoothing example
Add 1 Smoothing Example
  • pMLE(Cher read Moby Dick) =

p(Cher | <s>)  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = 0  0  1/3  1  1 = 0

    • p(Cher | <s>) = (1 + C(<s> Cher))/(11 + C(<s>)) = (1 + 0) / (11 + 3) = 1/14 = .0714
    • p(read | Cher) = (1 + C(Cher read))/(11 + C(Cher)) = (1 + 0) / (11 + 1) = 1/12 = .0833
    • p(Moby | read) = (1 + C(read Moby))/(11 + C(read)) = (1 + 1) / (11 + 3) = 2/14 = .1429
    • P(Dick | Moby) = (1 + C(Moby Dick))/(11 + C(Moby)) = (1 + 1) / (11 + 1) = 2/12 = .1667
    • P(</s> | Dick) = (1 + C(Dick </s>))/(11 + C<s>) = (1 + 1) / (11 + 3) = 2/14 = .1429
  • p’(Cher read Moby Dick) =

p(Cher | <s>)  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = 1/14  1/12  2/14  2/12  2/14 = 2.02e-5

objections to laplace s law
Objections to Laplace’s Law
  • For NLP applications that are very sparse, Laplace’s Law actually gives far too much of the probability space to unseen events.
  • Worse at predicting the actual probabilities of bigrams with zero counts than other methods.
  • Count variances are actually greater than the MLE.
lidstone s law
Lidstone’s Law
  • P = probability of specific n-gram
  • C = count of that n-gram in training data
  • N = total n-grams in training data
  • B = number of “bins” (possible n-grams)
  •  = small positive number
    • M.L.E:  = 0LaPlace’s Law:  = 1Jeffreys-Perks Law:  = ½
  • PLid(w | h) = (C(h,w) + ) / (C(h) + B )
objections to lidstone s law
Objections to Lidstone’s Law
  • Need an a priori way to determine .
  • Predicts all unseen events to be equally likely.
  • Gives probability estimates linear in the M.L.E. frequency.
lidstone s law with 5
Lidstone’s Law with =.5
  • pMLE(Cher read Moby Dick) =

p(Cher | <s>)  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = 0  0  1/3  1  1 = 0

    • p(Cher | <s>) = (.5 + C(<s> Cher))/(.5* 11 + C(<s>)) = (.5 + 0) / (.5*11 + 3) = .5/8.5 =.0588
    • p(read | Cher) = (.5 + C(Cher read))/(.5* 11 + C(Cher)) = (.5 + 0) / (.5* 11 + 1) = .5/6.5 = .0769
    • p(Moby | read) = (.5 + C(read Moby))/(.5* 11 + C(read)) = (.5 + 1) / (.5* 11 + 3) = 1.5/8.5 = .1765
    • P(Dick | Moby) = (.5 + C(Moby Dick))/(.5* 11 + C(Moby)) = (.5 + 1) / (.5* 11 + 1) = 1.5/6.5 = .2308
    • P(</s> | Dick) = (.5 + C(Dick </s>))/(.5* 11 + C<s>) = (.5 + 1) / (.5* 11 + 3) = 1.5/8.5 = .1765
  • p’(Cher read Moby Dick) =

p(Cher | <s>)  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = .5/8.5  .5/6.5  1.5/8.5  1.5/6.5  1.5/8.5 = 3.25e-5

held out estimator
Held-Out Estimator
  • How much of the probability distribution should be reserved to allow for previously unseen events?
  • Can validate choice by holding out part of the training data.
  • How often do events seen (or not seen) in training data occur in validation data?
  • Held out estimator by Jelinek and Mercer (1985)
held out estimator1
Held Out Estimator
  • For each n-gram, w1,..,wn , compute C1(w1,..,wn) and C2(w1,..,wn), the frequencies of w1,..,wn in training and held out data, respectively.
    • Let Nr be the no. of bigrams with frequency r in the training text.
    • Let Tr be the that all n-grams that appeared r times in the training text appeared in the held out.,
  • Then the average of the frequency r n-grams is Tr/Nr
  • An estimate for of one of these n-gram is: Pho(w1,..,wn)= (Tr/Nr )/N
    • where C(w1,..,wn) = r
testing models
Testing Models
  • Divide data into training and testing sets.
  • Training data: divide into normal training plus validation (smoothing) sets: around 10% for validation (fewer parameters typically)
  • Testing data: distinguish between the “real” test set and a development set.
cross validation
Cross-Validation
  • Held out estimation is useful if there is a lot of data available. If not, it may be better to use each part of the data both as training data and held out data.
    • Deleted Estimation [Jelinek & Mercer, 1985]
    • Leave-One-Out [Ney et al., 1997]
deleted estimation

Divide training data into 2 parts

  • Train on A, validate on B
  • Train on B, validate on A
  • Combine two models

A

B

train

validate

Model 1

validate

train

Model 2

+

Model 1

Model 2

Final Model

Deleted Estimation
  • Use data for both training and validation
cross validation1
Cross-Validation

Two estimates:

Nra = number of n-grams occurring r times in a-th part of training set

Trab = total number of those found in b-th part

Combined estimate:

(arithmetic mean)

leave one out
Leave One Out
  • Primary training Corpus is of size N-1 tokens.
  • 1 token is used as held out data for a sort of simulated testing.
  • Process is repeated N times so that each piece of data is left in turn.
  • It explores the effect of how the model changes if any particular piece of data had not been observed (advantage)
good turing estimation
Good-Turing Estimation
  • Intuition: re-estimate the amount of mass assigned to n-grams with low (or zero) counts using the number of n-grams with higher counts. For any n-gram that occurs r times, we should assume that it occurs r* times, where Nr is the number of n-grams occurring precisely r times in the training data.
  • To convert the count to a probability, we normalize the n-gram  with r counts as:
good turing estimation1
Good-Turing Estimation
  • Note that N is equal to the original number of counts in the distribution.
  • Makes the assumption of a binomial distribution, which works well for large amounts of data and a large vocabulary despite the fact that words and n-grams do not have that distribution.
good turing estimation2
Good-Turing Estimation
  • Note that the estimate cannot be used if Nr=0; hence, it is necessary to smooth the Nr values.
  • The estimate can be written as:
    • If C(w1,..,wn) = r > 0, PGT(w1,..,wn) = r*/N where r*=((r+1)S(r+1))/S(r) and S(r) is a smoothed estimate of the expectation of Nr.
    • If C(w1,..,wn) = 0, PGT(w1,..,wn)  (N1/N0 ) /N
  • In practice, counts with a frequency greater than five are assumed reliable, as suggested by Katz.
  • In practice, this method is not used by itself because it does not use lower order information to estimate probabilities of higher order n-grams.
good turing estimation3
Good-Turing Estimation
  • N-grams with low counts are often treated as if they had a count of 0.
  • In practice r* is used only for small counts; counts greater than k=5 are assumed to be reliable: r*=r if r> k; otherwise:
discounting methods
Discounting Methods
  • Absolute discounting: Decrease probability of each observed n-gram by subtracting a small constant when C(w1, w2, …, wn) = r:
  • Linear discounting: Decrease probability of each observed n-gram by multiplying by the same proportion when C(w1, w2, …, wn) = r:
combining estimators overview
Combining Estimators: Overview
  • If we have several models of how the history predicts what comes next, then we might wish to combine them in the hope of producing an even better model.
  • Some combination methods:
    • Katz’s Back Off
    • Simple Linear Interpolation
    • General Linear Interpolation
backoff
Backoff
  • Back off to lower order n-gram if we have no evidence for the higher order form. Trigram backoff:
katz s back off model
Katz’s Back Off Model
  • If the n-gram of concern has appeared more than k times, then an n-gram estimate is used but an amount of the MLE estimate gets discounted (it is reserved for unseen n-grams).
  • If the n-gram occurred k times or less, then we will use an estimate from a shorter n-gram (back-off probability), normalized by the amount of probability remaining and the amount of data covered by this estimate.
  • The process continues recursively.
katz s back off model1
Katz’s Back Off Model
  • Katz used Good-Turing estimates when an n-gram appeared k or fewer times.
problems with backing off
Problems with Backing-Off
  • If bigram w1 w2 is common, but trigram w1 w2 w3 is unseen, it may be a meaningful gap, rather than a gap due to chance and scarce data.
    • i.e., a “grammatical null”
  • In that case, it may be inappropriate to back-off to lower-order probability.
linear interpolation
Linear Interpolation
  • One way of solving the sparseness in a trigram model is to mix that model with bigram and unigram models that suffer less from data sparseness.
  • This can be done by linear interpolation (also called finite mixture models).
  • The weights can be set using the Expectation-Maximization (EM) algorithm.
simple interpolated smoothing
Simple Interpolated Smoothing
  • Add information from less detailed distributions using l=(l0,l1,l2,l3):

p’l(wi| wi-2 ,wi-1) = l3 p3(wi| wi-2 ,wi-1) +l2 p2(wi| wi-1) + l1 p1(wi) + l0/|V|

  • Normalize:

li > 0, Si=0..n li = 1 is sufficient (l0 = 1 - Si=1..n li) (n=3)

  • Estimation using MLE:
    • fix the p3, p2, p1 and |V| parameters as estimated from the training data
    • then find {li}that minimizes the cross entropy (maximizes probability of data): -(1/|D|)Si=1..|D|log2(p’l(wi|hi))
held out data
Held Out Data
  • What data to use?
    • try the training data T: but we will always get l3 = 1
      • why? (let piT be an i-gram distribution estimated using T)
      • minimizing HT(p’l) over a vector l, p’l = l3p3T+l2p2T+l1p1T+l0/|V|
        • remember: HT(p’l) = H(p3T) + D(p3T||p’l); (p3T fixed ® H(p3T) fixed, best)
    • thus: do not use the training data for estimation of l!
      • must hold out part of the training data (heldout data, H):
      • ...call the remaining data the (true/raw) training data, T
      • the test data S (e.g., for evaluation purposes): still different data!
the formulas
The Formulas
  • Repeat: minimizing -(1/|H|)Si=1..|H|log2(p’l(wi|hi)) over l

p’l(wi| hi) = p’l(wi| wi-2 ,wi-1) = l3 p3(wi| wi-2 ,wi-1) +

l2 p2(wi| wi-1) + l1 p1(wi) + l0/|V|

  • “Expected Counts (of lambdas)”: j = 0..3

c(lj) = Si=1..|H| (ljpj(wi|hi) / p’l(wi|hi))

  • “Next l”: j = 0..3

lj,next = c(lj) / Sk=0..3 (c(lk))

the smoothing em algorithm
The (Smoothing) EM Algorithm

1. Start with some l, such that lj > 0 for all j Î 0..3.

2. Compute “Expected Counts” for each lj.

3. Compute new set of lj, using the “Next l” formula.

4. Start over at step 2, unless a termination condition is met.

  • Termination condition: convergence of l.
    • Simply set an , and finish if |lj - lj,next| <  for each j (step 3).
  • Guaranteed to converge:

follows from Jensen’s inequality, plus a technical proof.

example
Example
  • Raw distribution (unigram; smooth with uniform):

p(a) = .25, p(b) = .5, p(a) = 1/64 for a Î{c..r}, = 0 for the rest: s,t,u,v,w,x,y,z

  • Heldout data: baby; use one set of l(l1: unigram,l0: uniform)
  • Start with l1 = .5; p’l(b) = .5  .5 + .5 / 26 = .27

p’l(a) = .5  .25 + .5 / 26 = .14

p’l(y) = .5  0 + .5 / 26 = .02

c(l1) = .5 .5/.27 + .5 .25/.14 + .5 .5/.27 + .5  0/.02 = 2.72

c(l0) = .5 .04/.27 + .5 .04/.14 + .5 .04/.27 + .5 .04/.02 = 1.28

Normalize: l1,next = .68, l0,next = .32.

Repeat from step 2 (recompute p’l first for efficient computation, then c(li), ...)

Finish when new lambdas differ little from the previous ones (say, < 0.01 difference).

p2(wi| wi-1)

=1p1(wi) + 2 /|V|

witten bell smoothing
Witten-Bell Smoothing
  • The nth order smoothed model is defined recursively as:
  • To compute , we need the number of unique words that have that history.
witten bell smoothing1
Witten-Bell Smoothing
  • The number of words that follow the history and have one or more counts is:
  • We can assign the parameters such that:
witten bell smoothing2
Witten-Bell Smoothing
  • Substituting into the first equation, we get:
general linear interpolation
General Linear Interpolation
  • In simple linear interpolation, the weights are just a single number, but one can define a more general and powerful model where the weights are a function of the history.
  • Need some way to group or bucket lambda histories.
reference
Reference
  • www.cs.tau.ac.il/~nachumd/NLP
  • http://www.cs.sfu.ca
  • http://min.ecn.purdue.edu
  • Manning and Schutze
  • Jurafsky and Martin
ad