- By
**lilah** - Follow User

- 222 Views
- Uploaded on

Collocations. Definition Of Collocation (wrt Corpus Literature).

Download Presentation
## PowerPoint Slideshow about 'Collocations' - lilah

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Definition Of Collocation (wrt Corpus Literature)

- A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. [Chouekra, 1988]

Word Collocations

- Collocation
- Firth: “word is characterized by the company it keeps”; collocations of a given word are statements of the habitual or customary places of that word.
- non-compositionality of meaning
- cannot be derived directly from its parts (heavy rain)
- non-substitutability in context
- for parts (make a decision)
- non-modifiability (& non-transformability)
- kick the yellow bucket; take exceptions to

Collocations

- Collocations are not necessarily adjacent
- Collocations cannot be directly translated into other languages.

Example Classes

- Names
- Technical Terms
- “Light” Verb Constructions
- Phrasal verbs
- Noun Phrases

Linguistic Subclasses of Collocations

- Light verbs: verbs with little semantic content like make, take, do
- Terminological Expressions: concepts and objects in technical domains (e.g., hard drive)
- Idioms: fixed phrases
- kick the bucket, birds-of-a-feather, run for office
- Proper names: difficult to recognize even with lists
- Tuesday (person’s name), May, Winston Churchill, IBM, Inc.
- Numerical expressions
- containing “ordinary” words
- Monday Oct 04 1999, two thousand seven hundred fifty
- Verb particle constructions or Phrasal Verbs
- Separable parts:
- look up, take off, tell off

Collocation Detection Techniques

- Selection of Collocations by Frequency
- Selection of Collocation based on Mean and Variance of the distance between focal word and collocating word.
- Hypothesis Testing
- Pointwise Mutual Information

Frequency

- Technique:
- Count the number of times a bigram co-occurs
- Extract top counts and report them as candidates
- Results:
- Corpus: New York Times
- August – November, 1990
- Extremely un-interesting

Frequency with Tag Filters Technique

- Technique:
- Count the number of times a bigram co-occurs
- Tag candidates for POS
- Pass all candidates through POS filter, considering only ones matching filter
- Extract top counts and report them as candidates

Mean and Variance (Smadja et al., 1993)

- Frequency-based search works well for fixed phrases. However, many collocations consist of two words in more flexible (although regular) relationships. For example,
- Knock and door may not occur at a fixed distance from each other
- One method of detecting these flexible relationships uses the mean and variance of the offset (signed distance) between the two words in the corpus.

Example: Knock and Door

- She knocked on his door.
- They knocked at the door.
- 100 women knocked on the big red door.
- A man knocked on the metal front door.
- Average offset between knock and door:

(3 + 3 + 5 + 5)/ 4 = 4

- Variance:

((3-4)2 + (3-4)2 + (5-4)2 + (5-4)2 )/(4-1) = 4/3=1.15

Mean and Variance

- Technique (bigram at distance)
- Produce all possible pairs in a window
- Consider all pairs in window as candidates
- Keep data about distance of one word from another
- Count the number of time each candidate occurs
- Measures:
- Mean: average offset (possibly negative)
- Whether two words are related to each other
- Variance: s(offset)
- Variability in position of two words

Mean and Variance Illustration

- Candidate Generation example:
- Window: 3
- Used to find collocations with long-distance relationships

Hypothesis Testing: Overview

- Two (or more) words co-occur a lot
- Is a candidate a true collocation, or a (not-at-all-interesting) phantom?

The t test Intuition

- Intuition:
- Compute chance occurrence and ensure observed is significantly higher
- Take several permutations of the words in the corpus
- How more frequent is the set of all possible permutations than what is observed?
- Assumptions:
- H0 is the null hypothesis (words occur independently)
- P(w1, w2) = P(w1) P(w2)
- Distribution is “normal”

The t test Formula

- Measures:
- x = bigram count
- m = H0 = P(w1) P(w2)
- s2 = bigram count (since p ~ p[1 – p])
- N = total number of bigrams
- Result:
- Number to look up in a table
- Degree of confidence that collocation is not created by chance
- a = the confidence (%) with which one can reject H0

The t test Criticism

- Words are not normally distributed
- Can reject valid collocation
- Not good on sparse data

c2 Intuition

- Pearson’s chi-square test
- Intuition
- Compare observed frequencies to expected frequencies for independence
- Assumptions
- If sample is not small, the distribution is not normal

c2 General Formula

- Measures:
- Eij = Expected count of the bigram
- Oij = Observed count of the bigram
- Result
- A number to look up in a table (like the t test)
- Degree of confidence (a) with which H0

c2 Bigram Method and Formula

- Technique for Bigrams:
- Arrange the bigrams in a 2x2 table with counts for each
- Formula
- Oij: i = column; j = row

c2 Sample Findings

- Comparing corpora
- Machine Translation
- Comparison of (English) “cow” and (French) “vache” gives a
- c2 = 456400
- Similarity of two corpora

c2 Criticism

- Not good for small datasets

Likelihood Ratios Within a Single Corpus (Dunning, 1993)

- Likelihood ratios are more appropriate for sparse data than the Chi-Square test. In addition, they are easier to interpret than the Chi-Square statistic.
- In applying the likelihood ratio test to collocation discovery, use the following two alternative explanations for the occurrence frequency of a bigram w1 w2:
- H1: The occurrence of w2 is independent of the previous occurrence of w1: P(w2 | w1) = P(w2 | w1 ) = p
- H2: The occurrence of w2 is dependent of the previous occurrence of w1: p1 = P(w2 | w1) P(w2 | w1) = p2

Likelihood Ratios Within a Single Corpus

- Use the MLE for probabilities for p, p1, and p2 and assume the binomial distribution:
- Under H1: P(w2 | w1) = c2/N, P(w2 | w1) = c2/N
- Under H2: P(w2 | w1) = c12/ c1= p1, P(w2 | w1) = (c2-c12)/(N-c1) = p2
- Under H1: b(c12; c1, p) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p) gives c2- c12 out of N-c1 bigrams are w1w2
- Under H2: b(c12; c1, p1) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p2) gives c2- c12 out of N-c1 bigrams are w1w2

Likelihood Ratios Within a Single Corpus

- The likelihood of H1
- L(H1) = b(c12; c1, p)b(c2-c12; N-c1, p) (likelihood of independence)
- The likelihood of H2
- L(H2) = b(c12; c1, p1)b(c2- c12; N-c1, p2) (likelihood of dependence)
- The log of likelihood ratio
- log = log [L(H1)/ L(H2)] = log b(..) + log b(..) – log b(..) –log b(..)
- The quantity –2 log is asymptotically 2 distributed, so we can test for significance.

[Pointwise] Mutual Information (I)

- Intuition:
- Given a collocation (w1, w2) and an observation of w1
- I(w1; w2) indicates how more likely it is to see w2
- The same measure also works in reverse (observe w2)
- Assumptions:
- Data is not sparse

Mutual Information Formula

- Measures:
- P(w1) = unigram prob.
- P(w1w2) = bigram prob.
- P (w2|w1) = probability of w2 given we see w1
- Result:
- Number indicating increased confidence that we will see w2 after w1

Mutual Information Criticism

- A better measure of the independence of two words rather than the dependence of one word on another
- Horrible on [read: misidentifies] sparse data

Applications

- Collocations are useful in:
- Comparison of Corpora
- Parsing
- New Topic Detection
- Computational Lexicography
- Natural Language Generation
- Machine Translation

Comparison of Corpora

- Compare corpora to determine:
- Document clustering (for information retrieval)
- Plagiarism
- Comparison techniques:
- Competing hypotheses:
- Documents are dependent
- Documents are independent
- Compare hypotheses using l, etc.

Parsing

- When parsing, we may get more accurate data by treating a collocation as a unit (rather than individual words)
- Example: [ hand to hand ] is a unit in:

(S (NP They)

(VP engaged

(PP in hand)

(PP to

(NP hand combat))))

New Topic Detection

- When new topics are reported, the count of collocations associated with those topics increases
- When topics become old, the count drops

Computational Lexicography

- As new multi-word expressions become part of the language, they can be detected
- Existing collocations can be acquired
- Can also be used for cultural identification
- Examples:
- My friend got an A in his class
- My friend took an A in his class
- My friend made an A in his class
- My friend earned an A in his class

Natural Language Generation

- Problem:
- Given two (or more) possible productions, which is more feasible?
- Productions usually involve synonyms or near-synonyms
- Languages generally favour one production

Machine Translation

- Collocation-complete problem?
- Must find all used collocations
- Must parse collocation as a unit
- Must translate collocation as a unit
- In target language production, must select among many plausible alternatives

Thanks!

- Questions?

Statistical inference

- Statistical inference consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inferences about its distribution.

Language Models

- Predict the next word, given the previous words

(this sort of task is often referred to as a shannon game)

- A language model can take the context into account.
- Determine probability of different sequences by examining training corpus
- Applications:
- OCR / Speech recognition – resolve ambiguity
- Spelling correction
- Machine translation etc

Statistical Estimators

- Example:

Corpus: five Jane Austen novels

N = 617,091 words, V = 14,585 unique words

Task: predict the next word of the trigram “inferior to ___”

from test data, Persuasion: “[In person, she was] inferior to both [sisters.]”

- Given the observed training data …
- How do you develop a model (probability distribution) to predict future events?

The Perfect Language Model

- Sequence of word forms
- Notation: W = (w1,w2,w3,...,wn)
- The big (modeling) question is what is p(W)?
- Well, we know (Bayes/chain rule):

p(W) = p(w1,w2,w3,...,wn) = p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1)

- Not practical (even short for W ® too many parameters)

Markov Chain

- Unlimited memory (cf. previous foil):
- for wi, we know its predecessors w1,w2,w3,...,wi-1
- Limited memory:
- we disregard predecessors that are “too old”
- remember only k previous words: wi-k,wi-k+1,...,wi-1
- called “kth order Markov approximation”
- Stationary character (no change over time):

p(W) @Pi=1..n p(wi|wi-k,wi-k+1,...,wi-1), n = |W|

N-gram Language Models

- (n-1)th order Markov approximation ® n-gram LM:

p(W) = Pi=1..n p(wi|wi-n+1,wi-n+2,...,wi-1)

- In particular (assume vocabulary |V| = 20k):

0-gram LM: uniform model p(w) = 1/|V| 1 parameter

1-gram LM: unigram model p(w) 2´104 parameters

2-gram LM: bigram model p(wi|wi-1) 4´108 parameters

3-gram LM: trigram mode p(wi|wi-2,wi-1) 8´1012 parameters

4-gram LM: tetragram model p(wi| wi-3,wi-2,wi-1) 1.6´1017 parameters

Reliability vs. Discrimination

“large green ___________”

tree? mountain? frog? car?

“swallowed the large green ________”

pill? tidbit?

- larger n: more information about the context of the specific instance (greater discrimination)
- smaller n: more instances in training data, better statistical estimates (more reliability)

LM Observations

- How large n?
- zero is enough (theoretically)
- but anyway: as much as possible (as close to “perfect” model as possible)
- empirically: 3
- parameter estimation? (reliability, data availability, storage space, ...)
- 4 is too much: |V|=60k ® 1.296´1019 parameters
- but: 6-7 would be (almost) ideal (having enough data)
- For now, word forms only (no “linguistic” processing)

Parameter Estimation

- Parameter: numerical value needed to compute p(w|h)
- From data (how else?)
- Data preparation:
- get rid of formatting etc. (“text cleaning”)
- define words (separate but include punctuation, call it “word”, unless speech)
- define sentence boundaries (insert “words” <s> and </s>)
- letter case: keep, discard, or be smart:
- name recognition
- number type identification

Maximum Likelihood Estimate

- MLE: Relative Frequency...
- ...best predicts the data at hand (the “training data”)
- See (Ney et al. 1997) for a proof that the relative frequency really is the maximum likelihood estimate.
- Trigrams from Training Data T:
- count sequences of three words in T: C3(wi-2,wi-1,wi)
- count sequences of two words in T: C2(wi-2,wi-1):

PMLE(wi-2,wi-1,wi) = C3(wi-2,wi-1,wi) / N

PMLE(wi|wi-2,wi-1) = C3(wi-2,wi-1,wi) / C2(wi-2,wi-1)

Character Language Model

- Use individual characters instead of words:

Same formulas and methods

- Might consider 4-grams, 5-grams or even more
- Good for cross-language comparisons
- Transform cross-entropy between letter- and word-based models:

HS(pc) = HS(pw) / avg. # of characters/word in S

p(W) =dfPi=1..n p(ci|ci-n+1,ci-n+2,...,ci-1)

LM: an Example

- Training data:

<s0> <s> He can buy you the can of soda </s>

- Unigram: (8 words in vocabulary)

p1(He) = p1(buy) = p1(you) = p1(the) = p1(of) = p1(soda) = .125 p1(can) = .25

- Bigram:

p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5, p2(of|can) = .5,

p2(you |buy) = 1,...

- Trigram:

p3(He|<s0>,<s>) = 1, p3(can|<s>,He) = 1, p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(</s>|of,soda) = 1.

- Entropy: H(p1) = 2.75, H(p2) = 1, H(p3) = 0

LM: an Example (The Problem)

- Cross-entropy:
- S = <s0> <s> It was the greatest buy of all </s>
- Even HS(p1) fails (= HS(p2) = HS(p3) = ¥), because:
- all unigrams but p1(the), p1(buy), and p1(of) are 0.
- all bigram probabilities are 0.
- all trigram probabilities are 0.
- Need to make all “theoretically possible” probabilities non-zero.

LM: Another Example

- Training data S: |V| =11 (not counting <s> and </s>)
- <s> John read Moby Dick </s>
- <s> Mary read a different book </s>
- <s> She read a book by Cher </s>
- Bigram estimates:
- P(She | <s>) = C(<s> She)/ Sw C(<s> w) = 1/3
- P(read | She) = C(She read)/ Sw C(She w) = 1
- P (Moby | read) = C(read Moby)/ Sw C(read w) = 1/3
- P (Dick | Moby) = C(Moby Dick)/ Sw C(Moby w) = 1
- P(</s> | Dick) = C(Dick </s> )/ Sw C(Dick w) = 1
- p(She read Moby Dick) =

p(She | <s>) p(read | She) p(Moby | read) p(Dick | Moby) p(</s> | Dick) = 1/3 1 1/3 1 1 = 1/9

The Zero Problem

- “Raw” n-gram language model estimate:
- necessarily, there will be some zeros
- Often trigram model ® 2.16´1014 parameters, data ~ 109 words
- which are true zeros?
- optimal situation: even the least frequent trigram would be seen several times, in order to distinguish it’s probability vs. other trigrams (hapax legomena = only-once term => uniqueness)
- optimal situation cannot happen, unfortunately (question: how much data would we need?)
- we don’t know; hence, we eliminate them.
- Different kinds of zeros: p(w|h) = 0, p(w) = 0

Why do we need non-zero probabilities?

- Avoid infinite Cross Entropy:
- happens when an event is found in the test data which has not been seen in training data
- Make the system more robust
- low count estimates:
- they typically happen for “detailed” but relatively rare appearances
- high count estimates: reliable but less “detailed”

Eliminating the Zero Probabilities:Smoothing

- Get new p’(w) (same W): almost p(w) except for eliminating zeros
- Discount w for (some) p(w) > 0: new p’(w) < p(w)

SwÎdiscounted (p(w) - p’(w)) = D

- Distribute D to all w; p(w) = 0: new p’(w) > p(w)
- possibly also to other w with low p(w)
- For some w (possibly): p’(w) = p(w)
- Make sure SwÎW p’(w) = 1
- There are many ways of smoothing

Laplace’s Law: Smoothing by Adding 1

- Laplace’s Law:
- PLAP(w1,..,wn)=(C(w1,..,wn)+1)/(N+B), where C(w1,..,wn) is the frequency of n-gram w1,..,wn, N is the number of training instances, and B is the number of bins training instances are divided into (vocabulary size)
- Problem if B > C(W) (can be the case; even >> C(W))
- PLAP(w | h) = (C(h,w) + 1) / (C(h) + B)
- The idea is to give a little bit of the probability space to unseen events.

Add 1 Smoothing Example

- pMLE(Cher read Moby Dick) =

p(Cher | <s>) p(read | Cher) p(Moby | read) p(Dick | Moby) p(</s> | Dick) = 0 0 1/3 1 1 = 0

- p(Cher | <s>) = (1 + C(<s> Cher))/(11 + C(<s>)) = (1 + 0) / (11 + 3) = 1/14 = .0714
- p(read | Cher) = (1 + C(Cher read))/(11 + C(Cher)) = (1 + 0) / (11 + 1) = 1/12 = .0833
- p(Moby | read) = (1 + C(read Moby))/(11 + C(read)) = (1 + 1) / (11 + 3) = 2/14 = .1429
- P(Dick | Moby) = (1 + C(Moby Dick))/(11 + C(Moby)) = (1 + 1) / (11 + 1) = 2/12 = .1667
- P(</s> | Dick) = (1 + C(Dick </s>))/(11 + C<s>) = (1 + 1) / (11 + 3) = 2/14 = .1429
- p’(Cher read Moby Dick) =

p(Cher | <s>) p(read | Cher) p(Moby | read) p(Dick | Moby) p(</s> | Dick) = 1/14 1/12 2/14 2/12 2/14 = 2.02e-5

Objections to Laplace’s Law

- For NLP applications that are very sparse, Laplace’s Law actually gives far too much of the probability space to unseen events.
- Worse at predicting the actual probabilities of bigrams with zero counts than other methods.
- Count variances are actually greater than the MLE.

Lidstone’s Law

- P = probability of specific n-gram
- C = count of that n-gram in training data
- N = total n-grams in training data
- B = number of “bins” (possible n-grams)
- = small positive number
- M.L.E: = 0LaPlace’s Law: = 1Jeffreys-Perks Law: = ½
- PLid(w | h) = (C(h,w) + ) / (C(h) + B )

Objections to Lidstone’s Law

- Need an a priori way to determine .
- Predicts all unseen events to be equally likely.
- Gives probability estimates linear in the M.L.E. frequency.

Lidstone’s Law with =.5

- pMLE(Cher read Moby Dick) =

p(Cher | <s>) p(read | Cher) p(Moby | read) p(Dick | Moby) p(</s> | Dick) = 0 0 1/3 1 1 = 0

- p(Cher | <s>) = (.5 + C(<s> Cher))/(.5* 11 + C(<s>)) = (.5 + 0) / (.5*11 + 3) = .5/8.5 =.0588
- p(read | Cher) = (.5 + C(Cher read))/(.5* 11 + C(Cher)) = (.5 + 0) / (.5* 11 + 1) = .5/6.5 = .0769
- p(Moby | read) = (.5 + C(read Moby))/(.5* 11 + C(read)) = (.5 + 1) / (.5* 11 + 3) = 1.5/8.5 = .1765
- P(Dick | Moby) = (.5 + C(Moby Dick))/(.5* 11 + C(Moby)) = (.5 + 1) / (.5* 11 + 1) = 1.5/6.5 = .2308
- P(</s> | Dick) = (.5 + C(Dick </s>))/(.5* 11 + C<s>) = (.5 + 1) / (.5* 11 + 3) = 1.5/8.5 = .1765
- p’(Cher read Moby Dick) =

p(Cher | <s>) p(read | Cher) p(Moby | read) p(Dick | Moby) p(</s> | Dick) = .5/8.5 .5/6.5 1.5/8.5 1.5/6.5 1.5/8.5 = 3.25e-5

Held-Out Estimator

- How much of the probability distribution should be reserved to allow for previously unseen events?
- Can validate choice by holding out part of the training data.
- How often do events seen (or not seen) in training data occur in validation data?
- Held out estimator by Jelinek and Mercer (1985)

Held Out Estimator

- For each n-gram, w1,..,wn , compute C1(w1,..,wn) and C2(w1,..,wn), the frequencies of w1,..,wn in training and held out data, respectively.
- Let Nr be the no. of bigrams with frequency r in the training text.
- Let Tr be the that all n-grams that appeared r times in the training text appeared in the held out.,
- Then the average of the frequency r n-grams is Tr/Nr
- An estimate for of one of these n-gram is: Pho(w1,..,wn)= (Tr/Nr )/N
- where C(w1,..,wn) = r

Testing Models

- Divide data into training and testing sets.
- Training data: divide into normal training plus validation (smoothing) sets: around 10% for validation (fewer parameters typically)
- Testing data: distinguish between the “real” test set and a development set.

Cross-Validation

- Held out estimation is useful if there is a lot of data available. If not, it may be better to use each part of the data both as training data and held out data.
- Deleted Estimation [Jelinek & Mercer, 1985]
- Leave-One-Out [Ney et al., 1997]

Divide training data into 2 parts

- Train on A, validate on B
- Train on B, validate on A
- Combine two models

A

B

train

validate

Model 1

validate

train

Model 2

+

Model 1

Model 2

Final Model

Deleted Estimation- Use data for both training and validation

Cross-Validation

Two estimates:

Nra = number of n-grams occurring r times in a-th part of training set

Trab = total number of those found in b-th part

Combined estimate:

(arithmetic mean)

Leave One Out

- Primary training Corpus is of size N-1 tokens.
- 1 token is used as held out data for a sort of simulated testing.
- Process is repeated N times so that each piece of data is left in turn.
- It explores the effect of how the model changes if any particular piece of data had not been observed (advantage)

Good-Turing Estimation

- Intuition: re-estimate the amount of mass assigned to n-grams with low (or zero) counts using the number of n-grams with higher counts. For any n-gram that occurs r times, we should assume that it occurs r* times, where Nr is the number of n-grams occurring precisely r times in the training data.
- To convert the count to a probability, we normalize the n-gram with r counts as:

Good-Turing Estimation

- Note that N is equal to the original number of counts in the distribution.
- Makes the assumption of a binomial distribution, which works well for large amounts of data and a large vocabulary despite the fact that words and n-grams do not have that distribution.

Good-Turing Estimation

- Note that the estimate cannot be used if Nr=0; hence, it is necessary to smooth the Nr values.
- The estimate can be written as:
- If C(w1,..,wn) = r > 0, PGT(w1,..,wn) = r*/N where r*=((r+1)S(r+1))/S(r) and S(r) is a smoothed estimate of the expectation of Nr.
- If C(w1,..,wn) = 0, PGT(w1,..,wn) (N1/N0 ) /N
- In practice, counts with a frequency greater than five are assumed reliable, as suggested by Katz.
- In practice, this method is not used by itself because it does not use lower order information to estimate probabilities of higher order n-grams.

Good-Turing Estimation

- N-grams with low counts are often treated as if they had a count of 0.
- In practice r* is used only for small counts; counts greater than k=5 are assumed to be reliable: r*=r if r> k; otherwise:

Discounting Methods

- Absolute discounting: Decrease probability of each observed n-gram by subtracting a small constant when C(w1, w2, …, wn) = r:
- Linear discounting: Decrease probability of each observed n-gram by multiplying by the same proportion when C(w1, w2, …, wn) = r:

Combining Estimators: Overview

- If we have several models of how the history predicts what comes next, then we might wish to combine them in the hope of producing an even better model.
- Some combination methods:
- Katz’s Back Off
- Simple Linear Interpolation
- General Linear Interpolation

Backoff

- Back off to lower order n-gram if we have no evidence for the higher order form. Trigram backoff:

Katz’s Back Off Model

- If the n-gram of concern has appeared more than k times, then an n-gram estimate is used but an amount of the MLE estimate gets discounted (it is reserved for unseen n-grams).
- If the n-gram occurred k times or less, then we will use an estimate from a shorter n-gram (back-off probability), normalized by the amount of probability remaining and the amount of data covered by this estimate.
- The process continues recursively.

Katz’s Back Off Model

- Katz used Good-Turing estimates when an n-gram appeared k or fewer times.

Problems with Backing-Off

- If bigram w1 w2 is common, but trigram w1 w2 w3 is unseen, it may be a meaningful gap, rather than a gap due to chance and scarce data.
- i.e., a “grammatical null”
- In that case, it may be inappropriate to back-off to lower-order probability.

Linear Interpolation

- One way of solving the sparseness in a trigram model is to mix that model with bigram and unigram models that suffer less from data sparseness.
- This can be done by linear interpolation (also called finite mixture models).
- The weights can be set using the Expectation-Maximization (EM) algorithm.

Simple Interpolated Smoothing

- Add information from less detailed distributions using l=(l0,l1,l2,l3):

p’l(wi| wi-2 ,wi-1) = l3 p3(wi| wi-2 ,wi-1) +l2 p2(wi| wi-1) + l1 p1(wi) + l0/|V|

- Normalize:

li > 0, Si=0..n li = 1 is sufficient (l0 = 1 - Si=1..n li) (n=3)

- Estimation using MLE:
- fix the p3, p2, p1 and |V| parameters as estimated from the training data
- then find {li}that minimizes the cross entropy (maximizes probability of data): -(1/|D|)Si=1..|D|log2(p’l(wi|hi))

Held Out Data

- What data to use?
- try the training data T: but we will always get l3 = 1
- why? (let piT be an i-gram distribution estimated using T)
- minimizing HT(p’l) over a vector l, p’l = l3p3T+l2p2T+l1p1T+l0/|V|
- remember: HT(p’l) = H(p3T) + D(p3T||p’l); (p3T fixed ® H(p3T) fixed, best)
- thus: do not use the training data for estimation of l!
- must hold out part of the training data (heldout data, H):
- ...call the remaining data the (true/raw) training data, T
- the test data S (e.g., for evaluation purposes): still different data!

The Formulas

- Repeat: minimizing -(1/|H|)Si=1..|H|log2(p’l(wi|hi)) over l

p’l(wi| hi) = p’l(wi| wi-2 ,wi-1) = l3 p3(wi| wi-2 ,wi-1) +

l2 p2(wi| wi-1) + l1 p1(wi) + l0/|V|

- “Expected Counts (of lambdas)”: j = 0..3

c(lj) = Si=1..|H| (ljpj(wi|hi) / p’l(wi|hi))

- “Next l”: j = 0..3

lj,next = c(lj) / Sk=0..3 (c(lk))

The (Smoothing) EM Algorithm

1. Start with some l, such that lj > 0 for all j Î 0..3.

2. Compute “Expected Counts” for each lj.

3. Compute new set of lj, using the “Next l” formula.

4. Start over at step 2, unless a termination condition is met.

- Termination condition: convergence of l.
- Simply set an , and finish if |lj - lj,next| < for each j (step 3).
- Guaranteed to converge:

follows from Jensen’s inequality, plus a technical proof.

Example

- Raw distribution (unigram; smooth with uniform):

p(a) = .25, p(b) = .5, p(a) = 1/64 for a Î{c..r}, = 0 for the rest: s,t,u,v,w,x,y,z

- Heldout data: baby; use one set of l(l1: unigram,l0: uniform)
- Start with l1 = .5; p’l(b) = .5 .5 + .5 / 26 = .27

p’l(a) = .5 .25 + .5 / 26 = .14

p’l(y) = .5 0 + .5 / 26 = .02

c(l1) = .5 .5/.27 + .5 .25/.14 + .5 .5/.27 + .5 0/.02 = 2.72

c(l0) = .5 .04/.27 + .5 .04/.14 + .5 .04/.27 + .5 .04/.02 = 1.28

Normalize: l1,next = .68, l0,next = .32.

Repeat from step 2 (recompute p’l first for efficient computation, then c(li), ...)

Finish when new lambdas differ little from the previous ones (say, < 0.01 difference).

p2(wi| wi-1)

=1p1(wi) + 2 /|V|

Witten-Bell Smoothing

- The nth order smoothed model is defined recursively as:
- To compute , we need the number of unique words that have that history.

Witten-Bell Smoothing

- The number of words that follow the history and have one or more counts is:
- We can assign the parameters such that:

Witten-Bell Smoothing

- Substituting into the first equation, we get:

General Linear Interpolation

- In simple linear interpolation, the weights are just a single number, but one can define a more general and powerful model where the weights are a function of the history.
- Need some way to group or bucket lambda histories.

Reference

- www.cs.tau.ac.il/~nachumd/NLP
- http://www.cs.sfu.ca
- http://min.ecn.purdue.edu
- Manning and Schutze
- Jurafsky and Martin

Download Presentation

Connecting to Server..