Collocations
Download
1 / 94

Collocations - PowerPoint PPT Presentation


  • 223 Views
  • Updated On :

Collocations. Definition Of Collocation (wrt Corpus Literature).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Collocations' - lilah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Definition of collocation wrt corpus literature
Definition Of Collocation (wrt Corpus Literature)

  • A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. [Chouekra, 1988]


Word collocations
Word Collocations

  • Collocation

    • Firth: “word is characterized by the company it keeps”; collocations of a given word are statements of the habitual or customary places of that word.

    • non-compositionality of meaning

      • cannot be derived directly from its parts (heavy rain)

    • non-substitutability in context

      • for parts (make a decision)

    • non-modifiability (& non-transformability)

      • kick the yellow bucket; take exceptions to


Collocations1
Collocations

  • Collocations are not necessarily adjacent

  • Collocations cannot be directly translated into other languages.


Example classes
Example Classes

  • Names

  • Technical Terms

  • “Light” Verb Constructions

  • Phrasal verbs

  • Noun Phrases


Linguistic subclasses of collocations
Linguistic Subclasses of Collocations

  • Light verbs: verbs with little semantic content like make, take, do

  • Terminological Expressions: concepts and objects in technical domains (e.g., hard drive)

  • Idioms: fixed phrases

    • kick the bucket, birds-of-a-feather, run for office

  • Proper names: difficult to recognize even with lists

    • Tuesday (person’s name), May, Winston Churchill, IBM, Inc.

  • Numerical expressions

    • containing “ordinary” words

      • Monday Oct 04 1999, two thousand seven hundred fifty

  • Verb particle constructions or Phrasal Verbs

    • Separable parts:

      • look up, take off, tell off


  • Collocation detection techniques
    Collocation Detection Techniques

    • Selection of Collocations by Frequency

    • Selection of Collocation based on Mean and Variance of the distance between focal word and collocating word.

    • Hypothesis Testing

    • Pointwise Mutual Information


    Frequency
    Frequency

    • Technique:

      • Count the number of times a bigram co-occurs

      • Extract top counts and report them as candidates

    • Results:

      • Corpus: New York Times

        • August – November, 1990

      • Extremely un-interesting


    Frequency with tag filters technique
    Frequency with Tag Filters Technique

    • Technique:

      • Count the number of times a bigram co-occurs

      • Tag candidates for POS

      • Pass all candidates through POS filter, considering only ones matching filter

      • Extract top counts and report them as candidates



    Mean and variance smadja et al 1993
    Mean and Variance (Smadja et al., 1993)

    • Frequency-based search works well for fixed phrases. However, many collocations consist of two words in more flexible (although regular) relationships. For example,

      • Knock and door may not occur at a fixed distance from each other

    • One method of detecting these flexible relationships uses the mean and variance of the offset (signed distance) between the two words in the corpus.



    Example knock and door
    Example: Knock and Door

    • She knocked on his door.

    • They knocked at the door.

    • 100 women knocked on the big red door.

    • A man knocked on the metal front door.

    • Average offset between knock and door:

      (3 + 3 + 5 + 5)/ 4 = 4

    • Variance:

      ((3-4)2 + (3-4)2 + (5-4)2 + (5-4)2 )/(4-1) = 4/3=1.15


    Mean and variance
    Mean and Variance

    • Technique (bigram at distance)

      • Produce all possible pairs in a window

      • Consider all pairs in window as candidates

      • Keep data about distance of one word from another

      • Count the number of time each candidate occurs

    • Measures:

      • Mean: average offset (possibly negative)

        • Whether two words are related to each other

      • Variance: s(offset)

        • Variability in position of two words


    Mean and variance illustration
    Mean and Variance Illustration

    • Candidate Generation example:

      • Window: 3

    • Used to find collocations with long-distance relationships



    Hypothesis testing overview
    Hypothesis Testing: Overview

    • Two (or more) words co-occur a lot

    • Is a candidate a true collocation, or a (not-at-all-interesting) phantom?


    The t test intuition
    The t test Intuition

    • Intuition:

      • Compute chance occurrence and ensure observed is significantly higher

      • Take several permutations of the words in the corpus

      • How more frequent is the set of all possible permutations than what is observed?

    • Assumptions:

      • H0 is the null hypothesis (words occur independently)

        • P(w1, w2) = P(w1) P(w2)

      • Distribution is “normal”


    The t test formula
    The t test Formula

    • Measures:

      • x = bigram count

      • m = H0 = P(w1) P(w2)

      • s2 = bigram count (since p ~ p[1 – p])

      • N = total number of bigrams

    • Result:

      • Number to look up in a table

      • Degree of confidence that collocation is not created by chance

        • a = the confidence (%) with which one can reject H0



    The t test criticism
    The t test Criticism

    • Words are not normally distributed

      • Can reject valid collocation

    • Not good on sparse data


    C 2 intuition
    c2 Intuition

    • Pearson’s chi-square test

    • Intuition

      • Compare observed frequencies to expected frequencies for independence

    • Assumptions

      • If sample is not small, the distribution is not normal


    C 2 general formula
    c2 General Formula

    • Measures:

      • Eij = Expected count of the bigram

      • Oij = Observed count of the bigram

    • Result

      • A number to look up in a table (like the t test)

      • Degree of confidence (a) with which H0


    C 2 bigram method and formula
    c2 Bigram Method and Formula

    • Technique for Bigrams:

      • Arrange the bigrams in a 2x2 table with counts for each

      • Formula

        • Oij: i = column; j = row


    C 2 sample findings
    c2 Sample Findings

    • Comparing corpora

      • Machine Translation

        • Comparison of (English) “cow” and (French) “vache” gives a

        • c2 = 456400

      • Similarity of two corpora


    C 2 criticism
    c2 Criticism

    • Not good for small datasets


    Likelihood ratios within a single corpus dunning 1993
    Likelihood Ratios Within a Single Corpus (Dunning, 1993)

    • Likelihood ratios are more appropriate for sparse data than the Chi-Square test. In addition, they are easier to interpret than the Chi-Square statistic.

    • In applying the likelihood ratio test to collocation discovery, use the following two alternative explanations for the occurrence frequency of a bigram w1 w2:

      • H1: The occurrence of w2 is independent of the previous occurrence of w1: P(w2 | w1) = P(w2 | w1 ) = p

      • H2: The occurrence of w2 is dependent of the previous occurrence of w1: p1 = P(w2 | w1)  P(w2 | w1) = p2


    Likelihood ratios within a single corpus
    Likelihood Ratios Within a Single Corpus

    • Use the MLE for probabilities for p, p1, and p2 and assume the binomial distribution:

      • Under H1: P(w2 | w1) = c2/N, P(w2 | w1) = c2/N

      • Under H2: P(w2 | w1) = c12/ c1= p1, P(w2 | w1) = (c2-c12)/(N-c1) = p2

      • Under H1: b(c12; c1, p) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p) gives c2- c12 out of N-c1 bigrams are w1w2

      • Under H2: b(c12; c1, p1) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p2) gives c2- c12 out of N-c1 bigrams are w1w2


    Likelihood ratios within a single corpus1
    Likelihood Ratios Within a Single Corpus

    • The likelihood of H1

      • L(H1) = b(c12; c1, p)b(c2-c12; N-c1, p) (likelihood of independence)

    • The likelihood of H2

      • L(H2) = b(c12; c1, p1)b(c2- c12; N-c1, p2) (likelihood of dependence)

    • The log of likelihood ratio

      • log  = log [L(H1)/ L(H2)] = log b(..) + log b(..) – log b(..) –log b(..)

    • The quantity –2 log  is asymptotically 2 distributed, so we can test for significance.


    Pointwise mutual information i
    [Pointwise] Mutual Information (I)

    • Intuition:

      • Given a collocation (w1, w2) and an observation of w1

      • I(w1; w2) indicates how more likely it is to see w2

      • The same measure also works in reverse (observe w2)

    • Assumptions:

      • Data is not sparse


    Mutual information formula
    Mutual Information Formula

    • Measures:

      • P(w1) = unigram prob.

      • P(w1w2) = bigram prob.

      • P (w2|w1) = probability of w2 given we see w1

    • Result:

      • Number indicating increased confidence that we will see w2 after w1


    Mutual information criticism
    Mutual Information Criticism

    • A better measure of the independence of two words rather than the dependence of one word on another

    • Horrible on [read: misidentifies] sparse data


    Applications
    Applications

    • Collocations are useful in:

      • Comparison of Corpora

      • Parsing

      • New Topic Detection

      • Computational Lexicography

      • Natural Language Generation

      • Machine Translation


    Comparison of corpora
    Comparison of Corpora

    • Compare corpora to determine:

      • Document clustering (for information retrieval)

      • Plagiarism

    • Comparison techniques:

      • Competing hypotheses:

        • Documents are dependent

        • Documents are independent

      • Compare hypotheses using l, etc.


    Parsing
    Parsing

    • When parsing, we may get more accurate data by treating a collocation as a unit (rather than individual words)

      • Example: [ hand to hand ] is a unit in:

        (S (NP They)

        (VP engaged

        (PP in hand)

        (PP to

        (NP hand combat))))


    New topic detection
    New Topic Detection

    • When new topics are reported, the count of collocations associated with those topics increases

      • When topics become old, the count drops


    Computational lexicography
    Computational Lexicography

    • As new multi-word expressions become part of the language, they can be detected

      • Existing collocations can be acquired

    • Can also be used for cultural identification

      • Examples:

        • My friend got an A in his class

        • My friend took an A in his class

        • My friend made an A in his class

        • My friend earned an A in his class


    Natural language generation
    Natural Language Generation

    • Problem:

      • Given two (or more) possible productions, which is more feasible?

      • Productions usually involve synonyms or near-synonyms

      • Languages generally favour one production


    Machine translation
    Machine Translation

    • Collocation-complete problem?

      • Must find all used collocations

      • Must parse collocation as a unit

      • Must translate collocation as a unit

      • In target language production, must select among many plausible alternatives


    Thanks
    Thanks!

    • Questions?



    Statistical inference
    Statistical inference

    • Statistical inference consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inferences about its distribution.


    Language models
    Language Models

    • Predict the next word, given the previous words

      (this sort of task is often referred to as a shannon game)

    • A language model can take the context into account.

    • Determine probability of different sequences by examining training corpus

    • Applications:

      • OCR / Speech recognition – resolve ambiguity

      • Spelling correction

      • Machine translation etc


    Statistical estimators
    Statistical Estimators

    • Example:

      Corpus: five Jane Austen novels

      N = 617,091 words, V = 14,585 unique words

      Task: predict the next word of the trigram “inferior to ___”

      from test data, Persuasion: “[In person, she was] inferior to both [sisters.]”

    • Given the observed training data …

    • How do you develop a model (probability distribution) to predict future events?


    The perfect language model
    The Perfect Language Model

    • Sequence of word forms

    • Notation: W = (w1,w2,w3,...,wn)

    • The big (modeling) question is what is p(W)?

    • Well, we know (Bayes/chain rule):

      p(W) = p(w1,w2,w3,...,wn) = p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1)

    • Not practical (even short for W ® too many parameters)


    Markov chain
    Markov Chain

    • Unlimited memory (cf. previous foil):

      • for wi, we know its predecessors w1,w2,w3,...,wi-1

    • Limited memory:

      • we disregard predecessors that are “too old”

      • remember only k previous words: wi-k,wi-k+1,...,wi-1

      • called “kth order Markov approximation”

    • Stationary character (no change over time):

      p(W) @Pi=1..n p(wi|wi-k,wi-k+1,...,wi-1), n = |W|


    N gram language models
    N-gram Language Models

    • (n-1)th order Markov approximation ® n-gram LM:

      p(W) = Pi=1..n p(wi|wi-n+1,wi-n+2,...,wi-1)

    • In particular (assume vocabulary |V| = 20k):

      0-gram LM: uniform model p(w) = 1/|V| 1 parameter

      1-gram LM: unigram model p(w) 2´104 parameters

      2-gram LM: bigram model p(wi|wi-1) 4´108 parameters

      3-gram LM: trigram mode p(wi|wi-2,wi-1) 8´1012 parameters

      4-gram LM: tetragram model p(wi| wi-3,wi-2,wi-1) 1.6´1017 parameters


    Reliability vs discrimination
    Reliability vs. Discrimination

    “large green ___________”

    tree? mountain? frog? car?

    “swallowed the large green ________”

    pill? tidbit?

    • larger n: more information about the context of the specific instance (greater discrimination)

    • smaller n: more instances in training data, better statistical estimates (more reliability)


    Lm observations
    LM Observations

    • How large n?

      • zero is enough (theoretically)

      • but anyway: as much as possible (as close to “perfect” model as possible)

      • empirically: 3

        • parameter estimation? (reliability, data availability, storage space, ...)

        • 4 is too much: |V|=60k ® 1.296´1019 parameters

        • but: 6-7 would be (almost) ideal (having enough data)

    • For now, word forms only (no “linguistic” processing)


    Parameter estimation
    Parameter Estimation

    • Parameter: numerical value needed to compute p(w|h)

    • From data (how else?)

    • Data preparation:

      • get rid of formatting etc. (“text cleaning”)

      • define words (separate but include punctuation, call it “word”, unless speech)

      • define sentence boundaries (insert “words” <s> and </s>)

      • letter case: keep, discard, or be smart:

        • name recognition

        • number type identification


    Maximum likelihood estimate
    Maximum Likelihood Estimate

    • MLE: Relative Frequency...

      • ...best predicts the data at hand (the “training data”)

      • See (Ney et al. 1997) for a proof that the relative frequency really is the maximum likelihood estimate.

    • Trigrams from Training Data T:

      • count sequences of three words in T: C3(wi-2,wi-1,wi)

      • count sequences of two words in T: C2(wi-2,wi-1):

        PMLE(wi-2,wi-1,wi) = C3(wi-2,wi-1,wi) / N

        PMLE(wi|wi-2,wi-1) = C3(wi-2,wi-1,wi) / C2(wi-2,wi-1)


    Character language model
    Character Language Model

    • Use individual characters instead of words:

      Same formulas and methods

    • Might consider 4-grams, 5-grams or even more

    • Good for cross-language comparisons

    • Transform cross-entropy between letter- and word-based models:

      HS(pc) = HS(pw) / avg. # of characters/word in S

    p(W) =dfPi=1..n p(ci|ci-n+1,ci-n+2,...,ci-1)


    Lm an example
    LM: an Example

    • Training data:

      <s0> <s> He can buy you the can of soda </s>

      • Unigram: (8 words in vocabulary)

        p1(He) = p1(buy) = p1(you) = p1(the) = p1(of) = p1(soda) = .125 p1(can) = .25

      • Bigram:

        p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5, p2(of|can) = .5,

        p2(you |buy) = 1,...

      • Trigram:

        p3(He|<s0>,<s>) = 1, p3(can|<s>,He) = 1, p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(</s>|of,soda) = 1.

      • Entropy: H(p1) = 2.75, H(p2) = 1, H(p3) = 0


    Lm an example the problem
    LM: an Example (The Problem)

    • Cross-entropy:

    • S = <s0> <s> It was the greatest buy of all </s>

    • Even HS(p1) fails (= HS(p2) = HS(p3) = ¥), because:

      • all unigrams but p1(the), p1(buy), and p1(of) are 0.

      • all bigram probabilities are 0.

      • all trigram probabilities are 0.

    • Need to make all “theoretically possible” probabilities non-zero.


    Lm another example
    LM: Another Example

    • Training data S: |V| =11 (not counting <s> and </s>)

      • <s> John read Moby Dick </s>

      • <s> Mary read a different book </s>

      • <s> She read a book by Cher </s>

    • Bigram estimates:

      • P(She | <s>) = C(<s> She)/ Sw C(<s> w) = 1/3

      • P(read | She) = C(She read)/ Sw C(She w) = 1

      • P (Moby | read) = C(read Moby)/ Sw C(read w) = 1/3

      • P (Dick | Moby) = C(Moby Dick)/ Sw C(Moby w) = 1

      • P(</s> | Dick) = C(Dick </s> )/ Sw C(Dick w) = 1

    • p(She read Moby Dick) =

      p(She | <s>)  p(read | She)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = 1/3  1  1/3  1  1 = 1/9


    The zero problem
    The Zero Problem

    • “Raw” n-gram language model estimate:

      • necessarily, there will be some zeros

        • Often trigram model ® 2.16´1014 parameters, data ~ 109 words

      • which are true zeros?

        • optimal situation: even the least frequent trigram would be seen several times, in order to distinguish it’s probability vs. other trigrams (hapax legomena = only-once term => uniqueness)

        • optimal situation cannot happen, unfortunately (question: how much data would we need?)

      • we don’t know; hence, we eliminate them.

    • Different kinds of zeros: p(w|h) = 0, p(w) = 0


    Why do we need non zero probabilities
    Why do we need non-zero probabilities?

    • Avoid infinite Cross Entropy:

      • happens when an event is found in the test data which has not been seen in training data

    • Make the system more robust

      • low count estimates:

        • they typically happen for “detailed” but relatively rare appearances

      • high count estimates: reliable but less “detailed”


    Eliminating the zero probabilities smoothing
    Eliminating the Zero Probabilities:Smoothing

    • Get new p’(w) (same W): almost p(w) except for eliminating zeros

    • Discount w for (some) p(w) > 0: new p’(w) < p(w)

      SwÎdiscounted (p(w) - p’(w)) = D

    • Distribute D to all w; p(w) = 0: new p’(w) > p(w)

      • possibly also to other w with low p(w)

    • For some w (possibly): p’(w) = p(w)

    • Make sure SwÎW p’(w) = 1

    • There are many ways of smoothing



    Laplace s law smoothing by adding 1
    Laplace’s Law: Smoothing by Adding 1

    • Laplace’s Law:

      • PLAP(w1,..,wn)=(C(w1,..,wn)+1)/(N+B), where C(w1,..,wn) is the frequency of n-gram w1,..,wn, N is the number of training instances, and B is the number of bins training instances are divided into (vocabulary size)

      • Problem if B > C(W) (can be the case; even >> C(W))

      • PLAP(w | h) = (C(h,w) + 1) / (C(h) + B)

    • The idea is to give a little bit of the probability space to unseen events.


    Add 1 smoothing example
    Add 1 Smoothing Example

    • pMLE(Cher read Moby Dick) =

      p(Cher | <s>)  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = 0  0  1/3  1  1 = 0

      • p(Cher | <s>) = (1 + C(<s> Cher))/(11 + C(<s>)) = (1 + 0) / (11 + 3) = 1/14 = .0714

      • p(read | Cher) = (1 + C(Cher read))/(11 + C(Cher)) = (1 + 0) / (11 + 1) = 1/12 = .0833

      • p(Moby | read) = (1 + C(read Moby))/(11 + C(read)) = (1 + 1) / (11 + 3) = 2/14 = .1429

      • P(Dick | Moby) = (1 + C(Moby Dick))/(11 + C(Moby)) = (1 + 1) / (11 + 1) = 2/12 = .1667

      • P(</s> | Dick) = (1 + C(Dick </s>))/(11 + C<s>) = (1 + 1) / (11 + 3) = 2/14 = .1429

    • p’(Cher read Moby Dick) =

      p(Cher | <s>)  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = 1/14  1/12  2/14  2/12  2/14 = 2.02e-5


    Objections to laplace s law
    Objections to Laplace’s Law

    • For NLP applications that are very sparse, Laplace’s Law actually gives far too much of the probability space to unseen events.

    • Worse at predicting the actual probabilities of bigrams with zero counts than other methods.

    • Count variances are actually greater than the MLE.


    Lidstone s law
    Lidstone’s Law

    • P = probability of specific n-gram

    • C = count of that n-gram in training data

    • N = total n-grams in training data

    • B = number of “bins” (possible n-grams)

    •  = small positive number

      • M.L.E:  = 0LaPlace’s Law:  = 1Jeffreys-Perks Law:  = ½

    • PLid(w | h) = (C(h,w) + ) / (C(h) + B )


    Objections to lidstone s law
    Objections to Lidstone’s Law

    • Need an a priori way to determine .

    • Predicts all unseen events to be equally likely.

    • Gives probability estimates linear in the M.L.E. frequency.


    Lidstone s law with 5
    Lidstone’s Law with =.5

    • pMLE(Cher read Moby Dick) =

      p(Cher | <s>)  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = 0  0  1/3  1  1 = 0

      • p(Cher | <s>) = (.5 + C(<s> Cher))/(.5* 11 + C(<s>)) = (.5 + 0) / (.5*11 + 3) = .5/8.5 =.0588

      • p(read | Cher) = (.5 + C(Cher read))/(.5* 11 + C(Cher)) = (.5 + 0) / (.5* 11 + 1) = .5/6.5 = .0769

      • p(Moby | read) = (.5 + C(read Moby))/(.5* 11 + C(read)) = (.5 + 1) / (.5* 11 + 3) = 1.5/8.5 = .1765

      • P(Dick | Moby) = (.5 + C(Moby Dick))/(.5* 11 + C(Moby)) = (.5 + 1) / (.5* 11 + 1) = 1.5/6.5 = .2308

      • P(</s> | Dick) = (.5 + C(Dick </s>))/(.5* 11 + C<s>) = (.5 + 1) / (.5* 11 + 3) = 1.5/8.5 = .1765

    • p’(Cher read Moby Dick) =

      p(Cher | <s>)  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = .5/8.5  .5/6.5  1.5/8.5  1.5/6.5  1.5/8.5 = 3.25e-5


    Held out estimator
    Held-Out Estimator

    • How much of the probability distribution should be reserved to allow for previously unseen events?

    • Can validate choice by holding out part of the training data.

    • How often do events seen (or not seen) in training data occur in validation data?

    • Held out estimator by Jelinek and Mercer (1985)


    Held out estimator1
    Held Out Estimator

    • For each n-gram, w1,..,wn , compute C1(w1,..,wn) and C2(w1,..,wn), the frequencies of w1,..,wn in training and held out data, respectively.

      • Let Nr be the no. of bigrams with frequency r in the training text.

      • Let Tr be the that all n-grams that appeared r times in the training text appeared in the held out.,

    • Then the average of the frequency r n-grams is Tr/Nr

    • An estimate for of one of these n-gram is: Pho(w1,..,wn)= (Tr/Nr )/N

      • where C(w1,..,wn) = r


    Testing models
    Testing Models

    • Divide data into training and testing sets.

    • Training data: divide into normal training plus validation (smoothing) sets: around 10% for validation (fewer parameters typically)

    • Testing data: distinguish between the “real” test set and a development set.


    Cross validation
    Cross-Validation

    • Held out estimation is useful if there is a lot of data available. If not, it may be better to use each part of the data both as training data and held out data.

      • Deleted Estimation [Jelinek & Mercer, 1985]

      • Leave-One-Out [Ney et al., 1997]


    Deleted estimation

    A

    B

    train

    validate

    Model 1

    validate

    train

    Model 2

    +

    Model 1

    Model 2

    Final Model

    Deleted Estimation

    • Use data for both training and validation


    Cross validation1
    Cross-Validation

    Two estimates:

    Nra = number of n-grams occurring r times in a-th part of training set

    Trab = total number of those found in b-th part

    Combined estimate:

    (arithmetic mean)


    Leave one out
    Leave One Out

    • Primary training Corpus is of size N-1 tokens.

    • 1 token is used as held out data for a sort of simulated testing.

    • Process is repeated N times so that each piece of data is left in turn.

    • It explores the effect of how the model changes if any particular piece of data had not been observed (advantage)


    Good turing estimation
    Good-Turing Estimation

    • Intuition: re-estimate the amount of mass assigned to n-grams with low (or zero) counts using the number of n-grams with higher counts. For any n-gram that occurs r times, we should assume that it occurs r* times, where Nr is the number of n-grams occurring precisely r times in the training data.

    • To convert the count to a probability, we normalize the n-gram  with r counts as:


    Good turing estimation1
    Good-Turing Estimation

    • Note that N is equal to the original number of counts in the distribution.

    • Makes the assumption of a binomial distribution, which works well for large amounts of data and a large vocabulary despite the fact that words and n-grams do not have that distribution.


    Good turing estimation2
    Good-Turing Estimation

    • Note that the estimate cannot be used if Nr=0; hence, it is necessary to smooth the Nr values.

    • The estimate can be written as:

      • If C(w1,..,wn) = r > 0, PGT(w1,..,wn) = r*/N where r*=((r+1)S(r+1))/S(r) and S(r) is a smoothed estimate of the expectation of Nr.

      • If C(w1,..,wn) = 0, PGT(w1,..,wn)  (N1/N0 ) /N

    • In practice, counts with a frequency greater than five are assumed reliable, as suggested by Katz.

    • In practice, this method is not used by itself because it does not use lower order information to estimate probabilities of higher order n-grams.


    Good turing estimation3
    Good-Turing Estimation

    • N-grams with low counts are often treated as if they had a count of 0.

    • In practice r* is used only for small counts; counts greater than k=5 are assumed to be reliable: r*=r if r> k; otherwise:


    Discounting methods
    Discounting Methods

    • Absolute discounting: Decrease probability of each observed n-gram by subtracting a small constant when C(w1, w2, …, wn) = r:

    • Linear discounting: Decrease probability of each observed n-gram by multiplying by the same proportion when C(w1, w2, …, wn) = r:


    Combining estimators overview
    Combining Estimators: Overview

    • If we have several models of how the history predicts what comes next, then we might wish to combine them in the hope of producing an even better model.

    • Some combination methods:

      • Katz’s Back Off

      • Simple Linear Interpolation

      • General Linear Interpolation


    Backoff
    Backoff

    • Back off to lower order n-gram if we have no evidence for the higher order form. Trigram backoff:


    Katz s back off model
    Katz’s Back Off Model

    • If the n-gram of concern has appeared more than k times, then an n-gram estimate is used but an amount of the MLE estimate gets discounted (it is reserved for unseen n-grams).

    • If the n-gram occurred k times or less, then we will use an estimate from a shorter n-gram (back-off probability), normalized by the amount of probability remaining and the amount of data covered by this estimate.

    • The process continues recursively.


    Katz s back off model1
    Katz’s Back Off Model

    • Katz used Good-Turing estimates when an n-gram appeared k or fewer times.


    Problems with backing off
    Problems with Backing-Off

    • If bigram w1 w2 is common, but trigram w1 w2 w3 is unseen, it may be a meaningful gap, rather than a gap due to chance and scarce data.

      • i.e., a “grammatical null”

    • In that case, it may be inappropriate to back-off to lower-order probability.


    Linear interpolation
    Linear Interpolation

    • One way of solving the sparseness in a trigram model is to mix that model with bigram and unigram models that suffer less from data sparseness.

    • This can be done by linear interpolation (also called finite mixture models).

    • The weights can be set using the Expectation-Maximization (EM) algorithm.


    Simple interpolated smoothing
    Simple Interpolated Smoothing

    • Add information from less detailed distributions using l=(l0,l1,l2,l3):

      p’l(wi| wi-2 ,wi-1) = l3 p3(wi| wi-2 ,wi-1) +l2 p2(wi| wi-1) + l1 p1(wi) + l0/|V|

    • Normalize:

      li > 0, Si=0..n li = 1 is sufficient (l0 = 1 - Si=1..n li) (n=3)

    • Estimation using MLE:

      • fix the p3, p2, p1 and |V| parameters as estimated from the training data

      • then find {li}that minimizes the cross entropy (maximizes probability of data): -(1/|D|)Si=1..|D|log2(p’l(wi|hi))


    Held out data
    Held Out Data

    • What data to use?

      • try the training data T: but we will always get l3 = 1

        • why? (let piT be an i-gram distribution estimated using T)

        • minimizing HT(p’l) over a vector l, p’l = l3p3T+l2p2T+l1p1T+l0/|V|

          • remember: HT(p’l) = H(p3T) + D(p3T||p’l); (p3T fixed ® H(p3T) fixed, best)

      • thus: do not use the training data for estimation of l!

        • must hold out part of the training data (heldout data, H):

        • ...call the remaining data the (true/raw) training data, T

        • the test data S (e.g., for evaluation purposes): still different data!


    The formulas
    The Formulas

    • Repeat: minimizing -(1/|H|)Si=1..|H|log2(p’l(wi|hi)) over l

      p’l(wi| hi) = p’l(wi| wi-2 ,wi-1) = l3 p3(wi| wi-2 ,wi-1) +

      l2 p2(wi| wi-1) + l1 p1(wi) + l0/|V|

    • “Expected Counts (of lambdas)”: j = 0..3

      c(lj) = Si=1..|H| (ljpj(wi|hi) / p’l(wi|hi))

    • “Next l”: j = 0..3

      lj,next = c(lj) / Sk=0..3 (c(lk))


    The smoothing em algorithm
    The (Smoothing) EM Algorithm

    1. Start with some l, such that lj > 0 for all j Î 0..3.

    2. Compute “Expected Counts” for each lj.

    3. Compute new set of lj, using the “Next l” formula.

    4. Start over at step 2, unless a termination condition is met.

    • Termination condition: convergence of l.

      • Simply set an , and finish if |lj - lj,next| <  for each j (step 3).

    • Guaranteed to converge:

      follows from Jensen’s inequality, plus a technical proof.


    Example
    Example

    • Raw distribution (unigram; smooth with uniform):

      p(a) = .25, p(b) = .5, p(a) = 1/64 for a Î{c..r}, = 0 for the rest: s,t,u,v,w,x,y,z

    • Heldout data: baby; use one set of l(l1: unigram,l0: uniform)

    • Start with l1 = .5; p’l(b) = .5  .5 + .5 / 26 = .27

      p’l(a) = .5  .25 + .5 / 26 = .14

      p’l(y) = .5  0 + .5 / 26 = .02

      c(l1) = .5 .5/.27 + .5 .25/.14 + .5 .5/.27 + .5  0/.02 = 2.72

      c(l0) = .5 .04/.27 + .5 .04/.14 + .5 .04/.27 + .5 .04/.02 = 1.28

      Normalize: l1,next = .68, l0,next = .32.

      Repeat from step 2 (recompute p’l first for efficient computation, then c(li), ...)

      Finish when new lambdas differ little from the previous ones (say, < 0.01 difference).

    p2(wi| wi-1)

    =1p1(wi) + 2 /|V|


    Witten bell smoothing
    Witten-Bell Smoothing

    • The nth order smoothed model is defined recursively as:

    • To compute , we need the number of unique words that have that history.


    Witten bell smoothing1
    Witten-Bell Smoothing

    • The number of words that follow the history and have one or more counts is:

    • We can assign the parameters such that:


    Witten bell smoothing2
    Witten-Bell Smoothing

    • Substituting into the first equation, we get:


    General linear interpolation
    General Linear Interpolation

    • In simple linear interpolation, the weights are just a single number, but one can define a more general and powerful model where the weights are a function of the history.

    • Need some way to group or bucket lambda histories.



    Reference
    Reference

    • www.cs.tau.ac.il/~nachumd/NLP

    • http://www.cs.sfu.ca

    • http://min.ecn.purdue.edu

    • Manning and Schutze

    • Jurafsky and Martin