Collocations
This presentation is the property of its rightful owner.
Sponsored Links
1 / 94

Collocations PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on
  • Presentation posted in: General

Collocations. Definition Of Collocation (wrt Corpus Literature).

Download Presentation

Collocations

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Collocations

Collocations


Definition of collocation wrt corpus literature

Definition Of Collocation (wrt Corpus Literature)

  • A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. [Chouekra, 1988]


Word collocations

Word Collocations

  • Collocation

    • Firth: “word is characterized by the company it keeps”; collocations of a given word are statements of the habitual or customary places of that word.

    • non-compositionality of meaning

      • cannot be derived directly from its parts (heavy rain)

    • non-substitutability in context

      • for parts (make a decision)

    • non-modifiability (& non-transformability)

      • kick the yellow bucket; take exceptions to


Collocations1

Collocations

  • Collocations are not necessarily adjacent

  • Collocations cannot be directly translated into other languages.


Example classes

Example Classes

  • Names

  • Technical Terms

  • “Light” Verb Constructions

  • Phrasal verbs

  • Noun Phrases


Linguistic subclasses of collocations

Linguistic Subclasses of Collocations

  • Light verbs: verbs with little semantic content like make, take, do

  • Terminological Expressions: concepts and objects in technical domains (e.g., hard drive)

  • Idioms: fixed phrases

    • kick the bucket, birds-of-a-feather, run for office

  • Proper names: difficult to recognize even with lists

    • Tuesday (person’s name), May, Winston Churchill, IBM, Inc.

  • Numerical expressions

    • containing “ordinary” words

      • Monday Oct 04 1999, two thousand seven hundred fifty

  • Verb particle constructions or Phrasal Verbs

    • Separable parts:

      • look up, take off, tell off


  • Collocation detection techniques

    Collocation Detection Techniques

    • Selection of Collocations by Frequency

    • Selection of Collocation based on Mean and Variance of the distance between focal word and collocating word.

    • Hypothesis Testing

    • Pointwise Mutual Information


    Frequency

    Frequency

    • Technique:

      • Count the number of times a bigram co-occurs

      • Extract top counts and report them as candidates

    • Results:

      • Corpus: New York Times

        • August – November, 1990

      • Extremely un-interesting


    Frequency with tag filters technique

    Frequency with Tag Filters Technique

    • Technique:

      • Count the number of times a bigram co-occurs

      • Tag candidates for POS

      • Pass all candidates through POS filter, considering only ones matching filter

      • Extract top counts and report them as candidates


    Frequency with tag filters results

    Frequency with Tag Filters Results


    Mean and variance smadja et al 1993

    Mean and Variance (Smadja et al., 1993)

    • Frequency-based search works well for fixed phrases. However, many collocations consist of two words in more flexible (although regular) relationships. For example,

      • Knock and door may not occur at a fixed distance from each other

    • One method of detecting these flexible relationships uses the mean and variance of the offset (signed distance) between the two words in the corpus.


    Mean sample variance and standard deviation

    Mean, Sample Variance, and Standard Deviation


    Example knock and door

    Example: Knock and Door

    • She knocked on his door.

    • They knocked at the door.

    • 100 women knocked on the big red door.

    • A man knocked on the metal front door.

    • Average offset between knock and door:

      (3 + 3 + 5 + 5)/ 4 = 4

    • Variance:

      ((3-4)2 + (3-4)2 + (5-4)2 + (5-4)2 )/(4-1) = 4/3=1.15


    Mean and variance

    Mean and Variance

    • Technique (bigram at distance)

      • Produce all possible pairs in a window

      • Consider all pairs in window as candidates

      • Keep data about distance of one word from another

      • Count the number of time each candidate occurs

    • Measures:

      • Mean: average offset (possibly negative)

        • Whether two words are related to each other

      • Variance: s(offset)

        • Variability in position of two words


    Mean and variance illustration

    Mean and Variance Illustration

    • Candidate Generation example:

      • Window: 3

    • Used to find collocations with long-distance relationships


    Mean and variance collocations

    Mean and Variance Collocations


    Hypothesis testing overview

    Hypothesis Testing: Overview

    • Two (or more) words co-occur a lot

    • Is a candidate a true collocation, or a (not-at-all-interesting) phantom?


    The t test intuition

    The t test Intuition

    • Intuition:

      • Compute chance occurrence and ensure observed is significantly higher

      • Take several permutations of the words in the corpus

      • How more frequent is the set of all possible permutations than what is observed?

    • Assumptions:

      • H0 is the null hypothesis (words occur independently)

        • P(w1, w2) = P(w1) P(w2)

      • Distribution is “normal”


    The t test formula

    The t test Formula

    • Measures:

      • x = bigram count

      • m = H0 = P(w1) P(w2)

      • s2 = bigram count (since p ~ p[1 – p])

      • N = total number of bigrams

    • Result:

      • Number to look up in a table

      • Degree of confidence that collocation is not created by chance

        • a = the confidence (%) with which one can reject H0


    The t test sample findings

    The t test Sample Findings


    The t test criticism

    The t test Criticism

    • Words are not normally distributed

      • Can reject valid collocation

    • Not good on sparse data


    C 2 intuition

    c2 Intuition

    • Pearson’s chi-square test

    • Intuition

      • Compare observed frequencies to expected frequencies for independence

    • Assumptions

      • If sample is not small, the distribution is not normal


    C 2 general formula

    c2 General Formula

    • Measures:

      • Eij = Expected count of the bigram

      • Oij = Observed count of the bigram

    • Result

      • A number to look up in a table (like the t test)

      • Degree of confidence (a) with which H0


    C 2 bigram method and formula

    c2 Bigram Method and Formula

    • Technique for Bigrams:

      • Arrange the bigrams in a 2x2 table with counts for each

      • Formula

        • Oij: i = column; j = row


    C 2 sample findings

    c2 Sample Findings

    • Comparing corpora

      • Machine Translation

        • Comparison of (English) “cow” and (French) “vache” gives a

        • c2 = 456400

      • Similarity of two corpora


    C 2 criticism

    c2 Criticism

    • Not good for small datasets


    Likelihood ratios within a single corpus dunning 1993

    Likelihood Ratios Within a Single Corpus (Dunning, 1993)

    • Likelihood ratios are more appropriate for sparse data than the Chi-Square test. In addition, they are easier to interpret than the Chi-Square statistic.

    • In applying the likelihood ratio test to collocation discovery, use the following two alternative explanations for the occurrence frequency of a bigram w1 w2:

      • H1: The occurrence of w2 is independent of the previous occurrence of w1: P(w2 | w1) = P(w2 | w1 ) = p

      • H2: The occurrence of w2 is dependent of the previous occurrence of w1: p1 = P(w2 | w1)  P(w2 | w1) = p2


    Likelihood ratios within a single corpus

    Likelihood Ratios Within a Single Corpus

    • Use the MLE for probabilities for p, p1, and p2 and assume the binomial distribution:

      • Under H1: P(w2 | w1) = c2/N, P(w2 | w1) = c2/N

      • Under H2: P(w2 | w1) = c12/ c1= p1, P(w2 | w1) = (c2-c12)/(N-c1) = p2

      • Under H1: b(c12; c1, p) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p) gives c2- c12 out of N-c1 bigrams are w1w2

      • Under H2: b(c12; c1, p1) gives c12 out of c1 bigrams are w1w2 and b(c2-c12; N-c1, p2) gives c2- c12 out of N-c1 bigrams are w1w2


    Likelihood ratios within a single corpus1

    Likelihood Ratios Within a Single Corpus

    • The likelihood of H1

      • L(H1) = b(c12; c1, p)b(c2-c12; N-c1, p) (likelihood of independence)

    • The likelihood of H2

      • L(H2) = b(c12; c1, p1)b(c2- c12; N-c1, p2) (likelihood of dependence)

    • The log of likelihood ratio

      • log  = log [L(H1)/ L(H2)] = log b(..) + log b(..) – log b(..) –log b(..)

    • The quantity –2 log  is asymptotically 2 distributed, so we can test for significance.


    Pointwise mutual information i

    [Pointwise] Mutual Information (I)

    • Intuition:

      • Given a collocation (w1, w2) and an observation of w1

      • I(w1; w2) indicates how more likely it is to see w2

      • The same measure also works in reverse (observe w2)

    • Assumptions:

      • Data is not sparse


    Mutual information formula

    Mutual Information Formula

    • Measures:

      • P(w1) = unigram prob.

      • P(w1w2) = bigram prob.

      • P (w2|w1) = probability of w2 given we see w1

    • Result:

      • Number indicating increased confidence that we will see w2 after w1


    Mutual information criticism

    Mutual Information Criticism

    • A better measure of the independence of two words rather than the dependence of one word on another

    • Horrible on [read: misidentifies] sparse data


    Applications

    Applications

    • Collocations are useful in:

      • Comparison of Corpora

      • Parsing

      • New Topic Detection

      • Computational Lexicography

      • Natural Language Generation

      • Machine Translation


    Comparison of corpora

    Comparison of Corpora

    • Compare corpora to determine:

      • Document clustering (for information retrieval)

      • Plagiarism

    • Comparison techniques:

      • Competing hypotheses:

        • Documents are dependent

        • Documents are independent

      • Compare hypotheses using l, etc.


    Parsing

    Parsing

    • When parsing, we may get more accurate data by treating a collocation as a unit (rather than individual words)

      • Example: [ hand to hand ] is a unit in:

        (S (NP They)

        (VP engaged

        (PP in hand)

        (PP to

        (NP hand combat))))


    New topic detection

    New Topic Detection

    • When new topics are reported, the count of collocations associated with those topics increases

      • When topics become old, the count drops


    Computational lexicography

    Computational Lexicography

    • As new multi-word expressions become part of the language, they can be detected

      • Existing collocations can be acquired

    • Can also be used for cultural identification

      • Examples:

        • My friend got an A in his class

        • My friend took an A in his class

        • My friend made an A in his class

        • My friend earned an A in his class


    Natural language generation

    Natural Language Generation

    • Problem:

      • Given two (or more) possible productions, which is more feasible?

      • Productions usually involve synonyms or near-synonyms

      • Languages generally favour one production


    Machine translation

    Machine Translation

    • Collocation-complete problem?

      • Must find all used collocations

      • Must parse collocation as a unit

      • Must translate collocation as a unit

      • In target language production, must select among many plausible alternatives


    Thanks

    Thanks!

    • Questions?


    Statistical inference n gram model over sparse data

    Statistical Inference: n-gram Model over Sparse Data


    Statistical inference

    Statistical inference

    • Statistical inference consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inferences about its distribution.


    Language models

    Language Models

    • Predict the next word, given the previous words

      (this sort of task is often referred to as a shannon game)

    • A language model can take the context into account.

    • Determine probability of different sequences by examining training corpus

    • Applications:

      • OCR / Speech recognition – resolve ambiguity

      • Spelling correction

      • Machine translation etc


    Statistical estimators

    Statistical Estimators

    • Example:

      Corpus: five Jane Austen novels

      N = 617,091 words, V = 14,585 unique words

      Task: predict the next word of the trigram “inferior to ___”

      from test data, Persuasion: “[In person, she was] inferior to both [sisters.]”

    • Given the observed training data …

    • How do you develop a model (probability distribution) to predict future events?


    The perfect language model

    The Perfect Language Model

    • Sequence of word forms

    • Notation: W = (w1,w2,w3,...,wn)

    • The big (modeling) question is what is p(W)?

    • Well, we know (Bayes/chain rule):

      p(W) = p(w1,w2,w3,...,wn) = p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1)

    • Not practical (even short for W ® too many parameters)


    Markov chain

    Markov Chain

    • Unlimited memory (cf. previous foil):

      • for wi, we know its predecessors w1,w2,w3,...,wi-1

    • Limited memory:

      • we disregard predecessors that are “too old”

      • remember only k previous words: wi-k,wi-k+1,...,wi-1

      • called “kth order Markov approximation”

    • Stationary character (no change over time):

      p(W) @Pi=1..n p(wi|wi-k,wi-k+1,...,wi-1), n = |W|


    N gram language models

    N-gram Language Models

    • (n-1)th order Markov approximation ® n-gram LM:

      p(W) = Pi=1..n p(wi|wi-n+1,wi-n+2,...,wi-1)

    • In particular (assume vocabulary |V| = 20k):

      0-gram LM: uniform model p(w) = 1/|V| 1 parameter

      1-gram LM: unigram model p(w)2´104 parameters

      2-gram LM: bigram model p(wi|wi-1) 4´108 parameters

      3-gram LM: trigram modep(wi|wi-2,wi-1) 8´1012 parameters

      4-gram LM: tetragram modelp(wi| wi-3,wi-2,wi-1)1.6´1017 parameters


    Reliability vs discrimination

    Reliability vs. Discrimination

    “large green ___________”

    tree? mountain? frog? car?

    “swallowed the large green ________”

    pill? tidbit?

    • larger n: more information about the context of the specific instance (greater discrimination)

    • smaller n: more instances in training data, better statistical estimates (more reliability)


    Lm observations

    LM Observations

    • How large n?

      • zero is enough (theoretically)

      • but anyway: as much as possible (as close to “perfect” model as possible)

      • empirically: 3

        • parameter estimation? (reliability, data availability, storage space, ...)

        • 4 is too much: |V|=60k ® 1.296´1019 parameters

        • but: 6-7 would be (almost) ideal (having enough data)

    • For now, word forms only (no “linguistic” processing)


    Parameter estimation

    Parameter Estimation

    • Parameter: numerical value needed to compute p(w|h)

    • From data (how else?)

    • Data preparation:

      • get rid of formatting etc. (“text cleaning”)

      • define words (separate but include punctuation, call it “word”, unless speech)

      • define sentence boundaries (insert “words” <s> and </s>)

      • letter case: keep, discard, or be smart:

        • name recognition

        • number type identification


    Maximum likelihood estimate

    Maximum Likelihood Estimate

    • MLE: Relative Frequency...

      • ...best predicts the data at hand (the “training data”)

      • See (Ney et al. 1997) for a proof that the relative frequency really is the maximum likelihood estimate.

    • Trigrams from Training Data T:

      • count sequences of three words in T: C3(wi-2,wi-1,wi)

      • count sequences of two words in T: C2(wi-2,wi-1):

        PMLE(wi-2,wi-1,wi) = C3(wi-2,wi-1,wi) / N

        PMLE(wi|wi-2,wi-1) = C3(wi-2,wi-1,wi) / C2(wi-2,wi-1)


    Character language model

    Character Language Model

    • Use individual characters instead of words:

      Same formulas and methods

    • Might consider 4-grams, 5-grams or even more

    • Good for cross-language comparisons

    • Transform cross-entropy between letter- and word-based models:

      HS(pc) = HS(pw) / avg. # of characters/word in S

    p(W) =dfPi=1..n p(ci|ci-n+1,ci-n+2,...,ci-1)


    Lm an example

    LM: an Example

    • Training data:

      <s0> <s> He can buy you the can of soda </s>

      • Unigram: (8 words in vocabulary)

        p1(He) = p1(buy) = p1(you) = p1(the) = p1(of) = p1(soda) = .125 p1(can) = .25

      • Bigram:

        p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5, p2(of|can) = .5,

        p2(you |buy) = 1,...

      • Trigram:

        p3(He|<s0>,<s>) = 1, p3(can|<s>,He) = 1, p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(</s>|of,soda) = 1.

      • Entropy: H(p1) = 2.75, H(p2) = 1, H(p3) = 0


    Lm an example the problem

    LM: an Example (The Problem)

    • Cross-entropy:

    • S = <s0> <s> It was the greatest buy of all </s>

    • Even HS(p1) fails (= HS(p2) = HS(p3) = ¥), because:

      • all unigrams but p1(the), p1(buy), and p1(of) are 0.

      • all bigram probabilities are 0.

      • all trigram probabilities are 0.

    • Need to make all “theoretically possible” probabilities non-zero.


    Lm another example

    LM: Another Example

    • Training data S: |V| =11 (not counting <s> and </s>)

      • <s> John read Moby Dick </s>

      • <s> Mary read a different book </s>

      • <s> She read a book by Cher </s>

    • Bigram estimates:

      • P(She | <s>) = C(<s> She)/ Sw C(<s> w) = 1/3

      • P(read | She) = C(She read)/ Sw C(She w) = 1

      • P (Moby | read) = C(read Moby)/ Sw C(read w) = 1/3

      • P (Dick | Moby) = C(Moby Dick)/ Sw C(Moby w) = 1

      • P(</s> | Dick) = C(Dick </s> )/ Sw C(Dick w) = 1

    • p(She read Moby Dick) =

      p(She | <s>)  p(read | She)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = 1/3  1  1/3  1  1 = 1/9


    The zero problem

    The Zero Problem

    • “Raw” n-gram language model estimate:

      • necessarily, there will be some zeros

        • Often trigram model ® 2.16´1014 parameters, data ~ 109 words

      • which are true zeros?

        • optimal situation: even the least frequent trigram would be seen several times, in order to distinguish it’s probability vs. other trigrams (hapax legomena = only-once term => uniqueness)

        • optimal situation cannot happen, unfortunately (question: how much data would we need?)

      • we don’t know; hence, we eliminate them.

    • Different kinds of zeros: p(w|h) = 0, p(w) = 0


    Why do we need non zero probabilities

    Why do we need non-zero probabilities?

    • Avoid infinite Cross Entropy:

      • happens when an event is found in the test data which has not been seen in training data

    • Make the system more robust

      • low count estimates:

        • they typically happen for “detailed” but relatively rare appearances

      • high count estimates: reliable but less “detailed”


    Eliminating the zero probabilities smoothing

    Eliminating the Zero Probabilities:Smoothing

    • Get new p’(w) (same W): almost p(w) except for eliminating zeros

    • Discount w for (some) p(w) > 0: new p’(w) < p(w)

      SwÎdiscounted (p(w) - p’(w)) = D

    • Distribute D to all w; p(w) = 0: new p’(w) > p(w)

      • possibly also to other w with low p(w)

    • For some w (possibly): p’(w) = p(w)

    • Make sure SwÎW p’(w) = 1

    • There are many ways of smoothing


    Smoothing an example

    Smoothing: an Example


    Laplace s law smoothing by adding 1

    Laplace’s Law: Smoothing by Adding 1

    • Laplace’s Law:

      • PLAP(w1,..,wn)=(C(w1,..,wn)+1)/(N+B), where C(w1,..,wn) is the frequency of n-gram w1,..,wn, N is the number of training instances, and B is the number of bins training instances are divided into (vocabulary size)

      • Problem if B > C(W) (can be the case; even >> C(W))

      • PLAP(w | h) = (C(h,w) + 1) / (C(h) + B)

    • The idea is to give a little bit of the probability space to unseen events.


    Add 1 smoothing example

    Add 1 Smoothing Example

    • pMLE(Cher read Moby Dick) =

      p(Cher | <s>)  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = 0  0  1/3  1  1 = 0

      • p(Cher | <s>) = (1 + C(<s> Cher))/(11 + C(<s>)) = (1 + 0) / (11 + 3) = 1/14 = .0714

      • p(read | Cher) = (1 + C(Cher read))/(11 + C(Cher)) = (1 + 0) / (11 + 1) = 1/12 = .0833

      • p(Moby | read) = (1 + C(read Moby))/(11 + C(read)) = (1 + 1) / (11 + 3) = 2/14 = .1429

      • P(Dick | Moby) = (1 + C(Moby Dick))/(11 + C(Moby)) = (1 + 1) / (11 + 1) = 2/12 = .1667

      • P(</s> | Dick) = (1 + C(Dick </s>))/(11 + C<s>) = (1 + 1) / (11 + 3) = 2/14 = .1429

    • p’(Cher read Moby Dick) =

      p(Cher | <s>)  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = 1/14  1/12  2/14  2/12  2/14 = 2.02e-5


    Objections to laplace s law

    Objections to Laplace’s Law

    • For NLP applications that are very sparse, Laplace’s Law actually gives far too much of the probability space to unseen events.

    • Worse at predicting the actual probabilities of bigrams with zero counts than other methods.

    • Count variances are actually greater than the MLE.


    Lidstone s law

    Lidstone’s Law

    • P = probability of specific n-gram

    • C = count of that n-gram in training data

    • N = total n-grams in training data

    • B = number of “bins” (possible n-grams)

    •  = small positive number

      • M.L.E:  = 0LaPlace’s Law:  = 1Jeffreys-Perks Law:  = ½

    • PLid(w | h) = (C(h,w) + ) / (C(h) + B )


    Objections to lidstone s law

    Objections to Lidstone’s Law

    • Need an a priori way to determine .

    • Predicts all unseen events to be equally likely.

    • Gives probability estimates linear in the M.L.E. frequency.


    Lidstone s law with 5

    Lidstone’s Law with =.5

    • pMLE(Cher read Moby Dick) =

      p(Cher | <s>)  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = 0  0  1/3  1  1 = 0

      • p(Cher | <s>) = (.5 + C(<s> Cher))/(.5* 11 + C(<s>)) = (.5 + 0) / (.5*11 + 3) = .5/8.5 =.0588

      • p(read | Cher) = (.5 + C(Cher read))/(.5* 11 + C(Cher)) = (.5 + 0) / (.5* 11 + 1) = .5/6.5 = .0769

      • p(Moby | read) = (.5 + C(read Moby))/(.5* 11 + C(read)) = (.5 + 1) / (.5* 11 + 3) = 1.5/8.5 = .1765

      • P(Dick | Moby) = (.5 + C(Moby Dick))/(.5* 11 + C(Moby)) = (.5 + 1) / (.5* 11 + 1) = 1.5/6.5 = .2308

      • P(</s> | Dick) = (.5 + C(Dick </s>))/(.5* 11 + C<s>) = (.5 + 1) / (.5* 11 + 3) = 1.5/8.5 = .1765

    • p’(Cher read Moby Dick) =

      p(Cher | <s>)  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p(</s> | Dick) = .5/8.5  .5/6.5  1.5/8.5  1.5/6.5  1.5/8.5 = 3.25e-5


    Held out estimator

    Held-Out Estimator

    • How much of the probability distribution should be reserved to allow for previously unseen events?

    • Can validate choice by holding out part of the training data.

    • How often do events seen (or not seen) in training data occur in validation data?

    • Held out estimator by Jelinek and Mercer (1985)


    Held out estimator1

    Held Out Estimator

    • For each n-gram, w1,..,wn , compute C1(w1,..,wn) and C2(w1,..,wn), the frequencies of w1,..,wn in training and held out data, respectively.

      • Let Nr be the no. of bigrams with frequency r in the training text.

      • Let Tr be the that all n-grams that appeared r times in the training text appeared in the held out.,

    • Then the average of the frequency r n-grams is Tr/Nr

    • An estimate for of one of these n-gram is: Pho(w1,..,wn)= (Tr/Nr )/N

      • where C(w1,..,wn) = r


    Testing models

    Testing Models

    • Divide data into training and testing sets.

    • Training data: divide into normal training plus validation (smoothing) sets: around 10% for validation (fewer parameters typically)

    • Testing data: distinguish between the “real” test set and a development set.


    Cross validation

    Cross-Validation

    • Held out estimation is useful if there is a lot of data available. If not, it may be better to use each part of the data both as training data and held out data.

      • Deleted Estimation [Jelinek & Mercer, 1985]

      • Leave-One-Out [Ney et al., 1997]


    Deleted estimation

    • Divide training data into 2 parts

    • Train on A, validate on B

    • Train on B, validate on A

    • Combine two models

    A

    B

    train

    validate

    Model 1

    validate

    train

    Model 2

    +

    Model 1

    Model 2

    Final Model

    Deleted Estimation

    • Use data for both training and validation


    Cross validation1

    Cross-Validation

    Two estimates:

    Nra = number of n-grams occurring r times in a-th part of training set

    Trab = total number of those found in b-th part

    Combined estimate:

    (arithmetic mean)


    Leave one out

    Leave One Out

    • Primary training Corpus is of size N-1 tokens.

    • 1 token is used as held out data for a sort of simulated testing.

    • Process is repeated N times so that each piece of data is left in turn.

    • It explores the effect of how the model changes if any particular piece of data had not been observed (advantage)


    Good turing estimation

    Good-Turing Estimation

    • Intuition: re-estimate the amount of mass assigned to n-grams with low (or zero) counts using the number of n-grams with higher counts. For any n-gram that occurs r times, we should assume that it occurs r* times, where Nr is the number of n-grams occurring precisely r times in the training data.

    • To convert the count to a probability, we normalize the n-gram  with r counts as:


    Good turing estimation1

    Good-Turing Estimation

    • Note that N is equal to the original number of counts in the distribution.

    • Makes the assumption of a binomial distribution, which works well for large amounts of data and a large vocabulary despite the fact that words and n-grams do not have that distribution.


    Good turing estimation2

    Good-Turing Estimation

    • Note that the estimate cannot be used if Nr=0; hence, it is necessary to smooth the Nr values.

    • The estimate can be written as:

      • If C(w1,..,wn) = r > 0, PGT(w1,..,wn) = r*/N where r*=((r+1)S(r+1))/S(r) and S(r) is a smoothed estimate of the expectation of Nr.

      • If C(w1,..,wn) = 0, PGT(w1,..,wn)  (N1/N0 ) /N

    • In practice, counts with a frequency greater than five are assumed reliable, as suggested by Katz.

    • In practice, this method is not used by itself because it does not use lower order information to estimate probabilities of higher order n-grams.


    Good turing estimation3

    Good-Turing Estimation

    • N-grams with low counts are often treated as if they had a count of 0.

    • In practice r* is used only for small counts; counts greater than k=5 are assumed to be reliable: r*=r if r> k; otherwise:


    Discounting methods

    Discounting Methods

    • Absolute discounting: Decrease probability of each observed n-gram by subtracting a small constant when C(w1, w2, …, wn) = r:

    • Linear discounting: Decrease probability of each observed n-gram by multiplying by the same proportion when C(w1, w2, …, wn) = r:


    Combining estimators overview

    Combining Estimators: Overview

    • If we have several models of how the history predicts what comes next, then we might wish to combine them in the hope of producing an even better model.

    • Some combination methods:

      • Katz’s Back Off

      • Simple Linear Interpolation

      • General Linear Interpolation


    Backoff

    Backoff

    • Back off to lower order n-gram if we have no evidence for the higher order form. Trigram backoff:


    Katz s back off model

    Katz’s Back Off Model

    • If the n-gram of concern has appeared more than k times, then an n-gram estimate is used but an amount of the MLE estimate gets discounted (it is reserved for unseen n-grams).

    • If the n-gram occurred k times or less, then we will use an estimate from a shorter n-gram (back-off probability), normalized by the amount of probability remaining and the amount of data covered by this estimate.

    • The process continues recursively.


    Katz s back off model1

    Katz’s Back Off Model

    • Katz used Good-Turing estimates when an n-gram appeared k or fewer times.


    Problems with backing off

    Problems with Backing-Off

    • If bigram w1 w2 is common, but trigram w1 w2 w3 is unseen, it may be a meaningful gap, rather than a gap due to chance and scarce data.

      • i.e., a “grammatical null”

    • In that case, it may be inappropriate to back-off to lower-order probability.


    Linear interpolation

    Linear Interpolation

    • One way of solving the sparseness in a trigram model is to mix that model with bigram and unigram models that suffer less from data sparseness.

    • This can be done by linear interpolation (also called finite mixture models).

    • The weights can be set using the Expectation-Maximization (EM) algorithm.


    Simple interpolated smoothing

    Simple Interpolated Smoothing

    • Add information from less detailed distributions using l=(l0,l1,l2,l3):

      p’l(wi| wi-2 ,wi-1) = l3 p3(wi| wi-2 ,wi-1) +l2 p2(wi| wi-1) + l1 p1(wi) + l0/|V|

    • Normalize:

      li > 0, Si=0..n li = 1 is sufficient (l0 = 1 - Si=1..n li) (n=3)

    • Estimation using MLE:

      • fix the p3, p2, p1 and |V| parameters as estimated from the training data

      • then find {li}that minimizes the cross entropy (maximizes probability of data): -(1/|D|)Si=1..|D|log2(p’l(wi|hi))


    Held out data

    Held Out Data

    • What data to use?

      • try the training data T: but we will always get l3 = 1

        • why? (let piT be an i-gram distribution estimated using T)

        • minimizing HT(p’l) over a vector l, p’l = l3p3T+l2p2T+l1p1T+l0/|V|

          • remember: HT(p’l) = H(p3T) + D(p3T||p’l); (p3T fixed ® H(p3T) fixed, best)

      • thus: do not use the training data for estimation of l!

        • must hold out part of the training data (heldout data, H):

        • ...call the remaining data the (true/raw) training data, T

        • the test data S (e.g., for evaluation purposes): still different data!


    The formulas

    The Formulas

    • Repeat: minimizing -(1/|H|)Si=1..|H|log2(p’l(wi|hi)) over l

      p’l(wi| hi) = p’l(wi| wi-2 ,wi-1) = l3 p3(wi| wi-2 ,wi-1) +

      l2 p2(wi| wi-1) + l1 p1(wi) + l0/|V|

    • “Expected Counts (of lambdas)”: j = 0..3

      c(lj) = Si=1..|H| (ljpj(wi|hi) / p’l(wi|hi))

    • “Next l”: j = 0..3

      lj,next = c(lj) / Sk=0..3 (c(lk))


    The smoothing em algorithm

    The (Smoothing) EM Algorithm

    1. Start with some l, such that lj > 0 for all j Î 0..3.

    2. Compute “Expected Counts” for each lj.

    3. Compute new set of lj, using the “Next l” formula.

    4. Start over at step 2, unless a termination condition is met.

    • Termination condition: convergence of l.

      • Simply set an , and finish if |lj - lj,next| <  for each j (step 3).

    • Guaranteed to converge:

      follows from Jensen’s inequality, plus a technical proof.


    Example

    Example

    • Raw distribution (unigram; smooth with uniform):

      p(a) = .25, p(b) = .5, p(a) = 1/64 for a Î{c..r}, = 0 for the rest: s,t,u,v,w,x,y,z

    • Heldout data: baby; use one set of l(l1: unigram,l0: uniform)

    • Start with l1 = .5; p’l(b) = .5  .5 + .5 / 26 = .27

      p’l(a) = .5  .25 + .5 / 26 = .14

      p’l(y) = .5  0 + .5 / 26 = .02

      c(l1) = .5 .5/.27 + .5 .25/.14 + .5 .5/.27 + .5  0/.02 = 2.72

      c(l0) = .5 .04/.27 + .5 .04/.14 + .5 .04/.27 + .5 .04/.02 = 1.28

      Normalize: l1,next = .68, l0,next = .32.

      Repeat from step 2 (recompute p’l first for efficient computation, then c(li), ...)

      Finish when new lambdas differ little from the previous ones (say, < 0.01 difference).

    p2(wi| wi-1)

    =1p1(wi) + 2 /|V|


    Witten bell smoothing

    Witten-Bell Smoothing

    • The nth order smoothed model is defined recursively as:

    • To compute , we need the number of unique words that have that history.


    Witten bell smoothing1

    Witten-Bell Smoothing

    • The number of words that follow the history and have one or more counts is:

    • We can assign the parameters such that:


    Witten bell smoothing2

    Witten-Bell Smoothing

    • Substituting into the first equation, we get:


    General linear interpolation

    General Linear Interpolation

    • In simple linear interpolation, the weights are just a single number, but one can define a more general and powerful model where the weights are a function of the history.

    • Need some way to group or bucket lambda histories.


    Collocations

    Q & A


    Reference

    Reference

    • www.cs.tau.ac.il/~nachumd/NLP

    • http://www.cs.sfu.ca

    • http://min.ecn.purdue.edu

    • Manning and Schutze

    • Jurafsky and Martin


  • Login