corpora and statistical methods n.
Skip this Video
Loading SlideShow in 5 Seconds..
Corpora and Statistical Methods PowerPoint Presentation
Download Presentation
Corpora and Statistical Methods

Loading in 2 Seconds...

play fullscreen
1 / 101

Corpora and Statistical Methods - PowerPoint PPT Presentation

  • Uploaded on

Corpora and Statistical Methods. Language modelling using N-Grams. Example task. The word-prediction task (Shannon game) Given : a sequence of words (the history ) a choice of next word Predict : the most likely next word

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Corpora and Statistical Methods' - neal

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
corpora and statistical methods

Corpora and Statistical Methods

Language modelling using N-Grams

example task
Example task
  • The word-prediction task (Shannon game)
  • Given:
    • a sequence of words (the history)
    • a choice of next word
  • Predict:
    • the most likely next word
  • Generalises easily to other problems, such as predicting the POS of unknowns based on history.
applications of the shannon game
Applications of the Shannon game
  • Automatic speech recognition (cf. tutorial 1):
    • given a sequence of possible words, estimate its probability
  • Context-sensitive spelling correction:
    • Many spelling errors are real words
      • He walked for miles in the dessert. (resp. desert)
    • Identifying such errors requires a global estimate of the probability of a sentence.
applications of n gram models generally
Applications of N-gram models generally
  • POS Tagging:
    • predict the POS of an unknown word by looking at its history
  • Statistical parsing:
    • e.g. predict the group of words that together form a phrase
  • Statistical NL Generation:
    • given a semantic form to be realised as text, and several possible realisations, select the most probable one.
a real world example google s did you mean
A real-world example: Google’s did you mean
  • Google uses an n-gram model (based on sequences of characters, not words).
  • In this case, the sequence apple desserts is much more probable than apple deserts
how it works
How it works
  • Documents provided by the search engine are added to:
    • An index (for fast retrieval)
    • A language model (based on probability of a sequence of characters)
  • A submitted query (“apple deserts”) can be modified (using character insertions, deletions, substitutions and transpositions) to yield a query that fits the language model better (“apple desserts”).
  • Outcome is a context-sensitive spelling correction:
    • “apple deserts”  “apple desserts”
    • “frodbaggins”  “frodobaggins”
    • “frod”  “ford”
the noisy channel model
The noisy channel model
  • After Jurafsky and Martin (2009), Speech and Language Processing (2nd Ed). Prentice Hall p. 198
the markov assumption
The Markov Assumption
  • Markov models:
    • probabilistic models which predict the likelihood of a future unit based on limited history
    • in language modelling, this pans out as the local history assumption:
      • the probability of wn depends on a limited number of prior words
    • utility of the assumption:
      • we can rely on a small n for our n-gram models (bigram, trigram)
      • long n-grams become exceedingly sparse
      • Probabilities become very small with long sequences
the structure of an n gram model
The structure of an n-gram model
  • The task can be re-stated in conditional probabilistic terms:
  • Limiting n under the Markov Assumption means:
    • greater chance of finding more than one occurrence of the sequence w1…wn-1
    • more robust statistical estimations
  • N-grams are essentially equivalence classes or bins
    • every unique n-gram is a type or bin
structure of n gram models ii
Structure of n-gram models (II)
  • If we construct a model where all histories with the same n-1 words are considered one class or bin, we have an (n-1)th order Markov Model
  • Note terminology:
    • n-gram model = (n-1)th order Markov Model
methodological considerations
Methodological considerations
  • We are often concerned with:
    • building an n-gram model
    • evaluating it
  • We therefore make a distinction between training and test data
    • You never test on your training data
    • If you do, you’re bound to get good results.
    • N-gram models tend to beovertrained, i.e.: if you train on a corpus C, your model will be biased towards expecting the kinds of events in C.
      • Another term for this: overfitting
dividing the data
Dividing the data
  • Given: a corpus of n units (words, sentences, … depending on the task)
    • A large proportion of the corpus is reserved for training.
    • A smaller proportion for testing/evaluation (normally 5-10%)
held out validation data
Held-out (validation) data
  • Held-out estimation:
    • during training, we sometimes estimate parameters for our model empirically
    • commonly used in smoothing (how much probability space do we want to set aside for unseen data)?
    • therefore, the training set is often split further into training data and validation data
      • normally, held-out data is 10% of the size of the training data
development data
Development data
  • A common approach:
    • train an algorithm on training data

a. (estimate further parameters on held-out data if required)

    • evaluate it
    • re-tune it
    • go back to Step 1 until no further finetuning necessary
    • Carry out final evaluation
  • For this purpose, it’s useful to have:
    • training data for step 1
    • development set for steps 2-4
    • final test set for step 5
significance testing
Significance testing
  • Often, we compare the performance of our algorithm against some baseline.
  • A single, raw performance score won’t tell us much. We need to test for significance (e.g. using t-test).
  • Typical method:
    • Split test set into several small test sets, e.g. 20 samples
    • evaluation carried out separately on each
    • mean and variance estimated based on 20 different samples
    • test for significant difference between algorithm and a predefined baseline
size of n gram models
Size of n-gram models
  • In a corpus of vocabulary size N, the assumption is that any combination of n words is a potential n-gram.
  • For a bigram model: N2 possible n-grams in principle
  • For a trigram model: N3 possible n-grams.
size continued
Size (continued)
  • Each n-gram in our model is a parameter used to estimate probability of the next possible word.
    • too many parameters make the model unwieldy
    • too many parameters lead to data sparseness: most of them will have f = 0 or 1
  • Most models stick to unigrams, bigrams or trigrams.
    • estimation can also combine different order models
further considerations
Further considerations
  • When building a model, we tend to take into account the start-of-sentence symbol:
    • the girl swallowed a large green caterpillar
      • <s> the
      • the girl
  • Also typical to map all tokens w such that count(w) < k to <UNK>:
    • usually, tokens with frequency 1 or 2 are just considered “unknown” or “unseen”
    • this reduces the parameter space
maximum likelihood estimation approach
Maximum Likelihood Estimation Approach
  • Basic equation:
  • In a unigram model, this reduces to simple probability.
  • MLE models estimate probability using relative frequency.
limitations of mle
Limitations of MLE
  • MLE builds the model that maximises the probability of the training data.
  • Unseen events in the training data are assigned zero probability.
    • Since n-gram models tend to be sparse, this is a real problem.
  • Consequences:
    • seen events are given more probability mass than they have
    • unseen events are given zero mass
seen unseen



Probability mass

of events not in

training data

The problem with MLE is that it distributes A’ among members of A.

Probability mass of events in

training data

the solution
The solution
  • Solution is to correct MLE estimation using a smoothing technique.
    • More on this in the next part
    • But cf. Tutorial 1, which introduced the simplest method of smoothing known.
adequacy of different order models
Adequacy of different order models
  • Manning/Schutze `99 report results for n-gram models of a corpus of the novels of Austen.
  • Task: use n-gram model to predict the probability of a sentence in the test data.
  • Models:
    • unigram: essentially zero-context markov model, uses only the probability of individual words
    • bigram
    • trigram
    • 4-gram
example test case
Example test case
  • Training Corpus: five Jane Austen novels
  • Corpus size = 617,091 words
  • Vocabulary size = 14,585 unique types
  • Task: predict the next word of the trigram
  • “inferior to ________”
    • from test data, Persuasion:
    • “[In person, she was] inferior to both [sisters.]”
adequacy of unigrams
Adequacy of unigrams
  • Problems with unigram models:
    • not entirely hopeless because most sentences contain a majority of highly common words
    • ignores syntax completely:
      • P(In person she was inferior) = P(inferior was she person in)
adequacy of bigrams
Adequacy of bigrams
  • Bigrams:
    • improve situation dramatically
    • some unexpected results:
      • p(she|person) decreases compared to the unigram model. Though she is very common, it is uncommon after person
adequacy of trigrams
Adequacy of trigrams
  • Trigram models will do brilliantly when they’re useful.
    • They capture a surprising amount of contextual variation in text.
    • Biggest limitation:
      • most new trigrams in test data will not have been seen in training data.
  • Problem carries over to 4-grams, and is much worse!
reliability vs discrimination
Reliability vs. Discrimination
  • larger n: more information about the context of the specific instance (greater discrimination)
  • smaller n: more instances in training data, better statistical estimates (more reliability)
backing off
Backing off
  • Possible way of striking a balance between reliability and discrimination:
    • backoff model:
      • where possible, use a trigram
      • if trigram is unseen, try and “back off” to a bigram model
      • if bigrams are unseen, try and “back off” to a unigram
  • Recall: Entropy is a measure of uncertainty:
    • high entropy = high uncertainty
  • perplexity:
    • if I’ve trained on a sample, how surprised am I when exposed to a new sample?
    • a measure of uncertainty of a model on new data
entropy as expected value
Entropy as “expected value”
  • One way to think of the summation part is as a weighted average of the information content.
  • We can view this average value as an “expectation”: the expected surprise/uncertainty of our model.
comparing distributions
Comparing distributions
  • We have a language model built from a sample. The sample is a probability distribution q over n-grams.
    • q(x) = the probability of some n-gram x in our model.
  • The sample is generated from a true population (“the language”) with probability distribution p.
    • p(x) = the probability of x in the true distribution
evaluating a language model
Evaluating a language model
  • We’d like an estimate of how good our model is as a model of the language
    • i.e. we’d like to compare q to p
  • We don’t have access to p. (Hence, can’t use KL-Divergence)
  • Instead, we use our test data as an estimate of p.
cross entropy basic intuition
Cross-entropy: basic intuition
  • Measure the number of bits needed to identify an event coming from p, if we code it according to q:
    • We draw sequences according to p;
    • but we sum the log of their probability according to q.
  • This estimate is called cross-entropy H(p,q)
cross entropy p vs q
Cross-entropy: p vs. q
  • Cross-entropy is an upper bound on the entropy of the true distribution p:
    • H(p) ≤ H(p,q)
    • if our model distribution (q) is good, H(p,q) ≈ H(p)
  • We estimate cross-entropy based on our test data.
    • Gives an estimate of the distance of our language model from the distribution in the test sample.
estimating cross entropy
Estimating cross-entropy

Probability according

to p (test set)

Entropy according

to q (language model)

  • The perplexity of a language model with probability distribution q, relative to a test set with probability distribution p is:
  • A perplexity value of k (obtained on a test set) tells us:
    • our model is as surprised on average as it would be if it had to make k guesses for every sequence (n-gram) in the test data.
  • The lower the perplexity, the better the language model (the lower the surprise on our test data).
perplexity example jurafsky martin 2000 p 228
Perplexity example (Jurafsky & Martin, 2000, p. 228)
  • Trained unigram, bigram and trigram models from a corpus of news text (Wall Street Journal)
    • applied smoothing
    • 38 million words
    • Vocab of 19,979 (low-frequency words mapped to UNK).
  • Computed perplexity on a test set of 1.5 million words.
j m s results
J&M’s results
  • Trigrams do best of all.
  • Value suggests the extent to which the model can fit the data in the test set.
  • Note: with unigrams, the model has to make lots of guesses!
  • Main point about Markov-based language models:
    • data sparseness is always a problem
    • smoothing techniques are required to estimate probability of unseen events
  • Next part discusses more refined smoothing techniques than those seen so far.
part 2

Part 2

Smoothing (aka discounting) techniques

  • Smoothing methods:
    • Simple smoothing
    • Witten-Bell & Good-Turing estimation
    • Held-out estimation and cross-validation
  • Combining several n-gram models:
    • back-off models
rationale behind smoothing
Rationale behind smoothing
  • Sample frequencies
  • seen events with probability P
  • unseen events (including “grammatical” zeroes”) with probability 0
  • Real population frequencies
  • seen events
  • (including the unseen events in our sample)

+ smoothing

to approximate

results in

Lower probabilities for seen events (discounting). Left over probability mass distributed over unseens (smoothing).

maximum likelihood estimate
Maximum Likelihood Estimate


Unknowns are assigned 0% probability mass

actual probability distribution
Actual Probability Distribution


These are non-zero probabilities in the real distribution

laplace s law
LaPlace’s Law

NB. This method ends up assigning most prob. mass to unseens


generalisation lidstone s law
Generalisation: Lidstone’s Law
  • P = probability of specific n-gram
  • C(x) = count of n-gram x in training data
  • N = total n-grams in training data
  • V = number of “bins” (possible n-grams)
  • = small positive number

M.L.E:  = 0LaPlace’s Law:  = 1 (add-one smoothing)Jeffreys-Perks Law:  = ½

objections to lidstone s law
Objections to Lidstone’s Law
  • Need an a priori way to determine .
  • Predicts all unseen events to be equally likely
  • Gives probability estimates linear in the M.L.E. frequency
main intuition
Main intuition
  • A zero-frequency event can be thought of as an event which hasn’t happened (yet).
    • The probability of it happening can be estimated from the probability of sth happening for the first time (i.e. the hapaxes in our corpus).
  • The count of things which are seen only once can be used to estimate the count of things that are never seen.
witten bell method
Witten-Bell method
  • T = no. of times we saw an event for the first time.

= no of different n-gram types (bins)

NB: T is no. of types actually attested (unlike V, the no of possible in our previous estimations)

  • Estimate total probability mass of unseen n-grams:
    • Basically, MLE of the probability of a new type event occurring (“being seen for the first time”)
    • This is the total probability mass to be distributed among all zero events (unseens)

no of actual n-grams (N) + no of actual types (T)

witten bell method1
Witten-Bell method
  • Divide the total probability mass among all the zero n-grams. Can distribute it equally.
  • Remove this probability mass from the non-zero n-grams (discounting):
witten bell vs add one
Witten-Bell vs. Add-one
  • If we work with unigrams, Witten-Bell and Add-one smoothing give very similar results.
  • The difference is with n-grams for n>1.
  • Main idea: estimate probability of an unseen bigram <w1,w2> from the probability of seeing a bigram starting with w1 for the first time.
witten bell with bigrams
Witten-Bell with bigrams
  • Generalised total probability mass estimate:

No. bigram types beginning with w1

No. bigram tokens beginning with w1

Estimated total probability of bigrams starting with w1

witten bell with bigrams1
Witten-Bell with bigrams
  • Non-zero bigrams get discounted as before, but again conditioning on history:
  • Note: Witten-Bell won’t assign the same probability mass to all unseen n-grams.
  • The amount assigned will depend on the first word in the bigram (first n-1 words in the n-gram).
good turing method
Good-Turing method
  • Introduced by Good (1953), but partly attributed to Alan Turing
    • work carried out at Bletchley Park during WWII
  • “Simple Good-Turing” method (Gale and Sampson 1995)
  • Main idea:
    • re-estimate amount of probability mass assigned to low-frequency or zero n-grams based on the number of n-grams (types) with higher frequencies
  • Given:
    • sample frequency of a type (n-gram, aka bin, aka equivalence class)
  • GT provides:
    • an estimate of the true population frequency of a type
    • an estimate of the total probability of unseen types in the population.
  • the sample frequency C(x)of an n-gram x in a corpus of size N with vocabulary size V
  • the no. of n-gram types with frequency C, Tc
  • C*(x): the estimated true population frequency of an n-gram x with sample frequency C(x)
    • N.B. in a perfect sample, C(x) = C*(x)
    • in real life, C*(x) < C(x)(i.e. sample overestimates the true frequency)
some background
Some background
  • Suppose:
    • we had access to the true population probability of our n-grams
    • we treat each occurrence of an n-gramas a Bernoulli trial: either the n-gram is xor not
    • i.e. a binomial assumption
  • Then, we could calculate the expected no of types with frequency C, Tc
    • = the expected frequency of frequency


TC= no. of n-gram types with frequency C

N = total no. of n-grams

background continued
Background continued
  • Given an estimate of E(TC), we could then calculate C*
  • Fundamental underlying theorem:
  • Note: this makes the estimated (“true”) frequency C* a function of the expected number of types with frequency C+1. Like Witten-bell, it makes the adjusted count of zero-frequency events dependent on events of frequency 1.
background continued1
Background continued
  • We can use the above to calculate adjusted frequencies directly.
    • Often, though, we want to calculate the total “missing probability mass” for zero-count n-grams (the unseens):


    • T1 is the number of types with frequency 1
    • N is the total number of items seen in training
example of readjusted counts
Example of readjusted counts
  • From: Jurafsky & Martin 2009
  • Examples are bigram counts from two corpora.
a little problem
A little problem
  • The GT theorem assumes that we know the expected population count of types!
    • We’ve assumed that we get this from a corpus, but this, of course, is not the case.
  • Secondly, TC+1 will often be zero! For example, it’s quite possible to find several n-grams with frequency 100, and no n-grams with frequency 101!
    • Note that this is more typical for high frequencies, than low ones.
low frequencies and gaps
Low frequencies and gaps
  • Low C:linear trend.
  • Higher C:angular discontinuity.
  • Frequencies in corpus display “jumps” and so do frequencies of frequencies.
  • This implies the presence of gaps at higher frequencies.

TC: log10 frequency of frequency

C: log10 frequency

(after Gale and Sampson 1995)

possible solution
Possible solution
  • Use Good-Turing for n-grams with corpus frequency less than some constant k(typically, k = 5).
    • Low-frequency types are numerous, so GT is reliable.
    • High-frequency types are assumed to be near the “truth”.
  • To avoid gaps (where Tc+1 = 0), empirically estimate a function S(C) that acts as a proxy for E(TC)
proxy function for gaps
Proxy function for gaps
  • For any sample C, let:
  • where:
    • C’’is the next highest non-zero frequency
    • C’ is the previous non-zero frequency

log10 SC

log10 frequency

(after Gale and Sampson 1995)

gale and sampson s combined proposal
Gale and Sampson’s combined proposal
  • For low frequencies (< k), use standard equation, assuming E(TC) = TC
    • If we have gaps (i.e. TC =0), we use our proxy function for TC. Obtained through linear regression to fit the log-log curve
  • And for high frequencies, we can assume that C* = C
  • Finally, estimate probability of n-gram:
gt estimation final step
GT Estimation: Final step
  • GT gives approximations to probabilities.
    • Re-estimated probabilities of n-grams won’t sum to 1
    • necessary to re-normalise
  • Gale/Sampson 1995:
a final word on gt smoothing
A final word on GT smoothing
  • In practice, GT is very seldom used on its own.
  • Most frequently, we use GT with backoff, about which, more later...
held out estimation general idea
Held-out estimation: General idea
  • “hold back” some training data
  • create our language model
  • compare, for each n-gram (w1…wn):
    • Ct: estimated frequency of the n-gram based on training data
    • Ch: frequency of the n-gram in the held-out data
held out estimation
Held-out estimation
  • Define TotCas:
    • total no. of times that n-grams with frequency C in the training corpus actually occurred in the held-out data
  • Re-estimate the probability:
cross validation
  • Problem with held-out estimation:
    • our training set is smaller
  • Way around this:
    • divide training data into training + validation data (roughly equal sizes)
    • use each half first as training then as validation (i.e. train twice)
    • take a mean
cross validation a k a deleted estimation
Cross-Validation(a.k.a. deleted estimation)
  • Use training and validation data

Split training data:



train on A, validate on B



Model 1

train on B, validate on A



Model 2


Model 2

combine model 1 & 2

Model 1

Final Model

cross validation1

Combined estimate (arithmetic mean):

the rationale
The rationale
  • We would like to balance between reliability and discrimination:
    • use trigram where useful
    • otherwise back off to bigram, unigram
  • How can you develop a model to utilize different length n-grams as appropriate?
interpolation vs backoff
Interpolation vs. Backoff
  • Interpolation: compute probability of an n-gram as a function of:
    • The n-gram itself
    • All lower-order n-grams
    • Probabilities are linearly interpolated.
    • Lower-order n-grams are always used.
  • Backoff:
    • If n-gram exists in model, use that
    • Else fall back to lower order n-grams
simple interpolation trigram example
Simple interpolation: trigram example
  • Combine all estimates, weighted by a factor.
  • All parametersshould sum to 1:
  • NB: we have different interpolation parameters for the various n-gram sizes.
more sophisticated version
More sophisticated version
  • Suppose we have the trigrams:
    • (the dog barked)
    • (the puppy barked)
  • Suppose (the dog) occurs several times in our corpus, but not (the puppy)
  • In our interpolation, we might want to weight trigrams of the form (the dog _) more than (the puppy _) (because the former is composed of a more reliable bigram)
  • Rather than using the same parameter for all trigrams, we could condition on the initial bigram.
sophisticated interpolation trigram example
Sophisticated interpolation: trigram example
  • Combine all estimates, weighted by factors that depend on the context.
where do parameters come from
Where do parameters come from?
  • Typically:
    • We estimate counts from training data.
    • We estimate parameters from held-out data.
    • The lambdas are chosen so that they maximise the likelihood on the held-out data.
  • Often, the expectation maximisation (EM) algorithm is used to discover the right values to plug into the equations.
  • (more on this later)
  • Recall that backoff models only use lower order n-grams when the higher order one is unavailable.
  • Best known model by Katz (1987).
    • Uses backoff with smoothed probabilities
    • Smoothed probabilities obtained using Good-Turing estimation.
backoff trigram example
Backoff: trigram example
  • Backoff estimate:
  • That is:
    • If the trigram has count > 0, we use the smoothed (P*) estimate
    • If not, we recursively back off to lower orders, interpolating with a parameter (alpha)
backoff vs simple smoothing
Backoff vs. Simple smoothing
  • With Good-Turing smoothing, we typically end up with the “leftover” probability mass that is distributed equally among the unseens.
    • So GF tells us how much leftover probability there is.
  • Backoff gives us a better way of distributing this mass among unseen trigrams, by relying on the counts of their component bigrams and unigrams.
    • So backoff tells us how to divide that leftover probability.
why we need those alphas
Why we need those alphas
  • If we rely on true probabilities, then for a given word and a given n-gram window, the probability of the word sums to 1:
  • But if we back off to lower-order model when the trigram probability is 0, we’re adding extra probability mass, and the sum will now exceed 1.
  • We therefore need:
    • P* to discount the original MLE estimate (P)
    • Alpha parametersto ensure that the probability from the lower-order n-grams sums up to exactly the amount we discounted in P*.
computing the alphas i
Computing the alphas -- I
  • Recall: we have C(w1w2w3) = 0
  • Let ß(w1w2) represent the amount of probability left over when we discount (seen) trigrams containing w3

The sum of probabilities P for seen trigrams involving w3 (preceded by any two tokens) is 1. The smoothed probabilities P* sum to less than 1. We’re taking the remainder.

computing the alphas ii
Computing the alphas -- II
  • We now compute alpha:

The denominator sums over all unseen trigrams involving our bigram. We distribute the remaining mass ß(w1w2) overall all those trigrams.

what about unseen bigrams
What about unseen bigrams?
  • So what happens if even (w1w2) in (w1w2w3) has count zero?
    • I.e. we fall to an even lower order. Moreover:
    • And:
problems with backing off
Problems with Backing-Off
  • Suppose (w2 w3) is common but trigram (w1 w2 w3) is unseen
  • This may be a meaningful gap, rather than a gap due to chance and scarce data
    • i.e., a “grammatical null”
  • May not want to back-off to lower-order probability
    • in this case, p = 0 is accurate!
  • Gale, W.A., and Sampson, G. (1995). Good-Turing frequency estimation without tears. Journal of Quantitative Linguistics, 2: 217-237