Corpora and Statistical Methods

1 / 101

# Corpora and Statistical Methods - PowerPoint PPT Presentation

Corpora and Statistical Methods. Language modelling using N-Grams. Example task. The word-prediction task (Shannon game) Given : a sequence of words (the history ) a choice of next word Predict : the most likely next word

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Corpora and Statistical Methods' - neal

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Corpora and Statistical Methods

Language modelling using N-Grams

• The word-prediction task (Shannon game)
• Given:
• a sequence of words (the history)
• a choice of next word
• Predict:
• the most likely next word
• Generalises easily to other problems, such as predicting the POS of unknowns based on history.
Applications of the Shannon game
• Automatic speech recognition (cf. tutorial 1):
• given a sequence of possible words, estimate its probability
• Context-sensitive spelling correction:
• Many spelling errors are real words
• He walked for miles in the dessert. (resp. desert)
• Identifying such errors requires a global estimate of the probability of a sentence.
Applications of N-gram models generally
• POS Tagging:
• predict the POS of an unknown word by looking at its history
• Statistical parsing:
• e.g. predict the group of words that together form a phrase
• Statistical NL Generation:
• given a semantic form to be realised as text, and several possible realisations, select the most probable one.
A real-world example: Google’s did you mean
• Google uses an n-gram model (based on sequences of characters, not words).
• In this case, the sequence apple desserts is much more probable than apple deserts
How it works
• Documents provided by the search engine are added to:
• An index (for fast retrieval)
• A language model (based on probability of a sequence of characters)
• A submitted query (“apple deserts”) can be modified (using character insertions, deletions, substitutions and transpositions) to yield a query that fits the language model better (“apple desserts”).
• Outcome is a context-sensitive spelling correction:
• “apple deserts”  “apple desserts”
• “frodbaggins”  “frodobaggins”
• “frod”  “ford”
The noisy channel model
• After Jurafsky and Martin (2009), Speech and Language Processing (2nd Ed). Prentice Hall p. 198
The Markov Assumption
• Markov models:
• probabilistic models which predict the likelihood of a future unit based on limited history
• in language modelling, this pans out as the local history assumption:
• the probability of wn depends on a limited number of prior words
• utility of the assumption:
• we can rely on a small n for our n-gram models (bigram, trigram)
• long n-grams become exceedingly sparse
• Probabilities become very small with long sequences
The structure of an n-gram model
• The task can be re-stated in conditional probabilistic terms:
• Limiting n under the Markov Assumption means:
• greater chance of finding more than one occurrence of the sequence w1…wn-1
• more robust statistical estimations
• N-grams are essentially equivalence classes or bins
• every unique n-gram is a type or bin
Structure of n-gram models (II)
• If we construct a model where all histories with the same n-1 words are considered one class or bin, we have an (n-1)th order Markov Model
• Note terminology:
• n-gram model = (n-1)th order Markov Model
Methodological considerations
• We are often concerned with:
• building an n-gram model
• evaluating it
• We therefore make a distinction between training and test data
• You never test on your training data
• If you do, you’re bound to get good results.
• N-gram models tend to beovertrained, i.e.: if you train on a corpus C, your model will be biased towards expecting the kinds of events in C.
• Another term for this: overfitting
Dividing the data
• Given: a corpus of n units (words, sentences, … depending on the task)
• A large proportion of the corpus is reserved for training.
• A smaller proportion for testing/evaluation (normally 5-10%)
Held-out (validation) data
• Held-out estimation:
• during training, we sometimes estimate parameters for our model empirically
• commonly used in smoothing (how much probability space do we want to set aside for unseen data)?
• therefore, the training set is often split further into training data and validation data
• normally, held-out data is 10% of the size of the training data
Development data
• A common approach:
• train an algorithm on training data

a. (estimate further parameters on held-out data if required)

• evaluate it
• re-tune it
• go back to Step 1 until no further finetuning necessary
• Carry out final evaluation
• For this purpose, it’s useful to have:
• training data for step 1
• development set for steps 2-4
• final test set for step 5
Significance testing
• Often, we compare the performance of our algorithm against some baseline.
• A single, raw performance score won’t tell us much. We need to test for significance (e.g. using t-test).
• Typical method:
• Split test set into several small test sets, e.g. 20 samples
• evaluation carried out separately on each
• mean and variance estimated based on 20 different samples
• test for significant difference between algorithm and a predefined baseline
Size of n-gram models
• In a corpus of vocabulary size N, the assumption is that any combination of n words is a potential n-gram.
• For a bigram model: N2 possible n-grams in principle
• For a trigram model: N3 possible n-grams.
Size (continued)
• Each n-gram in our model is a parameter used to estimate probability of the next possible word.
• too many parameters make the model unwieldy
• too many parameters lead to data sparseness: most of them will have f = 0 or 1
• Most models stick to unigrams, bigrams or trigrams.
• estimation can also combine different order models
Further considerations
• When building a model, we tend to take into account the start-of-sentence symbol:
• the girl swallowed a large green caterpillar
• <s> the
• the girl
• Also typical to map all tokens w such that count(w) < k to <UNK>:
• usually, tokens with frequency 1 or 2 are just considered “unknown” or “unseen”
• this reduces the parameter space
Maximum Likelihood Estimation Approach
• Basic equation:
• In a unigram model, this reduces to simple probability.
• MLE models estimate probability using relative frequency.
Limitations of MLE
• MLE builds the model that maximises the probability of the training data.
• Unseen events in the training data are assigned zero probability.
• Since n-gram models tend to be sparse, this is a real problem.
• Consequences:
• seen events are given more probability mass than they have
• unseen events are given zero mass
Seen/unseen

A

A’

Probability mass

of events not in

training data

The problem with MLE is that it distributes A’ among members of A.

Probability mass of events in

training data

The solution
• Solution is to correct MLE estimation using a smoothing technique.
• More on this in the next part
• But cf. Tutorial 1, which introduced the simplest method of smoothing known.
• Manning/Schutze `99 report results for n-gram models of a corpus of the novels of Austen.
• Task: use n-gram model to predict the probability of a sentence in the test data.
• Models:
• unigram: essentially zero-context markov model, uses only the probability of individual words
• bigram
• trigram
• 4-gram
Example test case
• Training Corpus: five Jane Austen novels
• Corpus size = 617,091 words
• Vocabulary size = 14,585 unique types
• Task: predict the next word of the trigram
• “inferior to ________”
• from test data, Persuasion:
• “[In person, she was] inferior to both [sisters.]”
• Problems with unigram models:
• not entirely hopeless because most sentences contain a majority of highly common words
• ignores syntax completely:
• P(In person she was inferior) = P(inferior was she person in)
• Bigrams:
• improve situation dramatically
• some unexpected results:
• p(she|person) decreases compared to the unigram model. Though she is very common, it is uncommon after person
• Trigram models will do brilliantly when they’re useful.
• They capture a surprising amount of contextual variation in text.
• Biggest limitation:
• most new trigrams in test data will not have been seen in training data.
• Problem carries over to 4-grams, and is much worse!
Reliability vs. Discrimination
• smaller n: more instances in training data, better statistical estimates (more reliability)
Backing off
• Possible way of striking a balance between reliability and discrimination:
• backoff model:
• where possible, use a trigram
• if trigram is unseen, try and “back off” to a bigram model
• if bigrams are unseen, try and “back off” to a unigram
Perplexity
• Recall: Entropy is a measure of uncertainty:
• high entropy = high uncertainty
• perplexity:
• if I’ve trained on a sample, how surprised am I when exposed to a new sample?
• a measure of uncertainty of a model on new data
Entropy as “expected value”
• One way to think of the summation part is as a weighted average of the information content.
• We can view this average value as an “expectation”: the expected surprise/uncertainty of our model.
Comparing distributions
• We have a language model built from a sample. The sample is a probability distribution q over n-grams.
• q(x) = the probability of some n-gram x in our model.
• The sample is generated from a true population (“the language”) with probability distribution p.
• p(x) = the probability of x in the true distribution
Evaluating a language model
• We’d like an estimate of how good our model is as a model of the language
• i.e. we’d like to compare q to p
• Instead, we use our test data as an estimate of p.
Cross-entropy: basic intuition
• Measure the number of bits needed to identify an event coming from p, if we code it according to q:
• We draw sequences according to p;
• but we sum the log of their probability according to q.
• This estimate is called cross-entropy H(p,q)
Cross-entropy: p vs. q
• Cross-entropy is an upper bound on the entropy of the true distribution p:
• H(p) ≤ H(p,q)
• if our model distribution (q) is good, H(p,q) ≈ H(p)
• We estimate cross-entropy based on our test data.
• Gives an estimate of the distance of our language model from the distribution in the test sample.
Estimating cross-entropy

Probability according

to p (test set)

Entropy according

to q (language model)

Perplexity
• The perplexity of a language model with probability distribution q, relative to a test set with probability distribution p is:
• A perplexity value of k (obtained on a test set) tells us:
• our model is as surprised on average as it would be if it had to make k guesses for every sequence (n-gram) in the test data.
• The lower the perplexity, the better the language model (the lower the surprise on our test data).
Perplexity example (Jurafsky & Martin, 2000, p. 228)
• Trained unigram, bigram and trigram models from a corpus of news text (Wall Street Journal)
• applied smoothing
• 38 million words
• Vocab of 19,979 (low-frequency words mapped to UNK).
• Computed perplexity on a test set of 1.5 million words.
J&M’s results
• Trigrams do best of all.
• Value suggests the extent to which the model can fit the data in the test set.
• Note: with unigrams, the model has to make lots of guesses!
Summary
• Main point about Markov-based language models:
• data sparseness is always a problem
• smoothing techniques are required to estimate probability of unseen events
• Next part discusses more refined smoothing techniques than those seen so far.

### Part 2

Smoothing (aka discounting) techniques

Overview…
• Smoothing methods:
• Simple smoothing
• Witten-Bell & Good-Turing estimation
• Held-out estimation and cross-validation
• Combining several n-gram models:
• back-off models
Rationale behind smoothing
• Sample frequencies
• seen events with probability P
• unseen events (including “grammatical” zeroes”) with probability 0
• Real population frequencies
• seen events
• (including the unseen events in our sample)

+ smoothing

to approximate

results in

Lower probabilities for seen events (discounting). Left over probability mass distributed over unseens (smoothing).

Maximum Likelihood Estimate

F(w)

Unknowns are assigned 0% probability mass

Actual Probability Distribution

F(w)

These are non-zero probabilities in the real distribution

LaPlace’s Law

NB. This method ends up assigning most prob. mass to unseens

F(w)

Generalisation: Lidstone’s Law
• P = probability of specific n-gram
• C(x) = count of n-gram x in training data
• N = total n-grams in training data
• V = number of “bins” (possible n-grams)
• = small positive number

M.L.E:  = 0LaPlace’s Law:  = 1 (add-one smoothing)Jeffreys-Perks Law:  = ½

Objections to Lidstone’s Law
• Need an a priori way to determine .
• Predicts all unseen events to be equally likely
• Gives probability estimates linear in the M.L.E. frequency
Main intuition
• A zero-frequency event can be thought of as an event which hasn’t happened (yet).
• The probability of it happening can be estimated from the probability of sth happening for the first time (i.e. the hapaxes in our corpus).
• The count of things which are seen only once can be used to estimate the count of things that are never seen.
Witten-Bell method
• T = no. of times we saw an event for the first time.

= no of different n-gram types (bins)

NB: T is no. of types actually attested (unlike V, the no of possible in our previous estimations)

• Estimate total probability mass of unseen n-grams:
• Basically, MLE of the probability of a new type event occurring (“being seen for the first time”)
• This is the total probability mass to be distributed among all zero events (unseens)

no of actual n-grams (N) + no of actual types (T)

Witten-Bell method
• Divide the total probability mass among all the zero n-grams. Can distribute it equally.
• Remove this probability mass from the non-zero n-grams (discounting):
• If we work with unigrams, Witten-Bell and Add-one smoothing give very similar results.
• The difference is with n-grams for n>1.
• Main idea: estimate probability of an unseen bigram <w1,w2> from the probability of seeing a bigram starting with w1 for the first time.
Witten-Bell with bigrams
• Generalised total probability mass estimate:

No. bigram types beginning with w1

No. bigram tokens beginning with w1

Estimated total probability of bigrams starting with w1

Witten-Bell with bigrams
• Non-zero bigrams get discounted as before, but again conditioning on history:
• Note: Witten-Bell won’t assign the same probability mass to all unseen n-grams.
• The amount assigned will depend on the first word in the bigram (first n-1 words in the n-gram).
Good-Turing method
• Introduced by Good (1953), but partly attributed to Alan Turing
• work carried out at Bletchley Park during WWII
• “Simple Good-Turing” method (Gale and Sampson 1995)
• Main idea:
• re-estimate amount of probability mass assigned to low-frequency or zero n-grams based on the number of n-grams (types) with higher frequencies
Rationale
• Given:
• sample frequency of a type (n-gram, aka bin, aka equivalence class)
• GT provides:
• an estimate of the true population frequency of a type
• an estimate of the total probability of unseen types in the population.
Ingredients
• the sample frequency C(x)of an n-gram x in a corpus of size N with vocabulary size V
• the no. of n-gram types with frequency C, Tc
• C*(x): the estimated true population frequency of an n-gram x with sample frequency C(x)
• N.B. in a perfect sample, C(x) = C*(x)
• in real life, C*(x) < C(x)(i.e. sample overestimates the true frequency)
Some background
• Suppose:
• we treat each occurrence of an n-gramas a Bernoulli trial: either the n-gram is xor not
• i.e. a binomial assumption
• Then, we could calculate the expected no of types with frequency C, Tc
• = the expected frequency of frequency

where:

TC= no. of n-gram types with frequency C

N = total no. of n-grams

Background continued
• Given an estimate of E(TC), we could then calculate C*
• Fundamental underlying theorem:
• Note: this makes the estimated (“true”) frequency C* a function of the expected number of types with frequency C+1. Like Witten-bell, it makes the adjusted count of zero-frequency events dependent on events of frequency 1.
Background continued
• We can use the above to calculate adjusted frequencies directly.
• Often, though, we want to calculate the total “missing probability mass” for zero-count n-grams (the unseens):

Where:

• T1 is the number of types with frequency 1
• N is the total number of items seen in training
• From: Jurafsky & Martin 2009
• Examples are bigram counts from two corpora.
A little problem
• The GT theorem assumes that we know the expected population count of types!
• We’ve assumed that we get this from a corpus, but this, of course, is not the case.
• Secondly, TC+1 will often be zero! For example, it’s quite possible to find several n-grams with frequency 100, and no n-grams with frequency 101!
• Note that this is more typical for high frequencies, than low ones.
Low frequencies and gaps
• Low C:linear trend.
• Higher C:angular discontinuity.
• Frequencies in corpus display “jumps” and so do frequencies of frequencies.
• This implies the presence of gaps at higher frequencies.

TC: log10 frequency of frequency

C: log10 frequency

(after Gale and Sampson 1995)

Possible solution
• Use Good-Turing for n-grams with corpus frequency less than some constant k(typically, k = 5).
• Low-frequency types are numerous, so GT is reliable.
• High-frequency types are assumed to be near the “truth”.
• To avoid gaps (where Tc+1 = 0), empirically estimate a function S(C) that acts as a proxy for E(TC)
Proxy function for gaps
• For any sample C, let:
• where:
• C’’is the next highest non-zero frequency
• C’ is the previous non-zero frequency

log10 SC

log10 frequency

(after Gale and Sampson 1995)

Gale and Sampson’s combined proposal
• For low frequencies (< k), use standard equation, assuming E(TC) = TC
• If we have gaps (i.e. TC =0), we use our proxy function for TC. Obtained through linear regression to fit the log-log curve
• And for high frequencies, we can assume that C* = C
• Finally, estimate probability of n-gram:
GT Estimation: Final step
• GT gives approximations to probabilities.
• Re-estimated probabilities of n-grams won’t sum to 1
• necessary to re-normalise
• Gale/Sampson 1995:
A final word on GT smoothing
• In practice, GT is very seldom used on its own.
• Most frequently, we use GT with backoff, about which, more later...
Held-out estimation: General idea
• “hold back” some training data
• create our language model
• compare, for each n-gram (w1…wn):
• Ct: estimated frequency of the n-gram based on training data
• Ch: frequency of the n-gram in the held-out data
Held-out estimation
• Define TotCas:
• total no. of times that n-grams with frequency C in the training corpus actually occurred in the held-out data
• Re-estimate the probability:
Cross-validation
• Problem with held-out estimation:
• our training set is smaller
• Way around this:
• divide training data into training + validation data (roughly equal sizes)
• use each half first as training then as validation (i.e. train twice)
• take a mean
Cross-Validation(a.k.a. deleted estimation)
• Use training and validation data

Split training data:

A

B

train on A, validate on B

train

validate

Model 1

train on B, validate on A

validate

train

Model 2

+

Model 2

combine model 1 & 2

Model 1

Final Model

Cross-Validation

Combined estimate (arithmetic mean):

The rationale
• We would like to balance between reliability and discrimination:
• use trigram where useful
• otherwise back off to bigram, unigram
• How can you develop a model to utilize different length n-grams as appropriate?
Interpolation vs. Backoff
• Interpolation: compute probability of an n-gram as a function of:
• The n-gram itself
• All lower-order n-grams
• Probabilities are linearly interpolated.
• Lower-order n-grams are always used.
• Backoff:
• If n-gram exists in model, use that
• Else fall back to lower order n-grams
Simple interpolation: trigram example
• Combine all estimates, weighted by a factor.
• All parametersshould sum to 1:
• NB: we have different interpolation parameters for the various n-gram sizes.
More sophisticated version
• Suppose we have the trigrams:
• (the dog barked)
• (the puppy barked)
• Suppose (the dog) occurs several times in our corpus, but not (the puppy)
• In our interpolation, we might want to weight trigrams of the form (the dog _) more than (the puppy _) (because the former is composed of a more reliable bigram)
• Rather than using the same parameter for all trigrams, we could condition on the initial bigram.
Sophisticated interpolation: trigram example
• Combine all estimates, weighted by factors that depend on the context.
Where do parameters come from?
• Typically:
• We estimate counts from training data.
• We estimate parameters from held-out data.
• The lambdas are chosen so that they maximise the likelihood on the held-out data.
• Often, the expectation maximisation (EM) algorithm is used to discover the right values to plug into the equations.
• (more on this later)
Backoff
• Recall that backoff models only use lower order n-grams when the higher order one is unavailable.
• Best known model by Katz (1987).
• Uses backoff with smoothed probabilities
• Smoothed probabilities obtained using Good-Turing estimation.
Backoff: trigram example
• Backoff estimate:
• That is:
• If the trigram has count > 0, we use the smoothed (P*) estimate
• If not, we recursively back off to lower orders, interpolating with a parameter (alpha)
Backoff vs. Simple smoothing
• With Good-Turing smoothing, we typically end up with the “leftover” probability mass that is distributed equally among the unseens.
• So GF tells us how much leftover probability there is.
• Backoff gives us a better way of distributing this mass among unseen trigrams, by relying on the counts of their component bigrams and unigrams.
• So backoff tells us how to divide that leftover probability.
Why we need those alphas
• If we rely on true probabilities, then for a given word and a given n-gram window, the probability of the word sums to 1:
• But if we back off to lower-order model when the trigram probability is 0, we’re adding extra probability mass, and the sum will now exceed 1.
• We therefore need:
• P* to discount the original MLE estimate (P)
• Alpha parametersto ensure that the probability from the lower-order n-grams sums up to exactly the amount we discounted in P*.
Computing the alphas -- I
• Recall: we have C(w1w2w3) = 0
• Let ß(w1w2) represent the amount of probability left over when we discount (seen) trigrams containing w3

The sum of probabilities P for seen trigrams involving w3 (preceded by any two tokens) is 1. The smoothed probabilities P* sum to less than 1. We’re taking the remainder.

Computing the alphas -- II
• We now compute alpha:

The denominator sums over all unseen trigrams involving our bigram. We distribute the remaining mass ß(w1w2) overall all those trigrams.