- By
**neal** - Follow User

- 104 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Corpora and Statistical Methods' - neal

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Corpora and Statistical Methods

### Part 2

Language modelling using N-Grams

Example task

- The word-prediction task (Shannon game)
- Given:
- a sequence of words (the history)
- a choice of next word
- Predict:
- the most likely next word
- Generalises easily to other problems, such as predicting the POS of unknowns based on history.

Applications of the Shannon game

- Automatic speech recognition (cf. tutorial 1):
- given a sequence of possible words, estimate its probability
- Context-sensitive spelling correction:
- Many spelling errors are real words
- He walked for miles in the dessert. (resp. desert)
- Identifying such errors requires a global estimate of the probability of a sentence.

Applications of N-gram models generally

- POS Tagging:
- predict the POS of an unknown word by looking at its history
- Statistical parsing:
- e.g. predict the group of words that together form a phrase
- Statistical NL Generation:
- given a semantic form to be realised as text, and several possible realisations, select the most probable one.

A real-world example: Google’s did you mean

- Google uses an n-gram model (based on sequences of characters, not words).
- In this case, the sequence apple desserts is much more probable than apple deserts

How it works

- Documents provided by the search engine are added to:
- An index (for fast retrieval)
- A language model (based on probability of a sequence of characters)
- A submitted query (“apple deserts”) can be modified (using character insertions, deletions, substitutions and transpositions) to yield a query that fits the language model better (“apple desserts”).
- Outcome is a context-sensitive spelling correction:
- “apple deserts” “apple desserts”
- “frodbaggins” “frodobaggins”
- “frod” “ford”

The noisy channel model

- After Jurafsky and Martin (2009), Speech and Language Processing (2nd Ed). Prentice Hall p. 198

The Markov Assumption

- Markov models:
- probabilistic models which predict the likelihood of a future unit based on limited history
- in language modelling, this pans out as the local history assumption:
- the probability of wn depends on a limited number of prior words
- utility of the assumption:
- we can rely on a small n for our n-gram models (bigram, trigram)
- long n-grams become exceedingly sparse
- Probabilities become very small with long sequences

The structure of an n-gram model

- The task can be re-stated in conditional probabilistic terms:
- Limiting n under the Markov Assumption means:
- greater chance of finding more than one occurrence of the sequence w1…wn-1
- more robust statistical estimations
- N-grams are essentially equivalence classes or bins
- every unique n-gram is a type or bin

Structure of n-gram models (II)

- If we construct a model where all histories with the same n-1 words are considered one class or bin, we have an (n-1)th order Markov Model
- Note terminology:
- n-gram model = (n-1)th order Markov Model

Methodological considerations

- We are often concerned with:
- building an n-gram model
- evaluating it
- We therefore make a distinction between training and test data
- You never test on your training data
- If you do, you’re bound to get good results.
- N-gram models tend to beovertrained, i.e.: if you train on a corpus C, your model will be biased towards expecting the kinds of events in C.
- Another term for this: overfitting

Dividing the data

- Given: a corpus of n units (words, sentences, … depending on the task)
- A large proportion of the corpus is reserved for training.
- A smaller proportion for testing/evaluation (normally 5-10%)

Held-out (validation) data

- Held-out estimation:
- during training, we sometimes estimate parameters for our model empirically
- commonly used in smoothing (how much probability space do we want to set aside for unseen data)?
- therefore, the training set is often split further into training data and validation data
- normally, held-out data is 10% of the size of the training data

Development data

- A common approach:
- train an algorithm on training data

a. (estimate further parameters on held-out data if required)

- evaluate it
- re-tune it
- go back to Step 1 until no further finetuning necessary
- Carry out final evaluation
- For this purpose, it’s useful to have:
- training data for step 1
- development set for steps 2-4
- final test set for step 5

Significance testing

- Often, we compare the performance of our algorithm against some baseline.
- A single, raw performance score won’t tell us much. We need to test for significance (e.g. using t-test).
- Typical method:
- Split test set into several small test sets, e.g. 20 samples
- evaluation carried out separately on each
- mean and variance estimated based on 20 different samples
- test for significant difference between algorithm and a predefined baseline

Size of n-gram models

- In a corpus of vocabulary size N, the assumption is that any combination of n words is a potential n-gram.
- For a bigram model: N2 possible n-grams in principle
- For a trigram model: N3 possible n-grams.
- …

Size (continued)

- Each n-gram in our model is a parameter used to estimate probability of the next possible word.
- too many parameters make the model unwieldy
- too many parameters lead to data sparseness: most of them will have f = 0 or 1
- Most models stick to unigrams, bigrams or trigrams.
- estimation can also combine different order models

Further considerations

- When building a model, we tend to take into account the start-of-sentence symbol:
- the girl swallowed a large green caterpillar
- <s> the
- the girl
- …
- Also typical to map all tokens w such that count(w) < k to <UNK>:
- usually, tokens with frequency 1 or 2 are just considered “unknown” or “unseen”
- this reduces the parameter space

Maximum Likelihood Estimation Approach

- Basic equation:
- In a unigram model, this reduces to simple probability.
- MLE models estimate probability using relative frequency.

Limitations of MLE

- MLE builds the model that maximises the probability of the training data.
- Unseen events in the training data are assigned zero probability.
- Since n-gram models tend to be sparse, this is a real problem.
- Consequences:
- seen events are given more probability mass than they have
- unseen events are given zero mass

Seen/unseen

A

A’

Probability mass

of events not in

training data

The problem with MLE is that it distributes A’ among members of A.

Probability mass of events in

training data

The solution

- Solution is to correct MLE estimation using a smoothing technique.
- More on this in the next part
- But cf. Tutorial 1, which introduced the simplest method of smoothing known.

Adequacy of different order models

- Manning/Schutze `99 report results for n-gram models of a corpus of the novels of Austen.
- Task: use n-gram model to predict the probability of a sentence in the test data.
- Models:
- unigram: essentially zero-context markov model, uses only the probability of individual words
- bigram
- trigram
- 4-gram

Example test case

- Training Corpus: five Jane Austen novels
- Corpus size = 617,091 words
- Vocabulary size = 14,585 unique types
- Task: predict the next word of the trigram
- “inferior to ________”
- from test data, Persuasion:
- “[In person, she was] inferior to both [sisters.]”

Adequacy of unigrams

- Problems with unigram models:
- not entirely hopeless because most sentences contain a majority of highly common words
- ignores syntax completely:
- P(In person she was inferior) = P(inferior was she person in)

Adequacy of bigrams

- Bigrams:
- improve situation dramatically
- some unexpected results:
- p(she|person) decreases compared to the unigram model. Though she is very common, it is uncommon after person

Adequacy of trigrams

- Trigram models will do brilliantly when they’re useful.
- They capture a surprising amount of contextual variation in text.
- Biggest limitation:
- most new trigrams in test data will not have been seen in training data.
- Problem carries over to 4-grams, and is much worse!

Reliability vs. Discrimination

- larger n: more information about the context of the specific instance (greater discrimination)
- smaller n: more instances in training data, better statistical estimates (more reliability)

Backing off

- Possible way of striking a balance between reliability and discrimination:
- backoff model:
- where possible, use a trigram
- if trigram is unseen, try and “back off” to a bigram model
- if bigrams are unseen, try and “back off” to a unigram

Perplexity

- Recall: Entropy is a measure of uncertainty:
- high entropy = high uncertainty
- perplexity:
- if I’ve trained on a sample, how surprised am I when exposed to a new sample?
- a measure of uncertainty of a model on new data

Entropy as “expected value”

- One way to think of the summation part is as a weighted average of the information content.
- We can view this average value as an “expectation”: the expected surprise/uncertainty of our model.

Comparing distributions

- We have a language model built from a sample. The sample is a probability distribution q over n-grams.
- q(x) = the probability of some n-gram x in our model.
- The sample is generated from a true population (“the language”) with probability distribution p.
- p(x) = the probability of x in the true distribution

Evaluating a language model

- We’d like an estimate of how good our model is as a model of the language
- i.e. we’d like to compare q to p
- We don’t have access to p. (Hence, can’t use KL-Divergence)
- Instead, we use our test data as an estimate of p.

Cross-entropy: basic intuition

- Measure the number of bits needed to identify an event coming from p, if we code it according to q:
- We draw sequences according to p;
- but we sum the log of their probability according to q.
- This estimate is called cross-entropy H(p,q)

Cross-entropy: p vs. q

- Cross-entropy is an upper bound on the entropy of the true distribution p:
- H(p) ≤ H(p,q)
- if our model distribution (q) is good, H(p,q) ≈ H(p)
- We estimate cross-entropy based on our test data.
- Gives an estimate of the distance of our language model from the distribution in the test sample.

Perplexity

- The perplexity of a language model with probability distribution q, relative to a test set with probability distribution p is:
- A perplexity value of k (obtained on a test set) tells us:
- our model is as surprised on average as it would be if it had to make k guesses for every sequence (n-gram) in the test data.
- The lower the perplexity, the better the language model (the lower the surprise on our test data).

Perplexity example (Jurafsky & Martin, 2000, p. 228)

- Trained unigram, bigram and trigram models from a corpus of news text (Wall Street Journal)
- applied smoothing
- 38 million words
- Vocab of 19,979 (low-frequency words mapped to UNK).
- Computed perplexity on a test set of 1.5 million words.

J&M’s results

- Trigrams do best of all.
- Value suggests the extent to which the model can fit the data in the test set.
- Note: with unigrams, the model has to make lots of guesses!

Summary

- Main point about Markov-based language models:
- data sparseness is always a problem
- smoothing techniques are required to estimate probability of unseen events
- Next part discusses more refined smoothing techniques than those seen so far.

Smoothing (aka discounting) techniques

Overview…

- Smoothing methods:
- Simple smoothing
- Witten-Bell & Good-Turing estimation
- Held-out estimation and cross-validation
- Combining several n-gram models:
- back-off models

Rationale behind smoothing

- Sample frequencies
- seen events with probability P
- unseen events (including “grammatical” zeroes”) with probability 0

- Real population frequencies
- seen events
- (including the unseen events in our sample)

+ smoothing

to approximate

results in

Lower probabilities for seen events (discounting). Left over probability mass distributed over unseens (smoothing).

Generalisation: Lidstone’s Law

- P = probability of specific n-gram
- C(x) = count of n-gram x in training data
- N = total n-grams in training data
- V = number of “bins” (possible n-grams)
- = small positive number

M.L.E: = 0LaPlace’s Law: = 1 (add-one smoothing)Jeffreys-Perks Law: = ½

Objections to Lidstone’s Law

- Need an a priori way to determine .
- Predicts all unseen events to be equally likely
- Gives probability estimates linear in the M.L.E. frequency

Main intuition

- A zero-frequency event can be thought of as an event which hasn’t happened (yet).
- The probability of it happening can be estimated from the probability of sth happening for the first time (i.e. the hapaxes in our corpus).
- The count of things which are seen only once can be used to estimate the count of things that are never seen.

Witten-Bell method

- T = no. of times we saw an event for the first time.

= no of different n-gram types (bins)

NB: T is no. of types actually attested (unlike V, the no of possible in our previous estimations)

- Estimate total probability mass of unseen n-grams:
- Basically, MLE of the probability of a new type event occurring (“being seen for the first time”)
- This is the total probability mass to be distributed among all zero events (unseens)

no of actual n-grams (N) + no of actual types (T)

Witten-Bell method

- Divide the total probability mass among all the zero n-grams. Can distribute it equally.
- Remove this probability mass from the non-zero n-grams (discounting):

Witten-Bell vs. Add-one

- If we work with unigrams, Witten-Bell and Add-one smoothing give very similar results.
- The difference is with n-grams for n>1.
- Main idea: estimate probability of an unseen bigram <w1,w2> from the probability of seeing a bigram starting with w1 for the first time.

Witten-Bell with bigrams

- Generalised total probability mass estimate:

No. bigram types beginning with w1

No. bigram tokens beginning with w1

Estimated total probability of bigrams starting with w1

Witten-Bell with bigrams

- Non-zero bigrams get discounted as before, but again conditioning on history:
- Note: Witten-Bell won’t assign the same probability mass to all unseen n-grams.
- The amount assigned will depend on the first word in the bigram (first n-1 words in the n-gram).

Good-Turing method

- Introduced by Good (1953), but partly attributed to Alan Turing
- work carried out at Bletchley Park during WWII
- “Simple Good-Turing” method (Gale and Sampson 1995)
- Main idea:
- re-estimate amount of probability mass assigned to low-frequency or zero n-grams based on the number of n-grams (types) with higher frequencies

Rationale

- Given:
- sample frequency of a type (n-gram, aka bin, aka equivalence class)
- GT provides:
- an estimate of the true population frequency of a type
- an estimate of the total probability of unseen types in the population.

Ingredients

- the sample frequency C(x)of an n-gram x in a corpus of size N with vocabulary size V
- the no. of n-gram types with frequency C, Tc
- C*(x): the estimated true population frequency of an n-gram x with sample frequency C(x)
- N.B. in a perfect sample, C(x) = C*(x)
- in real life, C*(x) < C(x)(i.e. sample overestimates the true frequency)

Some background

- Suppose:
- we had access to the true population probability of our n-grams
- we treat each occurrence of an n-gramas a Bernoulli trial: either the n-gram is xor not
- i.e. a binomial assumption
- Then, we could calculate the expected no of types with frequency C, Tc
- = the expected frequency of frequency

where:

TC= no. of n-gram types with frequency C

N = total no. of n-grams

Background continued

- Given an estimate of E(TC), we could then calculate C*
- Fundamental underlying theorem:
- Note: this makes the estimated (“true”) frequency C* a function of the expected number of types with frequency C+1. Like Witten-bell, it makes the adjusted count of zero-frequency events dependent on events of frequency 1.

Background continued

- We can use the above to calculate adjusted frequencies directly.
- Often, though, we want to calculate the total “missing probability mass” for zero-count n-grams (the unseens):

Where:

- T1 is the number of types with frequency 1
- N is the total number of items seen in training

Example of readjusted counts

- From: Jurafsky & Martin 2009
- Examples are bigram counts from two corpora.

A little problem

- The GT theorem assumes that we know the expected population count of types!
- We’ve assumed that we get this from a corpus, but this, of course, is not the case.
- Secondly, TC+1 will often be zero! For example, it’s quite possible to find several n-grams with frequency 100, and no n-grams with frequency 101!
- Note that this is more typical for high frequencies, than low ones.

Low frequencies and gaps

- Low C:linear trend.
- Higher C:angular discontinuity.
- Frequencies in corpus display “jumps” and so do frequencies of frequencies.
- This implies the presence of gaps at higher frequencies.

TC: log10 frequency of frequency

C: log10 frequency

(after Gale and Sampson 1995)

Possible solution

- Use Good-Turing for n-grams with corpus frequency less than some constant k(typically, k = 5).
- Low-frequency types are numerous, so GT is reliable.
- High-frequency types are assumed to be near the “truth”.
- To avoid gaps (where Tc+1 = 0), empirically estimate a function S(C) that acts as a proxy for E(TC)

Proxy function for gaps

- For any sample C, let:
- where:
- C’’is the next highest non-zero frequency
- C’ is the previous non-zero frequency

log10 SC

log10 frequency

(after Gale and Sampson 1995)

Gale and Sampson’s combined proposal

- For low frequencies (< k), use standard equation, assuming E(TC) = TC
- If we have gaps (i.e. TC =0), we use our proxy function for TC. Obtained through linear regression to fit the log-log curve
- And for high frequencies, we can assume that C* = C
- Finally, estimate probability of n-gram:

GT Estimation: Final step

- GT gives approximations to probabilities.
- Re-estimated probabilities of n-grams won’t sum to 1
- necessary to re-normalise
- Gale/Sampson 1995:

A final word on GT smoothing

- In practice, GT is very seldom used on its own.
- Most frequently, we use GT with backoff, about which, more later...

Held-out estimation: General idea

- “hold back” some training data
- create our language model
- compare, for each n-gram (w1…wn):
- Ct: estimated frequency of the n-gram based on training data
- Ch: frequency of the n-gram in the held-out data

Held-out estimation

- Define TotCas:
- total no. of times that n-grams with frequency C in the training corpus actually occurred in the held-out data
- Re-estimate the probability:

Cross-validation

- Problem with held-out estimation:
- our training set is smaller
- Way around this:
- divide training data into training + validation data (roughly equal sizes)
- use each half first as training then as validation (i.e. train twice)
- take a mean

Cross-Validation(a.k.a. deleted estimation)

- Use training and validation data

Split training data:

A

B

train on A, validate on B

train

validate

Model 1

train on B, validate on A

validate

train

Model 2

+

Model 2

combine model 1 & 2

Model 1

Final Model

Cross-Validation

Combined estimate (arithmetic mean):

The rationale

- We would like to balance between reliability and discrimination:
- use trigram where useful
- otherwise back off to bigram, unigram
- How can you develop a model to utilize different length n-grams as appropriate?

Interpolation vs. Backoff

- Interpolation: compute probability of an n-gram as a function of:
- The n-gram itself
- All lower-order n-grams
- Probabilities are linearly interpolated.
- Lower-order n-grams are always used.
- Backoff:
- If n-gram exists in model, use that
- Else fall back to lower order n-grams

Simple interpolation: trigram example

- Combine all estimates, weighted by a factor.
- All parametersshould sum to 1:
- NB: we have different interpolation parameters for the various n-gram sizes.

More sophisticated version

- Suppose we have the trigrams:
- (the dog barked)
- (the puppy barked)
- Suppose (the dog) occurs several times in our corpus, but not (the puppy)
- In our interpolation, we might want to weight trigrams of the form (the dog _) more than (the puppy _) (because the former is composed of a more reliable bigram)
- Rather than using the same parameter for all trigrams, we could condition on the initial bigram.

Sophisticated interpolation: trigram example

- Combine all estimates, weighted by factors that depend on the context.

Where do parameters come from?

- Typically:
- We estimate counts from training data.
- We estimate parameters from held-out data.
- The lambdas are chosen so that they maximise the likelihood on the held-out data.
- Often, the expectation maximisation (EM) algorithm is used to discover the right values to plug into the equations.

Backoff

- Recall that backoff models only use lower order n-grams when the higher order one is unavailable.
- Best known model by Katz (1987).
- Uses backoff with smoothed probabilities
- Smoothed probabilities obtained using Good-Turing estimation.

Backoff: trigram example

- Backoff estimate:
- That is:
- If the trigram has count > 0, we use the smoothed (P*) estimate
- If not, we recursively back off to lower orders, interpolating with a parameter (alpha)

Backoff vs. Simple smoothing

- With Good-Turing smoothing, we typically end up with the “leftover” probability mass that is distributed equally among the unseens.
- So GF tells us how much leftover probability there is.
- Backoff gives us a better way of distributing this mass among unseen trigrams, by relying on the counts of their component bigrams and unigrams.
- So backoff tells us how to divide that leftover probability.

Why we need those alphas

- If we rely on true probabilities, then for a given word and a given n-gram window, the probability of the word sums to 1:
- But if we back off to lower-order model when the trigram probability is 0, we’re adding extra probability mass, and the sum will now exceed 1.
- We therefore need:
- P* to discount the original MLE estimate (P)
- Alpha parametersto ensure that the probability from the lower-order n-grams sums up to exactly the amount we discounted in P*.

Computing the alphas -- I

- Recall: we have C(w1w2w3) = 0
- Let ß(w1w2) represent the amount of probability left over when we discount (seen) trigrams containing w3

The sum of probabilities P for seen trigrams involving w3 (preceded by any two tokens) is 1. The smoothed probabilities P* sum to less than 1. We’re taking the remainder.

Computing the alphas -- II

- We now compute alpha:

The denominator sums over all unseen trigrams involving our bigram. We distribute the remaining mass ß(w1w2) overall all those trigrams.

What about unseen bigrams?

- So what happens if even (w1w2) in (w1w2w3) has count zero?
- I.e. we fall to an even lower order. Moreover:
- And:

Problems with Backing-Off

- Suppose (w2 w3) is common but trigram (w1 w2 w3) is unseen
- This may be a meaningful gap, rather than a gap due to chance and scarce data
- i.e., a “grammatical null”
- May not want to back-off to lower-order probability
- in this case, p = 0 is accurate!

References

- Gale, W.A., and Sampson, G. (1995). Good-Turing frequency estimation without tears. Journal of Quantitative Linguistics, 2: 217-237

Download Presentation

Connecting to Server..