Corpora and Statistical Methods

Corpora and Statistical Methods Language modelling using N-Grams

Example task • The word-prediction task (Shannon game) • Given: • a sequence of words (the history) • a choice of next word • Predict: • the most likely next word • Generalises easily to other problems, such as predicting the POS of unknowns based on history.

Applications of the Shannon game • Automatic speech recognition (cf. tutorial 1): • given a sequence of possible words, estimate its probability • Context-sensitive spelling correction: • Many spelling errors are real words • He walked for miles in the dessert. (resp. desert) • Identifying such errors requires a global estimate of the probability of a sentence.

Applications of N-gram models generally • POS Tagging: • predict the POS of an unknown word by looking at its history • Statistical parsing: • e.g. predict the group of words that together form a phrase • Statistical NL Generation: • given a semantic form to be realised as text, and several possible realisations, select the most probable one.

A real-world example: Google’s did you mean • Google uses an n-gram model (based on sequences of characters, not words). • In this case, the sequence apple desserts is much more probable than apple deserts

How it works • Documents provided by the search engine are added to: • An index (for fast retrieval) • A language model (based on probability of a sequence of characters) • A submitted query (“apple deserts”) can be modified (using character insertions, deletions, substitutions and transpositions) to yield a query that fits the language model better (“apple desserts”). • Outcome is a context-sensitive spelling correction: • “apple deserts”  “apple desserts” • “frodbaggins”  “frodobaggins” • “frod”  “ford”

The noisy channel model • After Jurafsky and Martin (2009), Speech and Language Processing (2nd Ed). Prentice Hall p. 198

The assumptions behind n-gram models

The Markov Assumption • Markov models: • probabilistic models which predict the likelihood of a future unit based on limited history • in language modelling, this pans out as the local history assumption: • the probability of wn depends on a limited number of prior words • utility of the assumption: • we can rely on a small n for our n-gram models (bigram, trigram) • long n-grams become exceedingly sparse • Probabilities become very small with long sequences

The structure of an n-gram model • The task can be re-stated in conditional probabilistic terms: • Limiting n under the Markov Assumption means: • greater chance of finding more than one occurrence of the sequence w1…wn-1 • more robust statistical estimations • N-grams are essentially equivalence classes or bins • every unique n-gram is a type or bin

Structure of n-gram models (II) • If we construct a model where all histories with the same n-1 words are considered one class or bin, we have an (n-1)th order Markov Model • Note terminology: • n-gram model = (n-1)th order Markov Model

Methodological considerations • We are often concerned with: • building an n-gram model • evaluating it • We therefore make a distinction between training and test data • You never test on your training data • If you do, you’re bound to get good results. • N-gram models tend to beovertrained, i.e.: if you train on a corpus C, your model will be biased towards expecting the kinds of events in C. • Another term for this: overfitting

Dividing the data • Given: a corpus of n units (words, sentences, … depending on the task) • A large proportion of the corpus is reserved for training. • A smaller proportion for testing/evaluation (normally 5-10%)

Held-out (validation) data • Held-out estimation: • during training, we sometimes estimate parameters for our model empirically • commonly used in smoothing (how much probability space do we want to set aside for unseen data)? • therefore, the training set is often split further into training data and validation data • normally, held-out data is 10% of the size of the training data

Development data • A common approach: • train an algorithm on training data a. (estimate further parameters on held-out data if required) • evaluate it • re-tune it • go back to Step 1 until no further finetuning necessary • Carry out final evaluation • For this purpose, it’s useful to have: • training data for step 1 • development set for steps 2-4 • final test set for step 5

Significance testing • Often, we compare the performance of our algorithm against some baseline. • A single, raw performance score won’t tell us much. We need to test for significance (e.g. using t-test). • Typical method: • Split test set into several small test sets, e.g. 20 samples • evaluation carried out separately on each • mean and variance estimated based on 20 different samples • test for significant difference between algorithm and a predefined baseline

Size of n-gram models • In a corpus of vocabulary size N, the assumption is that any combination of n words is a potential n-gram. • For a bigram model: N2 possible n-grams in principle • For a trigram model: N3 possible n-grams. • …

Size (continued) • Each n-gram in our model is a parameter used to estimate probability of the next possible word. • too many parameters make the model unwieldy • too many parameters lead to data sparseness: most of them will have f = 0 or 1 • Most models stick to unigrams, bigrams or trigrams. • estimation can also combine different order models

Further considerations • When building a model, we tend to take into account the start-of-sentence symbol: • the girl swallowed a large green caterpillar • <s> the • the girl • … • Also typical to map all tokens w such that count(w) < k to <UNK>: • usually, tokens with frequency 1 or 2 are just considered “unknown” or “unseen” • this reduces the parameter space

Building models using Maximum Likelihood Estimation

Maximum Likelihood Estimation Approach • Basic equation: • In a unigram model, this reduces to simple probability. • MLE models estimate probability using relative frequency.

Limitations of MLE • MLE builds the model that maximises the probability of the training data. • Unseen events in the training data are assigned zero probability. • Since n-gram models tend to be sparse, this is a real problem. • Consequences: • seen events are given more probability mass than they have • unseen events are given zero mass

Seen/unseen A A’ Probability mass of events not in training data The problem with MLE is that it distributes A’ among members of A. Probability mass of events in training data

The solution • Solution is to correct MLE estimation using a smoothing technique. • More on this in the next part • But cf. Tutorial 1, which introduced the simplest method of smoothing known.

Adequacy of different order models • Manning/Schutze `99 report results for n-gram models of a corpus of the novels of Austen. • Task: use n-gram model to predict the probability of a sentence in the test data. • Models: • unigram: essentially zero-context markov model, uses only the probability of individual words • bigram • trigram • 4-gram

Example test case • Training Corpus: five Jane Austen novels • Corpus size = 617,091 words • Vocabulary size = 14,585 unique types • Task: predict the next word of the trigram • “inferior to ________” • from test data, Persuasion: • “[In person, she was] inferior to both [sisters.]”

Selecting an nVocabulary (V) = 20,000 words

Adequacy of unigrams • Problems with unigram models: • not entirely hopeless because most sentences contain a majority of highly common words • ignores syntax completely: • P(In person she was inferior) = P(inferior was she person in)

Adequacy of bigrams • Bigrams: • improve situation dramatically • some unexpected results: • p(she|person) decreases compared to the unigram model. Though she is very common, it is uncommon after person

Adequacy of trigrams • Trigram models will do brilliantly when they’re useful. • They capture a surprising amount of contextual variation in text. • Biggest limitation: • most new trigrams in test data will not have been seen in training data. • Problem carries over to 4-grams, and is much worse!

Reliability vs. Discrimination • larger n: more information about the context of the specific instance (greater discrimination) • smaller n: more instances in training data, better statistical estimates (more reliability)

Backing off • Possible way of striking a balance between reliability and discrimination: • backoff model: • where possible, use a trigram • if trigram is unseen, try and “back off” to a bigram model • if bigrams are unseen, try and “back off” to a unigram

Evaluating language models

Perplexity • Recall: Entropy is a measure of uncertainty: • high entropy = high uncertainty • perplexity: • if I’ve trained on a sample, how surprised am I when exposed to a new sample? • a measure of uncertainty of a model on new data

Entropy as “expected value” • One way to think of the summation part is as a weighted average of the information content. • We can view this average value as an “expectation”: the expected surprise/uncertainty of our model.

Comparing distributions • We have a language model built from a sample. The sample is a probability distribution q over n-grams. • q(x) = the probability of some n-gram x in our model. • The sample is generated from a true population (“the language”) with probability distribution p. • p(x) = the probability of x in the true distribution

Evaluating a language model • We’d like an estimate of how good our model is as a model of the language • i.e. we’d like to compare q to p • We don’t have access to p. (Hence, can’t use KL-Divergence) • Instead, we use our test data as an estimate of p.

Cross-entropy: basic intuition • Measure the number of bits needed to identify an event coming from p, if we code it according to q: • We draw sequences according to p; • but we sum the log of their probability according to q. • This estimate is called cross-entropy H(p,q)

Cross-entropy: p vs. q • Cross-entropy is an upper bound on the entropy of the true distribution p: • H(p) ≤ H(p,q) • if our model distribution (q) is good, H(p,q) ≈ H(p) • We estimate cross-entropy based on our test data. • Gives an estimate of the distance of our language model from the distribution in the test sample.

Estimating cross-entropy Probability according to p (test set) Entropy according to q (language model)

Perplexity • The perplexity of a language model with probability distribution q, relative to a test set with probability distribution p is: • A perplexity value of k (obtained on a test set) tells us: • our model is as surprised on average as it would be if it had to make k guesses for every sequence (n-gram) in the test data. • The lower the perplexity, the better the language model (the lower the surprise on our test data).

Perplexity example (Jurafsky & Martin, 2000, p. 228) • Trained unigram, bigram and trigram models from a corpus of news text (Wall Street Journal) • applied smoothing • 38 million words • Vocab of 19,979 (low-frequency words mapped to UNK). • Computed perplexity on a test set of 1.5 million words.

J&M’s results • Trigrams do best of all. • Value suggests the extent to which the model can fit the data in the test set. • Note: with unigrams, the model has to make lots of guesses!

Summary • Main point about Markov-based language models: • data sparseness is always a problem • smoothing techniques are required to estimate probability of unseen events • Next part discusses more refined smoothing techniques than those seen so far.

Part 2 Smoothing (aka discounting) techniques

Overview… • Smoothing methods: • Simple smoothing • Witten-Bell & Good-Turing estimation • Held-out estimation and cross-validation • Combining several n-gram models: • back-off models

Rationale behind smoothing • Sample frequencies • seen events with probability P • unseen events (including “grammatical” zeroes”) with probability 0 • Real population frequencies • seen events • (including the unseen events in our sample) + smoothing to approximate results in Lower probabilities for seen events (discounting). Left over probability mass distributed over unseens (smoothing).

Laplace’s law, Lidstone’s law and the Jeffreys-Perks law

Instances in the Training Corpus:“inferior to ________” F(w)

Maximum Likelihood Estimate F(w) Unknowns are assigned 0% probability mass

Corpora and Statistical Methods