1 / 101

# Corpora and Statistical Methods

Corpora and Statistical Methods. Language modelling using N-Grams. Example task. The word-prediction task (Shannon game) Given : a sequence of words (the history ) a choice of next word Predict : the most likely next word

## Corpora and Statistical Methods

E N D

### Presentation Transcript

1. Corpora and Statistical Methods Language modelling using N-Grams

2. Example task • The word-prediction task (Shannon game) • Given: • a sequence of words (the history) • a choice of next word • Predict: • the most likely next word • Generalises easily to other problems, such as predicting the POS of unknowns based on history.

3. Applications of the Shannon game • Automatic speech recognition (cf. tutorial 1): • given a sequence of possible words, estimate its probability • Context-sensitive spelling correction: • Many spelling errors are real words • He walked for miles in the dessert. (resp. desert) • Identifying such errors requires a global estimate of the probability of a sentence.

4. Applications of N-gram models generally • POS Tagging: • predict the POS of an unknown word by looking at its history • Statistical parsing: • e.g. predict the group of words that together form a phrase • Statistical NL Generation: • given a semantic form to be realised as text, and several possible realisations, select the most probable one.

5. A real-world example: Google’s did you mean • Google uses an n-gram model (based on sequences of characters, not words). • In this case, the sequence apple desserts is much more probable than apple deserts

6. How it works • Documents provided by the search engine are added to: • An index (for fast retrieval) • A language model (based on probability of a sequence of characters) • A submitted query (“apple deserts”) can be modified (using character insertions, deletions, substitutions and transpositions) to yield a query that fits the language model better (“apple desserts”). • Outcome is a context-sensitive spelling correction: • “apple deserts”  “apple desserts” • “frodbaggins”  “frodobaggins” • “frod”  “ford”

7. The noisy channel model • After Jurafsky and Martin (2009), Speech and Language Processing (2nd Ed). Prentice Hall p. 198

8. The assumptions behind n-gram models

9. The Markov Assumption • Markov models: • probabilistic models which predict the likelihood of a future unit based on limited history • in language modelling, this pans out as the local history assumption: • the probability of wn depends on a limited number of prior words • utility of the assumption: • we can rely on a small n for our n-gram models (bigram, trigram) • long n-grams become exceedingly sparse • Probabilities become very small with long sequences

10. The structure of an n-gram model • The task can be re-stated in conditional probabilistic terms: • Limiting n under the Markov Assumption means: • greater chance of finding more than one occurrence of the sequence w1…wn-1 • more robust statistical estimations • N-grams are essentially equivalence classes or bins • every unique n-gram is a type or bin

11. Structure of n-gram models (II) • If we construct a model where all histories with the same n-1 words are considered one class or bin, we have an (n-1)th order Markov Model • Note terminology: • n-gram model = (n-1)th order Markov Model

12. Methodological considerations • We are often concerned with: • building an n-gram model • evaluating it • We therefore make a distinction between training and test data • You never test on your training data • If you do, you’re bound to get good results. • N-gram models tend to beovertrained, i.e.: if you train on a corpus C, your model will be biased towards expecting the kinds of events in C. • Another term for this: overfitting

13. Dividing the data • Given: a corpus of n units (words, sentences, … depending on the task) • A large proportion of the corpus is reserved for training. • A smaller proportion for testing/evaluation (normally 5-10%)

14. Held-out (validation) data • Held-out estimation: • during training, we sometimes estimate parameters for our model empirically • commonly used in smoothing (how much probability space do we want to set aside for unseen data)? • therefore, the training set is often split further into training data and validation data • normally, held-out data is 10% of the size of the training data

15. Development data • A common approach: • train an algorithm on training data a. (estimate further parameters on held-out data if required) • evaluate it • re-tune it • go back to Step 1 until no further finetuning necessary • Carry out final evaluation • For this purpose, it’s useful to have: • training data for step 1 • development set for steps 2-4 • final test set for step 5

16. Significance testing • Often, we compare the performance of our algorithm against some baseline. • A single, raw performance score won’t tell us much. We need to test for significance (e.g. using t-test). • Typical method: • Split test set into several small test sets, e.g. 20 samples • evaluation carried out separately on each • mean and variance estimated based on 20 different samples • test for significant difference between algorithm and a predefined baseline

17. Size of n-gram models • In a corpus of vocabulary size N, the assumption is that any combination of n words is a potential n-gram. • For a bigram model: N2 possible n-grams in principle • For a trigram model: N3 possible n-grams. • …

18. Size (continued) • Each n-gram in our model is a parameter used to estimate probability of the next possible word. • too many parameters make the model unwieldy • too many parameters lead to data sparseness: most of them will have f = 0 or 1 • Most models stick to unigrams, bigrams or trigrams. • estimation can also combine different order models

19. Further considerations • When building a model, we tend to take into account the start-of-sentence symbol: • the girl swallowed a large green caterpillar • <s> the • the girl • … • Also typical to map all tokens w such that count(w) < k to <UNK>: • usually, tokens with frequency 1 or 2 are just considered “unknown” or “unseen” • this reduces the parameter space

20. Building models using Maximum Likelihood Estimation

21. Maximum Likelihood Estimation Approach • Basic equation: • In a unigram model, this reduces to simple probability. • MLE models estimate probability using relative frequency.

22. Limitations of MLE • MLE builds the model that maximises the probability of the training data. • Unseen events in the training data are assigned zero probability. • Since n-gram models tend to be sparse, this is a real problem. • Consequences: • seen events are given more probability mass than they have • unseen events are given zero mass

23. Seen/unseen A A’ Probability mass of events not in training data The problem with MLE is that it distributes A’ among members of A. Probability mass of events in training data

24. The solution • Solution is to correct MLE estimation using a smoothing technique. • More on this in the next part • But cf. Tutorial 1, which introduced the simplest method of smoothing known.

25. Adequacy of different order models • Manning/Schutze `99 report results for n-gram models of a corpus of the novels of Austen. • Task: use n-gram model to predict the probability of a sentence in the test data. • Models: • unigram: essentially zero-context markov model, uses only the probability of individual words • bigram • trigram • 4-gram

26. Example test case • Training Corpus: five Jane Austen novels • Corpus size = 617,091 words • Vocabulary size = 14,585 unique types • Task: predict the next word of the trigram • “inferior to ________” • from test data, Persuasion: • “[In person, she was] inferior to both [sisters.]”

27. Selecting an nVocabulary (V) = 20,000 words

28. Adequacy of unigrams • Problems with unigram models: • not entirely hopeless because most sentences contain a majority of highly common words • ignores syntax completely: • P(In person she was inferior) = P(inferior was she person in)

29. Adequacy of bigrams • Bigrams: • improve situation dramatically • some unexpected results: • p(she|person) decreases compared to the unigram model. Though she is very common, it is uncommon after person

30. Adequacy of trigrams • Trigram models will do brilliantly when they’re useful. • They capture a surprising amount of contextual variation in text. • Biggest limitation: • most new trigrams in test data will not have been seen in training data. • Problem carries over to 4-grams, and is much worse!

31. Reliability vs. Discrimination • larger n: more information about the context of the specific instance (greater discrimination) • smaller n: more instances in training data, better statistical estimates (more reliability)

32. Backing off • Possible way of striking a balance between reliability and discrimination: • backoff model: • where possible, use a trigram • if trigram is unseen, try and “back off” to a bigram model • if bigrams are unseen, try and “back off” to a unigram

33. Evaluating language models

34. Perplexity • Recall: Entropy is a measure of uncertainty: • high entropy = high uncertainty • perplexity: • if I’ve trained on a sample, how surprised am I when exposed to a new sample? • a measure of uncertainty of a model on new data

35. Entropy as “expected value” • One way to think of the summation part is as a weighted average of the information content. • We can view this average value as an “expectation”: the expected surprise/uncertainty of our model.

36. Comparing distributions • We have a language model built from a sample. The sample is a probability distribution q over n-grams. • q(x) = the probability of some n-gram x in our model. • The sample is generated from a true population (“the language”) with probability distribution p. • p(x) = the probability of x in the true distribution

37. Evaluating a language model • We’d like an estimate of how good our model is as a model of the language • i.e. we’d like to compare q to p • We don’t have access to p. (Hence, can’t use KL-Divergence) • Instead, we use our test data as an estimate of p.

38. Cross-entropy: basic intuition • Measure the number of bits needed to identify an event coming from p, if we code it according to q: • We draw sequences according to p; • but we sum the log of their probability according to q. • This estimate is called cross-entropy H(p,q)

39. Cross-entropy: p vs. q • Cross-entropy is an upper bound on the entropy of the true distribution p: • H(p) ≤ H(p,q) • if our model distribution (q) is good, H(p,q) ≈ H(p) • We estimate cross-entropy based on our test data. • Gives an estimate of the distance of our language model from the distribution in the test sample.

40. Estimating cross-entropy Probability according to p (test set) Entropy according to q (language model)

41. Perplexity • The perplexity of a language model with probability distribution q, relative to a test set with probability distribution p is: • A perplexity value of k (obtained on a test set) tells us: • our model is as surprised on average as it would be if it had to make k guesses for every sequence (n-gram) in the test data. • The lower the perplexity, the better the language model (the lower the surprise on our test data).

42. Perplexity example (Jurafsky & Martin, 2000, p. 228) • Trained unigram, bigram and trigram models from a corpus of news text (Wall Street Journal) • applied smoothing • 38 million words • Vocab of 19,979 (low-frequency words mapped to UNK). • Computed perplexity on a test set of 1.5 million words.

43. J&M’s results • Trigrams do best of all. • Value suggests the extent to which the model can fit the data in the test set. • Note: with unigrams, the model has to make lots of guesses!

44. Summary • Main point about Markov-based language models: • data sparseness is always a problem • smoothing techniques are required to estimate probability of unseen events • Next part discusses more refined smoothing techniques than those seen so far.

45. Part 2 Smoothing (aka discounting) techniques

46. Overview… • Smoothing methods: • Simple smoothing • Witten-Bell & Good-Turing estimation • Held-out estimation and cross-validation • Combining several n-gram models: • back-off models

47. Rationale behind smoothing • Sample frequencies • seen events with probability P • unseen events (including “grammatical” zeroes”) with probability 0 • Real population frequencies • seen events • (including the unseen events in our sample) + smoothing to approximate results in Lower probabilities for seen events (discounting). Left over probability mass distributed over unseens (smoothing).

48. Laplace’s law, Lidstone’s law and the Jeffreys-Perks law

49. Maximum Likelihood Estimate F(w) Unknowns are assigned 0% probability mass

More Related