- 42 Views
- Uploaded on
- Presentation posted in: General

LIN6932: Topics in Computational Linguistics

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

LIN6932: Topics in Computational Linguistics

Hana Filip

Lecture 5: N-grams

- Last bit on tagging:
- Tagging Foreign Languages
- Error Analysis

- N-grams

- Idea:
- First do morphological parsing
- Get all possible parses
- Treat each parse for a word as a “POS tag”
- Use a tagger to disambiguate

- Look at a confusion matrix (contingency table)
- E.g. 4.4% of the total errors caused by mistagging VBD as VBN
- See what errors are causing problems
- Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)
- Adverb (RB) vs Particle (RP) vs Prep (IN)
- Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

- ERROR ANALYSIS IS ESSENTIAL!!!

- I do uh main- mainly business data processing
- Fragments
- Filled pauses

- Are cat and cats the same word?
- Some terminology
- Lemma: a set of lexical forms having the same stem, major part of speech, and rough word sense
- Cat and cats = same lemma

- Wordform: the full inflected surface form.
- Cat and cats = different wordforms

- Lemma: a set of lexical forms having the same stem, major part of speech, and rough word sense

- they picnicked by the pool then lay back on the grass and looked at the stars
- 16 tokens
- 14 types

- SWBD (Switchboard Corpus, Brown)):
- ~20,000 wordform types,
- 2.4 million wordform tokens

- Brown et al (1992) large corpus
- 583 million wordform tokens
- 293,181 wordform types

- The noisy channel model expects P(W); the probability of the sentence
- The model that computes P(W) is called the language model.
- A better term for this would be “The Grammar”
- But “Language model” or LM is standard
“noise in the data”: the data does not give enough information, are incorrect, or the domain it comes from is nondeterministic

- How to compute this joint probability:
- P(“the”,”other”,”day”,”I”,”was”,”walking”,”along”,”and”,”saw”,”a”,”lizard”)

- Intuition: let’s rely on the Chain Rule of Probability

- Recall the definition of conditional probabilities
- Rewriting:
- More generally
- P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
- In general
- P(x1,x2,x3,…xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1…xn-1)

conditional/posterior probability =

(LIKELIHOOD multiplied by PRIOR) divided by NORMALIZING CONSTANT

We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words

- P(“the big red dog was”)=
- P(the)*P(big|the)*P(red|the big)*P(dog|the big red)*P(was|the big red dog)

- Chomsky dictum: “Language is creative”
- We’ll never be able to get enough data to compute the statistics for those long prefixes
- P(lizard|the,other,day,I,was,walking,along,and,saw,a)

- Make the simplifying assumption
- P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|a)

- Or maybe
- P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|saw,a)

- So for each component in the product replace with the approximation (assuming a prefix of N)
- Bigram version

- The Maximum Likelihood Estimate

- <s> I am Sam </s>
- <s> Sam I am </s>
- <s> I do not like green eggs and ham </s>
- This is the Maximum Likelihood Estimate, because it is the one which maximizes P(Training set|Model)

- can you tell me about any good cantonese restaurants close by
- mid priced thai food is what i’m looking for
- tell me about chez panisse
- can you give me a listing of the kinds of food that are available
- i’m looking for a good place to eat breakfast
- when is caffe venezia open during the day

- Out of 9222 sentences

- Normalize by unigrams:
- Result:

- P(<s> I want english food </s>) =
p(i|<s>) x p(want|I) x p(english|want) x p(food|english) x p(</s>|food)

=.000031

- P(english|want) = .0011
- P(chinese|want) = .0065
- P(to|want) = .66
- P(eat | to) = .28
- P(food | to) = 0
- P(want | spend) = 0
- P (i | <s>) = .25

- Generate random sentences:
- Choose a random bigram <s>, w according to its probability
- Now choose a random bigram (w, x) according to its probability
- And so on until we choose </s>
- Then string the words together
- <s> I
Iwant

want to

to eat

eat Chinese

Chinese food

food </s>

- N=884,647 tokens, V=29,066
- Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table)
- Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare

- N-grams only work well for word prediction if the test corpus looks like the training corpus
- In real life, it often doesn’t
- We need to train robust models, adapt to test set, etc

- Zipf’s Law:
- A small number of events occur with high frequency
- A large number of events occur with low frequency
- You can quickly collect statistics on the high frequency events
- You might have to wait an arbitrarily long time to get valid statistics on low frequency events

- Result:
- Our estimates are sparse! no counts at all for the vast bulk of things we want to estimate!
- Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. After all, ANYTHING CAN HAPPEN!
- How to address?

- Answer:
- Estimate the likelihood of unseen N-grams!

Slide adapted from Bonnie Dorr and Julia Hirschberg

Slide from Rion Snow

- Also called Laplace smoothing
- Just add one to all the counts!
- Very simple
- MLE estimate:
- Laplace estimate:
- Reconstructed counts:

- C(count to) went from 608 to 238!
- P(to|want) from .66 to .26!
- Discount d= c*/c
- d for “chinese food” =.10!!! A 10x reduction
- So in general, add-one is a blunt instrument
- Could use more fine-grained method (add-k)

- But add-one smoothing not used for N-grams, as we have much better methods
- Despite its flaws it is however still used to smooth other probabilistic models in NLP, especially
- For pilot studies
- in domains where the number of zeros isn’t so huge.

- Last bit on tagging:
- Tagging Foreign Languages
- Error Analysis

- N-grams