LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing Lecture 2 1/14/2013

For you to do • Sign attendance sheet • Turn in survey/quiz • Fill out Doodle poll for office hours, if you think you might want to go to office hours • Link was sent to you in an e-mail

Recommended reading • Manning & Schutze • 1.4: Zipf’s Law • Chapter 4: Corpus-based work • Jurafsky & Martin Chapter 4 • 4.1, 4.2, 4.3: counting n-grams • 4.5: smoothing

Outline • Word frequencies and Zipf’s law • N-grams and sparse data • Probability theory I • Smoothing I

Get statistics to do statistical NLP • Determine how often things occur in language • But language is an abstract entity • Look at a corpus instead • Corpus: large electronic file of language usage • Written texts: news articles, fiction, poetry, web pages • Spoken language: audio files, text transcriptions • Gather statistics from a corpus • Count occurrences of different types of linguistic units: words, n-grams (sequences), part of speech (POS) tags, etc.

Where to get corpora • Do a web search for “<language> corpus” • NLTK (Natural Language Toolkit): http://nltk.org/nltk_data/ • Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/ • Many LDC corpora are at the UA library • Main languages are English, Arabic, Chinese; spoken and written • http://corplinguistics.wordpress.com/2011/10/30/top-five-ldc-corpora/ • Many other corpora, some accessible online: http://en.wikipedia.org/wiki/Text_corpus

Brown Corpus • http://en.wikipedia.org/wiki/Brown_Corpus • Compilation of texts of American English in the 1960s • 500 texts from 15 different genres: news, religious texts, popular books, government documents, fiction, etc. • 1,154,633 words • You will you access to this corpus

How do we define “word”? • Simplest answer: any contiguous sequence of non-whitespace characters is a word • Examples on next slide • It is useful to first perform tokenization: split a raw text into linguistically meaningful lexical units • In this class we will use a pre-tokenized version of the Brown Corpus

Some examples of “words” in the untokenized Brown Corpus already The the (the Caldwell Caldwell's Caldwell, ribbon'' Lt. Ave., 1964 1940s $50 $6,100,000,000 fire-fighting natural-law Baden-Baden 1961-62 France-Germany Kohnstamm-positive Colquitt--After

Frequencies of words in a corpus: types and tokens • Brown corpus: • 1,154,633 word tokens • 49,680 word types • Type vs. token (for words): • Type: a distinct word • e.g. “with” • Token: an individual occurrence of a word • e.g. “with” occurs 7,270 times • Question: what percentage of the word types in a corpus appear exactly once? (make a guess)

Frequency and rank • Sort words by decreasing frequency • Rank = order in sorted list • Rank 1: most-frequent word • Rank 2: second most-frequent word • Rank 3: third most-frequent word • etc. • Plot word frequencies by rank

Plot of word frequencies, linear scale

Zoom in on lower-left corner

Plot of word frequencies, log-log scale Log1010 = 1 Log10100 = 2 Log101000 = 3

Plot of word frequencies, log-log scale 10 most-freq words: freq. > 10,000 Next 90 words: 1,000 < freq. < 10,000 Next 900 words: 100 < freq. < 1,000 Next 9000: 10 < freq. < 100 ~40,000 words: 1 <= freq < 10

Word frequency distributions in language • Exemplifies a power law distribution • For any corpus and any language: • There are a few very common words • A substantial number of medium freq. words • A huge number of low frequency words • Brown corpus • 1,154,633 word tokens, 49,680 word types • 21,919 types appear only once! = 44.1%

Word frequencies follow Zipf’s law • http://en.wikipedia.org/wiki/Zipf%27s_law • Zipf’s Law (1935, 1949): the frequencyF of a word w is inversely proportional to the rankR of w: F  1 / R i.e., F x R = k, for some constant k • Example: 50th most common word occurs approximately 3 times as frequently as 150th most common word freq. at rank 50:  1 / 50 freq. at rank 150:  1 / 150 ( 1 / 50 ) / ( 1 / 150 ) = 3

Zipf’s Law explains linear-like relationship between frequency and rank in log-log scale Red line = constant k in Zipf’s Law

Most-frequent words in Brown Corpus

Some words with a frequency of 20 in Brown Corpus • pursue • purchased • punishment • promises • procurement • probabilities • precious • pitcher • pitch replacement repair relating rehabilitation refund receives ranks queen quarrel puts

Some words with a frequency of 1 in Brown Corpus • government-controlled • gouverne • goutte • gourmets • gourmet's • gothic • gossiping • gossiped • gossamer • gosh • gorgeously • gorge • gooshey • goooolick • goofed • gooey • goody • goodness' • goodies • good-will

Frequencies of different types of words • Extremely frequent words • Belong to part-of-speech categories of Determiner, Preposition, Conjunction, Pronoun • Also frequent adverbs, and punctuation • Linguistically, these are function words: convey the syntactic information in a sentence • Moderately frequent words • Common Nouns, Verbs, Adjectives, Adverbs • Linguistically, these are content words: convey semantic information in a sentence • Infrequent words • Also content words: Nouns, Verbs, Adjectives, rare Adverbs (such as “gorgeously”) • New words, names, foreign words, many numbers

Consequence of Zipf’s Law:unknown words • Because of highly skewed distribution, many possible words won’t appear in a corpus, and are therefore “unknown” • Some common words not found in Brown Corpus: combustible parabola preprocess headquartering deodorizer deodorizers usurps usurping • Names of people and places, especially foreign names • Vocabulary of specialized domains • Medical, legal, scientific, etc. • Neologisms (newly-formed words) won’t be in a corpus

Some neologismshttp://www.wordspy.com/diversions/neologisms.asp • self-coin, verb • To coin an already existing word that you didn't know about. • awkword, noun • A word that is difficult to pronounce. • jig-sawdust, noun • The sawdust-like bits that fall off jig-saw puzzle pieces. • multidude, noun • The collective noun for a group of surfers.

N-grams • An n-gram is a contiguous sequence of N units, for some particular linguistic unit. • Example sentence, POS-tagged: The/DT brown/JJ fox/NN jumped/VBD ./. • 5 different word 1-grams (unigrams): The, brown, fox, jumped, . • 4 different word 2-grams (bigrams): The brown, brown fox, fox jumped, jumped . • 3 different word 3-grams (trigrams): The brown fox, brown fox jumped, fox jumped . • 4 different POS bigrams: DT JJ, JJ NN, NN VBD, VBD .

N-gram frequencies in a corpus • Corpus linguists research language usage • See what words an author uses • See how language changes • Look at n-gram frequencies in corpora • Example: Google n-grams http://books.google.com/ngrams

Sparse data problem • For many linguistic units, there will often be very few or even zero occurrences of logically possible n-grams in a corpus • Example: quick brown beaver • Not found in the Brown corpus (1.2 million words) • Not found by Google either!

Sparse data results from combinatorics • As the units to be counted in a corpus become larger (i.e., higher N for n-grams), fewer occurrences are expected to be seen • Due to exponential increase in # of logically possible N-grams • Example: Brown corpus: ~50,000 word types • Number of logically possible word N-grams N = 1: 50,000 different unigrams N = 2: 50,0002 = 2.5 billion different bigrams N = 3: 50,0003 = 125 trillion different trigrams

Expected frequency of an N-gram(Brown: 1.2 million tokens, 50k word types) • Suppose N-grams are uniformly distributed • “uniform” = each occurs equally often • Calculate expected frequencies of N-grams: • 50,000 word unigrams • 1.2 million / 50,000 = 24 occurrences • 2.5 billion word bigrams • 1.2 million / 2.5 billion = .00048 occurrences • 125 trillion word trigrams: • 1.2 million / 125 trillion = .0000000096 occurrences

Consequences of Zipf’s Law (II) • Zipf’s law: highly skewed distribution for words • Also holds for n > 1 (bigrams, trigrams, etc.), and for any linguistic unit • Makes the sparse data problem even worse! • Because of combinatorics, a higher value of N leads to sparse data for N-grams • But because of Zipf’s Law, even if N is small, the frequencies of N-grams consisting of low-frequency words will be quite sparse

Dealing with sparse data • Since in NLP we develop systems to analyze novel examples of language, we must deal with sparse data somehow (cannot ignore) • Would need a massively massive corpus to count frequencies of rare n-grams • Infeasible; even Google isn’t big enough! • Smooth the frequencies • Impose structure upon data • Linguistic analysis: POS categories, syntax, morphemes, phonemes, semantic categories, etc.

Google ngrams again • Access at http://books.google.com/ngrams • Also available in library on 6 DVDs: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 • Number of tokens: 1,024,908,267,229 • Number of sentences: 95,119,665,584 • Quantity of ngrams (types) Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663

1. Discrete random variables • A discrete random variabletakes on a range of values, or events • The set of possible events is the sample space, Ω • Example: rolling a die Ω = {1 dot, 2 dots, 3 dots, 4 dots, 5 dots, 6 dots} • The occurrence of a random value taking on a particular value from the sample space is a trial

2. Probability distribution • A set of data can be described as a probability distribution over a set of events • Definition of a probability distribution: • We have a set of events x drawn from a finite sample space Ω • Probability of each event is between 0 and 1 • Sum of probabilities of all events is

Example: Probability distribution • Suppose you have a die that is equally weighted on all sides. • Let X be the random variable for the outcome of a single roll. p(X=1 dot) = 1 / 6 p(X=2 dots) = 1 / 6 p(X=3 dots) = 1 / 6 p(X=4 dots) = 1 / 6 p(X=5 dots) = 1 / 6 p(X=6 dots) = 1 / 6

Probability estimation • Suppose you have a die and you don’t know how it is weighted. • Let X be the random variable for the outcome of a roll. • Want to produce values for p̂(X), which is an estimate of the probability distribution of X. • Read as “p-hat” • Do this through Maximum Likelihood Estimation (MLE): the probability of an event is the number of times it occurs, divided by the total number of trials.

Example: roll a die; random variable X • Data: roll a die 60 times, record the frequency of each event • 1 dot 9 rolls • 2 dots 10 rolls • 3 dots 9 rolls • 4 dots 12 rolls • 5 dots 9 roll • 6 dots 11 rolls

Example: roll a die; random variable X • Maximum Likelihood Estimate: p̂(X=x) = count(x) / total_count_of_all_events • p̂( X = 1 dot) = 9 / 60 = 0.150 p̂( X = 2 dots) = 10 / 60 = 0.166 p̂( X = 3 dots) = 9 / 60 = 0.150 p̂( X = 4 dots) = 12 / 60 = 0.200 p̂( X = 5 dots) = 9 / 60 = 0.150 p̂( X = 6 dots) = 11 / 60 = 0.183 Sum = 60 / 60 = 1.0

Convergence of p̂(X) • Suppose we know that the die is equally weighted. • We observe that our values for p̂(X) are close to p(X), but not all exactly equal. • We would expect that as the number of trials increases, p̂(X) will get closer to p(X). • For example, we could roll the die 1,000,000 times. Probability estimate will improve with more data.

Simplify notation • People are often not precise, and write “p(X)” when they mean “p̂(X)” • We will do this also • Can also leave out the name of the random variable when it is understood • Example: p(X=4 dots) p(4 dots)

Don’t use MLE for n-gram probabilities • Maximum Likelihood Estimation applied to n-grams: • count(X) = observed frequency of X in a corpus • p(X) = observed probability of X in a corpus • However, a corpus is only a sample of the language • Does not contain all possible words, phrases, constructions, etc. • Linguistically possible but nonexistent items in a corpus are assigned zero probability with MLE • Zero probability means impossible • But they are not impossible… • Need to revise probability estimate

Smoothing • Take probability mass away from observed items, and assign to zero-count items • Solid line: observed counts • Dotted line: smoothed counts

Smoothing methods 1. Add-one smoothing 2. Deleted estimation 3. Good-Turing smoothing (will see later) 4. Witten-Bell smoothing 5. Backoff smoothing 6. Interpolated backoff

Add-one smoothing (LaPlace estimator) • Simplest method • Example: smoothing N-grams • First, determine how many N-grams there are (for some value of N) • The frequencies of N-grams that are observed in the training data are not modified • For all other N-grams, assign a small constant value, such as 0.05

Probability distribution changes after add-one smoothing • Since zero-count N-grams now have a nonzero frequency, the probability mass assigned to observed N-grams decreases • Example: 5 possible items Original counts: [ 5, 3, 2, 0, 0 ] Probability distribution: [ .5, .3, .2, 0.0, 0.0 ] Smoothed counts, add 0.05: [5, 3, 2, .05, .05 ] Smoothed probabilities: [.495, .297, .198, .005, .005 ] Probabilities now nonzero Probabilities are discounted

Add-one smoothing (LaPlace estimator) • Advantage: very simple technique • Disadvantages: • Too much probability mass can be assigned to zero-count n-grams. This is because there can be a huge number of zero-count n-grams. • Example: • Suppose you smooth word bigrams in the Brown corpus (1.2 million words), and use a smoothed frequency of .05. • The 495,000 different observed bigrams have a total smoothed count of 1,200,000. • The 2.5 billion unobserved bigrams have a total smoothed count of .05 x 2,500,000,000 = 125,000,000. • The probability mass assigned to unobserved bigrams is 125,000,000 / (125,000,000 + 1,200,000) = 99% !!!

LING / C SC 439/539 Statistical Natural Language Processing