1 / 59

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 2 1/14/2013. For you to do. Sign attendance sheet Turn in survey/quiz Fill out Doodle poll for office hours, if you think you might want to go to office hours Link was sent to you in an e-mail. Recommended reading.

snowy
Download Presentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing Lecture 2 1/14/2013

  2. For you to do • Sign attendance sheet • Turn in survey/quiz • Fill out Doodle poll for office hours, if you think you might want to go to office hours • Link was sent to you in an e-mail

  3. Recommended reading • Manning & Schutze • 1.4: Zipf’s Law • Chapter 4: Corpus-based work • Jurafsky & Martin Chapter 4 • 4.1, 4.2, 4.3: counting n-grams • 4.5: smoothing

  4. Outline • Word frequencies and Zipf’s law • N-grams and sparse data • Probability theory I • Smoothing I

  5. Get statistics to do statistical NLP • Determine how often things occur in language • But language is an abstract entity • Look at a corpus instead • Corpus: large electronic file of language usage • Written texts: news articles, fiction, poetry, web pages • Spoken language: audio files, text transcriptions • Gather statistics from a corpus • Count occurrences of different types of linguistic units: words, n-grams (sequences), part of speech (POS) tags, etc.

  6. Where to get corpora • Do a web search for “<language> corpus” • NLTK (Natural Language Toolkit): http://nltk.org/nltk_data/ • Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/ • Many LDC corpora are at the UA library • Main languages are English, Arabic, Chinese; spoken and written • http://corplinguistics.wordpress.com/2011/10/30/top-five-ldc-corpora/ • Many other corpora, some accessible online: http://en.wikipedia.org/wiki/Text_corpus

  7. Brown Corpus • http://en.wikipedia.org/wiki/Brown_Corpus • Compilation of texts of American English in the 1960s • 500 texts from 15 different genres: news, religious texts, popular books, government documents, fiction, etc. • 1,154,633 words • You will you access to this corpus

  8. How do we define “word”? • Simplest answer: any contiguous sequence of non-whitespace characters is a word • Examples on next slide • It is useful to first perform tokenization: split a raw text into linguistically meaningful lexical units • In this class we will use a pre-tokenized version of the Brown Corpus

  9. Some examples of “words” in the untokenized Brown Corpus already The the (the Caldwell Caldwell's Caldwell, ribbon'' Lt. Ave., 1964 1940s $50 $6,100,000,000 fire-fighting natural-law Baden-Baden 1961-62 France-Germany Kohnstamm-positive Colquitt--After

  10. Frequencies of words in a corpus: types and tokens • Brown corpus: • 1,154,633 word tokens • 49,680 word types • Type vs. token (for words): • Type: a distinct word • e.g. “with” • Token: an individual occurrence of a word • e.g. “with” occurs 7,270 times • Question: what percentage of the word types in a corpus appear exactly once? (make a guess)

  11. Frequency and rank • Sort words by decreasing frequency • Rank = order in sorted list • Rank 1: most-frequent word • Rank 2: second most-frequent word • Rank 3: third most-frequent word • etc. • Plot word frequencies by rank

  12. Plot of word frequencies, linear scale

  13. Zoom in on lower-left corner

  14. Plot of word frequencies, log-log scale Log1010 = 1 Log10100 = 2 Log101000 = 3

  15. Plot of word frequencies, log-log scale 10 most-freq words: freq. > 10,000 Next 90 words: 1,000 < freq. < 10,000 Next 900 words: 100 < freq. < 1,000 Next 9000: 10 < freq. < 100 ~40,000 words: 1 <= freq < 10

  16. Word frequency distributions in language • Exemplifies a power law distribution • For any corpus and any language: • There are a few very common words • A substantial number of medium freq. words • A huge number of low frequency words • Brown corpus • 1,154,633 word tokens, 49,680 word types • 21,919 types appear only once! = 44.1%

  17. Word frequencies follow Zipf’s law • http://en.wikipedia.org/wiki/Zipf%27s_law • Zipf’s Law (1935, 1949): the frequencyF of a word w is inversely proportional to the rankR of w: F  1 / R i.e., F x R = k, for some constant k • Example: 50th most common word occurs approximately 3 times as frequently as 150th most common word freq. at rank 50:  1 / 50 freq. at rank 150:  1 / 150 ( 1 / 50 ) / ( 1 / 150 ) = 3

  18. Zipf’s Law explains linear-like relationship between frequency and rank in log-log scale Red line = constant k in Zipf’s Law

  19. Most-frequent words in Brown Corpus

  20. Some words with a frequency of 20 in Brown Corpus • pursue • purchased • punishment • promises • procurement • probabilities • precious • pitcher • pitch replacement repair relating rehabilitation refund receives ranks queen quarrel puts

  21. Some words with a frequency of 1 in Brown Corpus • government-controlled • gouverne • goutte • gourmets • gourmet's • gothic • gossiping • gossiped • gossamer • gosh • gorgeously • gorge • gooshey • goooolick • goofed • gooey • goody • goodness' • goodies • good-will

  22. Frequencies of different types of words • Extremely frequent words • Belong to part-of-speech categories of Determiner, Preposition, Conjunction, Pronoun • Also frequent adverbs, and punctuation • Linguistically, these are function words: convey the syntactic information in a sentence • Moderately frequent words • Common Nouns, Verbs, Adjectives, Adverbs • Linguistically, these are content words: convey semantic information in a sentence • Infrequent words • Also content words: Nouns, Verbs, Adjectives, rare Adverbs (such as “gorgeously”) • New words, names, foreign words, many numbers

  23. Consequence of Zipf’s Law:unknown words • Because of highly skewed distribution, many possible words won’t appear in a corpus, and are therefore “unknown” • Some common words not found in Brown Corpus: combustible parabola preprocess headquartering deodorizer deodorizers usurps usurping • Names of people and places, especially foreign names • Vocabulary of specialized domains • Medical, legal, scientific, etc. • Neologisms (newly-formed words) won’t be in a corpus

  24. Some neologismshttp://www.wordspy.com/diversions/neologisms.asp • self-coin, verb • To coin an already existing word that you didn't know about. • awkword, noun • A word that is difficult to pronounce. • jig-sawdust, noun • The sawdust-like bits that fall off jig-saw puzzle pieces. • multidude, noun • The collective noun for a group of surfers.

  25. Outline • Word frequencies and Zipf’s law • N-grams and sparse data • Probability theory I • Smoothing I

  26. N-grams • An n-gram is a contiguous sequence of N units, for some particular linguistic unit. • Example sentence, POS-tagged: The/DT brown/JJ fox/NN jumped/VBD ./. • 5 different word 1-grams (unigrams): The, brown, fox, jumped, . • 4 different word 2-grams (bigrams): The brown, brown fox, fox jumped, jumped . • 3 different word 3-grams (trigrams): The brown fox, brown fox jumped, fox jumped . • 4 different POS bigrams: DT JJ, JJ NN, NN VBD, VBD .

  27. N-gram frequencies in a corpus • Corpus linguists research language usage • See what words an author uses • See how language changes • Look at n-gram frequencies in corpora • Example: Google n-grams http://books.google.com/ngrams

  28. Sparse data problem • For many linguistic units, there will often be very few or even zero occurrences of logically possible n-grams in a corpus • Example: quick brown beaver • Not found in the Brown corpus (1.2 million words) • Not found by Google either!

  29. Sparse data results from combinatorics • As the units to be counted in a corpus become larger (i.e., higher N for n-grams), fewer occurrences are expected to be seen • Due to exponential increase in # of logically possible N-grams • Example: Brown corpus: ~50,000 word types • Number of logically possible word N-grams N = 1: 50,000 different unigrams N = 2: 50,0002 = 2.5 billion different bigrams N = 3: 50,0003 = 125 trillion different trigrams

  30. Expected frequency of an N-gram(Brown: 1.2 million tokens, 50k word types) • Suppose N-grams are uniformly distributed • “uniform” = each occurs equally often • Calculate expected frequencies of N-grams: • 50,000 word unigrams • 1.2 million / 50,000 = 24 occurrences • 2.5 billion word bigrams • 1.2 million / 2.5 billion = .00048 occurrences • 125 trillion word trigrams: • 1.2 million / 125 trillion = .0000000096 occurrences

  31. Consequences of Zipf’s Law (II) • Zipf’s law: highly skewed distribution for words • Also holds for n > 1 (bigrams, trigrams, etc.), and for any linguistic unit • Makes the sparse data problem even worse! • Because of combinatorics, a higher value of N leads to sparse data for N-grams • But because of Zipf’s Law, even if N is small, the frequencies of N-grams consisting of low-frequency words will be quite sparse

  32. Dealing with sparse data • Since in NLP we develop systems to analyze novel examples of language, we must deal with sparse data somehow (cannot ignore) • Would need a massively massive corpus to count frequencies of rare n-grams • Infeasible; even Google isn’t big enough! • Smooth the frequencies • Impose structure upon data • Linguistic analysis: POS categories, syntax, morphemes, phonemes, semantic categories, etc.

  33. Google ngrams again • Access at http://books.google.com/ngrams • Also available in library on 6 DVDs: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 • Number of tokens: 1,024,908,267,229 • Number of sentences: 95,119,665,584 • Quantity of ngrams (types) Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663

  34. Outline • Word frequencies and Zipf’s law • N-grams and sparse data • Probability theory I • Smoothing I

  35. 1. Discrete random variables • A discrete random variabletakes on a range of values, or events • The set of possible events is the sample space, Ω • Example: rolling a die Ω = {1 dot, 2 dots, 3 dots, 4 dots, 5 dots, 6 dots} • The occurrence of a random value taking on a particular value from the sample space is a trial

  36. 2. Probability distribution • A set of data can be described as a probability distribution over a set of events • Definition of a probability distribution: • We have a set of events x drawn from a finite sample space Ω • Probability of each event is between 0 and 1 • Sum of probabilities of all events is

  37. Example: Probability distribution • Suppose you have a die that is equally weighted on all sides. • Let X be the random variable for the outcome of a single roll. p(X=1 dot) = 1 / 6 p(X=2 dots) = 1 / 6 p(X=3 dots) = 1 / 6 p(X=4 dots) = 1 / 6 p(X=5 dots) = 1 / 6 p(X=6 dots) = 1 / 6

  38. Probability estimation • Suppose you have a die and you don’t know how it is weighted. • Let X be the random variable for the outcome of a roll. • Want to produce values for p̂(X), which is an estimate of the probability distribution of X. • Read as “p-hat” • Do this through Maximum Likelihood Estimation (MLE): the probability of an event is the number of times it occurs, divided by the total number of trials.

  39. Example: roll a die; random variable X • Data: roll a die 60 times, record the frequency of each event • 1 dot 9 rolls • 2 dots 10 rolls • 3 dots 9 rolls • 4 dots 12 rolls • 5 dots 9 roll • 6 dots 11 rolls

  40. Example: roll a die; random variable X • Maximum Likelihood Estimate: p̂(X=x) = count(x) / total_count_of_all_events • p̂( X = 1 dot) = 9 / 60 = 0.150 p̂( X = 2 dots) = 10 / 60 = 0.166 p̂( X = 3 dots) = 9 / 60 = 0.150 p̂( X = 4 dots) = 12 / 60 = 0.200 p̂( X = 5 dots) = 9 / 60 = 0.150 p̂( X = 6 dots) = 11 / 60 = 0.183 Sum = 60 / 60 = 1.0

  41. Convergence of p̂(X) • Suppose we know that the die is equally weighted. • We observe that our values for p̂(X) are close to p(X), but not all exactly equal. • We would expect that as the number of trials increases, p̂(X) will get closer to p(X). • For example, we could roll the die 1,000,000 times. Probability estimate will improve with more data.

  42. Simplify notation • People are often not precise, and write “p(X)” when they mean “p̂(X)” • We will do this also • Can also leave out the name of the random variable when it is understood • Example: p(X=4 dots) p(4 dots)

  43. Outline • Word frequencies and Zipf’s law • N-grams and sparse data • Probability theory I • Smoothing I

  44. Don’t use MLE for n-gram probabilities • Maximum Likelihood Estimation applied to n-grams: • count(X) = observed frequency of X in a corpus • p(X) = observed probability of X in a corpus • However, a corpus is only a sample of the language • Does not contain all possible words, phrases, constructions, etc. • Linguistically possible but nonexistent items in a corpus are assigned zero probability with MLE • Zero probability means impossible • But they are not impossible… • Need to revise probability estimate

  45. Smoothing • Take probability mass away from observed items, and assign to zero-count items • Solid line: observed counts • Dotted line: smoothed counts

  46. Smoothing methods 1. Add-one smoothing 2. Deleted estimation 3. Good-Turing smoothing (will see later) 4. Witten-Bell smoothing 5. Backoff smoothing 6. Interpolated backoff

  47. Add-one smoothing (LaPlace estimator) • Simplest method • Example: smoothing N-grams • First, determine how many N-grams there are (for some value of N) • The frequencies of N-grams that are observed in the training data are not modified • For all other N-grams, assign a small constant value, such as 0.05

  48. Probability distribution changes after add-one smoothing • Since zero-count N-grams now have a nonzero frequency, the probability mass assigned to observed N-grams decreases • Example: 5 possible items Original counts: [ 5, 3, 2, 0, 0 ] Probability distribution: [ .5, .3, .2, 0.0, 0.0 ] Smoothed counts, add 0.05: [5, 3, 2, .05, .05 ] Smoothed probabilities: [.495, .297, .198, .005, .005 ] Probabilities now nonzero Probabilities are discounted

  49. Add-one smoothing (LaPlace estimator) • Advantage: very simple technique • Disadvantages: • Too much probability mass can be assigned to zero-count n-grams. This is because there can be a huge number of zero-count n-grams. • Example: • Suppose you smooth word bigrams in the Brown corpus (1.2 million words), and use a smoothed frequency of .05. • The 495,000 different observed bigrams have a total smoothed count of 1,200,000. • The 2.5 billion unobserved bigrams have a total smoothed count of .05 x 2,500,000,000 = 125,000,000. • The probability mass assigned to unobserved bigrams is 125,000,000 / (125,000,000 + 1,200,000) = 99% !!!

More Related