1 / 61

Lecture 3 Ngrams

This lecture covers the topic of N-gram models for prediction in Natural Language Processing using Python and NLTK. It includes a discussion on smoothing techniques and provides relevant readings for further understanding.

Download Presentation

Lecture 3 Ngrams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 3Ngrams CSCE 771 Natural Language Processing • Topics • Python • NLTK • N – grams • Smoothing • Readings: • Chapter 4 – Jurafsky and Martin January 23, 2013

  2. Last Time • Slides from Lecture 1 30- • Regular expressions in Python, (grep, vi, emacs, word)? • Eliza • Morphology • Today • N-gram models for prediction

  3. Eliza.py • https://github.com/nltk/nltk/blob/master/nltk/chat/eliza.py • List of re – response pattern pairs • If Regular expression matches • Then respond with … • pairs = ( •   (r'I need (.*)', •   ( "Why do you need %1?", •     "Would it really help you to get %1?", •     "Are you sure you need %1?")), •   (r'Why don\'t you (.*)', •   ( "Do you really think I don't %1?", •     "Perhaps eventually I will %1.", •     "Do you really want me to %1?")),

  4. http://nltk.org/book/ • Natural Language Processing with Python--- Analyzing Text with the Natural Language ToolkitSteven Bird, Ewan Klein, and Edward LoperPreface (extras) 1. Language Processing and Python (extras) 2. Accessing Text Corpora and Lexical Resources (extras) 3. Processing Raw Text 4. Writing Structured Programs (extras) 5. Categorizing and Tagging Words 6. Learning to Classify Text (extras) 7. Extracting Information from Text 8. Analyzing Sentence Structure (extras) 9. Building Feature Based Grammars 10. Analyzing the Meaning of Sentences (extras) 11. Managing Linguistic Data 12. Afterword: Facing the Language Challenge nltk.org/book

  5. Language Processing and Python • >>> from nltk.book import * • *** Introductory Examples for the NLTK Book *** • Loading text1, ..., text9 and sent1, ..., sent9 • Type the name of the text or sentence to view it. • Type: 'texts()' or 'sents()' to list the materials. • text1: Moby Dick by Herman Melville 1851 • text2: Sense and Sensibility by Jane Austen 1811 • text3: The Book of Genesis • text4: Inaugural Address Corpus • … nltk.org/book

  6. Simple text processing with NLTK • >>> text1.concordance("monstrous") • >>> text1.similar("monstrous") • >>> text2.common_contexts(["monstrous", "very"]) • >>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) • >>> text3.generate() • >>> text5[16715:16735] nltk.org/book

  7. Counting Vocabulary • >>> len(text3) • >>> sorted(set(text3)) • >>> from __future__ import division • >>> len(text3) / len(set(text3)) • >>> text3.count("smote") nltk.org/book

  8. lexical_diversity • >>> deflexical_diversity(text): • ... return len(text) / len(set(text)) • ... • >>> def percentage(count, total): • ... return 100 * count / total • ... nltk.org/book

  9. 1.3   Computing with Language: Simple Statistics • Frequency Distributions • >>> fdist1 = FreqDist(text1) • >>> fdist1 <FreqDist with 260819 outcomes> • >>> vocabulary1 = fdist1.keys() • >>> vocabulary1[:50] • >>> fdist1['whale'] • >>> V = set(text1) • >>> long_words = [w for w in V if len(w) > 15] • >>> sorted(long_words) nltk.org/book

  10. List constructors in Python • >>> V = set(text1) • >>> long_words = [w for w in V if len(w) > 15] • >>> sorted(long_words) • >>> fdist5 = FreqDist(text5) • >>> sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7]) nltk.org/book

  11. Collocations and Bigrams • >>> bigrams(['more', 'is', 'said', 'than', 'done']) • [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] • >>> text4.collocations() • Building collocations list • United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; National Government; United Nations; public money nltk.org/book

  12. Table 1.2 nltk.org/book

  13. Quotes from Chapter 4 • But it must be recognized that the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. • Noam Chomsky 1969 • http://www.chomsky.info/ • Anytime a linguist leaves the group the recognition rate goes up. • Fred Jelinek (then of the IBM speech group) SLP – Jurafsky and Matrin for the rest of the day

  14. Predicting Words • Please turn your homework … • What is the next word? • Language models: N-gram models

  15. Word/Character prediction Uses • Spelling correction (at character level) • Spelling correction (at a higher level) when the corrector corrects to the wrong word • Augmentative communication – person with disability chooses words from a menu predicted by the system

  16. Real-Word Spelling Errors • Mental confusions • Their/they’re/there • To/too/two • Weather/whether • Peace/piece • You’re/your • Typos that result in real words

  17. Spelling Errors that are Words • Typos • Context • Left context • Right context

  18. Real Word Spelling Errors • Collect a set of common pairs of confusions • Whenever a member of this set is encountered compute the probability of the sentence in which it appears • Substitute the other possibilities and compute the probability of the resulting sentence • Choose the higher one

  19. Word Counting • Probability based on counting • He stepped out into the hall, was delighted to encounter a water brother. (from the Brown corpus) • Words? • Bi-grams • Frequencies of words, but what words? • Corpora ? • Web everything on it • Shakespeare • Bible/Koran • Spoken transcripts (switchboard) • Problems with spoken speech “uh” , “um” fillers

  20. 6.2 Bigrams from Berkeley Restaurant Proj. • Berkeley Restaurant Project – a speech based restaurant consultant • Handling requests: • I’m looking for Cantonese food. • I’m looking for a good place to eat breakfast.

  21. Chain Rule • Recall the definition of conditional probabilities • Rewriting • Or… • Or…

  22. Example • The big red dog • P(The)*P(big|the)*P(red|the big)*P(dog|the big red) • Better P(The| <Beginning of sentence>) written as • P(The | <S>)

  23. General Case • The word sequence from position 1 to n is • So the probability of a sequence is

  24. Unfortunately • That doesn’t help since its unlikely we’ll ever gather the right statistics for the prefixes.

  25. Markov Assumption • Assume that the entire prefix history isn’t necessary. • In other words, an event doesn’t depend on all of its history, just a fixed length near history

  26. Markov Assumption • So for each component in the product replace each with its with the approximation (assuming a prefix of N)

  27. Maximum Likelihood Estimation • Maximum Likelihood Estimation (MLE) - Method to estimate probabilities for the n-gram models • Normalize counts from a corpus

  28. N-Grams: The big red dog • Unigrams: P(dog) • Bigrams: P(dog|red) • Trigrams: P(dog|big red) • Four-grams: P(dog|the big red) • In general, we’ll be dealing with • P(Word| Some fixed prefix)

  29. Caveat • The formulation P(Word| Some fixed prefix) is not really appropriate in many applications. • It is if we’re dealing with real time speech where we only have access to prefixes. • But if we’re dealing with text we already have the right and left contexts. There’s no a priori reason to stick to left contexts only.

  30. BERP Table: Counts (fig 4.1) Then we can normalize by dividing each row by the unigram counts.

  31. BERP Table: Bigram Probabilities

  32. Example • For this example • P(I | <s>) = .25 • P(food | english) = .5 • P (english | want) 0.0011 • P (</s> | food) = .68 • Now consider “<s> I want English food </s>” • P(<s> I want English food </s>) • = P(I | <s>) P(want | i) P(english | want) P(food | english) P(</s>|food)

  33. An Aside on Logs • You don’t really do all those multiplies. The numbers are too small and lead to underflows • Convert the probabilities to logs and then do additions. • To get the real probability (if you need it) go back to the antilog.

  34. Some Observations • The following numbers are very informative. Think about what they capture. • P(want|I) = .32 • P(to|want) = .65 • P(eat|to) = .26 • P(food|Chinese) = .56 • P(lunch|eat) = .055

  35. P(I | I) P(want | I) P(I | food) I I I want I want I want to The food I want is Some More Observations

  36. Generation • Choose N-Grams according to their probabilities and string them together

  37. BERP • I want want to to eat eat Chinese Chinese food food .

  38. Some Useful Observations • A small number of events occur with high frequency • You can collect reliable statistics on these events with relatively small samples • A large number of events occur with small frequency • You might have to wait a long time to gather statistics on the low frequency events

  39. Some Useful Observations • Some zeroes are really zeroes • Meaning that they represent events that can’t or shouldn’t occur • On the other hand, some zeroes aren’t really zeroes • They represent low frequency events that simply didn’t occur in the corpus

  40. Shannon’s Method • Sentences randomly generated based on the probability models (n-gram models) • Sample a random bigram (<s>, w) according to its probability • Now sample a random bigram (w, x) according to its probability • Where the prefix w matches the suffix of the first. • And so on until we randomly choose a (y, </s>) • Then string the words together • <s> I • I want want to to eat eat Chinese Chinese food food </s> Slide from: Speech and Language Processing Jurafsky and Martin

  41. Shannon’s method applied to Shakespeare

  42. Shannon applied to Wall Street Journal

  43. Evaluating N-grams: Perplexity • Training set • Test set : W = w1w2….wn • Perplexity (PP) is a Measure of how good a model is. • PP(W) = P(w1w2….wn )-1/N • Higher probability  lower perplexity • Wall Street Journal perplexities of models

  44. Unknown words: Open versus Closed Vocabularies • <UNK> unrecognized word token

  45. Google words visualization http://googlesystem.blogspot.com/2008/05/using-googles-n-gram-corpus.html

  46. Problem • Let’s assume we’re using N-grams • How can we assign a probability to a sequence where one of the component n-grams has a value of zero • Assume all the words are known and have been seen • Go to a lower order n-gram • Back off from bigrams to unigrams • Replace the zero with something else

  47. Smoothing • Smoothing - reevaluating some of the zero and low probability N-grams and assigning them non-zero values • Add-One (Laplace) • Make the zero counts 1. • Rationale: They’re just events you haven’t seen yet. If you had seen them, chances are you would only have seen them once… so make the count equal to 1.

  48. Add-One Smoothing • Terminology • N – Number of total words • V – vocabulary size == number of distinct words • Maximum Likelihood estimate

  49. Adjusted counts “C*” • Terminology • N – Number of total words • V – vocabulary size == number of distinct words Adjusted count C* Adjusted probabilities

  50. Discounting • Discounting – lowering some of the larger non-zero counts to get the “probability” to assign to the zero entries • dc – the discounted counts • The discounted probabilities can then be directly calculated

More Related