1 / 35

SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing. Marti Hearst Sept 15, 2004. Class Pace and Schedule. Need a foundation before you can do anything interesting. Tokenizing, Tagging, Regex’s Text Classification Principles and Techniques Training vs. Testing, processing corpora

madra
Download Presentation

SIMS 290-2: Applied Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 15, 2004

  2. Class Pace and Schedule • Need a foundation before you can do anything interesting. • Tokenizing, Tagging, Regex’s • Text Classification Principles and Techniques • Training vs. Testing, processing corpora • Through (approximately) the 6th week, keep doing exercises from the NLTK tutorials to build that foundation. • 2 more homeworks • I’m trying to make them bite-sized pieces • 7th – 10th Group Miniproject on Enron Corpus • Will involve classification or Information Extraction • Different groups will do different things • May have a homework within this timeframe • 11th – 15th Another Miniproject • Either on Enron project or your choices • I will suggest ideas; you can propose them too • May also have 1-2 other homeworks in this timeframe

  3. Language Modeling • An fundamental concept in NLP • Main idea: • For a given language, some words are more likely than others to follow each other, or • You can predict (with some degree of accuracy) the probability that a given word will follow another word. • Illustration: • Distributions of words in class-participation exercise.

  4. Next Word Prediction • From a NY Times story... • Stocks ... • Stocks plunged this …. • Stocks plunged this morning, despite a cut in interest rates • Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall ... • Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began Adapted from slide by Bonnie Dorr

  5. Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last … • Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last Tuesday's terrorist attacks. Adapted from slide by Bonnie Dorr

  6. Human Word Prediction • Clearly, at least some of us have the ability to predict future words in an utterance. • How? • Domain knowledge • Syntactic knowledge • Lexical knowledge Adapted from slide by Bonnie Dorr

  7. Claim • A useful part of the knowledge needed to allow word prediction can be captured using simple statistical techniques • In particular, we'll rely on the notion of the probability of a sequence (a phrase, a sentence) Adapted from slide by Bonnie Dorr

  8. Applications • Why do we want to predict a word, given some preceding words? • Rank the likelihood of sequences containing various alternative hypotheses, e.g. for ASR Theatre owners say popcorn/unicorn sales have doubled... • Assess the likelihood/goodness of a sentence • for text generation or machine translation. The doctor recommended a cat scan. El doctor recommendó una exploración del gato. Adapted from slide by Bonnie Dorr

  9. N-Gram Models of Language • Use the previous N-1 words in a sequence to predict the next word • Language Model (LM) • unigrams, bigrams, trigrams,… • How do we train these models? • Very large corpora Adapted from slide by Bonnie Dorr

  10. Simple N-Grams • Assume a language has V word types in its lexicon, how likely is word x to follow word y? • Simplest model of word probability: 1/V • Alternative 1: estimate likelihood of x occurring in new text based on its general frequency of occurrence estimated from a corpus (unigram probability) popcorn is more likely to occur than unicorn • Alternative 2: condition the likelihood of x occurring in the context of previous words (bigrams, trigrams,…) mythical unicorn is more likely than mythical popcorn Adapted from slide by Bonnie Dorr

  11. A Word on Notation • P(unicorn) • Read this as “The probability of seeing the token unicorn” • Unigram tagger uses this. • P(unicorn|mythical) • Called the Conditional Probability. • Read this as “The probability of seeing the token unicorn given that you’ve seen the token mythical • Bigram tagger uses this. • Related to the conditional frequency distributions that we’ve been working with.

  12. Computing the Probability of a Word Sequence • Compute the product of component conditional probabilities? • P(the mythical unicorn) = P(the) P(mythical|the) P(unicorn|the mythical) • The longer the sequence, the less likely we are to find it in a training corpus P(Most biologists and folklore specialists believe that in fact the mythical unicorn horns derived from the narwhal) • Solution: approximate using n-grams Adapted from slide by Bonnie Dorr

  13. Bigram Model • Approximate by • P(unicorn|the mythical) by P(unicorn|mythical) • Markov assumption: • The probability of a word depends only on the probability of a limited history • Generalization: • The probability of a word depends only on the probability of the n previous words • trigrams, 4-grams, … • the higher n is, the more data needed to train • backoff models Adapted from slide by Bonnie Dorr

  14. Using N-Grams • For N-gram models • P(wn-1,wn) = P(wn | wn-1) P(wn-1) • By the Chain Rule we can decompose a joint probability, e.g. P(w1,w2,w3) P(w1,w2, ...,wn) = P(w1|w2,w3,...,wn) P(w2|w3, ...,wn) … P(wn-1|wn) P(wn) For bigrams then, the probability of a sequence is just the product of the conditional probabilities of its bigrams P(the,mythical,unicorn) = P(unicorn|mythical)P(mythical|the) P(the|<start>) Adapted from slide by Bonnie Dorr

  15. Training and Testing • N-Gram probabilities come from a training corpus • overly narrow corpus: probabilities don't generalize • overly general corpus: probabilities don't reflect task or domain • A separate test corpus is used to evaluate the model, typically using standard metrics • held out test set; development test set • cross validation • results tested for statistical significance Adapted from slide by Bonnie Dorr

  16. A Simple Example • From BeRP: The Berkeley Restaurant Project • A testbed for a Speech Recognition project • System prompts user for information in order to fill in slots in a restaurant database. • Type of food, hours open, how expensive • After getting lots of input, can compute how likely it is that someone will say X given that they already said Y. P(I want to each Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) Adapted from slide by Bonnie Dorr

  17. Eat on .16 Eat Thai .03 Eat some .06 Eat breakfast .03 Eat lunch .06 Eat in .02 Eat dinner .05 Eat Chinese .02 Eat at .04 Eat Mexican .02 Eat a .04 Eat tomorrow .01 Eat Indian .04 Eat dessert .007 Eat today .03 Eat British .001 A Bigram Grammar Fragment from BeRP Adapted from slide by Bonnie Dorr

  18. <start> I .25 Want some .04 <start> I’d .06 Want Thai .01 <start> Tell .04 To eat .26 <start> I’m .02 To have .14 I want .32 To spend .09 I would .29 To be .02 I don’t .08 British food .60 I have .04 British restaurant .15 Want to .65 British cuisine .01 Want a .05 British lunch .01 Adapted from slide by Bonnie Dorr

  19. P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25*.32*.65*.26*.001*.60 = .000080 • vs. I want to eat Chinese food = .00015 • Probabilities seem to capture “syntactic'' facts, “world knowledge'' • eat is often followed by an NP • British food is not too popular • N-gram models can be trained by counting and normalization Adapted from slide by Bonnie Dorr

  20. What do we learn about the language? • What's being captured with ... • P(want | I) = .32 • P(to | want) = .65 • P(eat | to) = .26 • P(food | Chinese) = .56 • P(lunch | eat) = .055 • What about... • P(I | I) = .0023 • P(I | want) = .0025 • P(I | food) = .013 Adapted from slide by Bonnie Dorr

  21. Tagging with lexical frequencies • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN • Problem: assign a tag to race given its lexical frequency • Solution: we choose the tag that has the greater • P(race|VB)Probability of “race” given “VB” on prior word • P(race|NN)Probability of “race” given “NN” on prior word • Actual estimate from the Switchboard corpus: • P(race|NN) = .00041 • P(race|VB) = .00003 Modified from Massio Poesio's lecture

  22. Combining Taggers • Use more accurate algorithms when we can, backoff to wider coverage when needed. • Try tagging the token with the 1st order tagger. • If the 1st order tagger is unable to find a tag for the token, try finding a tag with the 0th order tagger. • If the 0th order tagger is also unable to find a tag, use the NN_CD_Tagger to find a tag. Modified from Diane Litman's version of Steve Bird's notes

  23. BackoffTagger class >>> train_toks = TaggedTokenizer().tokenize(tagged_text_str) # Construct the taggers >>> tagger1 = NthOrderTagger(1, SUBTOKENS=‘WORDS’) >>> tagger2 = UnigramTagger() # 0th order >>> tagger3 = NN_CD_Tagger() # Train the taggers >>> for tok in train_toks: tagger1.train(tok) tagger2.train(tok) Modified from Diane Litman's version of Steve Bird's notes

  24. Backoff (continued) # Combine the taggers (in order, by specificity) > tagger = BackoffTagger([tagger1, tagger2, tagger3]) # Use the combined tagger > accuracy = tagger_accuracy(tagger, unseen_tokens) Modified from Diane Litman's version of Steve Bird's notes

  25. Rule-Based Tagger • The Linguistic Complaint • Where is the linguistic knowledge of a tagger? • Just a massive table of numbers • Aren’t there any linguistic insights that could emerge from the data? • Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun. Modified from Diane Litman's version of Steve Bird's notes

  26. The Brill tagger • An example of TRANSFORMATION-BASED LEARNING • Very popular (freely available, works fairly well) • A SUPERVISED method: requires a tagged corpus • Basic idea: do a quick job first (using frequency), then revise it using contextual rules Slide modified from Massimo Poesio's

  27. Brill Tagging: In more detail • Start with simple (less accurate) rules…learn better ones from tagged corpus • Tag each word initially with most likely POS • Examine set of transformationsto see which improves tagging decisions compared to tagged corpus • Re-tag corpus using best transformation • Repeat until, e.g., performance doesn’t improve • Result: tagging procedure (ordered list of transformations) which can be applied to new, untagged text

  28. An example • Examples: • They are expected to racetomorrow. • Therace for outer space. • Tagging algorithm: • Tag all uses of “race” as NN (most likely tag in the Brown corpus) • They are expected to race/NN tomorrow • the race/NN for outer space • Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO: • They are expected to race/VB tomorrow • the race/NN for outer space Slide modified from Massimo Poesio's

  29. First 20 Transformation Rules From: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging Eric Brill.  Computational Linguistics.  December, 1995.

  30. Transformation Rules for Tagging Unknown Words From: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging Eric Brill.  Computational Linguistics.  December, 1995.

  31. Additional issues Most of the difference in performance between POS algorithms depends on their treatment of UNKNOWN WORDS Class-based N-grams Adapted from Massio Peosio's

  32. Evaluating a Tagger • Tagged tokens – the original data • Untag (exclude) the data • Tag the data with your own tagger • Compare the original and new tags • Iterate over the two lists checking for identity and counting • Accuracy = fraction correct Modified from Diane Litman's version of Steve Bird's notes

  33. Assessing the Errors Why the tuple method? Dictionaries cannot be indexed by lists, so convert lists to tuples. exclude returns a new token containing only the properties that are not named in the given list.

  34. Assessing the Errors

  35. Upcoming • First assignment due 8pm tonight • Turn in on course Assignments page • For next week: • Read the Chunking tutorial. • (The pdf version has the missing images) • http://nltk.sourceforge.net/tutorial/chunking.pdf • We’ll have an assignment getting practice with this.

More Related