1 / 98

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 20 4 /1/2013. Recommended reading. Word Sense Disambiguation Jurafsky & Martin 20.0-20.4

Download Presentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. LING / C SC 439/539Statistical Natural Language Processing • Lecture 20 • 4/1/2013

  2. Recommended reading • Word Sense Disambiguation • Jurafsky & Martin 20.0-20.4 • David Yarowsky. 1994. Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. Proc. of ACL. • David Yarowsky. 1995. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. Proc. of ACL. • Discuss again next week • Coreference resolution • Jurafsky & Martin 21.3 • Aria Haghighi & Dan Klein. 2009. Simple Coreference Resolution with Rich Syntactic and Semantic Features. Proc. of EMNLP. • Vincent Ng. 2010. Supervised Noun Phrase Coreference Research: The First Fifteen Years. Proc. of ACL. • Information extraction • Jurafsky & Martin Chapter 22

  3. Outline • Generative models of classification • Generative vs. discriminative classifiers • Disambiguation problems • Decision List • Coreference resolution • Information extraction

  4. Generative probabilistic models • Assigns a probability distribution over all possible outcomes of all variables • Make independence and conditional independence assumptions about the data • Otherwise leads to sparse data problem in parameter estimation • Such assumptions are made according to one’s theory about the structure of data

  5. Generative probabilistic models • Example: language model • Observed variables only • Sequence of words W • Generative model: Nth-order Markov model p(W)

  6. Generative models with hidden variables • Example: Naïve Bayes • Observed: vector of features X • Hidden: class variable C • Generative model: p(C, X)

  7. Generative models with hidden variables • Example: HMM (for POS tagging) • Observed: sequence of words W • Hidden: sequence of POS tags T • Generative model: p(W, T)

  8. Generative models with hidden variables • Example: PCFG • Observed: sentence S • Hidden: parse tree T • Generative model: p(S, T) • p(S, T) = product of rules to derive sentence S with phrase structure tree T from the start symbol

  9. Common problems for generative models • Parameter estimation • Estimate probabilities from a corpus • Calculate probability of an observation • E.g., probability of a sentence • Marginalize over hidden variables (an ambiguous observation may have multiple hidden structures) • Classification (also called decoding) • Find most likely hidden structure for observations

  10. Classification in generative models • Classification • Want to find most likely values for hidden variables given observations • Compute argmaxH p(H|O) • Use Bayes rule • Generative model defines p(O, H) • Use Bayes to obtain p(H|O) • argmaxH p(H|O) = argmaxH p(O|H) * p(H)

  11. Bayes rule and classification • Bayes rule: p(B | A) = p(A, B) p(A) • Product rule: p(A, B) = p(A) * p(B | A) = p(B) * p(A | B) • argmaxB p(B | A) = argmaxB p(A|B) * p(B) • Compute posterior p(B|A) given the likelihood p(A|B) and the prior p(B) • Ignore prior p(A), since it is a constant

  12. Outline • Generative models of classification • Generative vs. discriminative classifiers • Disambiguation problems • Decision List • Coreference resolution • Information extraction

  13. Classification in generative models • Data described by a set of random variables • C ∈ { c1, …, cn }: the class to be predicted • F = f1, f2, … fn: feature variables • Classification: choose the most likely class given features • Generative model defines a joint distribution: p( C, F ) • Use Bayes rule to recover conditional prob.: p( C | F )

  14. Discriminative models of classification • Find a class C that maximizes p( C | F ) • C ∈ { c1, …, cn }: the class to be predicted • F = f1, f2, … fn: feature variables • Discriminative: directly model p( C | F ) • “Discriminative”: find out what distinguishes the classes • Compare to generative • First model p(C, F), then use Bayes rule to calculate p(F|C)*p(C), which is equivalent to p(C | F ) • “Generative”: probability distribution of all of the data

  15. Popular discriminative classifiers • Decision List • Binary classification, single feature • Logistic Regression • Binary classification, vector of features • Maximum Entropy • Multiclass classification, vector of features • Conditional Random Field • Multiclass classification, sequential classifier, vectors of features • (SVM, Perceptron) • Discriminative, though not probabilistic

  16. Generative vs. discriminative classifiers • According to Bayes Rule, generative and discriminative classifiers should be equivalent: argmaxC p( C | F ) = argmaxC p( F | C ) * p( C ) = argmaxC p( F, C ) • Discriminative: argmaxC p( C | F ) • Generative: argmaxC p( F | C ) * p( C )

  17. Generative vs. discriminative: independence assumptions • Generative: model joint prob of classes and features: p(C, F) • Often have to make severe independence assumptions • e.g. Naïve Bayes: • In discriminative classifiers where we model p(C|F), we don’t need to make such independence assumptions • We can add use non-independent features without specifying their probabilistic dependencies • Example of dependent features: • “Current word begins with capital letter” • “Current word is all-caps”

  18. Generative vs. discriminative classifiers • Vapnik 1998: “one should solve the [classification] problem directly and never solve a more general problem as an intermediate step” • Model p( C | F ) directly • Don’t first define p(C, F), then use it to obtain p( C | F ) • Problem: additional difficulties with parameter estimation in the joint model

  19. Summary of probabilistic classifiers(joint = generative, conditional = discriminative)

  20. Outline • Generative models of classification • Generative vs. discriminative classifiers • Disambiguation problems • Decision List • Coreference resolution • Information extraction

  21. Disambiguation problems • For these problems, the instance to be classified is ambiguous: • Accent restoration • Word sense disambiguation • Capitalization restoration • Binary classification problems

  22. Problem 1: accent restoration • Some languages, such as French and Spanish, are written with accents on characters, where these accents determine word identity • Text is sometimes typed written without accents • Need to perform accent restoration to recover intended words • Example: … unefamille des pecheurs • pêcheurs (fishermen) • pécheurs (sinners)

  23. Problem 2: capitalization restoration • Text is sometimes written in all capitals or all lower case, and needs disambiguation • “AIDS …” • disease or helpful tools? • Words at the beginning of a sentence are capitalized • “Bush …” • president or shrub?

  24. Problem 3: word sense disambiguation (WSD) • The bank on State Street • Possible meanings of “bank” • Sense 1: river bank • Sense 2: place for $$$ • Need word sense disambiguation • Given an ambiguous word, decide on its sense

  25. WSD is important in translation • Translation into Korean: • Iraq lost the battle. Ilakukacentweyciessta. [Iraq ] [battle] [lost] • John lost his computer. John-i computer-lulilepelyessta. [John] [computer] [misplaced] • Semantic Constraints: lose1(Agent, Patient: competition) <=> ciessta lose2 (Agent, Patient: physobj) <=> ilepelyessta

  26. WSD is needed in speech synthesis (convert text to sound) • … slightly elevated lead levels • Sense 1: lead role (rhymes with seed) • Sense 2: lead mines (rhymes with bed) • The speaker produces too little bass • Sense 1: string bass (rhymes with vase) • Sense 2: sea bass (rhymes with pass)

  27. Word sense disambiguation • For a particular word, can its senses be distinguished? • First, need a set of senses to be predicted • WordNet: • Hierarchically organized database of senses for open-class words in English • http://www.cogsci.princeton.edu/~wn/

  28. Word Sense Word Sense aim register • Point or direct object, weapon, at something ... • Wish, purpose or intend to achieve something • Enter into an official record • Be aware of, enter into someone’s consciousness • Indicate a measurement • Show in one’s face Word senses in WordNet • Meaning of nouns, verbs, and adjectives are specified using a catalog of possible senses Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon . • Enter into an official record • Wish, purpose or intend to achieve something

  29. Words can have many senses in WordNet;for WSD, let’s assume each word has 2 senses The noun bass has 8 senses. 1. bass -- (the lowest part of the musical range) 2. bass, bass part -- (the lowest part in polyphonic music) 3. bass, basso -- (an adult male singer with the lowest voice) 4. sea bass, bass -- (the lean flesh of a saltwater fish of the family Serranidae) 5. freshwater bass, bass -- (any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)) 6. bass, bass voice, basso -- (the lowest adult male singing voice) 7. bass -- (the member with the lowest range of a family of musical instruments) 8. bass -- (nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes) The adj bass has 1 sense. 1. bass, deep -- (having or denoting a low vocal or instrumental range; "a deep voice"; "a bass voice is lower than a baritone voice"; "a bass clarinet")

  30. Senseval-1 (1998): English, French, Italian WSDhttp://www.senseval.org/ • 35 different words • were tested • Table shows # of test instances per word

  31. How can we do WSD? • Disambiguate a word by looking at its context • Warren Weaver, 1949: “If one examines the words in a book, one at a time as through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of the words […] But if one lengthens the slit in the opaque mask, until one can see not only the central word in question but also say N words on either side, then if N is large enough one can unambiguously decide the meaning of the central word.”

  32. Example: WSD through context • What does this word mean in each case? • The human hand consists of a broad palm with 5 digits, attached to the forearm by a joint called the wrist (carpus). • Neither the anatomy of the palm tree stems nor the conformation of their flowers, however, entitles them to any such high position in the vegetable hierarchy

  33. Can’t build a system by hand • Fernand Marty (1986, 1992) • French text-to-speech synthesis • Hand-formulated rules and heuristics • Rule: presence of deposit in the vicinity of bank indicates $$$ • Problem: lots and lots and lots and lots of rules would be needed

  34. Outline • Generative models of classification • Generative vs. discriminative classifiers • Disambiguation problems • Decision List • Coreference resolution • Information extraction

  35. Decision List • David Yarowsky (1994, 1995) • A simple discriminative classifier • Compute argmaxC p(C|F) • Compare: p(C1|f1), p(C2|f1), … p(C1|fn), p(C2|fn) • Choose class based on largest difference in p( Ci | fj ) for a feature fj in the data to be classified

  36. Decision List for WSD: p(sense|feature) • The decision list compares the conditional probabilities of senses given various features, to determine the probabilistically most likely sense for a word. • Example: disambiguate ‘bank’ in this sentence: • I checked my boat at the marina next to the bank of the river near the shore. • p( money-sense | ‘check’ ) • p( river-sense| ‘check’ ) • p( money-sense | ‘shore’ ) • p( river-sense | ‘shore’ )  let’s say this has highest prob

  37. Automatically build disambiguation system • Yarowsky’s method: • Get corpus with words annotated for different categories • Formulate templates for generation of disambiguating rules • Algorithm constructs all such rules from a corpus • Algorithm selects relevant rules through statistics of usage for each category • Methodology can be applied to any binary disambiguation problem

  38. Rule templates Possible rules + Ranked rules annotated corpus Statistics of usage

  39. Decision list algorithm: step 1, identify ambiguities • Example problem: accent restoration

  40. Step 2: Collect training contexts • Begin with an annotated corpus • (In this context, a corpus with accents indicated)

  41. Step 3: Specify rule templates • Given a particular training context, collect: • Word immediately to the right (+1 W) or left (-1 W) • Word found in ±k word window • Pair of words at fixed offsets • Other evidence can be used: • Lemma (morphological root) • Part of speech category • Other types of word classes (e.g. set of days of week)

  42. Step 4a: Count frequency of rules for each category

  43. Step 4b: Turn rule frequencies into probabilities

  44. Which rules are indicative of a category? • Two categories c1 and c2; p(c1|rule) + p(c2|rule) = 1 • Log-likelihood ratio: log( p(c1|rule) / p(c2|rule) ) • If p(c1|rule) = 0.5 and p(c2|rule) = 0.5, doesn’t distinguish log( p(c1 | rule) / p(c2 | rule) ) = 0 • If p(c1|rule) > 0.5 and p(c2|rule) < 0.5, c1 is more likely log( p(c1 | rule) / p(c2 | rule) ) > 0 • If p(c1|rule) < 0.5 and p(c2|rule) > 0.5, c2 is more likely log( p(c1 | rule) / p(c2 | rule) ) < 0

  45. Which rules are best for disambiguating between categories? • Use absolute value of log-likelihood ratio: abs(log( p(sense1 | rule) / p(sense2 | rule) )) • Rank rules by abs. value of log-likelihood ratio • Rules that best distinguish between the two categories are ranked highest

  46. Step 5: Choose rules that are indicative of categories: sort by abs(LogL) • This is the final decision list

  47. Step 6: classify new data with decision list • For a sentence with a word to be disambiguated: • Go down the ranked list of rules in the decision list • Find the first rule with a matching context • Assign a sense according to that rule • Finished. • Ignore other lower-ranked rules, even if they have matching contexts as well

  48. Example: disambiguate “plant” • Radiation from the crippled nuclear plant in Japan is showing up in rain in the United States.

  49. How well does it work? • Simple statistical model • Easy to implement • Test performance on several different disambiguation problems

  50. Performance: accent restoration • On ambiguous cases, 98% correct • Examples: cóte / cóté: 98% décidé / décide: 97% hacia / hacía: 97%

More Related