1 / 45

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing. Preslav Nakov and Marti Hearst Computer Science Division and SIMS University of California, Berkeley. Supported by NSF DBI-0317510 and a gift from Genentech. Overview. Unsupervised algorithm

Download Presentation

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing Preslav Nakov andMarti HearstComputer Science Division and SIMSUniversity of California, Berkeley Supported by NSF DBI-0317510 and a gift from Genentech

  2. Overview • Unsupervised algorithm • Applied here to noun compound bracketing, but promising for structural ambiguity generally • Features • n-grams, 2 ,MI • Beyond the n-gram • surface features • paraphrases • State-of-the art accuracy

  3. Noun Compound Bracketing (a) [ [ liver cell ] antibody ] (left bracketing) (b) [ liver [cell line] ] (right bracketing) • In (a), the antibodytargets the liver cell. • In (b), the cell lineis derived from the liver. liver cell line liver cell antibody

  4. Related Work Pr that w1 precedes w2 • Marcus(1980), Pustejosky&al.(1993), Resnik(1993) • adjacency model: Pr(w1|w2) vs. Pr(w2|w3) • Lauer (1995) • dependency model: Pr(w1|w2) vs. Pr(w1|w3) • Keller & Lapata (2004): • use the Web • unigrams and bigrams • Girju & al. (2005) • supervised model • bracketing in context • requires WordNet senses to be given • This work: • 2 • Web • n-grams • paraphrases • surface features

  5. Adjacency & Dependency (1) • right bracketing: [w1[w2w3] ] • w2w3 is a compound (modified by w1) • home health care • w1 and w2 independently modify w3 • adult male rat • left bracketing : [ [w1w2 ]w3] • only 1 modificational choice possible • law enforcement officer w1 w2 w3 w1 w2 w3

  6. Adjacency & Dependency (2) • right bracketing: [w1[w2w3] ] • w2w3 is a compound (modified by w1) • w1 and w2 independently modify w3 • adjacency model • Is w2w3 a compound? • (vs. w1w2 being a compound) • dependency model • Does w1 modify w3? • (vs. w1 modifying w2) w1 w2 w3 w1 w2 w3 w1 w2 w3

  7. Frequencies • Adjacency model • Compare #(w1,w2) to #(w2,w3) • Dependency model • Compare #(w1,w2) to #(w1,w3) Frequencyof w1w2 w1 w2 w3 left right w1 w2 w3

  8. Probabilities • Adjacency model • Compare Pr(w1w2|w2) to Pr(w2w3|w3) • Dependency model • Compare Pr(w1w2|w2) to Pr(w1w3|w3) Pr that w1 modifies w2 w1 w2 w3 left right w1 w2 w3

  9. Probabilities: Dependency • Dependency model • Pr(left) = Pr(w1w2|w2)Pr(w2w3|w3) • Pr(right) = Pr(w1w3|w3)Pr(w2w3|w3) So we compare Pr(w1w2|w2) to Pr(w1w3|w3) BUT! No cancellation in the Lauer’s model: right w1 w2 w3 left

  10. Probabilities: Estimation • Using page hits as a proxy for n-gram counts • Pr(w1w2|w2) = #(w1,w2) / #(w2) • #(w2) word frequency; query for “w2” • #(w1,w2) bigram frequency; query for “w1 w2” • smoothed by 0.5

  11. Probabilities: Why? (1) • Why should we use: • (a) Pr(w1w2|w2), rather than • (b) Pr(w2w1|w1)? • Keller&Lapata (2004) calculate: • AltaVista queries: • (a): 70.49% • (b): 68.85% • British National Corpus: • (a): 63.11% • (b): 65.57%

  12. Probabilities: Why? (2) • Why should we use: • (a) Pr(w1w2|w2), rather than • (b) Pr(w2w1|w1)? • Maybe to introduce a bracketing prior. • Just like Lauer (1995) did. • But otherwise, no reason to prefer either one. • Do we need probabilities? (association is OK) • Do we need a directed model? (symmetry is OK)

  13. Association Models: 2 (Chi Squared) • A = #(wi,wj) • B = #(wi) – #(wi,wj) • C = #(wj) – #(wi,wj) • D = N – (A+B+C) • N = 8 trillion (= A+B+C+D) 8 billion Web pages x 1,000 words

  14. Web-derived Surface Features • Authors often disambiguate noun compounds using surface markers, e.g.: • amino-acid sequence  left • brain stem’s cell  left • brain’s stem cell  right • The enormous size of the Web makes them frequent enough to be useful.

  15. Web-derived Surface Features:Dash (hyphen) • Left dash • cell-cycle analysis left • Right dash • donor T-cell right • fiber optics-system  should be left.. • Double dash • T-cell-depletion unusable…

  16. Web-derived Surface Features:Possessive Marker • Attached to the first word • brain’s stem cell  right • Attached to the second word • brain stem’s cell  left • Combined features • brain’s stem-cell  right

  17. Web-derived Surface Features:Capitalization • don’t-care – lowercase – uppercase • Plasmodium vivax Malaria  left • plasmodium vivax Malaria  left • lowercase – uppercase–don’t-care • brain Stem cell  right • brain Stem Cell  right • Disabled on: • Roman digits • Single-letter words: e.g. vitamin D deficiency

  18. Web-derived Surface Features:Embedded Slash • Left embedded slash • leukemia/lymphoma cell  right

  19. Web-derived Surface Features:Parentheses • Single-word • growth factor (beta)  left • (brain) stem cell  right • Two-word • (growth factor) beta  left • brain (stem cell)  right

  20. Web-derived Surface Features:Column, dot, semi-column • Following the first word • home. health care  right • adult, male rat  right • Following the second word • health care, provider  left • lung cancer: patients  left

  21. Web-derived Surface Features:Dash to External Word • External word to the left • mouse-brain stem cell  right • External word to the right • tumor necrosis factor-alpha  left

  22. Web-derived Surface Features:Problems & Solutions • Problem: search engines ignore punctuation • “brain-stem cell” does not work • Solution: • query for “brain stem cell” • obtain 1,000 document summaries • look for the features in these summaries

  23. Other Web-derived Features:Abbreviation • After the second word • tumor necrosis factor (NF) right • After the third word • tumor necrosis (TN) factor  right • We query for e.g. “tumor necrosis tn factor” • Problems: • Roman digits: IV, VI • States: CA • Short words: me

  24. Other Web-derived Features:Concatenation • Consider health care reform • healthcare : 79,500,000 • carereform : 269 • healthreform: 812 • Adjacency model • healthcare vs. carereform • Dependency model • healthcare vs. healthreform • Triples • “healthcarereform” vs. “health carereform”

  25. Other Web-derived Features:Using Google’s * • Each * allows an one-word wildcard • Single star • “health care * reform” left • “health * care reform” right • More stars and/or reverse order • “care reform * * health” right • Adjacency model

  26. Other Web-derived Features:Reorder • Reorders for “healthcare reform” • “care reform health” right • “reform health care” left

  27. Other Web-derived Features:Internal Inflection Variability • First word • ??? • Second word • tyrosine kinase activation • tyrosine kinases activation

  28. Other Web-derived Features:Switch The First Two Words • Predict right, if we can reorder • adult male ratas • male adult rat

  29. Paraphrases (1) • The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) • Prepositional • stem cells in the brain right • cells from the brain stem  right • Verbal • virus causinghuman immunodeficiency  left • pain associated witharthritis migraine  right • Copula • office building that is a skyscraper right

  30. Paraphrases (2) • Lauer(1995), Keller&Lapata(2003), Girju&al. (2005) predict NC semantics by choosing the most likely preposition: • of, for, in, at, on, from, with, about, (like) • This could be problematic, when more than one preposition is possible • In contrast: • we try to predict syntax, not semantics • we do not disambiguate, just add up all counts • cells in (the) bone marrow  left • cells from (the) bone marrow  left

  31. Paraphrases (3) • prepositional paraphrases: • We use: ~150 prepositions • verbal paraphrases: • We use: associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for. • copula paraphrases: • We use: is/was and that/which/who • optional elements: • articles: a, an, the • quantifiers: some, every, etc. • pronouns: this, these, etc.

  32. Evaluation: Datasets • Lauer Set • 244 noun compounds (NCs) • from Grolier’s encyclopedia • inter-annotator agreement: 81.5% • Biomedical Set • 430 NCs • from MEDLINE • inter-annotator agreement: 88% (=.606)

  33. Evaluation: Experiments • Exact phrase queries • Limited to English • Inflections: • Lauer Set: Carroll’s morphological tools • Biomedical Set: UMLS Specialist Lexicon

  34. Results: Lauer (1) wrong N/A correct

  35. Results Lauer (2) wrong N/A correct

  36. Results Lauer (3)

  37. Results: Bio (1) wrong N/A correct

  38. Results Bio (2) wrong N/A correct

  39. Individual Surface Features Performance: Bio

  40. Paraphrase and Surface Features Performance • Lauer Set • Biomedical Set

  41. Discussion Lauer Bio • Adjacency vs. Dependency • 2 vs. frequencies vs. probabilities

  42. Conclusion • Introduced search engine statistics that go beyond the n-gram (applicable to other tasks) • surface features • paraphrases • Obtained new state-of-the-art results on NC bracketing • more robust than Lauer (1995) • more accurate than Keller&Lapata (2004)

  43. Future Work • Recognize ambiguous cases • Bracket more than 3 nouns • Not just bracketing but dependences: • e.g. growth factoralpha • Bracket NPs in general (other POS) • augment Penn Treebank with NP-internal dependences • Application to other structural ambiguity problems: • Prepositional phrase attachment • Noun phrase coordination

  44. The End Thank you!

  45. Web Counts: Problems • Page hits are inaccurate • This may be ok (Keller&Lapata,2003) • The Web lacks linguistic annotation • Pr(health|care) = #(“health care”) / #(care) • health: noun • care: both verb and noun • can be adjacent by chance • can come from different sentences • Cannot find: • stem cells VERB PREPOSITION brain • protein synthesis’ inhibition

More Related