1 / 23

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005. Preslav Nakov EECS, Computer Science Division University of California, Berkeley Marti Hearst SIMS University of California, Berkeley. Outline. Introduction Related Work Models and Features.

alanna
Download Presentation

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Engine Statistics Beyond the n-gram:Application to Noun Compound BracketingCoNLL-2005 Preslav Nakov EECS, Computer Science Division University of California, Berkeley Marti Hearst SIMS University of California, Berkeley

  2. Outline • Introduction • Related Work • Models and Features

  3. Introduction • Noun compound bracketing-> Noun compound interpretation • liver cell antibody • [[liver cell] antibody] • liver cell line • [liver [cell line]] • POS equivalent, different syntactic trees

  4. This Paper • A highly accurate unsupervised method for making bracketing decisions for noun compounds (NCs) • Current: using bigram estimates to compute adjacency and dependency scores • Improvement • χ2 measure • a new set of surface features for querying Web search engines • Evaluate on 2 domains, encyclopedia & bioscience

  5. Related Work • NC syntax and semantics • Still active -> J. of Com. Speech and Language – Special Issue on Multiword Expressions • Adjacency model • Probabilistic dependency model, Laucer (1995) • Data sparseness (use categories instead) • 244 NCs from encyclopedia • Inter-annotator agreement 81.5% • Baseline 66.8% -> 77.5% • Adding POS -> state-of-the-art result of 80.7%

  6. 2003~2005 • Keller and Lapata (2003) • Use Web Search Engines for obtaining frequencies for unseen bigrams • (2004) apply to six NLP tasks including disambiguation of NCs • Simpler version (use frequency only) - 78.68% • Girju et al. (2005) supervised (decision tree) (5 WordNet semantic features) • 83.1%

  7. Models and Features • Adjacency and dependency model • w1w2w3 -> [w1 [w2w3]] (two reasons) take on right bracketing • w2w3 is a compound (modified by w1) • home health care • Adjacency model checks 1. • w1 and w2 independently modify w3 • adult male rat • (Better) Dependency model checks 2. • Left bracketing -> only 1 choice • [law enforcement] agent

  8. Computing Probabilities • Alternative • Calculations

  9. χ2 measure • B=#(wi)-(A) • C=#(wj)-(A) • D=~N-A-B-C • N=8T =google 8B pages X 1000 words/page (Yang and Pedersen, 1997) χ2 better than MI

  10. 蛋包飯 • 蛋 2067593 • 蛋包2217 • 包 10207448 • 包飯3398 • 飯 1672224 • χ2 包飯750.34 > 蛋包67.32

  11. Web-Derived Surface (1/2) • Authors sometimes (consciously or not) disambiguate the words they write by using surface-level markers to suggest the correct meaning. • Dash (hyphen) • left bracketing • cell cycle analysis -> cell-cycle • right bracketing less reliable • donor T-cell • fiber optics-system • t-cell-depletion • Possessive marker • brain’s stem cells, brain stem’s cells, brain’s stem-cells • Internal capitalization • Plasmodium vivax Malaria, brain Stem cells • disable this feature on Roman digits and single-letter words • vitamin D deficiency

  12. Web-Derived Surface (2/2) • Embedded slashes • leukemia/lymphoma cell • growth factor (beta) or (growth factor) beta • (brain) stem cells • a comma, a dot or a colon • “health care, provider” or “lung cancer: patients” (weak indicator) • mouse-brain stem cells(weak indicator) • Unfortunately, Web SE ignore punctuation characters - hyphens, brackets, apostrophes, etc. • collect them indirectly – post-processing the resulting summaries (up to 1000 results) • Above features are clearly more reliable than others, we do not try to weight them • Features verifying • Counts returned by SE, page hits as a proxy for n-gram frequencies • from 1000 summaries

  13. Other Web-Derived Features • Abbreviations • tumor necrosis factor (NF) • tumor necrosis (TN) factor • Concatenation • health care reform -> healthcare, carereform • Wildcard (*) • “health care * reform” <-> “health * care reform” • Reorder • reform health care <-> care reform health • myosin heavy chain, heavy chain myosin • Internal inflection variability • tyrosine kinase activation, tyrosine kinasesactivation • Switching • “adult male rat”, we would also expect “male adult rat”.

  14. 新發現

  15. Paraphrases • Warren (1978) proposes • stem cells in the brain • cells from the brain stem • Copula paraphrase • office building that/which is a skyscraper • pain associated with arthritis migraine • search engines lack linguistic annotations • small set of hand-chosen paraphrases • associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for

  16. Evaluations • Lauer’s Dataset (1995) • 244 unambiguous 3-noun NC-s • Biomedical Dataset (Nakov et al., 2005, SIG BioLink) • Open NLP tools • sentence splitted, tokenized, POS tagged and shallow parsed a set of 1.4 million MEDLINE abstracts (citations between 1994 and 2003) • 500 NCs, 361 left, 69 right, 70 ambiguous

  17. Experiments • used MSN Search statistics for the n-grams and the paraphrases (unless the pattern contained a “*”) • MSN always returned exact numbers • Google for the surface features • Google and Yahoo rounded their page hits, which generally leads to lower accuracy (Yahoo was better than Google for these estimates)

  18. Tools Mentioned • UMLS Specialist lexicon • 得到生物領域字不同的拼法 • http://www.nlm.nih.gov/pubs/factsheets/umlslex.html • Carroll’s morphological tools • http://www.cogs.susx.ac.uk/lab/nlp/carroll/morph.html

  19. UMLS Lexicon • {base=AAAentry=E0000049 cat=noun variants=metareg variants=uncount acronym_of=abdominal aortic aneurysmectomy|E0429482 acronym_of=acne-associated arthritis|E0429483 acronym_of=acquired aplastic anemia|E0429484 acronym_of=acute anxiety attack|E0429485 acronym_of=androgenic anabolic agent|E0429486 acronym_of=aneurysm of ascending aorta acronym_of=aromatic amino acid|E0356310 acronym_of=acute apical abscess|E0356309 abbreviation_of=abdominal aortic aneurysm|E0006446} • {base=AAMDspelling_variant=A.A.M.D.entry=E0000050 cat=noun variants=groupuncount acronym_of=American Association on Mental Deficiency|E0000277}

  20. Conclusions and Future Work • Improved upon the state-of-the-art approaches to NC bracketing • Future include • test on > 3 words • recognize the ambiguous case • Include determiners and modifiers • on other NLP problems • refine the parser output • Parser typically assume right bracketing

More Related