1 / 18

Category-Based Pseudowords

Category-Based Pseudowords. Preslav Nakov & Marti Hearst University of California at Berkeley EECS & SIMS Supported by Genentech and ARDA Aquaint. Word sense disambiguation. WSD task: determine the sense of a particular instance of a multi-sense word given its context

rwomack
Download Presentation

Category-Based Pseudowords

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Category-Based Pseudowords Preslav Nakov & Marti Hearst University of California at Berkeley EECS & SIMS Supported by Genentech and ARDA Aquaint HLT/NAACL'03

  2. Word sense disambiguation • WSD task: determine the sense of a particular instance of a multi-sense word given its context • classic ambiguous example: bank • homography • river bank • financial institution • polysemy • financial institution • building HLT/NAACL'03

  3. Evaluation • Ideally: using a sense-tagged corpus • general purpose – e.g. SENSEVAL corpus • specific domain, e.g. biomedical • the National Library of Medicine test collection contains instances of 50 highly frequent ambiguous concepts from the UMLS Metathesaurus. • Moving to a new domain • a sense-tagged corpus may be unavailable • even when available, may be unsuitable • What if we use a different sense distinction: e.g. MeSH instead of the UMLS Metathesaurus? • What if we are also interested in less frequent words, e.g. need to evaluate an all-words system? HLT/NAACL'03

  4. Pseudowords • building a sense-tagged corpus is very expensive, so create an artificial one • pseudoword: composite comprised of two or more words, chosen at random (Gale et al.’92), (Schuetze’92): • e.g. banana and door  banana_door • accepted as an upper bound of the true system’s accuracy HLT/NAACL'03

  5. Problems Chosen entirely at random, and thus: • difficult to characterize in terms of the type of ambiguity being modeled • optimistic in their estimations (Gaustad’01) • highly likely to combine semantically distinct words • real ambiguous words have senses similar in meaning and difficult to distinguish HLT/NAACL'03

  6. The solution Use lexical category membership HLT/NAACL'03

  7. MeSH and Medline • we use MeSH (Medical Subject Headings) • example: Eye has the following codes A01.456.505.420 (child of Face) A09.371 (child of Sense Organs) • average number of senses: 2.12 • we cut after the first period to allow generalization (e.g. A01 and A09) • 71.18% - single class, 22.14% - two classes • the ambiguity drops to 1.39 • Medline abstracts - 180,226 • training: 120,150 • testing: 60,076 HLT/NAACL'03

  8. Pseudowords generation (1) Build a list C of the category couples and their frequencies in the training corpus HLT/NAACL'03

  9. Pseudowords generation (2) Generate pseudowords with the following characteristics: • represent a real ambiguity class pair (met in the training corpus) • the number of pseudowords drawn from a particular class pair is proportional to the pair’s frequency • only unambiguous words are used as pseudowords constituents • multi-word concepts are allowed as elements, e.g. general systems theory + glutathione s-tranferase HLT/NAACL'03

  10. Pseudowords generation (3) Pseudowords for the lower bound • in real texts, the more frequent sense for a two-sense distinction occurs around 92% of the time (Sanderson & van Rijsbergen’99) • evenly distributed senses are harder • so we build a balanced list W of pairs: • we calculate the mean corpus word frequency E and then find the words with freq. in [E/2;3E/2] • in the particular experiment: E=45.21, which gave a list of 64,596 pairs HLT/NAACL'03

  11. Pseudowords generation (4) • importance sampling • 1) Select a category pair c1,c2 from C by sampling from a multinomial distribution with parameters proportional to the frequencies of the elements of C. • 2) Sample uniformly to draw two random distinct words w1 and w2 whose classes correspond to the classes selected in step 1). • 3) If the word pair w1,w2 has been sampled already, go to step 1) and try again. • we sampled 1,000 pseudowords (88,758 instances) out of the possible 64,596 HLT/NAACL'03

  12. Sample pseudowords • the more unusual pairs come from less frequent categories HLT/NAACL'03

  13. Classifier • Naïve Bayes classifier • simple, commonly used for WSD, and among the best performing • we used a symmetric context window: • 10, 20, 40 and 300 words on each side • category name as a proxy for the sense • ambiguous MeSH categories as target • UNambiguous MeSH categories as features (we use a class-based model, and not a word-based one) HLT/NAACL'03

  14. Abbreviations • we have no real disambiguated corpus • use abbreviations, as suggested in (Liu et al.,’02) • represent real ambiguous words • but may be due to accident • intermediate position between entirely random pseudowords and real ambiguous words • we generated 98,841 abbreviations (332,020 instances in total) such that: • their expansions are fully and unambiguously mapped to MeSH • they represent exactly two distinct categories used an abbr. extraction tool described in (Schwartz&Hearst’03) HLT/NAACL'03

  15. Sample abbreviations HLT/NAACL'03

  16. Evaluation • Category based • baseline – choose the more frequent class (shown for abbreviations) • pessimistic – evenly distributed constituents • realistic – random constituents (frequency at least 5) • abbreviations • Non-category based • optimistic – completely random (the standard way to generate) HLT/NAACL'03

  17. Conclusions • We introduced category based pseudowords based on distributions from lexical category co-occurrence: • give a more accurate lower bound • allow detailed study (many samples) of a particular sense ambiguity • represent a better motivated word grouping in pseudowords HLT/NAACL'03

  18. Thank you! Your questions? HLT/NAACL'03

More Related