1 / 36

Part 5. Minimally Supervised Methods for Word Sense Disambiguation

Part 5. Minimally Supervised Methods for Word Sense Disambiguation. Outline. Task definition What does “minimally” supervised mean? Bootstrapping algorithms Co-training Self-training Yarowsky algorithm Using the Web for Word Sense Disambiguation Web as a corpus Web as collective mind.

cissy
Download Presentation

Part 5. Minimally Supervised Methods for Word Sense Disambiguation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part 5.Minimally Supervised Methods for Word Sense Disambiguation

  2. Outline Task definition What does “minimally” supervised mean? Bootstrapping algorithms Co-training Self-training Yarowsky algorithm Using the Web for Word Sense Disambiguation Web as a corpus Web as collective mind

  3. Task Definition Supervised WSD = learning sense classifiers starting with annotated data Minimally supervised WSD = learning sense classifiers from annotated class, with minimal human supervision Examples Automatically bootstrap a corpus starting with a few human annotated examples Use monosemous relatives / dictionary definitions to automatically construct sense tagged data Rely on Web-users + active learning for corpus annotation

  4. Outline • Task definition • What does “minimally” supervised mean? • Bootstrapping algorithms • Co-training • Self-training • Yarowsky algorithm • Using the Web for Word Sense Disambiguation • Web as a corpus • Web as collective mind

  5. Bootstrapping WSD Classifiers • Build sense classifiers with little training data • Expand applicability of supervised WSD • Bootstrapping approaches • Co-training • Self-training • Yarowsky algorithm

  6. Bootstrapping Recipe • Ingredients • (Some) labeled data • (Large amounts of) unlabeled data • (One or more) basic classifiers • Output • Classifier that improves over the basic classifiers

  7. Co-training / Self-training • A set L of labeled training examples • A set U of unlabeled examples • Classifiers Ci • 1. Create a pool of examples U' • choose P random examples from U • 2. Loop for I iterations • Train Ci on L and label U' • Select G most confident examples and add to L • maintain distribution in L • Refill U' with examples from U • keep U' at constant size P

  8. Co-training • (Blum and Mitchell 1998) • Two classifiers • independent views • [independence condition can be relaxed] • Co-training in Natural Language Learning • Statistical parsing (Sarkar 2001) • Co-reference resolution (Ng and Cardie 2003) • Part of speech tagging (Clark, Curran and Osborne 2003) • ...

  9. Self-training • (Nigam and Ghani 2000) • One single classifier • Retrain on its own output • Self-training for Natural Language Learning • Part of speech tagging (Clark, Curran and Osborne 2003) • Co-reference resolution (Ng and Cardie 2003) • several classifiers through bagging

  10. Parameter Setting for Co-training/Self-training • 1. Create a pool of examples U' • choose P random examples from U • 2. Loop for I iterations • Train Ci on L and label U' • Select G most confident examples and add to L • maintain distribution in L • Refill U' with examples from U • keep U' at constant size P • A major drawback of bootstrapping • “No principled method for selecting optimal values for these parameters” (Ng and Cardie 2003)

  11. Experiments with Co-training / Self-training for WSD • (Mihalcea 2004) • Training / Test data • Senseval-2 nouns (29 ambiguous nouns) • Average corpus size: 95 training examples, 48 test examples • Raw data • British National Corpus • Average corpus size: 7,085 examples • Co-training • Two classifiers: local and topical classifiers • Self-training • One classifier: global classifier

  12. Optimal Parameter Settings • Optimized on the test set • Upper bound in co-training/self-training performance • Parameter ranges • P = {1, 100, 500, 1000, 1500, 2000, 5000} • G = {1, 10, 20, 30, 40, 50, 100, 150, 200} • I = {1, ..., 40} • 29 nouns → 120,000 runs • Accuracy: • Basic classifier: 53.84% • Optimal self-training: 65.61% • Optimal co-training: 65.75% • ~25% error reduction • Example: lady • basic = 61.53% • self-training = 84.61% [20/100/39] • co-training = 82.05% [1/1000/3]

  13. Empirical Parameter Settings • How to detect parameter settings in practice? • 20% training data → validation set • Same range of parameter values • Method 1: Per-word parameter setting • Identify best parameter setting for each word • No improvement over basic classifier • Basic = 53.84% • Co-training = 51.73% • Self-training = 52.88%

  14. Empirical Parameter Settings • Method 2: Overall parameter setting • For each parameter setting P, G, I • Determine the total relative growth in performance • Select the “best” setting • Co-training: • G = 1, P = 1500, I = 2 • Basic = 53.84%, Co-training = 55.67% • Self-training • G = 1, P = 1, I = 1 • Basic = 53.84%, Self-training = 54.16%

  15. Empirical Parameter Setting • Method 3: Smoothed co-training • Combine iterations of co-training with voting • Effect • similar shape • “smoothed” learning curve • larger range with better-than-baseline performance • Results (avg.) • Basic = 53.84% • Co-training, global setting • basic = 55.67% • smoothed = 58.35% • Co-training, per-word setting • basic = 51.73% • smoothed = 56.68%

  16. Yarowsky Algorithm • (Yarowsky 1995) • Similar to co-training • Differs in the basic assumption • “view independence” (co-training) vs. “precision independence” (Yarowsky algorithm) • (Abney 2002) • Relies on two classifiers and a decision list • One sense per collocation : • Nearby words provide strong and consistent clues as to the sense of a target word • One sense per discourse : • The sense of a target word is highly consistent within a single document

  17. Learning Algorithm • A decision list is used to classify instances of target word : • “the loss of animal and plant species through extinction …” • Classification is based on the highest ranking rule that matches the target context

  18. Bootstrapping Algorithm Sense-A: life Sense-B: factory • All occurrences of the target word are identified • A small training set of seed data is tagged with word sense

  19. Bootstrapping Algorithm • Iterative procedure: • Train decision list algorithm on seed set • Classify residual data with decision list • Create new seed set by identifying samples that are tagged with a probability above a certain threshold • Retrain classifier on new seed set • Selecting training seeds • Initial training set should accurately distinguish among possible senses • Strategies: • Select a single, defining seed collocation for each possible sense. Ex: “life” and “manufacturing” for target plant • Use words from dictionary definitions • Hand-label most frequent collocates

  20. Bootstrapping Algorithm Seed set grows and residual set shrinks ….

  21. Bootstrapping Algorithm Convergence: Stop when residual set stabilizes

  22. One Sense per Discourse Algorithm can be improved by applying “One Sense per Discourse” constraint • After algorithm has converged: Identify tokens tagged with low confidence, label with dominant tag of that document • After each iteration: Extend tag to all examples in a single document after enough examples are tagged with a single sense

  23. Evaluation • Test corpus: extracted from 460 million word corpus of multiple sources (news articles, transcripts, novels, etc.) • Performance of multiple models compared with: • supervised decision lists • unsupervised learning algorithm of Schütze (1992), based on alignment of clusters with word senses

  24. Outline • Task definition • What does “minimally” supervised mean? • Bootstrapping algorithms • Co-training • Self-training • Yarowsky algorithm • Using the Web for Word Sense Disambiguation • Web as a corpus • Web as collective mind

  25. The Web as a Corpus • Use the Web as a large textual corpus • Build annotated corpora using monosemous relatives • Bootstrap annotated corpora starting with few seeds • Use the (semi)automatically tagged data to train WSD classifiers

  26. Monosemous Relatives • IDEA: determine a phrase (SP) which uniquely identifies the sense of a word (W#i) • Determine one or more Search Phrases from a machine readable dictionary using several heuristics • Search the Internet using the Search Phrases from step 1. • Replace the Search Phrases in the examples gathered at 2 with W#i. • Output: sense annotated corpus for the word sense W#i

  27. Heuristics to Identify Monosemous Relatives • Heuristic 1 • Determine a monosemous synonym • remember#1 has recollect as monosemous synonym  SP=recollect • Heuristic 2 • Parse the gloss and determine the set of single phrase definitions • produce#5 has the definition “bring onto the market or release” 2 definitions: “bring onto the market” and “release” eliminate “release” as being ambiguous  SP=bring onto the market • Heuristic 3 • Parse the gloss and determine the set of single phrase definitions • Replace the stop words with the NEAR operator • Strengthen the query: concatenate the words from the current synset using the AND operator • produce#6 has the synset {grow, raise, farm, produce}and the definition “cultivate by growing” SP=cultivate NEAR growing AND (grow OR raise OR farm OR produce)

  28. Heuristics to Identify Monosemous Relatives • Heuristic 4 • Parse the gloss and determine the set of single phrase definitions • Keep only the head phrase • Strengthen the query: concatenate the words from the current synset using the AND operator • company#5 has the synset {party,company}and the definition “band of people associated in some activity” SP=band of people AND (company OR party)

  29. Example • Building annotated corpora for the noun interest.

  30. Example • Gather 5,404 examples • Check the first 70 examples  67 correct; 95.7% accuracy. 1. I appreciate the genuine interest#1 which motivated you to write your message. 2. The webmaster of this site warrants neither accuracy, nor interest#2. 3. He forgives us not only for our interest#3, but for his own. 4. Interest#4 coverage, including rents, was 3.6x 5. As an interest#5, she enjoyed gardening and taking part into church activities. 6. Voted on issues, they should have abstained because of direct and indirect personal interests#6 in the matters of hand. 7. The Adam Smith Society is a new interest#7 organized within the APA.

  31. Experimental Evaluation • Tests on 20 words • 7 nouns, 7 verbs, 3 adjectives, 3 adverbs (120 word meanings) • manually check the first 10 examples of each sense of a word => 91% accuracy • (Mihalcea 1999)

  32. Web-based Bootstrapping • Similar to Yarowsky algorithm • Relies on data gathered from the Web 1. Create a set of seeds (phrases) consisting of: • Sense tagged examples in SemCor • Sense tagged examples from WordNet • Additional sense tagged examples, if available (created with the substitution method or Open Mind method) • Phrase? • At least two open class words; • Words involved in a semantic relation (e.g. noun phrase, verb-object, verb-subject, etc.) 2. Search the Web using queries formed with the seed expressions found at Step 1 • Add to the generated corpus of maximum of N text passages • (Mihalcea 2002)

  33. The Web as Collective Mind • Two different views of the Web: • collection of Web pages • very large group of Web users • Millions of Web users can contribute their knowledge to a data repository • Open Mind Word Expert (Chklovski and Mihalcea, 2002) • Fast growing rate: • Started in April 2002 • Currently more than 100,000 examples of noun senses in several languages

  34. OMWEonline http://teach-computers.org

  35. Open Mind Word Expert: Quantity and Quality • Data • A mix of different corpora: Treebank, Open Mind Common Sense, Los Angeles Times, British National Corpus • Word senses • Based on WordNet definitions • Active learning to select the most informative examples for learning • Use two classifiers trained on existing annotated data • Select items where the two classifiers disagree for human annotation • Quality: • Two tags per item • One tag per item per contributor • Evaluations: • Agreement rates of about 65% - comparable to the agreements rates obtained when collecting data for Senseval-2 with trained lexicographers • Replicability: tests on 1,600 examples of “interest” led to 90%+ replicability

  36. References • (Abney 2002) Abney, S. Bootstrapping. Proceedings of ACL 2002. • (Blum and Mitchell 1998) Blum, A. and Mitchell, T. Combining labeled and unlabeled data with co-training. Proceedings of COLT 1998. • (Chklovski and Mihalcea 2002) Chklovski, T. and Mihalcea, R. Building a sense tagged corpus with Open Mind Word Expert. Proceedings of ACL 2002 workshop on WSD. • (Clark, Curran and Osborne 2003) Clark, S. and Curran, J.R. and Osborne, M. Bootstrapping POS taggers using unlabelled data. Proceedings of CoNLL 2003. • (Mihalcea 1999) Mihalcea, R. An automatic method for generating sense tagged corpora. Proceedings of AAAI 1999. • (Mihalcea 2002) Mihalcea, R. Bootstrapping large sense tagged corpora. Proceedings of LREC 2002. • (Mihalcea 2004) Mihalcea, R. Co-training and Self-training for Word Sense Disambiguation. Proceedings of CoNLL 2004. • (Ng and Cardie 2003) Ng, V. and Cardie, C. Weakly supervised natural language learning without redundant views. Proceedings of HLT-NAACL 2003. • (Nigam and Ghani 2000) Nigam, K. and Ghani, R. Analyzing the effectiveness and applicability of co-training. Proceedings of CIKM 2000. • (Sarkar 2001) Sarkar, A. Applying cotraining methods to statistical parsing. Proceedings of NAACL 2001. • (Yarowsky 1995) Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of ACL 1995.

More Related