Word Sense Disambiguation

Word Sense Disambiguation German Rigau i Claramunt http://www.lsi.upc.es/~rigau TALP Research Center Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

Setting Unsupervised WSD systems Supervised WSD systems Using the Web and EWN to WSD WSDOutline

Word Sense Disambiguation is the problem of assigning the appropriate meaning (sense) to a given word in a text “WSD is perhaps the great open problem at the lexical level of NLP” (Resnik & Yarowsky 97) WSD resolution would allow: acquisition of subcategorisation structure: parsing improve existing Information Retrieval Machine Translation Natural Language Understanding Using the Web and EWN for WSDSetting

Example Senses: (WordNet 1.5.) age 1: the length of time something (or someone) has existed; "his age was 71"; "it was replaced because of its age" age 2: a historic period; "the Victorian age"; "we live in a litigious age” DSO Corpora examples: (Ng 96) He was mad about stars at the >> age 1 << of nine . About 20,000 years ago the last ice >> age 2 << ended . Using the Web and EWN for WSDSetting

Knowledge-Driven WSD (Unsupervised) knowledge-based WSD 100% coverage 55% accuracy (SensEval-1) No Training Process Large scale lexical knowledge resources WordNet MRDs Thesaurus Using the Web and EWN for WSDSetting

Data-Driven (Supervised) corpus-based WSD statistical-based WSD Machine-Learning WSD no full coverage 75% accuracy (SensEval-1) Training Process learning from large amount of sense annotated corpora (Ng 97) effort of 16 man/year per year per language Using the Web and EWN for WSDSetting

UnsupervisedWord Sense DisambiguationSystems German Rigau i Claramunt http://www.lsi.upc.es/~rigau TALP Research Center Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

Setting Knowledge-driven WSD methods MRDs Thesauri & Corpus LKBs LKBs & Conceptual Distance LKBs & Conceptual Density LKBs & Corpus Experiments: Genus Sense Disambiguation Future Work Unsupervised WSD SystemsOutline

Knowledge-Driven (Unsupervised) + No Need of large anotated corpora + Tested on unrestricted domains (words and senses) - Worst results Unsupervised WSD SystemsSetting

Lesk Method (Lesk 86) Counting word overlaping between context and senses of the word (Cowie et al. 92) simulated annealing for overcomming the combinatorial explosion using LDOCE (Wilks & Stevenson 97) simulated annealing 57% accuracy at a sense level Unsupervised WSD SystemsMRDs

Coocurrence Word Vectors (Wilks et al. 93) word-context vectors from LDOCE testing large set of relateness functions 13 senses of word bank 45% accuracy (Rigau et al. 97) (Noun) Genus Sense Disambiguation 60% accuracy Unsupervised WSD SystemsMRDs

Unsupervised WSD SystemsMRDs 371.616 conexions 11.8004 9.8 16 elaborado queso 35 113 10.8938 8.0 23 pasta queso 178 113 10.4846 7.5 25 leche queso 274 113 10.2483 9.2 13 oveja queso 45 113 9.1513 7.6 16 queso sabor 113 160 7.4956 8.3 8 queso tortilla 113 51 6.7732 7.5 8 queso vaca 113 84 6.5830 6.1 12 maíz queso 347 113 6.2208 8.9 5 queso suero 113 21 6.1509 8.8 5 mantequilla queso 22 113 6.1474 7.9 6 compacta queso 50 113 5.9918 7.7 6 picante queso 55 113 5.9002 9.8 4 manchego queso 9 113 5.6805 7.3 6 cabra queso 75 113 5.6300 5.9 9 pan queso 287 113

(Yarowsky 92) uses Roget’s Thesaurus to partition Grolier’s Enciclopedia 1042 categories 92% accuracy for 12 polysemous words (Yarowsky 95) seed words (Liddy & Paik 92) subject-code correlation matrix 122 LDOCE semantic codes 166 sentences of Wall Street Journal 89% correct subject code Unsupervised WSD SystemsThesauri & Corpus

(Rada et al. 92) length of the shortest path (Sussna 93) (Agirre et al. 94) (Rigau 94; Rigau et al. 95, 97; Atserias et al. 97) length of the shortest path specificity of the concepts Unsupervised WSD SystemsLKBs & Conceptual Distance

(Agirre & Rigau 95, 96) Unsupervised WSD SystemsLKBs & Conceptual Density

(Agirre & Rigau 95, 96) length of the shortest path the depth in the hierarchy concepts in a dense part of the hierarchy are relatively closer than those in a more sparse region. the measure should be independent of the number of concepts involved Unsupervised WSD SystemsLKBs & Conceptual Density

(Resnik 95) Information Content (Richardson et al. 94) (Jiang & Conrath 97) Unsupervised WSD SystemsLKBs & Corpus

Unsupervised WSD Unrestricted WSD (coverage 100%) Eight Heuristics (McRoy 92) Combining several lexical resources Combining several methods Unsupervised WSD SystemsExperiments: Genus Sense Disambiguation

0) Monosemous Genus Term 1) Entry Sense Ordering 2) Explicit Semantic Domain 3) Word Matching (Lesk 86) 4) Simple Concordance 5) Coocurrence Word Vectors 6) Semantic Vectors 7) Conceptual Distance Unsupervised WSD SystemsExperiments: Genus Sense Disambiguation

Unsupervised WSD SystemsExperiments: Genus Sense Disambiguation Results:

Unsupervised WSD SystemsExperiments: Genus Sense Disambiguation Knowledge provided by each heuristic:

SupervisedWord Sense DisambiguationSystems German Rigau i Claramunt http://www.lsi.upc.es/~rigau TALP Research Center Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

Setting Methodology Machine Learning algorithms Naive Bayes (Mooney 98) Snow (Dagan et al. 97) Exemplar-based (Ng 97) LazyBoosting (Escudero et al. 00) Experimental Results Naive Bayes vs. Exemplar Based Portability and Tuning of Supervised WSD Future Work WSD using ML algoritmsOutline

Data-Driven (Supervised) + Better results - Need of large corpora knowledge adquisition bottleneck (Gale et al. 93, Ng 97) - Tested on limited domains (words and senses) WSD using ML algorithmsSetting

WSD using ML algorithmsSetting Current research lines: “open the bottleneck” • Design of efficient example sampling methods (Engelson & Dagan 96; Fujii et al. 98) • Use of WordNet and Web to automatically obtain examples (Leacock et al. 98; Mihalcea & Moldovan 99) • Use of unsupervised methods for estimating parameters (Pedersen & Bruce 98)

WSD using ML algorithmsSetting • Contradictory Previous Work: • (Mooney, 98) + t-student test of significance + n-fold cross-validation - Word “line” with 4,149 examples and 6 senses (Leacock et al. 93). - Neither parameter setting nor algorithm tunning • (Ng 97) + Large corpora (192,800 occurrences of 191 words) - Direct Test (No n-fold crossvalidation). - Small set of features.

WSD using ML algorithms Methodology • Main goals: • Study supervised methods for WSD • Use it with “Automatically Extracted Examples from the Web using WordNet” • Rigorous direct comparisons • Supervised WSD Methods • Naive Bayes State-of-the-art accuracy (Mooney 98) • Snow From “Text Categorization” (Dagan et al. 97) • Exemplar-based State-of-the-art accuracy (Ng 97) • Boosting From “Text Categorization” (Schapire & Singer “to appear”, Escudero, Màrquez & Rigau 2000)

WSD using ML algorithms Methodology • Evaluation (Dietterich 98) • 10-fold crossvalidation • t-student test of significance • Data • LDC (Ng 96) 192,800 occurrences of 191 words (121 nouns +70 verbs) Avg. Number of senses: 7.2 N, 12.6 V, 9.2 (all) • WSJ Corpus (Corpus A) • Brown Corpus (Corpus B) • Sets of attributes • Set A (Ng 97) Small set of features No broad-context attributes • Set B  (Ng 96) Large set of features Broad-context attributes

WSD using ML algorithmsNaive Bayes • Based on Bayes Theorem (Duda & Hart 73) • Frequencies used as probabilities • Assumed independence of example features • Smoothing technique (Ng 97)

WSD using ML algorithmsExemplar-based WSD • k-NN approach (Ng 96; Ng 97) • Distances • Hamming • Modified Value Difference Metric MVDM (Cost & Salzberg 93) • Variants • Example weighting • Attribute weighting (RLM 91) k=3

WSD using ML algorithmsSnow • Snow (Golding & Roth 99) • Sparse Network of Winows • On-line learning system • Winow (Littlestone 88) • linear threshold • mistake-driven (when predicted class is wrong)

WSD using ML algorithmsSnow MAX Winow Sense 1 Winow Sense 2 wf w-1= “average” w+2=“42” w+1=“of” w+2=“nuclear” ... an average <age_1> of 42 ... ... in this <age_2> of nuclear ...

WSD using ML algorithmsBoosting • AdaBoost.MH (Freund & Shapire’00) • Combine many simple weak classifiers (hypothesis) • Weak classifiers are trained sequencially • Each iteration concentrate on the most difficult cases + Results: Better than NB and EB - Problem: Computational Complexity Time and space grow linearly with number of examples. + Solution: LazyBoosting!

Features from Set A (Ng 97): w-2, w-1 , w+1, w+2 , (w-2, w-1), (w-1 , w+1), (w+1, w+2) 15 reference words (10 N, 5 V) Average ns ex att nouns (121) 8.6 1040 3978 verbs (70) 17.9 1266 4432 total (191) 12.1 1115 4150 Accuaracy % MFS NB EB1 EB15 AB750 ABSC nouns (121) 57.4 71.7 65.8 71.1 73.5 73.4 verbs (70) 46.6 57.6 51.1 58.1 59.3 59.1 total (191) 53.3 66.4 60.2 66.2 68.1 68.0 WSD using ML algorithmsExperimental Results (LazyBoosting)

Accelerating the WeakLearner Reducing Feature Space Frequency filtering (Freq) Discard those features occourring less than N times Local frequency filtering (LFreq) Selects the N most freqeunt features of each sense RLM ranking (López de Mantaras 91) Selects the N most relevant features Reducing the number of Attributes examined LazyBoosting A small proportion of attributes are randomly selected at each iteration WSD using ML algorithmsExperimental Results (LazyBoosting)

Accelerating the WeakLearner All methods perform quite well many irrelevant attributes in the domain LFreq is slghly better than Freq RLM performs better than LFreq and Freq LazyBoosting is better than all other methods acceptable performance with 1% of exploration when looking for a weak rule. 10% achieves the same performance than 100% 7 times faster! WSD using ML algorithmsExperimental Results (LazyBoosting)

7 features from Set A (Ng 97): w-2, w-1 , w+1, w+2 , (w-2, w-1), (w-1 , w+1), (w+1, w+2) 15 reference words (10 N, 5 V) Average ns ex att nouns (121) 8.6 1040 3978 verbs (70) 17.9 1266 4432 total (191) 12.1 1115 4150 Accuaracy % MFS NB EB15 LB10SC nouns (121) 56.4 68.7 68.0 70.8 verbs (70) 46.7 64.8 64.9 67.5 total (191) 52.3 67.1 66.7 69.5 WSD using ML algorithmsExperimental Results (LazyBoosting)

WSD using ML algorithmsExperimental Results (NB vs EB) Experiments on Set A with 15 words: • Results • Conclusions: • NB and EB are better than MFS • k-NN performs better with k>1 • Variants of EB improve the EB • MVDM(cs) metric is better than Hamming distance • EB performs better than NB

WSD using ML algorithmsExperimental Results (NB vs EB) Experiments on Set B with 15 words: • Results What happened? • Problem with the binary representation of the broad-context attributes. • Examples are represented with sparse vectors (5,000 positions). • Two examples coincide in the majority of values. • Biases the similarity measure in favour of shortest sentences. Related work “Clarified”: (Mooney 98) • Poor results of k-NN algorithm (Ng 96; Ng 97) • Lower results of a system with a large number of attributes

WSD using ML algorithmsExperimental Results (NB vs EB) Improving both methods (NB and EB) (Escudero et al. 00b) • Use only positive information • Treat the broad-context attributes as multivalued attributes Let two values: The similarity S between two values has to be redefined as: • This representation allows a very computationally efficient implementation: Positive Naive Bayes (PNB) Positive Exemplar-based (PEB)

WSD using ML algorithmsExperimental Results (NB vs EB) Experiments on Set B with 15 words: • Results • Conclusions: • PEB improves by 12.2 points the accuracy of EB • PEB is higher than Set A except PEBh,10,e,a • PNB is at least as accurate as NB • The positive approach increases greatly the efficiency (80 times for NB and 15 for EB) of the algorithms • PEB accuracy is higher than PNB

WSD using ML algorithmsExperimental Results (NB vs EB) Global Results (191 words): • Conclusions: • In Set A, The best option is Exemplar-based using MVDM metric • In Set B, The best option is Exemplar-based using Hamming distance and example weighting MVDM metric has higher accuracy but is currently computationally prohibitive • Positive Exemplar-based allows the addition of unordered contextual attributes with an accuracy improvement • Positive information allows to improve greatly the efficiency

WSD using ML algorithmsExperimental Results (Portability) • 15 features from Set A (Ng 96): p-3, p-2, p-1 , p+1, p+2, p+3, w-1 , w+1, (w-2, w-1), (w-1 , w+1), (w+1, w+2), (w-3, w-2, w-1), (w-2, w-1 , w+1), (w-1 , w+1, w+2), (w+1, w+2 , w+3) • 21 reference words (13 N, 8 V) • DSO Corpus • Wall Street Journal (Corpus A) • Brown Corpus (Corpus B) • 7 combinations of training-test sets • A+B-A+B, A+B-A, A+B-B • A-A, B-B, A-B, B-A • Forcing the number of examples of corpus A and B be the same (reducing the size to the smalest)

WSD using ML algorithmsExperimental Results (Portability) First Experiment (% accuracy) Method A+B-A+B A+B-A A+B-B MFC 46.6 53.0 39.2 NB 61,6 67.3 55.9 EB 63.0 69.0 57.0 Snow 60.1 65.6 56.3 LB 66.3 71.8 60.9 Method A-A B-B A-B B-A MFC 56.0 45.5 36.4 38.7 NB 65.9 56.8 41.4 47.7 EB 69.0 57.4 45.3 51.1 Snow 67.1 56.1 44.1 49.8 LB 71.3 59.0 47.1 52.0

WSD using ML algorithmsExperimental Results (Portability) Conclusions of First Experiment • LazyBoosting outperforms all other methods in all cases • the knowledge acquired from a single corpus almost covers the knowldge of combining both corpora • Very disappointing results! • Looking at Kappa values • NB most similar to MFC • LB most similar to DSO • LB most disimilar to MFC

WSD using ML algorithmsExperimental Results (Portability) Second Experiment • Adding tuning material • B+%A-A, A+%B-B, %A-A, %B-B • ranging from 10% to 50% (50% remaining for test) • For NB, EB, Snow it is not worth keeping the original corpus • LB has a moderate (but consistent) improvement when retaining the original training set

WSD using ML algorithmsExperimental Results (Portability) Third Experiment • Two main reasons • Corpus A and B have a very different distribution of senses • Examples of corpus A and B contain differentinformation • New corpus sense-balanced • Forcing the number of examples of each sense of corpus A and B be the same (reducing the size to the smalest)

Word Sense Disambiguation