1 / 19

Word Sense Disambiguation

Word Sense Disambiguation. Overview. Selectional restriction based approaches Robust techniques Machine Learning Supervised Unsupervised Dictionary-based techniques. Disambiguation via Selectional Restrictions. A step toward semantic parsing

anng
Download Presentation

Word Sense Disambiguation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Word Sense Disambiguation CS 4705

  2. Overview • Selectional restriction based approaches • Robust techniques • Machine Learning • Supervised • Unsupervised • Dictionary-based techniques

  3. Disambiguation via Selectional Restrictions • A step toward semantic parsing • Different verbs select for different thematic roles wash the dishes (takes washable-thing as patient) serve delicious dishes (takes food-type as patient) • Method: rule-to-rule syntactico-semantic analysis • Semantic attachment rules are applied as sentences are syntactically parsed VP --> V NP V serve <theme> {theme:food-type} • Selectional restriction violation: no parse

  4. Requires: • Write selectional restrictions for each sense of each predicate – or use FrameNet • Serve alone has 15 verb senses • Hierarchical type information about each argument (a la WordNet) • How many hypernyms does dish have? • How many lexemes are hyponyms of dish? • But also: • Sometimes selectional restrictions don’t restrict enough (Which dishes do you like?) • Sometimes they restrict too much (Eatdirt, worm! I’ll eat my hat!)

  5. Can we take a more statistical approach? How likely is dish/crockery to be the object of serve? dish/food? • A simple approach (baseline): predict the most likely sense • Why might this work? • When will it fail? • A better approach: learn from a tagged corpus • What needs to be tagged? • An even better approach: Resnik’s selectional association (1997, 1998) • Estimate conditional probabilities of word senses from a corpus tagged only with verbs and their arguments (e.g. ragout is an object of served -- Jane served/V ragout/Obj

  6. How do we get the word sense probabilities? • For each verb object (e.g. ragout) • Look up hypernym classes in WordNet • Distribute “credit” for this object sense occurring with this verb among all the classes to which the object belongs Brian served/V the dish/Obj Jane served/V food/Obj • If ragout has N hypernym classes in WordNet, add 1/N to each class count (including food) as object of serve • If tureen has M hypernym classes in WordNet, add 1/M to each class count (including dish) as object of serve • Pr(Class|v) is the count(c,v)/count(v) • How can this work? • Ambiguous words have many superordinate classes John served food/the dish/tuna/curry • There is a common sense among these which gets “credit” in each instance, eventually dominating the likelihood score

  7. To determine most likely sense of ‘bass’ in Bill served bass • Having previously assigned ‘credit’ for the occurrence of all hypernyms of things like fish and things like musical instruments to all their hypernym classes (e.g. ‘fish’ and ‘musical instruments’) • Find the hypernym classes of bass (including fish and musical instruments) • Choose the class C with the highest probability, given that the verb is serve • Results: • Baselines: • random choice of word sense is 26.8% • choose most frequent sense (NB: requires sense-labeled training corpus) is 58.2% • Resnik’s: 44% correct with only pred/arg relations labeled

  8. Machine Learning Approaches • Learn a classifier to assign one of possible word senses for each word • Acquire knowledge from labeled or unlabeled corpus • Human intervention only in labeling corpus and selecting set of features to use in training • Input: feature vectors • Target (dependent variable) • Context (set of independent variables) • Output: classification rules for unseen text

  9. Supervised Learning • Training and test sets with words labeled as to correct sense (It was the biggest [fish: bass] I’ve seen.) • Obtain values of independent variables automatically (POS, co-occurrence information, …) • Run classifier on training data • Test on test data • Result: Classifier for use on unlabeled data

  10. Input Features for WSD • POS tags of target and neighbors • Surrounding context words (stemmed or not) • Punctuation, capitalization and formatting • Partial parsing to identify thematic/grammatical roles and relations • Collocational information: • How likely are target and left/right neighbor to co-occur • Co-occurrence of neighboring words • Intuition: How often does sea or words with bass

  11. How do we proceed? • Look at a window around the word to be disambiguated, in training data • Which features accurately predict the correct tag? • Can you think of other features might be useful in general for WSD? • Input to learner, e.g. Is the bass fresh today? [w-2, w-2/pos, w-1,w-/pos,w+1,w+1/pos,w+2,w+2/pos… [is,V,the,DET,fresh,RB,today,N...

  12. Types of Classifiers • Naïve Bayes • ŝ = p(s|V), or • Where s is one of the senses possible and V the input vector of features • Assume features independent, so probability of V is the product of probabilities of each feature, given s, so • and p(V) same for any ŝ • Then

  13. Rule Induction Learners (e.g. Ripper) • Given a feature vector of values for independent variables associated with observations of values for the training set (e.g. [fishing,NP,3,…] + bass2) • Produce a set of rules that perform best on the training data, e.g. • bass2 if w-1==‘fishing’ & pos==NP • …

  14. Decision Lists • like case statements applying tests to input in turn fish within window --> bass1 striped bass --> bass1 guitar within window --> bass2 bass player -->bass1 … • Yarowsky ‘96’s approach orders tests by individual accuracy on entire training set based on log-likelihood ratio

  15. Bootstrapping I • Start with a few labeled instances of target item as seeds to train initial classifier, C • Use high confidence classifications of C on unlabeled data as training data • Iterate • Bootstrapping II • Start with sentences containing words strongly associated with each sense (e.g. sea and music for bass), either intuitively or from corpus or from dictionary entries • One Sense per Discourse hypothesis

  16. Unsupervised Learning • Cluster feature vectors to ‘discover’ word senses using some similarity metric (e.g. cosine distance) • Represent each cluster as average of feature vectors it contains • Label clusters by hand with known senses • Classify unseen instances by proximity to these known and labeled clusters • Evaluation problem • What are the ‘right’ senses?

  17. Cluster impurity • How do you know how many clusters to create? • Some clusters may not map to ‘known’ senses

  18. Dictionary Approaches • Problem of scale for all ML approaches • Build a classifier for each sense ambiguity • Machine readable dictionaries (Lesk ‘86) • Retrieve all definitions of content words occurring in context of target (e.g. the happy seafarer ate the bass) • Compare for overlap with sense definitions of target entry (bass2: a type of fish that lives in the sea) • Choose sense with most overlap • Limits: Entries are short --> expand entries to ‘related’ words

  19. Summary • Many useful approaches developed to do WSD • Supervised and unsupervised ML techniques • Novel uses of existing resources (WN, dictionaries) • Future • More tagged training corpora becoming available • New learning techniques being tested, e.g. co-training • Next class: • Ch 17:3-5

More Related