Word sense disambiguation (2)
1 / 35

Word sense disambiguation (2) Instructor: Paul Tarau, based on Rada Mihalcea’s original slides - PowerPoint PPT Presentation

  • Uploaded on

Word sense disambiguation (2) Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Note: Some of the material in this slide set was adapted from a tutorial given by Rada Mihalcea & Ted Pedersen at ACL 2005. What is Supervised Learning?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Word sense disambiguation (2) Instructor: Paul Tarau, based on Rada Mihalcea’s original slides' - ivie

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Word sense disambiguation (2)

Instructor: Paul Tarau, based on RadaMihalcea’soriginal slides

Note: Some of the material in this slide set was adapted from a tutorial given by RadaMihalcea & Ted Pedersen at ACL 2005

What is supervised learning
What is Supervised Learning?

  • Collect a set of examples that illustrate the various possible classifications or outcomes of an event.

  • Identify patterns in the examples associated with each particular class of the event.

  • Generalize those patterns into rules.

  • Apply the rules to classify a new event.

Learn from these examples when do i go to the store
Learn from these examples :“when do I go to the store?”

Learn from these examples when do i go to the store1
Learn from these examples :“when do I go to the store?”

Task definition supervised wsd
Task Definition: Supervised WSD

  • Supervised WSD: Class of methods that induces a classifier from manually sense-tagged text using machine learning techniques.

  • Resources

    • Sense Tagged Text

    • Dictionary (implicit source of sense inventory)

    • Syntactic Analysis (POS tagger, Chunker, Parser, …)

  • Scope

    • Typically one target word per context

    • Part of speech of target word resolved

    • Lends itself to “targeted word” formulation

  • Reduces WSD to a classification problem where a target word is assigned the most appropriate sense from a given set of possibilities based on the context in which it occurs

Simple supervised approach
Simple Supervised Approach

  • Given a sentence S containing “bank”:

  • For each word Wi in S

  • If Wi is in FINANCIAL_BANK_BAG then

  • Sense_1 = Sense_1 + 1;

  • If Wi is in RIVER_BANK_BAG then

  • Sense_2 = Sense_2 + 1;

  • If Sense_1 > Sense_2 then print “Financial”

  • else if Sense_2 > Sense_1 then print “River”

  • else print “Can’t Decide”;

Supervised methodology
Supervised Methodology

  • Create a sample of training data where a given target word is manually annotated with a sense from a predetermined set of possibilities.

    • One tagged word per instance/lexical sample disambiguation

  • Select a set of features with which to represent context.

    • co-occurrences, collocations, POS tags, verb-obj relations, etc...

  • Convert sense-tagged training instances to feature vectors.

  • Apply a machine learning algorithm to induce a classifier.

    • Form – structure or relation among features

    • Parameters – strength of feature interactions

  • Convert a held out sample of test data into feature vectors.

    • “correct” sense tags are known but not used

  • Apply classifier to test instances to assign a sense tag.

From text to feature vectors
From Text to Feature Vectors

  • My/pronoun grandfather/noun used/verb to/prep fish/verb along/adv the/det banks/SHORE of/prep the/det Mississippi/noun River/noun. (S1)

  • The/det bank/FINANCE issued/verb a/det check/noun for/prep the/det amount/noun of/prep interest/noun. (S2)

Supervised learning algorithms
Supervised Learning Algorithms

  • Once data is converted to feature vector form, any supervised learning algorithm can be used. Many have been applied to WSD with good results:

    • Support Vector Machines

    • Nearest Neighbor Classifiers

    • Decision Trees

    • Decision Lists

    • Naïve Bayesian Classifiers

    • Perceptrons

    • Neural Networks

    • Graphical Models

    • Log Linear Models

Na ve bayesian classifier
Naïve Bayesian Classifier

  • Naïve Bayesian Classifier well known in Machine Learning community for good performance across a range of tasks (e.g., Domingos and Pazzani, 1997)

  • …Word Sense Disambiguation is no exception

  • Assumes conditional independence among features, given the sense of a word.

    • The form of the model is assumed, but parameters are estimated from training instances

  • When applied to WSD, features are often “a bag of words” that come from the training data

    • Usually thousands of binary features that indicate if a word is present in the context of the target word (or not)

Bayesian inference
Bayesian Inference

  • Given observed features, what is most likely sense?

  • Estimate probability of observed features given sense

  • Estimate unconditional probability of sense

  • Unconditional probability of features is a normalizing term, doesn’t affect sense classification

The na ve bayesian classifier
The Naïve Bayesian Classifier

  • Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense)

    • P(S=1) = 1,500/2000 = .75

    • P(S=2) = 500/2,000 = .25

  • Given “credit” occurs 200 times with bank/1 and 4 times with bank/2.

    • P(F1=“credit”) = 204/2000 = .102

    • P(F1=“credit”|S=1) = 200/1,500 = .133

    • P(F1=“credit”|S=2) = 4/500 = .008

  • Given a test instance that has one feature “credit”

    • P(S=1|F1=“credit”) = .133*.75/.102 = .978

    • P(S=2|F1=“credit”) = .008*.25/.102 = .020

Comparative results
Comparative Results

  • (Leacock, et. al. 1993) compared Naïve Bayes with a Neural Network and a Context Vector approach when disambiguating six senses of line…

  • (Mooney, 1996) compared Naïve Bayes with a Neural Network, Decision Tree/List Learners, Disjunctive and Conjunctive Normal Form learners, and a perceptron when disambiguating six senses of line…

  • (Pedersen, 1998) compared Naïve Bayes with Decision Tree, Rule Based Learner, Probabilistic Model, etc. when disambiguating line and 12 other words…

  • …All found that Naïve Bayesian Classifier performed as well as any of the other methods!

Decision lists and trees
Decision Lists and Trees

  • Very widely used in Machine Learning.

  • Decision trees used very early for WSD research (e.g., Kelly and Stone, 1975; Black, 1988).

  • Represent disambiguation problem as a series of questions (presence of feature) that reveal the sense of a word.

    • List decides between two senses after one positive answer

    • Tree allows for decision among multiple senses after a series of answers

  • Uses a smaller, more refined set of features than “bag of words” and Naïve Bayes.

    • More descriptive and easier to interpret.

Decision list for wsd yarowsky 1994
Decision List for WSD (Yarowsky, 1994)

  • Identify collocational features from sense tagged data.

  • Word immediately to the left or right of target :

    • I have my bank/1 statement.

    • The river bank/2 is muddy.

  • Pair of words to immediate left or right of target :

    • The world’s richest bank/1 is here in New York.

    • The river bank/2 is muddy.

  • Words found within k positions to left or right of target, where k is often 10-50 :

    • My credit is just horrible because my bank/1 has made several mistakes with my account and the balance is very low.

Building the decision list
Building the Decision List

  • Sort order of collocation tests using log of conditional probabilities.

  • Words most indicative of one sense (and not the other) will be ranked highly.

Computing dl score
Computing DL score

  • Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense)

    • P(S=1) = 1,500/2,000 = .75

    • P(S=2) = 500/2,000 = .25

  • Given “credit” occurs 200 times with bank/1 and 4 times with bank/2.

    • P(F1=“credit”) = 204/2,000 = .102

    • P(F1=“credit”|S=1) = 200/1,500 = .133

    • P(F1=“credit”|S=2) = 4/500 = .008

  • From Bayes Rule…

    • P(S=1|F1=“credit”) = .133*.75/.102 = .978

    • P(S=2|F1=“credit”) = .008*.25/.102 = .020

  • DL Score = abs (log (.978/.020)) = 3.89

Using the decision list
Using the Decision List

  • Sort DL-score, go through test instance looking for matching feature. First match reveals sense…

Learning a decision tree
Learning a Decision Tree

  • Identify the feature that most “cleanly” divides the training data into the known senses.

    • “Cleanly” measured by information gain or gain ratio.

    • Create subsets of training data according to feature values.

  • Find another feature that most cleanly divides a subset of the training data.

  • Continue until each subset of training data is “pure” or as clean as possible.

  • Well known decision tree learning algorithms include ID3 and C4.5 (Quillian, 1986, 1993)

  • In Senseval-1, a modified decision list (which supported some conditional branching) was most accurate for English Lexical Sample task (Yarowsky, 2000)

Supervised wsd with individual classifiers
Supervised WSD with Individual Classifiers

  • Many supervised Machine Learning algorithms have been applied to Word Sense Disambiguation, most work reasonably well.

    • (Witten and Frank, 2000) is a great intro. to supervised learning.

  • Features tend to differentiate among methods more than the learning algorithms.

  • Good sets of features tend to include:

    • Co-occurrences or keywords (global)

    • Collocations (local)

    • Bigrams (local and global)

    • Part of speech (local)

    • Predicate-argument relations

      • Verb-object, subject-verb,

    • Heads of Noun and Verb Phrases

Convergence of results
Convergence of Results

  • Accuracy of different systems applied to the same data tends to converge on a particular value, no one system shockingly better than another.

    • Senseval-1, a number of systems in range of 74-78% accuracy for English Lexical Sample task.

    • Senseval-2, a number of systems in range of 61-64% accuracy for English Lexical Sample task.

    • Senseval-3, a number of systems in range of 70-73% accuracy for English Lexical Sample task…

  • What to do next?

Ensembles of classifiers
Ensembles of Classifiers

  • Classifier error has two components (Bias and Variance)

    • Some algorithms (e.g., decision trees) try and build a representation of the training data – Low Bias/High Variance

    • Others (e.g., Naïve Bayes) assume a parametric form and don’t represent the training data – High Bias/Low Variance

  • Combining classifiers with different bias variance characteristics can lead to improved overall accuracy

  • “Bagging” a decision tree can smooth out the effect of small variations in the training data (Breiman, 1996)

    • Sample with replacement from the training data to learn multiple decision trees.

    • Outliers in training data will tend to be obscured/eliminated.

Ensemble considerations
Ensemble Considerations

  • Must choose different learning algorithms with significantly different bias/variance characteristics.

    • Naïve Bayesian Classifier versus Decision Tree

  • Must choose feature representations that yield significantly different (independent?) views of the training data.

    • Lexical versus syntactic features

  • Must choose how to combine classifiers.

    • Simple Majority Voting

    • Averaging of probabilities across multiple classifier output

    • Maximum Entropy combination (e.g., Klein, et. al., 2002)

Ensemble results
Ensemble Results

  • (Pedersen, 2000) achieved state of art for interest and line data using ensemble of Naïve Bayesian Classifiers.

    • Many Naïve Bayesian Classifiers trained on varying sized windows of context / bags of words.

    • Classifiers combined by a weighted vote

  • (Florian and Yarowsky, 2002) achieved state of the art for Senseval-1 and Senseval-2 data using combination of six classifiers.

    • Rich set of collocational and syntactic features.

    • Combined via linear combination of top three classifiers.

  • Many Senseval-2 and Senseval-3 systems employed ensemble methods.

Task definition minimally supervised wsd
Task Definition: Minimally supervised WSD

  • SupervisedWSD = learning sense classifiers starting with annotated data

  • Minimally supervised WSD = learning sense classifiers from annotated data, with minimal human supervision

  • Examples

    • Automatically bootstrap a corpus starting with a few human annotated examples

    • Use monosemous relatives / dictionary definitions to automatically construct sense tagged data

    • Rely on Web-users + active learning for corpus annotation

Bootstrapping wsd classifiers
Bootstrapping WSD Classifiers

  • Build sense classifiers with little training data

    • Expand applicability of supervised WSD

  • Bootstrapping approaches

    • Co-training

    • Self-training

    • Yarowsky algorithm

Bootstrapping recipe
Bootstrapping Recipe

  • Ingredients

    • (Some) labeled data

    • (Large amounts of) unlabeled data

    • (One or more) basic classifiers

  • Output

    • Classifier that improves over the basic classifiers

plant#1 growth is retarded …

… a nuclear power plant#2 …

Classifier 1

Classifier 2

… building the only atomic plant …

… plant growth is retarded …

… a herb or flowering plant …

… a nuclear power plant …

… building a new vehicle plant …

… the animal and plant life …

… the passion-fruit plant …

… plants#1 and animals …

… industry plant#2 …

Co training self training
Co-training / Self-training

  • 1. Create a pool of examples U'

    • choose P random examples from U

  • 2. Loop for I iterations

    • Train Ci on L and label U'

    • Select G most confident examples and add to L

      • maintain distribution in L

    • Refill U' with examples from U

      • keep U' at constant size P

  • A set L of labeled training examples

  • A set U of unlabeled examples

  • Classifiers Ci

Co training

  • (Blum and Mitchell 1998)

  • Two classifiers

    • independent views

    • [independence condition can be relaxed]

  • Co-training in Natural Language Learning

    • Statistical parsing (Sarkar 2001)

    • Co-reference resolution (Ng and Cardie 2003)

    • Part of speech tagging (Clark, Curran and Osborne 2003)

    • ...

Self training

  • (Nigam and Ghani 2000)

  • One single classifier

  • Retrain on its own output

  • Self-training for Natural Language Learning

    • Part of speech tagging (Clark, Curran and Osborne 2003)

    • Co-reference resolution (Ng and Cardie 2003)

      • several classifiers through bagging