Overview

Taking the Kitchen Sink Seriously:An Ensemble Approach to Word Sense Disambiguation fromChristopher Manning et al.

Overview • 23 student WSD projects combined in a 2-layer voting scheme (an ensemble of ensemble classifiers). • Performed well on SENSEVAL-2: 4th place out of 21 supervised systems on the English Lexical Sample task. • Offers some valuable lessons for both WSD and ensemble methods in general.

System Overview • 23 different "1st order" classifiers. • Independently developed WSD systems. • Use a variety of algorithms (naïve bayes, n-gram, etc.). • These 1st order classifiers combined into a variety of 2nd order classifiers/voting mechanisms. • 2nd order classifiers vary with respect to: • Algorithms used to combine 1st order classifiers. • Number of voters. Each takes the top k 1st order, where k is one of {1,3,5,7,9,11,13,15} .

Voting Algorithms • Majority vote (each vote has weight 1). • Weighted voting, with weights determined by EM. • Tries to choose weights that maximize the likelihood of 2nd order training instances, where the probability of a sense (given the votes) is defined as the sum of weighted votes for that sense. • Maximum entropy using features derived from the votes of the 1st order classifiers.

Classifier Construction Process • For each word: • Train each 1st order on ¾ of training data • Use remaining ¼ of data to rank performance of 1st orders • For each 2nd order classifier: • Take the top k 1st orders for this word • Train the 2nd order on ¾ of training data using this ensemble • Rank performance of 2nd orders with ¼ of training data • Take the top 2nd order as the classifier for this word. Retrain on all the training data.

Results • 61.7% accuracy in SENSEVAL-2 competition (4th place). • After competition, improved performance: • Used global performance (i.e., over all words) as a tie breaker for rankings of both 1st and 2nd order . • Improved accuracy to 63.9% (would have been 2nd).

Results for 2nd Order Classifiers • Results are averaged over all words. • Note MaxEnt's ability to resist dilution.

Evaluating Effects of Combination • We want different classifiers to make different mistakes. • We can measure this differentiation as the average (over all pairs of 1st order classifiers) of the fraction of errors that are shared (error independence). • When error independence and word difficulty grow, the advantage of combination grows.

Lessons for WSD • Every word is a separate problem. • All 1st and 2nd order classifiers had some words on which they did the best. • Implementation details: • Large or small window sizes work better than medium window sizes. • This suggests that senses are determined on both a very local, collocational level and a very general, topical level. • Smoothing is very important.

Lessons for Ensemble Methods • Variety within the ensemble is desirable. • Qualitatively different approaches are better than minor perturbations in similar approaches. • We can measure the extent to which this ideal is achieved. • Variety in combination algorithms helps as well. • In particular, it can help with overfitting (because different algorithms will start overtraining at different points).

Overview

Overview

Presentation Transcript

Overview

Overview

OVERVIEW

Overview

Overview

Overview

Overview

Overview

overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview