SENSEVAL: Evaluating WSD Systems

SENSEVAL: Evaluating WSD Systems Jason Blind & Lisa Norman College of Computer and Information Science Northeastern University Boston, MA 02115 January 25, 2006

What is SENSEVAL? • Mission • To organize and run evaluations that test the strengths and weaknesses of WSD systems. • Underlying Goal • To further human understanding of lexical semantics and polysemy. • History • Began as a workshop in April 1997 • Organized by ACL-SIGLEX • When • 1998, 2001, 2004, 2007?

What is WSD? • Machine Translation • English drug translates into French as either drogue or médicament. • Information Retrieval • If user queries for documents about drugs, do they want documents about illegal narcotics or medicine? • How do people disambiguate word senses? • Grammatical Context • “AIDS drug” (modified by proper name) • Lexical Context • If drug is followed by {addict, trafficker, etc…} the proper translation is most likely drogue. • Domain-based Context • If the text/document/conversation is about {disease, medicare, etc…} then médicament is most likely the correct translation.

Evaluating WSD Systems • Definition of the task • Selecting the data to be used for the evaluation • Production of correct answers for the evaluation data • Distribution of the data to the participants • Participants use their program to tag the data • Administrators score the participants’ tagging • Participants and administrators meet to compare notes and learn lessons

Evaluation Tasks • All-Words • Lexical Sample • Multilingual Lexical Sample • Translation • Automatic Sub-categorization Acquisition (?) • WSD of WordNet Glosses • Semantic Roles (FrameNet) • Logic Forms (FOPC)

SENSEVAL-1 • Comprised of WSD tasks for English, French and Italian. • Timetable • A plan for selecting evaluation materials was agreed. • Human annotators generated the ‘gold standard’ set of correct answers. • The gold standard materials, without answers, were released to participants, who then had a short time to run their programs over them and return their sets of answers to the organizers. • The organizers scored the returned answer sets, and scores were announced and discussed at the workshop. • 17 Systems were evaluated.

SENSEVAL-1 : Tasks • Lexical Sample • first carefully select a sample of words from the lexicon (based upon BNC frequency and WordNet polysemy levels); systems must then tag several corpus instances of the sample words in short extracts of text. • Advantages over All-words sample • More efficient human tagging • The all-words task requires access to a full dictionary • Many systems needed either sense tagged training data or some manual input for each dictionary entry, so all-words would be infeasible • Task breakdown • 15 noun tasks • 13 verb tasks • 8 adjective tasks • 5 indeterminate tasks

SENSEVAL-1 : Dictionary & Corpus • Hector • A joint Oxford University Press/Digital project in which a database with linked dictionary and 17M word corpus was developed. • Chosen, when SENSEVAL wasn’t sure if it would have extra funding to pay humans to sense-tag text, because it was already sense-tagged. • One disadvantage is that the OUP delivered corpus instances were associated with very little context (1-2 sentences usually).

SENSEVAL-1 : Data • Dry-run Distribution • Systems must tag almost all of the content words in a sample of running text. • Training-data Distribution • first carefully select a sample of words from the lexicon; systems must then tag several instances of the sample words in short extracts of text. • 20,000+ instances of 38 words. • Evaluation Distribution • A set of corpus instances for each task. • Each instance had been tagged by at least 3 humans (these were obviously not part of the distribution :^) • There were a total of 8,448 corpus instances in total. • Most tasks had between 80 and 400 instances.

SENSEVAL-1 : Baselines • Lesk’s algorithm • Dictionary-based • Unsupervised systems • Corpus-based • Supervised systems

SENSEVAL-1 : Results (English) • State of the art, where training data is available, is 75%-80% • When training data is available, systems that use it perform substantially better than those that do not. • A well implemented simple LESK algorithm is hard to beat.

SENSEVAL-2 : Tasks • All-Words • Systems must tag almost all of the content words in a sample of running text. • Lexical Sample • first carefully select a sample of words from the lexicon; systems must then tag several instances of the sample words in short extracts of text. • 73 words = 29 nouns + 15 adjectives + 19 verbs • Translation • (Japanese only) in SENSEVAL-2 • task in which word sense is defined according to translation distinction. (By contrast, SENSEVAL-1 evaluated systems on only lexical sample tasks in English, French, and Italian.)

SENSEVAL-2 : Dictionary & Corpus • Sense-Dictionary • WordNet (1.7) • Corpus • Penn Treebank II Wall Street Journal articles • British National Corpus (BNC)

SENSEVAL-2 : Data • Dry-run Distribution • Systems must tag almost all of the content words in a sample of running text. • Training-data Distribution • 12,000+ instances of 73 word • Evaluation Distribution • A set of corpus instances for each task. • Each instance had been tagged by at least 3 humans (these were obviously not part of the distribution :^) • There were a total of ? corpus instances in total.

SENSEVAL-2 : Results 34 Teams : 93 Systems

SENSEVAL-3 : Tasks • All-Words • English, Italian • Lexical Sample • Basque, Catalan, Chinese, English, Italian, Romanian, Spanish, Swedish • Multilingual Lexical Sample • Automatic Sub-categorization Acquisition (?) • WSD of WordNet Glosses • Semantic Roles (FrameNet) • Logic Forms (FOPC)

SENSEVAL-3 : Dictionary & Corpus • Sense-Dictionary • WordNet (2.0), eXtendedWordNet, EuroWordNet, ItalWordNet(1.7) • MiniDir-Cat • FrameNet • Corpus • British National Corpus (BNC) • Penn Trebank, Los Angeles Times, Open Mind Common Sense • SI-TAL (Integrated System for the Automatic Treatment of Language), MiniCors-Cat, etc...

SENSEVAL-3 : Data • Dry-run Distribution • Systems must tag almost all of the content words in a sample of running text. • Training-data Distribution • 12,000+ instances of 57 word • Evaluation Distribution • A set of corpus instances for each task. • Each instance had been tagged by at least 3 humans (these were obviously not part of the distribution :^)

Approaches to WSD • Lesk-based methods • Most Common Sense Heuristic • Domain Relevance Estimation • Latent Semantic Analysis (LSA) • Kernel methods • EM-based Clustering • Ensemble Classification • Maximum Entropy • Naïve Bayes • SVM • Boosting • KPCA

References • A.Kilgarriff. “An Exercise in Evaluating Word Sense Disambiguation Programs”, 1998 • A.Kilgarriff. “Gold Standard Datasets for Evaluating Word Sense Disambiguation Programs”, 1998 • R.Mihalcea, T.Chklovski, and A.Kilgarriff. “The SENSEVAL-3 English Lexical Sample Task.”, 2004 • J.Rosenzweig and A.Kilgarriff. “English SENSEVAL: Report and Results”, 1998 • P.Edmonds. “The evaluation of word sense disambiguation systems”. ELRA Newsletter, Vol. 7, No.3, 2002 • M.Carpuat, W.Su, and D.Wu. “Augmenting Ensemble Classification for Word Sense Disambiguation with Kernel PCA Model”. SENSEVAL-3, 2004

SENSEVAL: Evaluating WSD Systems

SENSEVAL: Evaluating WSD Systems

Presentation Transcript

Survey on WSD and IR

Tips for Evaluating Business Systems

Evaluating Recommender Systems

Evaluating Recommender Systems

Evaluating Grazing Systems Options

Word Sense Disambiguation (WSD)

Evaluating Safety Management Systems

Evaluating Decision Support Systems Projects

Three Approaches to Unsupervised WSD

Evaluating Systems

Source: IISI / WSD

Evaluating Systems

Evaluating Spoken Dialogue Systems

Chapter 21: Evaluating Systems

WSD Special Programs

Evaluating IR systems/search engines

WSD for Applications

Evaluating Running Systems

Evaluating Safety Management Systems