1 / 20

SENSEVAL: Evaluating WSD Systems

SENSEVAL: Evaluating WSD Systems. Jason Blind & Lisa Norman College of Computer and Information Science Northeastern University Boston, MA 02115 January 25, 2006. What is SENSEVAL?. Mission To organize and run evaluations that test the strengths and weaknesses of WSD systems.

pburks
Download Presentation

SENSEVAL: Evaluating WSD Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SENSEVAL: Evaluating WSD Systems Jason Blind & Lisa Norman College of Computer and Information Science Northeastern University Boston, MA 02115 January 25, 2006

  2. What is SENSEVAL? • Mission • To organize and run evaluations that test the strengths and weaknesses of WSD systems. • Underlying Goal • To further human understanding of lexical semantics and polysemy. • History • Began as a workshop in April 1997 • Organized by ACL-SIGLEX • When • 1998, 2001, 2004, 2007?

  3. What is WSD? • Machine Translation • English drug translates into French as either drogue or médicament. • Information Retrieval • If user queries for documents about drugs, do they want documents about illegal narcotics or medicine? • How do people disambiguate word senses? • Grammatical Context • “AIDS drug” (modified by proper name) • Lexical Context • If drug is followed by {addict, trafficker, etc…} the proper translation is most likely drogue. • Domain-based Context • If the text/document/conversation is about {disease, medicare, etc…} then médicament is most likely the correct translation.

  4. Evaluating WSD Systems • Definition of the task • Selecting the data to be used for the evaluation • Production of correct answers for the evaluation data • Distribution of the data to the participants • Participants use their program to tag the data • Administrators score the participants’ tagging • Participants and administrators meet to compare notes and learn lessons

  5. Evaluation Tasks • All-Words • Lexical Sample • Multilingual Lexical Sample • Translation • Automatic Sub-categorization Acquisition (?) • WSD of WordNet Glosses • Semantic Roles (FrameNet) • Logic Forms (FOPC)

  6. SENSEVAL-1 • Comprised of WSD tasks for English, French and Italian. • Timetable • A plan for selecting evaluation materials was agreed. • Human annotators generated the ‘gold standard’ set of correct answers. • The gold standard materials, without answers, were released to participants, who then had a short time to run their programs over them and return their sets of answers to the organizers. • The organizers scored the returned answer sets, and scores were announced and discussed at the workshop. • 17 Systems were evaluated.

  7. SENSEVAL-1 : Tasks • Lexical Sample • first carefully select a sample of words from the lexicon (based upon BNC frequency and WordNet polysemy levels); systems must then tag several corpus instances of the sample words in short extracts of text. • Advantages over All-words sample • More efficient human tagging • The all-words task requires access to a full dictionary • Many systems needed either sense tagged training data or some manual input for each dictionary entry, so all-words would be infeasible • Task breakdown • 15 noun tasks • 13 verb tasks • 8 adjective tasks • 5 indeterminate tasks

  8. SENSEVAL-1 : Dictionary & Corpus • Hector • A joint Oxford University Press/Digital project in which a database with linked dictionary and 17M word corpus was developed. • Chosen, when SENSEVAL wasn’t sure if it would have extra funding to pay humans to sense-tag text, because it was already sense-tagged. • One disadvantage is that the OUP delivered corpus instances were associated with very little context (1-2 sentences usually).

  9. SENSEVAL-1 : Data • Dry-run Distribution • Systems must tag almost all of the content words in a sample of running text. • Training-data Distribution • first carefully select a sample of words from the lexicon; systems must then tag several instances of the sample words in short extracts of text. • 20,000+ instances of 38 words. • Evaluation Distribution • A set of corpus instances for each task. • Each instance had been tagged by at least 3 humans (these were obviously not part of the distribution :^) • There were a total of 8,448 corpus instances in total. • Most tasks had between 80 and 400 instances.

  10. SENSEVAL-1 : Baselines • Lesk’s algorithm • Dictionary-based • Unsupervised systems • Corpus-based • Supervised systems

  11. SENSEVAL-1 : Results (English) • State of the art, where training data is available, is 75%-80% • When training data is available, systems that use it perform substantially better than those that do not. • A well implemented simple LESK algorithm is hard to beat.

  12. SENSEVAL-2 : Tasks • All-Words • Systems must tag almost all of the content words in a sample of running text. • Lexical Sample • first carefully select a sample of words from the lexicon; systems must then tag several instances of the sample words in short extracts of text. • 73 words = 29 nouns + 15 adjectives + 19 verbs • Translation • (Japanese only) in SENSEVAL-2 • task in which word sense is defined according to translation distinction. (By contrast, SENSEVAL-1 evaluated systems on only lexical sample tasks in English, French, and Italian.)

  13. SENSEVAL-2 : Dictionary & Corpus • Sense-Dictionary • WordNet (1.7) • Corpus • Penn Treebank II Wall Street Journal articles • British National Corpus (BNC)

  14. SENSEVAL-2 : Data • Dry-run Distribution • Systems must tag almost all of the content words in a sample of running text. • Training-data Distribution • 12,000+ instances of 73 word • Evaluation Distribution • A set of corpus instances for each task. • Each instance had been tagged by at least 3 humans (these were obviously not part of the distribution :^) • There were a total of ? corpus instances in total.

  15. SENSEVAL-2 : Results 34 Teams : 93 Systems

  16. SENSEVAL-3 : Tasks • All-Words • English, Italian • Lexical Sample • Basque, Catalan, Chinese, English, Italian, Romanian, Spanish, Swedish • Multilingual Lexical Sample • Automatic Sub-categorization Acquisition (?) • WSD of WordNet Glosses • Semantic Roles (FrameNet) • Logic Forms (FOPC)

  17. SENSEVAL-3 : Dictionary & Corpus • Sense-Dictionary • WordNet (2.0), eXtendedWordNet, EuroWordNet, ItalWordNet(1.7) • MiniDir-Cat • FrameNet • Corpus • British National Corpus (BNC) • Penn Trebank, Los Angeles Times, Open Mind Common Sense • SI-TAL (Integrated System for the Automatic Treatment of Language), MiniCors-Cat, etc...

  18. SENSEVAL-3 : Data • Dry-run Distribution • Systems must tag almost all of the content words in a sample of running text. • Training-data Distribution • 12,000+ instances of 57 word • Evaluation Distribution • A set of corpus instances for each task. • Each instance had been tagged by at least 3 humans (these were obviously not part of the distribution :^)

  19. Approaches to WSD • Lesk-based methods • Most Common Sense Heuristic • Domain Relevance Estimation • Latent Semantic Analysis (LSA) • Kernel methods • EM-based Clustering • Ensemble Classification • Maximum Entropy • Naïve Bayes • SVM • Boosting • KPCA

  20. References • A.Kilgarriff. “An Exercise in Evaluating Word Sense Disambiguation Programs”, 1998 • A.Kilgarriff. “Gold Standard Datasets for Evaluating Word Sense Disambiguation Programs”, 1998 • R.Mihalcea, T.Chklovski, and A.Kilgarriff. “The SENSEVAL-3 English Lexical Sample Task.”, 2004 • J.Rosenzweig and A.Kilgarriff. “English SENSEVAL: Report and Results”, 1998 • P.Edmonds. “The evaluation of word sense disambiguation systems”. ELRA Newsletter, Vol. 7, No.3, 2002 • M.Carpuat, W.Su, and D.Wu. “Augmenting Ensemble Classification for Word Sense Disambiguation with Kernel PCA Model”. SENSEVAL-3, 2004

More Related