1 / 37

Annotating the WordNet Glosses Ben Haskell <ben@clarity.princeton>

Annotating the WordNet Glosses Ben Haskell <ben@clarity.princeton.edu>. Annotating the Glosses. Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging)

arlen
Download Presentation

Annotating the WordNet Glosses Ben Haskell <ben@clarity.princeton>

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Annotating the WordNet Glosses Ben Haskell <ben@clarity.princeton.edu>

  2. Annotating the Glosses • Annotating open-class words with their WordNet sense tag (a.k.a. sense-tagging) • A disambiguation task: Process of linking an instance of a word to the WordNet synset representing its context-appropriate meaning, e.g. run a company vs. run an errand

  3. Glosses as node points in the network of relations • Once a word’s gloss is annotated, the synsets for all conceptually-related words used in the gloss can be accessed via their sense tags • Situates the word in an expanded network of links to other semantically-related words/concepts in WordNet

  4. Annotating the Glosses • Automatically tag monosemous words/collocations • For gold standard quality, sense-tagging of polysemous words must be done manually • More accurate sense-tagged data means better results for WSD systems, which means better performance from applications that depend on WSD

  5. System overview • Preprocessor • Gloss “parser” and tokenizer/lemmatizer • Semantic class recognizer • Noun phrase chunker • Collocation recognizer (globber) • Automatic sense tagger for monosemous terms • Manual tagging interface

  6. Logical structure of a Gloss • Smallest unit is a word, contracted form, or non-lexical punctuation • Collocations are decomposed into their constituent parts • Allows coding of discontinuous collocations • A collocation can be treated either as a single unit or a sequence of forms

  7. Example glosses • n. pass, toss, flip: (sports) the act of throwing the ball to another member of your team; "the pass was fumbled" • n. brace, suspender: elastic straps that hold trousers up (usually used in the plural) • v. kick: drive or propel with the foot

  8. Gloss “parser” • Regularization & clean-up of the gloss • Recognize & XML tag <def>, <aux>, <ex>, <qf>, verb arguments, domain <classif> • <aux> and <classif> contents do not get tagged • Replace XML-unfriendly characters (&, <, >) with XML entities

  9. Tokenizer • Isolate word forms • Differentiate non-lexical from lexical punctuation • E.g., sentence-ending periods vs. periods in abbreviations • Recognize apostrophe vs. quotation marks • E.g., states’ rights vs. `college-bound students’

  10. Lemmatizer • A lemma is the WordNet entry form plus WordNet part of speech • Inflected forms are uninflected using a stemmer developed in-house specifically for this task • A <wf> may be assigned multiple potential lemmas • saw: lemma=“saw%1|saw%2|see%2” • feeling: lemma=“feeling%1|feel%2”

  11. Lemmatizer, cont. • Exceptions: stopwords/phrases • Closed-class words (prepositions, pronouns, conjunctions, etc.) • multi-word terms such as “by means of”, “according to”, “granted that” • Hyphenated terms not in WordNet get split and separately lemmatized • E.g., over-fed becomes over + fed

  12. Semantic class recognizer • Recognizes and marks up parenthesized and free text belonging to a finite set of semantic classes • chem(ical symbol), curr(ency), date, d(ate)range, math, meas(ure phrase), n(umeric)range, num(ber), punc(tuation), symb(olic text), time, year • Words and phrases in these classes will not be sense-tagged

  13. Noun Phrase chunker • Isolates noun phrases (“chunks”) in order to narrow the scope for finding noun collocations in the next stage • Glosses are not otherwise syntactically parsed • Trained and tagged POS using Thorsten Brants’s TnT statistical tagger

  14. Noun Phrase chunker, cont. • Trained and chunked noun phrases using Steven Abney’s partial parser Cass • Enabled automatic recognition of otherwise ambiguous noun compounds and fixed expressions • E.g., opening move (JJ NN vs. VBG NN vs. VBG VB vs. NN VB), bill of fare (NN IN NN vs. VB IN NN) • Effected an increase in noun collocation coverage by 25% (types) and 29% (tokens)

  15. Collocation recognizer • Bag of Words approach • To find ‘North_America’, find glosses that have both ‘North’ and ‘America’ • Four passes • Ghost: ‘bring_home_the_bacon’ • mark ‘bacon’ so it won’t be tagged as monosemous • Contiguous: ‘North_America’ • Disjoint: North (and) [(South) America] • Examples: tag the synset’s collocations in its gloss

  16. Automatic sense-tagger • Tag monosemous words. • Words that have… • …only one lemmatized form • …only one WordNet sense • …not been marked as possibly ambiguous • i.e. non wait-list words, non ‘bacon’ words

  17. The mantag interface • Simplicity • Taggers will repeat the same actions hundreds of times per day • Automation • Instead of typing the 148,000 search terms, use a centralized list • Also allows for easy tracking of double-checking process

  18. Statistics

  19. Statistics, cont.

  20. Statistics, cont.

  21. Statistics, cont.

  22. Aim of ISI Effort • Jerry Hobbs, Ulf Hermjakob, Nishit Rathod, Fahad al-Qahtani • Gold standard translation of glosses into first-order logic with reified events

  23. ISI Effort examples In: gloss for dance, v, 2: graceful#a#1 way#n#8 ignore move in a graceful and rhythmic way move#v#2 ignore ignore rhythmic#a#1 Out: dance-V-2'(e0,x)       -> move-V-2'(e1,x) & in'(e2,e1,y) & graceful-A-1'(e3,y)          & rhythmic-A-1'(e4,y) & way-N-8'(e5,y)

  24. ISI Effort examples, cont. gloss for allegro, n, 2: musical_composition#n#1 ignore perform#v#2 In: a musical composition or passage performed quickly ignore musical_passage#n#1 quickly#r#4 Out: allegro-N-2'(e0,x) -> musical_composition-N-1/musical_passage-N-1'(e1,x) & perform-V-2'(e2,y,x) & quick-D-4'(e3,e2) musical_composition-N-1'(e1,x) -> musical_composition-N-1/musical_passage-N-1'(e1,x) musical_passage-N-1'(e1,x) -> musical_composition-N-1/musical_passage-N-1'(e1,x)

  25. ISI Method • Identify the most common gloss patterns and convert them first • Parse • using Charniak’s parser: • uneven, sometimes bizarre results (“aspen”: VBN) • Hermjakob’s CONTEX parser: • greater local control

  26. ISI Progress • Completed glosses of nouns with patterns: • NG (P NG)*: 45% of nouns • + NG ((VBN | VING) NG): 15% of nouns • 45 + 15 = 60% complete! • But gloss patterns are in a Zipf distribution:

  27. Distribution of noun glosses

More Related