1 / 50

Sense Tagging the Penn TreeBank Martha Palmer CIS and IRCS University of Pennsylvania

Sense Tagging the Penn TreeBank Martha Palmer CIS and IRCS University of Pennsylvania. John's Hopkins University April 18, 2000. Penn approach. Relies on lexically based linguistic analysis Humans annotate naturally occurring text (hand correct output of automatic parsers, e.g. Fiddich, XTAG)

rusti
Download Presentation

Sense Tagging the Penn TreeBank Martha Palmer CIS and IRCS University of Pennsylvania

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sense Tagging the Penn TreeBankMartha PalmerCIS and IRCSUniversity of Pennsylvania John's Hopkins UniversityApril 18, 2000

  2. Penn approach • Relies on lexically based linguistic analysis • Humans annotate naturally occurring text • (hand correct output of automatic parsers, e.g. Fiddich, XTAG) • Train statistical POStaggers, parsers, etc. • Common thread is predicate-argument structure Hypothesis: More linguistically sophisticated analyzers More accurate output

  3. Past results • XTAG project http://www.cis.upenn.edu/~xtag/ • Penn TreeBank http://www.cis.upenn.edu/~treebank/ • Enabled the development of tools: POStaggers, parsers, co-reference, etc http://www.ircs.upenn.edu/knowledge/licensing.html

  4. Under TIDES: • Annotations enriched with semantics and pragmatics • Provide companion lexicons for annotated corpora • Extend our coverage to other languages (Chinese, Korean) Hypothesis– Parallel annotated corpora/lexicons will enable rapid ramp-up of MT

  5. Outline • Lexical Choice in Machine Translation • Constraints on choices - word level • Introduce verb classes – VerbNet • Constraints on choices - class level • Sense tagging data • adding semantics to the Penn TreeBank • VerbNet as a companion lexicon • Korean and Chinese TreeBanks

  6. Outline • Lexical Choice in Machine Translation • Constraints on choices - word level • Introduce verb classes – VerbNet • Constraints on choices - class level • Sense tagging data • adding semantics to the Penn TreeBank • VerbNet as a companion lexicon • Korean and Chinese TreeBanks

  7. Machine Translation Lexical Choice- Word Sense Disambiguation • Iraq lost the battle. • Ilakuka centwey ciessta. • [Iraq ] [battle] [lost]. • John lost his computer. • John-i computer-lul ilepelyessta. • [John] [computer] [misplaced].

  8. Cross-linguistic Information Retrieval-sense ambiguities • English • speed of light > *plea bargaining, • (speedier trials, lighter sentences) • Multilingual: French => English • saisi stupefiant > drug seizure, *grip narcotic, *understand stupefying,...

  9. lose OBJ SUBJ Predicate-argument structures for lose • lose1 (Agent: animate, • Patient: physical-object) • lose2 (Agent: animate, • Patient: competition) • Agent <=> subj • Patient <=> obj • ACL-81, ACL-85,ACL86,MT90,CUP90,AIJ93

  10. Word sense disambiguation withSource Language Semantic Class Constraints(co-occurrence patterns + backoff) lose1(Agent, Patient: competition) <=> ciessta lose2 (Agent, Patient: physobj) <=> ilepelyessta

  11. Word sense disambiguation with Target Language Semantic Constraints receive <=> {patassta, swusinhayssta} patassta(Recipient, Patient: physical-obj) swusinhayssta (Recipient, Patient: communication) TAG+94, CSLI99,AMTA94

  12. break smash shatter snap ? da po - irregular pieces da sui - small pieces pie duan -line segments Lexical Gaps: English to Chinese

  13. Word sense disambiguation with Source Language Neighbors and Target Language Semantic Class Constraints break {smash, shatter, snap, etc.} <=> {da sui, da po, pie duan, di po,...} da sui (Agent, Patient: small and brittle) da po (Agent, Patient: concrete, inflexible object) pie duan(Agent, Patient: line segment shape) ACL94, MTJ95

  14. Levin classes (3100 verbs) • 47 top level classes, 150 second and third level • Based on pairs of syntactic frames. • John broke the jar. / Jars break easily. / The jar broke. • John cut the bread. / Bread cuts easily. / *The bread cut. • John hit the wall. / *Walls hit easily. / *The wall hit. • Reflect underlying semantic components • contact, directed motion, • exertion of force, change of state • Synonyms, syntactic patterns, relations

  15. Confusions in Levin classes? • Not semantically homogenous • {braid, clip, file, powder, pluck, etc...} • Multiple class listings • homonymy or polysemy? • Alternation contradictions? • Carry verbs disallow the Conative, but include • {push,pull,shove,kick,draw,yank,tug} • also in Push/pull class, does take the Conative

  16. Intersective Levin classes

  17. Regular Sense Extensions • John pushed the chair. +force, +contact • arg0 arg1 • John pushed the chairs apart. +ch-state • John pushed the chairs across the room. +ch-loc • John pushed at the chair. -ch-loc • The train whistled into the station. +ch-loc • The truck roared past the weigh station. +ch-loc AMTA98,ACL98,TAG98

  18. Intersective Levin Classes • More syntactically and semantically coherent • sets of syntactic patterns • explicit semantic components • relations between senses • VERBNET

  19. VerbNet: Push

  20. Manner of Motion Verbs: • Roll verbs • The ball rolled (down the hill.) • Down the hill rolled the ball. • Bill rolled the ball down the hill. • The ball rolled free. • The ball rolled 3 feet.

  21. Manner of Motion Verbs: • Run verbs • The horse jumped (over the stream). • The horse jumped the stream. • Over the stream jumped the horse. • The rider jumped the horse over the stream. • The horse jumped himself into a lather. • The horse jumped five feet. • He made/went for a jump.

  22. Manner of Motion Verbs: • Roll verbs • The ball rolled down the hill. arg1 argP • Bill rolled the ball down the hill. arg0 arg1 argP • Run verbs • The horse jumped. arg0 • The rider jumped the horse over the stream. argA arg0 argP

  23. Levin classes involving Motion verbs

  24. Portuguese – similar patterns,except for… • Same • Bounce/ rebater • Float/ flutuar • Roll/rolar • Slide/deslizar • Different (no causative) • Drift/derivar • Planar/glide

  25. Machine Translation - Head switching • The log floated into the cave. • A madeira entrou na caverna flutuando. • [log] [entered] [cave] [floating]

  26. Treatment for Head Switching uses: • Cross-linguistic generalizations • based on • Intersective Levin classes • implemented in • Feature-based Lexicalized Tree-Adjoining Grammars

  27. Partial derivations for Head Switching - STAG Transfer Lexicon AMTA96, KLUWER98

  28. Outline • Lexical Choice in Machine Translation • Constraints on choices - word level • Introduce verb classes – VerbNet • Constraints on choices - class level • Sense tagging data • adding semantics to the Penn TreeBank • VerbNet as a companion lexicon • Korean and Chinese TreeBanks

  29. Preparing Training Data • WordNet - online lexical resource • ISA relations, Part-whole, synonym sets • Poor inter-annotator agreement, 70-80% - lose • No predicate-argument structures or constraints • SENSEVAL/SIGLEX98: (Brighton, Sep,98) • Workshop on Word Sense Disambiguation • 34 words, corpus-based sense inventory • Inter-annotator agreement over 90%

  30. 8 major distinctions 3 Shake up 3 Shake down 2 Shake out 2 Shake off 8 major distinctions 2 shake up 2 shake off ShakeHector vs. WordNet

  31. Mismatches between lexicons:Hector - WordNet

  32. VERBNET

  33. Approach • Revising WordNet with VerbNet • corpus-based sense distinctions • explicit predicate-argument structure and constraints • Provides more coarse-grained sense distinctions for easy mapping to other lexical resources • Sense tagging Penn TreeBank

  34. Semantic Annotation –Hoa Dang, Joseph Rosenzweig, John Duda • Current syntactic annotation • POS, phrase structure bracketing • Logical Subject, locative, temporal adjuncts • New semantic augmentations • Sense tag verbs and noun arguments/adjuncts • Predicate-argument relations for verbs, label arguments (arg0, arg1, arg2)

  35. First Experiment (Siglex99) • WSJ 5K word corpus • running text • WordNet 1.6 • 2100 words sense tagged twice (10 days) • 89% inter-annotator agreement • 700 verb tokens – 81% agreement (disagreement in 90/350 verb tokens) • Automatic predicate-argument labeling • 81% precision on 162 structures • Hand corrected 2100 words in one day

  36. Example • I was shaking the whole time. <arg0> <WN2> <temporal> • The walls shook; the building rocked. <arg1> <WN3>; <arg1> <WN1>

  37. Predicate argument labels • Rosenzweig’s converter • Uses TreeBank “cues” • Consults lexical semantic KB • Verb subcategorization frames and alternations • Ontology of noun-phrase referents • Multi-word lexical items

  38. Predicate-Argument Labeling:one raid tree – Rosenzweig’s converter

  39. Predicate-Argument Labeling:one raid tree – Rosenzweig’s converter

  40. Second Experiment: Methodology(with Christiane Fellbaum) • Sense tagging of 150K • Two human annotators (replace one with automatic WSD if possible) • WordNet senses, but allow for revision of entries • Automatic predicate-argument labeling • hand correction, lexicon for reference • Standoff XML annotation

  41. Outline • Lexical Choice in Machine Translation • Constraints on choices - word level • Introduce verb classes – VerbNet • Constraints on choices - class level • Sense tagging data • adding semantics to the Penn TreeBank • VerbNet as a companion lexicon • Korean and Chinese TreeBanks

  42. Example translation

  43. Korean Morphological Analyzer (POStags) Parser/Generator TreeBank Companion pred-arg lexicon English POStagger Parser/Generator TreeBank Companion pred-arg lexicon Korean/English MT Components Transfer Lexicon

  44. Transfer lexicon entries: Mapping predicate argument structures across languages

  45. Korean/English MT Chunghye Han, Juntae Yoon, Meesook Kim, Eonsuk Ko(CoGenTex/Penn/Systran: ARL) • Parallel TreeBanks for Korean/English enable • Training of domain-specific Korean parsers • Collins parser and SuperTagger (also English) • Alignment of Korean/English structures • Attempt automatic and semi-automatic testing and generation of transfer lexicon (with CoGenTex) • Apply statistical MT techniques • Lexical semantics (Systran, mapped to EuroWordNet-IL) should improve • Accuracy of parsers • Recovery of dropped arguments • http://www.cis.upenn.edu/~xtag/koreantag/index.html

  46. Chinese TreeBank – DODFei Xia, Ninwen Xue, Fu-dong Chiouhttp://www.ldc.upenn.edu/ctb/index.html • Workshop of interested members of Chinese community, June ‘98 • Guidelines and sample files posted on web • Segmentation, March, ‘99 • POStagging, March, ‘99 • Bracketing, First pass, October, ’99 • Bracketing, Second Pass, May, ’00 • 95%+ inter-annotator consistency • Release of 100K annotated data, July, ’00 • Follow-up workshop, Hong Kong, ACL’00

  47. Goal for Chinese • Parallel, annotated corpora – translate CTB • Parse English with WSJ trained parsers, correct • Extend English TreeBank lexicon as needed • Parse Chinese with CTB trained parsers, correct • Start with lexicon extracted from CTB, extend Experiment with using semi-automated techniques wherever possible to speed up process

  48. Conclusion • Corpus annotations can be efficiently and reliably enriched with semantics • Companion lexicons can be derived from them Challenge– Parallel corpora annotated with predicate-argument structures will improve statistical MT. Prove me wrong?

More Related