Exploring PropBanks for English and Hindi - PowerPoint PPT Presentation

exploring propbanks for english and hindi n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Exploring PropBanks for English and Hindi PowerPoint Presentation
Download Presentation
Exploring PropBanks for English and Hindi

play fullscreen
1 / 72
Exploring PropBanks for English and Hindi
123 Views
Download Presentation
norina
Download Presentation

Exploring PropBanks for English and Hindi

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder

  2. Why is semantic information important? • Imagine an automatic question answering system • Who created the first effective polio vaccine? • Two possible choices: • Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine • The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh

  3. Question Answering • Who created the first effective polio vaccine? • Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine • The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh

  4. Question Answering • Who created the first effective polio vaccine? • [Becton Dickinson] created the [first disposable syringe]for use with the mass administration of the first effective polio vaccine • [The first effective polio vaccine] was created in 1952 by [Jonas Salk] at the University of Pittsburgh

  5. Question Answering • Who created the first effective polio vaccine? • [Becton Dickinsonagent] created the [first disposable syringetheme]for use with the mass administration of the first effective polio vaccine • [The first effective polio vaccinetheme] was created in 1952 by [Jonas Salkagent] at the University of Pittsburgh

  6. Question Answering • We need semantic information to prefer the right answer • The theme of create should be ‘the first effective polio vaccine’ • The theme in the first sentence was ‘the first disposable syringe’ • We can filter out the wrong answer

  7. We need semantic information • To find out about events and their participants • To capture semantic information across syntactic variation

  8. Semantic information • Semantic information about verbs and participants expressed through semantic roles • Agent, Experiencer, Theme, Result etc. • However, difficult to have a standard set of thematic roles

  9. Proposition Bank • Proposition Bank (PropBank) provides a way to carry out general purpose Semantic role labelling • A PropBank is a large annotated corpus of predicate-argument information • A set of semantic roles is defined for each verb • A syntactically parsed corpus is then tagged with verb-specific semantic role information

  10. Outline • English PropBank • Background • Annotation • Frame files & Tagset • Hindi PropBank development • Adapting Frame files • Light verbs • Mapping from dependency labels

  11. Proposition Bank • The first (English) PropBank was created on a 1 million syntactically parsed Wall Street Journal corpus • PropBank annotation has also been done on different genres e.g. web text, biomedical text • Arabic, Chinese & Hindi PropBanks have been created

  12. English PropBank • English PropBank envisioned as the next level of Penn Treebank (Kingsbury & Palmer, 2003) • Added a layer of predicate-argument information to the Penn Treebank • Broad in its coverage- covering every instance of a verb and its semantic arguments in the corpus • Amenable to collecting representative statistics

  13. English PropBank Annotation • Two steps are involved in annotation • Choose a sense ID for the predicate • Annotate the arguments of that predicate with semantic roles • This requires two components: frame files and PropBanktagset

  14. PropBank Frame files • PropBank defines semantic roles on a verb-by-verb basis • This is defined in a verb lexicon consisting of frame files • Each predicate will have a set of roles associated with a distinct usage • A polysemous predicate can have several rolesets within its frame file

  15. An example • John rings the bell

  16. An example • John rings the bell • Tall aspen trees ring the lake

  17. An example • [John] rings [the bell] • [Tall aspen trees] ring [the lake] Ring.01 Ring.02

  18. An example • [JohnARG0] rings [the bellARG1] • [Tall aspen treesARG1] ring [the lakeARG2] Ring.01 Ring.02

  19. Frame files • The Penn Treebank had about 3185 unique lemmas (Palmer, Gildea, Kingsbury, 2005) • Most frequently occurring verb: say • Small number of verbs had several framesets e.g. go, come, take, make • Most others had only one frameset per file

  20. PropBank annotation pane in Jubilee

  21. English PropBankTagset • Numbered arguments Arg0, Arg1, and so on until Arg4 • Modifiers with function tags e.g. ArgM-LOC (location) , ArgM-TMP (time), ArgM-PRP (purpose) • Modifiers give additional information about when, where or how the event occurred

  22. PropBanktagset • Correspond to the valency requirements of the verb • Or, those that occur with high frequency with that verb

  23. PropBanktagset • 15 modifier labels for English PropBank • [HeArg0] studied [economic growthArg1] [in IndiaArgM-LOC]

  24. PropBanktagset • Verb specific and more generalized • Arg0 and Arg1 correspond to Dowty’s Proto Roles • Leverage the commonalities among semantic roles • Agents, causers, experiencers – Arg0 • Undergoers, patients, themes- Arg1

  25. PropBanktagset • While annotating Arg0 and Arg1: • Unaccusative verbs take Arg1 as their subject argument • [The windowArg1] broke • Unergatives will take Arg0 • [JohnArg0] sang • Distinction is also made between internally caused events (blush: Arg0) & externally caused events (redden: Arg1)

  26. PropBanktagset • How might these map to the more familiar thematic roles? • Yi, Loper & Palmer (2007) describe such a mapping to VerbNet roles

  27. More frequent Arg0 and Arg1 (85%) are learnt more easily by automatic systems • Arg2 is less frequent, maps to more than one thematic role • Arg3-5 are even more infrequent

  28. Using PropBank • As a computational resource • Train semantic role labellers (Pradhan et al, 2005) • Question answering systems (with FrameNet) • Project semantic roles onto a parallel corpus in another language (Pado & Lapata, 2005) • For linguists, to study various phenomena related to predicate-argument structure

  29. Outline • English PropBank • Background • Annotation • Frame files & Tagset • Hindi PropBank development • Adapting Frame files • Light verbs • Mapping from dependency labels

  30. Developing PropBank for Hindi-Urdu • Hindi-Urdu PropBank is part of a project to develop a Multi-layered and multi-representational treebank for Hindi-Urdu • Hindi Dependency Treebank • Hindi PropBank • Hindi Phrase Structure Treebank • Ongoing project at CU-Boulder

  31. Hindi-Urdu PropBank • Corpus of 400,000 words for Hindi • Smaller corpus of 150,000 words for Urdu • Hindi corpus consists of newswire text from ‘AmarUjala’ • So far.. • 220 verb frames • ~100K words annotated

  32. Developing Hindi PropBank • Making a PropBank resource for a new language • Linguistic differences • Capturing relevant language-specific phenomena • Annotation practices • Maintain similar annotation practices • Consistency across PropBanks

  33. Developing Hindi PropBank • PropBank annotation for English, Chinese & Arabic was done on top of phrase structure trees • Hindi PropBank is annotated on dependency trees

  34. Dependency tree • Represent relations that hold between constituents (chunks) • Karaka labels show the relations between head verb and its dependents दिये gave k1 k2 k4 रामने Raam erg पैसे money औरतको woman dat

  35. Hindi PropBank • There are three components to the annotation • Hindi Frame file creation • Insertion of empty categories • Semantic role labelling

  36. Hindi PropBank • There are three components to the annotation • Hindi Frame file creation • Insertion of empty categories • Semantic role labelling • Both frame creation and labelling require new strategies for Hindi

  37. Hindi PropBank • Hindi frame files were adapted to include • Morphological causatives • Unaccusative verbs • Experiencers • Additionally, changes had to be made to analyze the large number (nearly 40%) of light verbs

  38. Causatives • Verbs that are related via morphological derivation are grouped together as individual predicates in the same frame file. • E.g. Cornerstone’s multi-lemma mode is used for the verbs खा (KA ‘eat’, खिला (KilA ‘feed’ andखिलवा(KilvA‘cause to feed’)

  39. Causatives raam neArg0 khaanaArg1khaayaa Ram erg food eat-perf ‘Ram ate the food’ mohan neArg0raam koArg2 khaanaArg1khilaayaa Mohan erg Ram dat food eat-caus-perf ‘Mohan made Ram eat the food’ sitaneArgCmohanseArgAraam koArg2 khaanaArg1khilvaayaa Sita erg Mohan instr Ram acc food eat-ind.caus-caus-perf ‘Sita, through Mohan made Ram eat the food ’

  40. Unaccusatives • PropBank needs to distinguish proto agents and proto patients • Sudha danced – • Unergative, animate agentive arguments- Arg0 • The door opened • Unaccusative, non animate patient arguments- Arg1 • For English, these distinctions were available in VerbNet • For Hindi, various diagnostic tests are applied to make this distinction

  41. First stage Yes No- Second Stage Ergative. E.g. naac, dauRa, bEtha? No-Third stage Yes.Unaccus-ative. E.g. khulnaa, barasnaa Yes.Unacc-usative. Eliminate bEtha No. Take the majority vote on the tests. Unaccusativity diagnostics (5) Applicable to verbs undergoing transitivity alternation: eliminate those that take (mostly) animate subjects cognate object & ergative case tests • For non-alternating verbs and others that remain, • the verb will be unaccusative if: • Impersonal passives are not possible • use of ‘huaa’ (past participial relative) is possible • the inanimate subject appears without overt genitive

  42. Unaccusativity • This information is then captured in the frame

  43. Experiencers • Arg0 includes agents, causers, experiencers • In Hindi, the experiencer subjects occur with dikhnaa (to glimpse), milnaa (to find), suujhnaa (to be struck with) etc. • Typically marked with dative case • Mujh-kochaanddikhaa I. Dat moon glimpse.pf (I glimpsed the moon) • PropBank labels these as Arg0, with an additional descriptor field ‘experiencer’ • Internally caused events

  44. Complex predicates • A large number of complex predicates in Hindi • Noun-verb complex predicates • vichaarkarna; think do (to think) • dhyaandenaa; attention give (to pay attention) • Adjectives can also occur • acchalagnaa: good seem (to like) • the predicating element is not the verb alone, but forms a composite with the noun/adj

  45. Complex predicates • There are also a number of verb-verb complex predicates • Combine with the bare form of the verb to convey different meanings • ropaRaa; cry lie- burst out crying • Add some aspectual meaning to the verb • As these occur with only a certain set of verbs, we find them automatically, based on part of speech tag

  46. Complex predicates • However, the noun-verb complex predicates are more pervasive • They occur with a large number of nouns • Frame files for the nominal predicate need to be created • E.g. in a sample of 90K words, the light verb kar; ‘do’ occurred 1738 times with 298 different nominals

  47. Complex predicates • Annotation strategy • Additional resources (frame files) • Cluster the large number of nominals?

  48. Light verb annotation • A convention for the annotation of light verbs has been adopted across Hindi, Arabic, Chinese and English PropBanks (Hwang et. al. 2010) • Annotation is carried out in three passes • Manual identification of light verb • Annotation of arguments based on frame file • Deterministic merging of the light verb and ‘true’ predicate • In Hindi, this process may be simplified because of the dependency label ‘pof’ that identifies a light verb

  49. Identify the N+V sequences that are complex predicates. Annotate predicating expression with ARG-PRX. REL: kar ARG-PRX: corii Annotate the arguments and modifiers of the complex predicate with a nominal predicate frame Automatically merge the nominal with the light verb. REL: corii_kiye Arg0: raam Arg1: paese

  50. Hindi PropBanktagset 24 labels