
Exploring PropBanks for English and Hindi Ashwini Vaidya Dept of Linguistics University of Colorado, Boulder
Why is semantic information important? • Imagine an automatic question answering system • Who created the first effective polio vaccine? • Two possible choices: • Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine • The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh
Question Answering • Who created the first effective polio vaccine? • Becton Dickinson created the first disposable syringe for use with the mass administration of the first effective polio vaccine • The first effective polio vaccine was created in 1952 by Jonas Salk at the University of Pittsburgh
Question Answering • Who created the first effective polio vaccine? • [Becton Dickinson] created the [first disposable syringe]for use with the mass administration of the first effective polio vaccine • [The first effective polio vaccine] was created in 1952 by [Jonas Salk] at the University of Pittsburgh
Question Answering • Who created the first effective polio vaccine? • [Becton Dickinsonagent] created the [first disposable syringetheme]for use with the mass administration of the first effective polio vaccine • [The first effective polio vaccinetheme] was created in 1952 by [Jonas Salkagent] at the University of Pittsburgh
Question Answering • We need semantic information to prefer the right answer • The theme of create should be ‘the first effective polio vaccine’ • The theme in the first sentence was ‘the first disposable syringe’ • We can filter out the wrong answer
We need semantic information • To find out about events and their participants • To capture semantic information across syntactic variation
Semantic information • Semantic information about verbs and participants expressed through semantic roles • Agent, Experiencer, Theme, Result etc. • However, difficult to have a standard set of thematic roles
Proposition Bank • Proposition Bank (PropBank) provides a way to carry out general purpose Semantic role labelling • A PropBank is a large annotated corpus of predicate-argument information • A set of semantic roles is defined for each verb • A syntactically parsed corpus is then tagged with verb-specific semantic role information
Outline • English PropBank • Background • Annotation • Frame files & Tagset • Hindi PropBank development • Adapting Frame files • Light verbs • Mapping from dependency labels
Proposition Bank • The first (English) PropBank was created on a 1 million syntactically parsed Wall Street Journal corpus • PropBank annotation has also been done on different genres e.g. web text, biomedical text • Arabic, Chinese & Hindi PropBanks have been created
English PropBank • English PropBank envisioned as the next level of Penn Treebank (Kingsbury & Palmer, 2003) • Added a layer of predicate-argument information to the Penn Treebank • Broad in its coverage- covering every instance of a verb and its semantic arguments in the corpus • Amenable to collecting representative statistics
English PropBank Annotation • Two steps are involved in annotation • Choose a sense ID for the predicate • Annotate the arguments of that predicate with semantic roles • This requires two components: frame files and PropBanktagset
PropBank Frame files • PropBank defines semantic roles on a verb-by-verb basis • This is defined in a verb lexicon consisting of frame files • Each predicate will have a set of roles associated with a distinct usage • A polysemous predicate can have several rolesets within its frame file
An example • John rings the bell
An example • John rings the bell • Tall aspen trees ring the lake
An example • [John] rings [the bell] • [Tall aspen trees] ring [the lake] Ring.01 Ring.02
An example • [JohnARG0] rings [the bellARG1] • [Tall aspen treesARG1] ring [the lakeARG2] Ring.01 Ring.02
Frame files • The Penn Treebank had about 3185 unique lemmas (Palmer, Gildea, Kingsbury, 2005) • Most frequently occurring verb: say • Small number of verbs had several framesets e.g. go, come, take, make • Most others had only one frameset per file
English PropBankTagset • Numbered arguments Arg0, Arg1, and so on until Arg4 • Modifiers with function tags e.g. ArgM-LOC (location) , ArgM-TMP (time), ArgM-PRP (purpose) • Modifiers give additional information about when, where or how the event occurred
PropBanktagset • Correspond to the valency requirements of the verb • Or, those that occur with high frequency with that verb
PropBanktagset • 15 modifier labels for English PropBank • [HeArg0] studied [economic growthArg1] [in IndiaArgM-LOC]
PropBanktagset • Verb specific and more generalized • Arg0 and Arg1 correspond to Dowty’s Proto Roles • Leverage the commonalities among semantic roles • Agents, causers, experiencers – Arg0 • Undergoers, patients, themes- Arg1
PropBanktagset • While annotating Arg0 and Arg1: • Unaccusative verbs take Arg1 as their subject argument • [The windowArg1] broke • Unergatives will take Arg0 • [JohnArg0] sang • Distinction is also made between internally caused events (blush: Arg0) & externally caused events (redden: Arg1)
PropBanktagset • How might these map to the more familiar thematic roles? • Yi, Loper & Palmer (2007) describe such a mapping to VerbNet roles
More frequent Arg0 and Arg1 (85%) are learnt more easily by automatic systems • Arg2 is less frequent, maps to more than one thematic role • Arg3-5 are even more infrequent
Using PropBank • As a computational resource • Train semantic role labellers (Pradhan et al, 2005) • Question answering systems (with FrameNet) • Project semantic roles onto a parallel corpus in another language (Pado & Lapata, 2005) • For linguists, to study various phenomena related to predicate-argument structure
Outline • English PropBank • Background • Annotation • Frame files & Tagset • Hindi PropBank development • Adapting Frame files • Light verbs • Mapping from dependency labels
Developing PropBank for Hindi-Urdu • Hindi-Urdu PropBank is part of a project to develop a Multi-layered and multi-representational treebank for Hindi-Urdu • Hindi Dependency Treebank • Hindi PropBank • Hindi Phrase Structure Treebank • Ongoing project at CU-Boulder
Hindi-Urdu PropBank • Corpus of 400,000 words for Hindi • Smaller corpus of 150,000 words for Urdu • Hindi corpus consists of newswire text from ‘AmarUjala’ • So far.. • 220 verb frames • ~100K words annotated
Developing Hindi PropBank • Making a PropBank resource for a new language • Linguistic differences • Capturing relevant language-specific phenomena • Annotation practices • Maintain similar annotation practices • Consistency across PropBanks
Developing Hindi PropBank • PropBank annotation for English, Chinese & Arabic was done on top of phrase structure trees • Hindi PropBank is annotated on dependency trees
Dependency tree • Represent relations that hold between constituents (chunks) • Karaka labels show the relations between head verb and its dependents दिये gave k1 k2 k4 रामने Raam erg पैसे money औरतको woman dat
Hindi PropBank • There are three components to the annotation • Hindi Frame file creation • Insertion of empty categories • Semantic role labelling
Hindi PropBank • There are three components to the annotation • Hindi Frame file creation • Insertion of empty categories • Semantic role labelling • Both frame creation and labelling require new strategies for Hindi
Hindi PropBank • Hindi frame files were adapted to include • Morphological causatives • Unaccusative verbs • Experiencers • Additionally, changes had to be made to analyze the large number (nearly 40%) of light verbs
Causatives • Verbs that are related via morphological derivation are grouped together as individual predicates in the same frame file. • E.g. Cornerstone’s multi-lemma mode is used for the verbs खा (KA ‘eat’, खिला (KilA ‘feed’ andखिलवा(KilvA‘cause to feed’)
Causatives raam neArg0 khaanaArg1khaayaa Ram erg food eat-perf ‘Ram ate the food’ mohan neArg0raam koArg2 khaanaArg1khilaayaa Mohan erg Ram dat food eat-caus-perf ‘Mohan made Ram eat the food’ sitaneArgCmohanseArgAraam koArg2 khaanaArg1khilvaayaa Sita erg Mohan instr Ram acc food eat-ind.caus-caus-perf ‘Sita, through Mohan made Ram eat the food ’
Unaccusatives • PropBank needs to distinguish proto agents and proto patients • Sudha danced – • Unergative, animate agentive arguments- Arg0 • The door opened • Unaccusative, non animate patient arguments- Arg1 • For English, these distinctions were available in VerbNet • For Hindi, various diagnostic tests are applied to make this distinction
First stage Yes No- Second Stage Ergative. E.g. naac, dauRa, bEtha? No-Third stage Yes.Unaccus-ative. E.g. khulnaa, barasnaa Yes.Unacc-usative. Eliminate bEtha No. Take the majority vote on the tests. Unaccusativity diagnostics (5) Applicable to verbs undergoing transitivity alternation: eliminate those that take (mostly) animate subjects cognate object & ergative case tests • For non-alternating verbs and others that remain, • the verb will be unaccusative if: • Impersonal passives are not possible • use of ‘huaa’ (past participial relative) is possible • the inanimate subject appears without overt genitive
Unaccusativity • This information is then captured in the frame
Experiencers • Arg0 includes agents, causers, experiencers • In Hindi, the experiencer subjects occur with dikhnaa (to glimpse), milnaa (to find), suujhnaa (to be struck with) etc. • Typically marked with dative case • Mujh-kochaanddikhaa I. Dat moon glimpse.pf (I glimpsed the moon) • PropBank labels these as Arg0, with an additional descriptor field ‘experiencer’ • Internally caused events
Complex predicates • A large number of complex predicates in Hindi • Noun-verb complex predicates • vichaarkarna; think do (to think) • dhyaandenaa; attention give (to pay attention) • Adjectives can also occur • acchalagnaa: good seem (to like) • the predicating element is not the verb alone, but forms a composite with the noun/adj
Complex predicates • There are also a number of verb-verb complex predicates • Combine with the bare form of the verb to convey different meanings • ropaRaa; cry lie- burst out crying • Add some aspectual meaning to the verb • As these occur with only a certain set of verbs, we find them automatically, based on part of speech tag
Complex predicates • However, the noun-verb complex predicates are more pervasive • They occur with a large number of nouns • Frame files for the nominal predicate need to be created • E.g. in a sample of 90K words, the light verb kar; ‘do’ occurred 1738 times with 298 different nominals
Complex predicates • Annotation strategy • Additional resources (frame files) • Cluster the large number of nominals?
Light verb annotation • A convention for the annotation of light verbs has been adopted across Hindi, Arabic, Chinese and English PropBanks (Hwang et. al. 2010) • Annotation is carried out in three passes • Manual identification of light verb • Annotation of arguments based on frame file • Deterministic merging of the light verb and ‘true’ predicate • In Hindi, this process may be simplified because of the dependency label ‘pof’ that identifies a light verb
Identify the N+V sequences that are complex predicates. Annotate predicating expression with ARG-PRX. REL: kar ARG-PRX: corii Annotate the arguments and modifiers of the complex predicate with a nominal predicate frame Automatically merge the nominal with the light verb. REL: corii_kiye Arg0: raam Arg1: paese
Hindi PropBanktagset 24 labels