A Bilingual Corpus of Inter-linked Events

A Bilingual Corpus of Inter-linked Events Tommaso Caselli♠, Nancy Ide♣, Roberto Bartolini ♠ ♠Istituto di Linguistica Computazionale – ILC-CNR Pisa ♣Department of Computer Science, Vassar College, USA {tommaso.caselli@ilc.cnr.it; ide@cs.vassar.edu; roberto.bartolini@ilc.cnr.it} LREC 08 – Marrakech, 30 May 2008

Outline • Motivations • A (gentle) introduction to TimeML, TimeBank and Italian TimeBank Corpora • The Bilingual Corpus: linking events in TimeBank & Italian TimeBank by means of the Inter-Lingual Index • Evaluation and Experiments : • similar events in the two corpora; • the ILI: a bootstrapping device for creating comparable corpora; • Conclusion

Motivations: • Retrieving the temporal relations between events in texts is required to improve the performance of I.R. and Open Domain Q.A. systems; • one of the most challenging task is represented by event identification: • can we facilitate events’ recognition by linking two comparable corpora - on size, content and annotation – in two different languages by means of the Inter-Lingual Index (ILI), which links IWN & WN? • are events encoded in the same way in Italian and English? • can we import layers of annotations from a corpus to another in two different languages by exploiting the ILI?

TimeML, TimeBank & Italian TimeBank TimeML (Pustejovsky et al., 2003) is a specification language to annotate core elements in a temporal framework: • temporal expressions (<TIMEX3>); e.g.: December 1st, three years • a wide range of linguistic expressions, like verbs, nouns, nominalizations, stative adjectives..., realizing eventualities (<EVENT>), i.e. events and states, and classifies them into 7 classes, i.e. ASPECTUAL, REPORTING, I_ACTION, I_STATE, PERCEPTION, STATE, OCCURRENCE, according to semantic and syntactic criteria; • connectives and temporal prepositions (<SIGNAL>), which make explicit the relation holding between two entities; • it creates dependencies between events (<SLINK>, <ALINK>, <TLINK>) and between events and times (<TLINK>).

TimeML, TimeBank & Italian TimeBank (2)‏ TimeBank 1.2.: • first available corpus annotated with TimeML; • 183 news article from different sources, including the Penn TreeBank2 Wall Street Journal, for a total of61K words; • 7,935 events; K=0.81 on partial match on event identification & K=0.67 on event classification Italian TimeBank: • Italian corpus comparable in size (62K words), content and annotation to TB 1.2. (171 articles from the Italian TreeBank and the PAROLE corpus); • under development: >13K words annotated, 1,755 events; • customization of TimeML to Italian (ISO-TimeML): imperfectvalue for TENSE; two new attributes -V_FORM & MOOD – for the <EVENT> tag, modification of <EVENT> tag text span; • mapping of the 7 TimeML event classes to the SIMPLE Ontology to improve event classification (K=0.84)‏

The Bilingual Corpus: Linking Events Linkage between the TimeBank (TB) & Italian TimeBank is accomplished through the Inter-Lingual Index (ILI), developed in the EuroWordNet Project (1999)‏ The ILI is effectively an unstructured version of WN, used as a ”hub” through which WN synsets are associated with synsets in WNs of other languages • In IWN the ILI is augmented with several semantic relations, such as eq_synonym, eq_hyperonym, eq_cause... specific information on the synsets relations between English and Italian. • 1,835 events (1,777 verbs & 658 nominalization); manual annotation of WN 2.0. senses, by 2 native speakers; 91% annotators’ agreement 1,686 events • 1,253 events (778 verbs & 462 nominalizations and nouns); semi-automatic annotation of IWN sense.

The Bilingual Corpus: Linking Events (2)‏ WN 2.0 IWD 1.5 Auto-Generated Mapping from WD 2.0 to IWD 1.5 IWN SENSE WN SENSE ILI ILI Italian TimeBank SENSE (from IWN)‏ “Augmented” TimeBank SENSE (WN 2.0) ILI LINK ILI (IWN) ILI (IWN) The ILI link is automatically determined and restricted to the eq_synonym and eq_near synonym relations only events with exaclty or approximately the samemeaning 1,103 events in TB with 115 event synsets & 1,250 event in Italian TB with 653 event synsets

Evaluation: Similar Events • To which extent the introduction of WN senses is useful for event identification? • Verify the Semantic Homogeneity Hypothesis: events with (almost) the same meaning assign the same TimeML class i.e. are semantically homogeneous. Automatic extraction of all events (nouns and verbs) with same ILI from both corpora: • 56 common event synsets DATA SPARSNESS • 35 common event synsets for verbsvs. 11 common event synsets for nouns

Evaluation: Similar Events - Verbs Analysis of common event synsets with a significant number of occurrences in both languages: 25 event synsets, each with 5 occurrences at least • for each event token we analyzed its semantic pattern: • basic argument structure; e.g. [ARG0] E [ARG1] [ARG2]; • semantic class of each argument and thematic role; e.g. [ARG0:Person:Agent]; • subvalency features; e.g. [Person:Agent: Def_Np] E [Event:Theme:Clause] and its TimeML class. • 30 different patterns have been identified for the 25 common synsets • 93.22% of cases support the Semantic Homogeneity Hypothesis: same meaning, same semantic pattern, same TimeML class • instances of event subcategorization (5 cases) i.e. more than one pattern.

Evaluation: Similar Events – Verbs (2)‏ < 10% of cases seem to question the validity of Semantic Homogeneity NOT A COUNTEREXAMPLE ILI = 1432563; WN seek#3 – IWN cercare#2; same semantic pattern: [person/organization] E [event]; TimeBank class: I_ACTION – Italian TB: I_STATE Inconsistency of the data is due to the exploitation of the SIMPLE – TimeML Mapping and Heuristics (Caselli et al. 2007)‏ - SIMPLE–TimeML Mapping: SIMPLE Semantic_type Modal Event : I_STATE - cercare#2 = Modal Event : I_STATE Purpose Act : I_ACTION All other instances of possible counterexamples we've found can all be explained in terms of factors others than a real difference between event realizations in the 2 languages

Evaluation: Similar Events – Nouns All 11 common types have been analyzed. They are all instances of nominalization of a corresponding event verb. Presence of WN senses is useful for identifying incorrect or inconsistent annotations in the source and target corpora and to more easily identify those instances which satisfy the criteria for an event in TimeML Incorrect Annotations in Italian TB: missing semantic types in SIMPLE; e.g. aumento_n has 3 senses in IWN but 1 semantic type in SIMPLE; Incorrect Annotations in TB: over-extension of the notion ”nominalization=event”; e.g.: payment_n 8/10 occurrences are marked as EVENT when their meaning is ''a sum of money''. BUT WN senses are not always sufficient to determine if a nominal realize an event or not, due to the existence in the lexicon of cases where the (non-)eventive reading is, somehow, always possible.

Experiments: the ILI as Bootstrapping Device • Can the ILI and wordnet senses be used as a bootstrapping strategy for the creation of comparable corpora? • Key idea: if the Semantic Homogeneity Hypothesis holds, this will enable the import of one layer of annotation from a source corpus to a target one. To verify the validity of this hypothesis we developed a system which takes as input the events augmented with WN senses from the TB, and gives as output an additional layer of annotation, i.e. it creates the EVENT tag in Italian. Italian Corpus + IWN sense Italian Corpus + IWN sense + (partial) EVENT annotation TB + WN sense‏ ILI & P.O.S of TB EVENT

Experiments: the ILI as Bootstrapping Device (2) • To evaluate the reliability of this approach we have used the entire corpus of the Italian TreeBank where a total of 62,522 words (9,832 verbs and 44,957 nouns) are manually assigned a sense from IWD. Our system has identified 3,700 events (6.7%), 1,183 of which are considered as ''probable events'' which need human post-processing. 58 new event synsets have been retrieved. - identification of annotation inconsistencies i.e. over-extension of the notion of event for nominalizations (e.g. movement#4 = social movement); - sense assigment is not sufficient to disambiguate eventive/non eventive reading of nominals e.g. indication#1 –segnale#1; - partial matches occur due to the way sense annotation is performed with WN; - significant reduction of manual effort: only the set of probable events requires validation and is restricted to those words whose event reading is not present in WN senses.

Conclusion • Identification of a new methodology to link comparable corpora in different languages by means of WN senses and the ILI; • Data from the resulting resource can be used for contrastive analysis of events as well as multilingual temporal analysis of texts; • There is a semantic homogeneity between similar events in different languages, including semantic preferences for thematic roles and TimeML classes; • Sense assignment to events improves accuracy in annotation, in particular for event identification, and useful to reveal inconsistencies and errors; • Modification to TimeML is suggested: introduction of a tag for those instances of ambiguous cases where a double reading (eventive/non-eventive) is always possible • The ILI can be used as a semi-automatic bootstrapping device to create resources by importing layers of annotation for words with similar sense

Thank You!!

Experiments and Evaluation: Similar Events – Nouns (2)‏ Identification of the senses is not enough to determine if a nominal may realize an event or not. • the couple ''agreement#1eq_synonymintesa#3, accordo#3'' do not have a clearcut eventive sense in both wordnets BUT: • in TB 31/32 occurrences are tagged as events; over-extension of the event reading; • in Italian TB only 7/16 occurrences are tagged as events, in Italian intesa#3, accordo#3 cannot be systematically interpreted as events; • no difference in WN & IWN senses is signalled between the eventive and non eventive readings!! • This calls for a refinement of annotation schemes for events to provide explicit means to mark ambiguous cases where the double reading is, somehow, always possible.

A Bilingual Corpus of Inter-linked Events

A Bilingual Corpus of Inter-linked Events

Presentation Transcript

A LESLLA corpus

Semantic annotation of a dialog corpus

What is a CORPUS?

A Newspaper Corpus Analysis

Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet

Developing a Bilingual Thesaurus of Information Studies

Building a corpus of pathological speech

Uses of a Corpus

What’s in a Corpus?

Compiling a corpus of transcribed speech

Publishing a Corpus

A Chronology of Events

Corpus Callosum Probabilistic Subdivision based on Inter-Hemispheric Connectivity

Treebanking a Blackfoot Corpus

Being a Bilingual Teacher: Factors that Influence Retention of Bilingual Teachers

Tools for Historical corpus research, and a corpus of Latin

Compiling a corpus of transcribed speech

A Summary of Events

A bilingual corpus-based TTS system as a foreign language learning tool

Corpus Annotation with Linked Open Data

The Sawa Corpus A Parallel Corpus English - Swahili

A Bilingual Corpus of Inter-linked Events