Learning noun phrase coreference resolution

Learning noun phrase coreference resolution Veronique Hoste CNTS Language Technology Group University of Antwerp

Definition (Hirst, 81) Anaphora is the device of making in discourse an abbreviated reference to some entity in the expectation that the perceiver will we able to disabbreviate the reference and thereby determine the identity of the entity.

Definition (Hirst, 81) Anaphora is the device of making in discourse an abbreviated reference to some entity in the expectation that the perceiver will we able to disabbreviate the reference and thereby determine the identity of the entity. ANAPHOR

Definition (Hirst, 81) ANTECEDENTor REFERENT Anaphora is the device of making in discourse an abbreviated reference to some entity in the expectation that the perceiver will we able to disabbreviate the reference and thereby determine the identity of the entity. ANAPHOR

Definition (Hirst, 81) ANTECEDENTor REFERENT Anaphora is the device of making in discourse an abbreviated reference to some entity in the expectation that the perceiver will we able to disabbreviate the reference and thereby determine the identity of the entity. ANAPHOR RESOLUTION

In other words ... • Reference = the act of using a referring expression to some extra-linguistic entity • Anaphor = refers to something in the text • If both the anaphor and its antecedent refer to the same extra-linguistic entity, they are coreferential • Anaphoric and coreferential relations do not always coincide. (e.g. bound anaphora), e.g. Most linguists prefer their own parsers.

Example Kim Clijstershas won the Proximus Diamond Games in Antwerp. Belgium’s world number two secured her first title on home soil by making short work of defaiting Italy’s Silvia Farina Elia. Clijsters broke Farina Elia’s second service game but her opponent broke back immediately and it wasn’t until the eight game that the Belgian broke again to lead 5-3, from which she served out to take the set. It was Clijsters’s sixth victory over the Italian.

Example Kim Clijstershas won the Proximus Diamond Games in Antwerp. Belgium’s world number two secured her first title on home soil by making short work of defaiting Italy’s Silvia Farina Elia. Clijsters broke FarinaElia’s second service game but her/ her opponent broke back immediately and it wasn’t until the eight game that the Belgian broke again to lead 5-3, from which she served out to take the set. It was Clijsters’s sixth victory over the Italian.

Example Kim Clijstershas won the Proximus Diamond Games in Antwerp. Belgium’s world number two secured her first title on home soil by making short work of defaiting Italy’s Silvia Farina Elia. Clijsters broke Farina Elia’s second service game but her/her opponent broke back immediately and it wasn’t until the eight game that the Belgian broke again to lead 5-3, from which she served out to take the set. It was Clijsters’s sixth victory over the Italian.

Why? Weakness in existing IE systems Who: ….. What: ….. Where: ….. When: ….. How: ….. Information extraction

Morphological and lexical knowledge Real-world knowledge Syntactic knowledge Anaphora resolution Semantic knowledge Discourse knowledge Coreference resolution, a complex problem

Which anaphora? • Identity relation <-> type-token relation: “I prefer the red car, but my husband wanted the grey one.” <-> part-whole relation: “If the gas tank is empty, you should refuel the car.” • NPs • Personal and posessive pronouns • Definite and indefinite NP’s

Two data sets • ENGLISH: MUC-6 and MUC-7 • The only datasets which are publicly available • Extensively used for evaluation • Articles from WSJ and NYT • DUTCH: KNACK-2002 • First Dutch coreferentially annotated corpus • Articles from KNACK 2002 on different topics: national and international politics, science, culture, …

MUC-6 and MUC-7 • Message Understanding Conference • Identity relation between NPs • MUC-6: 2141coreferential NPs in train set and 2091 in test set • MUC-7: 2569 coreferential NPs in train set and 1728 in test set • E.g. Ng(02): 35,895 train inst. (4.4% pos.) and 22,699 test inst. (3.9% pos.) for MUC-7

KNACK-2002 • Annotation: adapted version of MUC guidelines • Identity, bound, ISA, modality relations between NP’s • Ca. 13,266 coreferential NPs • E.g.

“Ongeveer een maand geleden stuurde <COREF ID=“1”>American Airlines</COREF> <COREF ID=“2”MIN=“toplui”>enkele toplui</COREF> naar Brussel. <COREF ID=“3” TYPE=“IDENT” REF=“1”>De grote vliegtuigmaatschappij</COREF> had interesse voor DAT en wou daarover <COREF ID=“4”>de eerste minister</COREF> spreken. Maar <COREF ID=“5” TYPE=“IDENT” REF=“4”>Guy Verhofstadt</COREF> weigerde <COREF ID=“6” TYPE=“BOUND” REF=“2”>de delegatie</COREF> te ontvangen.”

Anaphora resolutionthe practice

Free text Tokenization POS tagging NP chunking NER Relation finding GETTING STARTED

Identification of the anaphors • Identification of pleonastic pronouns Vb. “Hoe komt het dan dat hij zoveel invloed heeft in het Witte Huis” • Identification of pronouns referring to clauses, etc. • Identification of non-coreferential NP’s Vb. “Dat onvoorspelbare staten als schurkenstaten moeten worden behandeld: het zit al jaren in het gedachtegoed van Paul Wolfowitz ingebakken.”

Identification of the candidate antecedents • Determine search scope • Anaphora/cataphora • N preceding (following) sentences depending on the type of the anaphor • 2 or 3 sentences for pronouns • larger scope for other NP’s with proper nouns, common nouns

Approaches • The past: mostly knowledge-based techniques (constraints and preferences) e.g. Lappin & Leass (1994), Baldwin (CogNIAC, 1996) • Recently: machine learning (C4.5) Redefine coreference resolution as a CLASSIFICATION task.

A classification based approach • Given two entities in a text, NP1 and NP2, classify the pair as coreferent of not coreferent. • E.g. • [Clijsters] broke [[Farina Elia]’s second service game] but [[her] opponent] broke back immediately. [her opponent] - [Farina Elia’s second service game] not coreferential [her opponent] - [Farina Elia] coreferential [her opponent] - [Clijsters] not coreferential

Selected features (41) • Positional features (eg. dist_sent, dist_NP) • Local context features • Morphological and lexical features (e.g. i/j/ij-pron, j_demon, j_def, i/j/ij-proper, num_agree) • Syntactic features (e.g. i/j/ij_SBJ, appos) • String-matching features (comp_match, part_match, alias, same_head) • Semantic features (syn, hyper, same_NE, 4 features indicating semantic class)

Positive, negative and test instances • Positive: combination of the anaphor with each preceding element in the coreference chain. • Negative: combination of the anaphor with each preceding NP which is not part of the coreference chain • Test: all NPs starting from the second NP in the document are considered as possible anaphor and linked to all preceding NPs as possible antecedents

Baselineexperimentsand optimization

Two step procedure • First step: validation • Application of Timbl and Ripper on train set; 10-fold-cv • Evaluation: accuracy, precision, recall, F-beta • Second step: testing • Training of Timbl and Ripper on train set; testing on test set. • Selection of one positive instance in case of multiple positives (e.g. through application of ordered Ripper rules, clustering)

Algorithms compared • Ripper • Cohen, 95 • Rule Induction • Algorithm parameters: different class ordering principles; negative conditions or not; loss ratio values; cover parameter values • TiMBL • Memory-Based Learning • Algorithm parameters: ib1, igtree; overlap, mvdm; 5 feature weighting methods; 4 distance weighting methods; 10 values of k

Baseline validation results

Conclusions from baseline experiments • The concatenation of the NP-type classifiers is beneficial for Ripper, not for Timbl. • Low precision scores for Timbl (large number of false positives). The scores are up to 30% lower than the ones for Ripper. Reason: feature weighting? • Higher recall for Timbl: distinguishes better between true and false negatives.

Optimization Confirmed hypothesis in previous research: The observed difference in accuracy between two algorithms can be easily overwhelmed by accuracy differences resulting from interactions of algorithm parameter settings and feature selection

Optimization • Feature selection • backward elimination : start with all features and remove the features which do not contribute to prediction • bidirectional hillclimbing : start with features with highest gain ratio and perform both backward and forward selection • genetic algorithm : start with random feature set • Parameter optimization • Joint optimization by a genetic algorithm

Feature selection results

Parameter optimization results TiMBL Ripper

Initial population Generate new population using crossover and mutation Best individual Population of candidate solutions Evaluation based on fitness Selection Genetic algorithms

GA individuals Feature weighting 0,1,2,3,4 Neighbour weighting 0,1,2,3 Values: 0,1,2 k 0 1 0 1 2 0 2 1 0 2 0 0 2 1 0 2 2 0 3 2 2.0288721872 Parameters Features

GA optimization results MUC6

Optimization: summary • Is it worth the effort? • Yes, • optimization can lead to much larger classifier-internal variations than classifier-comparing variations • can lead to significant performance increases • leads to more reliable results • GAs are a feasible approach to search the space

Anaphora resolution and the problem of skewed class distributions

Problem • In an unbalanced data set, the majority class is represented by a large portion of all the instances whereas the other class, the minority class has only a small part of the instances. • Many real world data sets are highly unbalanced

ML and skewed data sets • Imbalanced data sets may result in poor performances of standard classification algorithms (e.g. decision tree learners, kNN, SVMs) • => the algorithms often generate classifiers that maximize the overall classification accuracy, while completely ignoring the minority class • =>or this may lead to a classifier with many small disjuncts which tends to overfit the data

Strategies for dealing with skewed data sets • Sampling • undersampling • oversampling • Adjusting misclassification costs (high cost to misclassification of the minority class) • Weighting of examples (focus on the minority class)

Sampling • Undersampling: examples from the majority class are removed problem: throw away possibly useful information • Oversampling: examples from the minority class are duplicated problem: no increase of information, overfitting • General observation in ML literature: - undersampling leads to better performance - oversampling does not help

Skewedness

Downsampling results

Changing loss ratio in Ripper • Loss ratio parameter: allows to specify the relative • cost of false positives and false negatives • Focus on recall: loss ratio < 1 • Focus on precision: loss ratio > 1

Skewedness: summary • Comparison of the sensitivity of Timbl and Ripper to the skewed data set (ML past: C4.5) • Both learners: large number of FN • Ripper has a much poorer performance on the minority class (Forgetting exceptions ?) • Ripper is also more sensitive to rebalancing • No particular downsampling level or loss ratio value leads to overall best performance => yet another optimization step ...

Learning noun phrase coreference resolution

Learning noun phrase coreference resolution

Presentation Transcript

Supervised models for coreference resolution

Error Analysis for Learning-based Coreference Resolution

Noun Phrase Extraction

Easy-First Coreference Resolution

Decision Trees for Coreference Resolution

NP = Noun Phrase

The Noun Phrase

Coreference Resolution

THE NOUN PHRASE (NP)

Memory-based learning for noun phrase coreference resolution

Lecture 4 Noun and Noun Phrase

Inference Protocols for Coreference Resolution

Graph-based Event Coreference Resolution

Lecture 4 Noun and Noun Phrase

Learning Dutch noun phrase coreference resolution

Noun Phrase, Part 2

Lecture 3 Noun and Noun Phrase

Learning noun phrase coreference resolution

A Machine Learning Approach to Coreference Resolution of Noun Phrases