Memory-based learning for noun phrase coreference resolution

Memory-based learning for noun phrase coreference resolution Veronique Hoste

Outline • Noun phrase coreference resolution • Definition • Why? • Problems • A memory-based learning approach

Definition (Hirst, 81) Anaphora is the device of making in discourse an abbreviated reference to some entity in the expectation that the perceiver will we able to disabbreviate the reference and thereby determine the identity of the entity.

Definition (Hirst, 81) ANAPHOR Anaphora is the device of making in discourse an abbreviated reference to some entity in the expectation that the perceiver will we able to disabbreviate the reference and thereby determine the identity of the entity.

Definition (Hirst, 81) ANTECEDENTor REFERENT Anaphora is the device of making in discourse an abbreviated reference to some entity in the expectation that the perceiver will we able to disabbreviate the reference and thereby determine the identity of the entity. ANAPHOR

Definition (Hirst, 81) ANTECEDENTor REFERENT ANAPHOR Anaphora is the device of making in discourse an abbreviated reference to some entity in the expectation that the perceiver will we able to disabbreviate the reference and thereby determine the identity of the entity. RESOLUTION

Example Kim Clijstershas won the Proximus Diamond Games in Antwerp. Belgium’s world number two secured her first title on home soil by making short work of defaiting Italy’s Silvia Farina Elia. Clijsters broke Farina Elia’s second service game but her opponent broke back immediately and it wasn’t until the eight game that the Belgian broke again to lead 5-3, from which she served out to take the set. It was Clijsters’s sixth victory over the Italian.

Example Kim Clijstershas won the Proximus Diamond Games in Antwerp. Belgium’s world number two secured her first title on home soil by making short work of defaiting Italy’s Silvia Farina Elia. Clijsters broke Farina Elia’s second service game but her/ her opponent broke back immediately and it wasn’t until the eight game that the Belgian broke again to lead 5-3, from which she served out to take the set. It was Clijsters’s sixth victory over the Italian.

Example Kim Clijstershas won the Proximus Diamond Games in Antwerp. Belgium’s world number two secured her first title on home soil by making short work of defaiting Italy’s Silvia Farina Elia. Clijsters broke Farina Elia’s second service game but her/her opponent broke back immediately and it wasn’t until the eight game that the Belgian broke again to lead 5-3, from which she served out to take the set. It was Clijsters’s sixth victory over the Italian.

Example Kim Clijsters has won the Proximus Diamond Games in Antwerp. Belgium’s world number two secured her first title on home soil by making short work of defaiting Italy’s Silvia Farina Elia. Clijsters broke Farina Elia’s second service game but her opponent broke back immediately and it wasn’t until the eight game that the Belgian broke again to lead 5-3, from which she served out to take the set. It was Clijsters’s sixth victory over the Italian.

Why? Weakness in existing IE systems Who: ….. What: ….. Where: ….. When: ….. How: ….. Information extraction

Morphological and lexical knowledge Real-world knowledge Syntactic knowledge Anaphora resolution Semantic knowledge Discourse knowledge Coreference resolution, a complex problem

Approaches • The past: mostly knowledge-based techniques (constraints and preferences) e.g. Lappin & Leass (1994), Baldwin (CogNIAC, 1996) • Recently: machine learning (C4.5) Redefine coreference resolution as a CLASSIFICATION task.

A classification based approach • Given two entities in a text, NP1 and NP2, classify the pair as coreferent of not coreferent. • E.g. • [Clijsters] broke [[Farina Elia]’s second service game] but [[her] opponent] broke back immediately. [her opponent] - [Farina Elia’s second service game] coref? - [Farina Elia] coref? - [Clijsters] coref?

Free text Tokenization POS tagging NP chunking NER Nested NP extraction GETTING STARTED

Learner ingredients • Starting point: corpora annotated with coreferential chains • “About one month ago <COREF ID=“1”>American Airlines</COREF> sent <COREF ID=“2”> a delegation</COREF> to Brussels. <COREF ID=“3” TYPE=“IDENT” REF=“1”> The large air plane company </COREF> was interested in DAT and wished to discuss this interest with <COREF ID=“4”>the prime minister</COREF>. But <COREF ID=“5” TYPE=“IDENT” REF=“4”>Guy Verhofstadt</COREF> refused to see <COREF ID=“6” REF=“2”>the delegation</COREF>.”

Two data sets • ENGLISH: MUC-6 (2141/2091 corefs) and MUC-7 (2569/1728 corefs) • The only datasets which are publicly available • Extensively used for evaluation • Articles from WSJ and NYT • DUTCH: KNACK-2002 • First Dutch coreferentially annotated corpus • Articles from KNACK 2002 on different topics: politics, science, culture, …

Learner ingredients (ctd) • Training data to train and validate the machine learner • Procedure: n-fold cross-validation • partition the training data in n parts • repeat n times: take each part as test set and train on the remaining other parts • Hold-out test data to test the resulting learner

Learner ingredients (ctd) • Creating instances • One instance for each pair of NPs • At the end of the instance: class values (both NPs are coreferential, not coreferential). E.g. [Clijsters] broke [[Farina Elia]’s second service game] but [[her] opponent] broke back immediately. [her opponent] - [Farina Elia’s second service game] not coreferential [her opponent] - [Farina Elia] coreferential [her opponent] - [Clijsters] not coreferential

Learner ingredients (ctd) • Instance: describes the characteristics of two NPs and their context • Features per instance: • local context: words + POS • string matching features (complete match, partial match) E.g. president Bush, George W. Bush • grammatical: - pronoun, demonstrative, definite, proper noun

Features (ctd) • grammatical (ctd): • number, gender • appositive • subject/object • semantic: • synonym, hypernym • alias • same named entity? • Distance in number of sentences and NPs

Task Build a small instance base for the following sentences. • work from right to left • link every NP (the potential anaphor) to all its preceding NPs (the candidate antecedents) • build for each pair a vector with the following features • feature 1+2: gender • feature 3+4: number • feature 5: exact match (binary) • feature 6: partial match (binary) • feature 7+8: pronoun/demonstrative/definite/proper • feature 9: synonyms/hypernyms (binary)

“About one month ago <COREF ID=“1”>American Airlines</COREF> sent <COREF ID=“2”> a delegation</COREF> to Brussels. <COREF ID=“3” TYPE=“IDENT” REF=“1”> The large air plane company </COREF> was interested in DAT and wished to discuss this interest with <COREF ID=“4”> prime minister Verhofstadt </COREF>. But <COREF ID=“5” TYPE=“IDENT” REF=“4”>Guy Verhofstadt</COREF> refused to see <COREF ID=“6” REF=“2”>the delegation</COREF>.”

Resulting instance base • the delegation - prime minister Verhofstadt • the delegation - this interest • the delegation - DAT • the delegation - the large airplane company • the delegation - Brussels • the delegation - a delegation • Guy Verhofstadt - prime minister Verhofstadt • (…) NP pairs Neutral, person, singular, singular, no, no, definite, proper, no, nocoref Neutral, person, singular, singular, yes, yes, definite, indefinite, yes, coref Male, person, singular, singular, no, yes, proper, proper, yes, coref (…) Feed these instances to the learning algorithm

Learning • TRAINING: • Input : set of training instances • Output: a coreference classifier • TESTING: • Input : new unseen instances • Output: classification

Memory-based learning • Background: performance in real-world tasks is based on remembering past events rather than creating rules or generalizations • Lazy (vs. eager) : MBL keeps all training data in memory and only abstracts at classification time by extrapolating a class from the most similar items in memory to the new test item

MBL components • memory-based learning component: During learning, the learning component adds new training instances to the memory without any abstraction or restructuring • similarity-based performance component: The classification of the most similar instance in memory is taken as classification for the new test instance

In other words ... • Given (x1, y1), (x2, y2), (x3, y3), …. (xn, yn) • Task at classification time is to find the closest xi for a new data point xq

Crucial components • A distance metric • The number of nearest neighbours to look at • A strategy of how to extrapolate from the nearest neighbours

Distance metrics When presenting a new instance for classification to the MBL learner, the learner looks in its memory in order to find all instances whose attributes are similarto the newly presented test instance.

Distance metrics • How far are xi and xq? • Most basic metric: Overlap Metric (xq,xi) = ni=1 (xqi,xii) where  (xqi,xii) = 0 if xqi = xii  (xqi,xii) = 1 if xqi  xii

Feature weighting • Problem: some features will be more informative for the prediction of the class label than others • Solution: feature selection or feature weighting • information gain weighting • gain ratio weighting • chi-squared weighting

Information gain weighting • Expresses the average entropy reduction from a feature when its value is known H(C) = -  cC P(c) log2 P(c) wi = H(C) -  vVi P(v) x H(C|v) Problem: features with many possible values are favoured above features with fewer possible values

Gain ratio weighting • Normalized version of information gain • = information gain divided by the entropy of the feature values wi = H(C) -  vVi P(v) x H(C|v) si(i) si(i) = -  vVi P(v) log2 P(v)

Chi-squared weighting • Given: contingency table consisting of all classes and feature values • Chi square: measures the difference between the expected values and the observed values in each of the cells of the table (Eij -Oij)2 2 =  ij Eij n.j ni. Eij= n..

k • Nearest neighbours: the instances in memory which are near to the test item to be classified • The classification of these nearest neighbours is used as classification for the new test instance • Expressed byk • k = 1 : the instances with the nearest distance to the test instance are used for classification

Extrapolation from the nearest neighbours • Goal: decide which will be the class of a new test item • Approaches: • Majority voting: all nearest neighbours receive equal weight • Distance weighted voting: link the choice of classification to the distance between the nearest neighbours and the test item

Potential problems

Noise How will MBL handle many uninformative features?

Skewedness E.g. 10% coreferential instances and 90% noncoreferential instances Does MBL suffer from skewed class distributions?

Memory-based learning for noun phrase coreference resolution

Memory-based learning for noun phrase coreference resolution

Presentation Transcript

Supervised models for coreference resolution

Error Analysis for Learning-based Coreference Resolution

Noun Phrase Extraction

Easy-First Coreference Resolution

Decision Trees for Coreference Resolution

NP = Noun Phrase

The Noun Phrase

Coreference Resolution

THE NOUN PHRASE (NP)

Lecture 4 Noun and Noun Phrase

Inference Protocols for Coreference Resolution

Graph-based Event Coreference Resolution

Learning noun phrase coreference resolution

Unsupervised Models for Coreference Resolution

Lecture 4 Noun and Noun Phrase

Learning Dutch noun phrase coreference resolution

Lecture 3 Noun and Noun Phrase

Learning noun phrase coreference resolution

A Machine Learning Approach to Coreference Resolution of Noun Phrases

Using MapReduce for Scalable Coreference Resolution