literature survey graph based clustering and its application in coreference resolution n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Literature Survey: Graph-based Clustering and its Application in Coreference Resolution PowerPoint Presentation
Download Presentation
Literature Survey: Graph-based Clustering and its Application in Coreference Resolution

Loading in 2 Seconds...

play fullscreen
1 / 55

Literature Survey: Graph-based Clustering and its Application in Coreference Resolution - PowerPoint PPT Presentation


  • 143 Views
  • Uploaded on

Literature Survey: Graph-based Clustering and its Application in Coreference Resolution. Zheng Chen Computer Science Department The Graduate Center , The City University of New York November 24, 2009 . Motivations and Goals. Motivations

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Literature Survey: Graph-based Clustering and its Application in Coreference Resolution' - Albert_Lan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
literature survey graph based clustering and its application in coreference resolution

Literature Survey: Graph-based Clustering and its Application in Coreference Resolution

Zheng Chen

Computer Science Department

The Graduate Center , The City University of New York

November 24, 2009

motivations and goals
Motivations and Goals
  • Motivations
    • Graph-based clustering has attracted researchers from various fields
    • Theoreticians are busy studying quality measures and algorithms
    • Practitioners are busy adapting the algorithms to their own applications
    • Some algorithms are known and popular in one field while new for the other
  • Goals
    • Provide an overview of graph-based clustering methodology
    • Applied to: coreference resolution

Literature Survey

outline part i theory
Outline (Part I: Theory)

Graph-based Clustering Methodology (a five-part story)

Literature Survey

outline part ii application
Outline (Part II: Application)

Coreference Resolution:

A Case Study of Applying Graph-based Clustering

  • Entity Coreference Resolution
    • A two step procedure: classification and clustering
    • Graph-based (Nicolae and Nicolae, 2006)
  • Event Coreference Resolution
    • Agglomerative clustering algorithm (Chen et al., 2009a)
    • Graph-based (Chen and Ji, 2009b)

Conclusions

Literature Survey

graph notation
Graph Notation

Literature Survey

hypothesis
Hypothesis

The hypothesis can be stated in different ways:

  • a graph can be partitioned into densely connected subgraphs that are sparsely connected to each other
  • A random walk that visits a dense sub-graph will likely stay in the sub-graph until many of its nodes have been visited
  • Considering all shortest paths between all pairs of nodes, edges between dense sub-graphs are likely to be in many shortest paths

Manhattan

Queens

Literature Survey

modeling
Modeling
  • Determine the meaning of vertices, edges
  • Compute the edge weights
  • Graph construction
  • Which graph should be chosen and how to choose parameters? (no theoretical justifications)

Literature Survey

measure
Measure

Literature Survey

slide11

Measure: Cheat Sheet

Formulas:

Objective functions:

Literature Survey

measure computation examples

0.1

5

0.8

1

0.8

0.8

0.6

6

2

4

0.7

0.2

0.8

3

Measure: Computation Examples

T=total weights in the graph

intra_density (C1) = (0.8+0.8+0.6)/T

inter_density(C1, C2)=(0.1+0.2)/T

cut (C1, C2) =0.1+0.2

ratiocut(C1)= cut (C1, C2) /3

vol (C1)=0.8+0.8+0.6+0.1+0.2

ncut (C1)= cut (C1, C2) / vol (C1)

expansion(C1)=min{1.6/1, 1.4/1, 1.4/1} =1.4

conductance(C1)=min{1.6/1.6, 1.4/1.5,1.4/1.6}=1.4/1.6

C2

C1

the fraction of edges inside cluster C1

expected fraction of edges in C1,if edges were located at random in the graph

modularity(C1)=3/8-(4/8)2

Literature Survey

measure1
Measure

Summary:

NP-hard problem for optimizing each of the measures

  • intra-cluster density and inter-cluster sparsity in (Ausiello et al., 2002; Wagner and Wagner, 1993)
  • ncut (Shi and Malik, 2000)
  • expansion and conductance (Ausiello et al., 2002; Šíma and Schaeffer, 2006)
  • bicriteria in (Kannan et al., 2000)
  • modularity (Brandes, 2006)

Any efficient algorithm, which has been claimed to solve the optimal problem with polynomial-time complexity, is heuristic and yields sub-optimal clustering.

Literature Survey

algorithm
Algorithm

m: number of edges, n: number of nodes, k: number of clusters

Literature Survey

slide17

Spectral clustering: Comments

  • unnormalized spectral clustering:ratiocut measure

normalized spectral clustering: ncutmeasure

  • Which spectral clustering algorithm do we choose?
    • Regular graph: works equally well
    • the degrees in the graph are broadly distributed, prefer normalized rather than unnormalized
    • normalized case: prefer rather than
  • Why successful?
    • Does not make assumption on the form of the clusters
  • Efficient:
    • Lanczos algorithm to solve eigenvalue problem m: the number of edges, n: the number of vertices
  • No worry about “local” optimum traps
  • Unstable under different choices of the parameters when constructing the graph

Literature Survey

girvan and newman algorithm girvan and newman 2002
Girvan and Newman Algorithm(Girvan and Newman, 2002)
  • Edge Betweenness
    • when a graph is made of tightly bound clusters, loosely interconnected, all shortest paths between clusters have to go through few inter-cluster connections.
  • Algorithm
    • 1. Calculate betweensess score for each edge
    • 2. Remove the one with the highest score
    • 3.Recalculate betweensess
    • 4. repeat from step 2
  • Comments
    • optimizing modularity measure
    • Good results in real data
    • Complexity remains an issue, for sparse graph

Literature Survey

newman fast algorithm newman 2004
Newman fast algorithm (Newman, 2004)
  • Algorithm
    • 1. Separate each node solely into n clusters.
    • 2. Calculate the increase of Q for all possible cluster pairs.
    • 3. Merge the pair which leads to the greatest increase in Q.
    • 4. Repeat 2 & 3 until the modularity Q reaches the maximal value.
  • Comments
    • Greedy optimizations technique
    • Advantage in complexity with on a sparse graph, 50 000 nodes in minutes rather than years

Literature Survey

algorithm summary
Algorithm: Summary
  • No algorithm is a panacea
  • A clustering algorithm was usually proposed to optimize some quality measure. Unfair to compare between two algorithms favoring two different measures
  • No measure can capture the full characteristics of cluster structures, thus no perfect algorithm
  • No definition for so called “best clustering”. The “best” depends on applications, data characteristics, granularity and so on.

Literature Survey

evaluation
Evaluation
  • Internal (intrinsic) measures
  • External(extrinsic) measures
    • Are there any formal constraints (properties, criteria) that an ideal extrinsic measure should satisfy?
    • Do the extrinsic measures proposed so far satisfy the constraints?

Literature Survey

evaluation formal constraints amigo et al 2008
Evaluation: Formal Constraints (Amigo et al., 2008)
  • homogeneity
  • completeness
  • rag bag
  • cluster size vs. quantity

Rosenberg and Hirschberg (2007)

Literature Survey

evaluation measures
Evaluation Measures

Literature Survey

slide24

Measures for Coreference Resolution

  • MUC :
    • no credits for separating out singleton clusters
    • all errors are considered to be equal
  • B-Cubed :
    • overcomes the two drawbacks of MUC measure
    • give multiple credits to a single item
  • ECM :
    • seeks an optimal alignment between the system clustering and the reference clustering

Literature Survey

satisfaction of formal constraints for various measures
Satisfaction of Formal Constraints for Various Measures
  • Extend the work of (Amigo et al., 2008) on more measures: adjusted rand index, V measure, MUC measure and ECM measure
  • Re-compute all the scores
  • None of the measures except B-Cubed F-measure can satisfy all the four constraints
  • ECM F-measure fails three constraints: homogeneity, completeness and rag bag

Literature Survey

future directions
Future Directions
  • Scalability
    • graphs in real applications are growing rapidly
    • graphs are changing dynamically
  • Stability
    • perturbations in the graph
  • Statistical significance
    • how significant is it comparing with a clustering produced by a null model of the graph

Literature Survey

coreference resolution
Coreference Resolution
  • Entity coreference resolution

Identifying which noun phrases (NPs, or mentions) refer to the same real-world entity in text.

    • An entity is an object in the real world such as person, organization, facility
    • A mention is a textual reference to an entity.
  • Event coreference resolutionIdentifying which event mentions refer to the same event in text.
    • An event is a specific occurrence involving participants.
    • An event mention includes a distinguished trigger(the word that most clearly expresses an event occurs) and involving arguments (entities/temporal expressions that play certain roles in the event).

Literature Survey

slide30

Entity Coreference Resolution: an Example

John Perry, of Weston Golf Club, announced his resignation yesterday. He was the President of the Massachusetts Golf Association. During his two years in office, Perry guided the MGA into a closer relationship with the Women's Golf Association of Massachusetts.

Literature Survey

slide31

Event Coreference Resolution: an Example

EM4Ankara police chief ErcumentYilmaz

visited the site of the morningblast .

EM1An explosion in a cafe at one of the

capital's busiest intersections killed one

woman and injured another Tuesday.

EM2Police were investigating the cause of

the explosion inthe restroom of the

multistory Crocodile Cafe in the

commercial district of Kizilayduring

the morning rush hour .

EM5The explosion comes a month after

EM6a bomb exploded at a McDonald's

restaurant in Istanbul, causing damage

but no injuries .

EM7Radical leftist, Kurdish and Islamic

groups are active in the country and have

carried out the bombing in the past .

EM3The blast shattered walls and

windows in the building .

Literature Survey

a parallel comparison between entity coreference resolution and event coreference resolution
A Parallel Comparison between Entity Coreference Resolution and Event Coreference Resolution

The two problems are similar because:

  • the problem descriptions are similar
  • the mathematical interpretations are similar
  • They can be solved by applying a two-step procedure
  • they can be solved by applying graph-based clustering methodology

They are different because:

  • entity and event have different attributes and values

Literature Survey

solution a two step procedure
Solution: a Two-step Procedure
  • classification step: compute the likelihood one entity mention corefers with the other
  • clustering step: group the mentions into clusters such that all mentions in a cluster refer to the same entity.

Literature Survey

solution a two step procedure1
Solution: a Two-step Procedure

Classification step

  • Learning algorithm
    • decision tree: McCarthy and Lehnert (1995) , Soon et al. (2001) , Strube el al. (2002) , Strube and Muller (2003) and Yang et al. (2003)
    • maximum entropy: Luo et al. (2004)
    • SVM: Finley and Joachims (2005)
    • Kernel :Yang et al. (2006)
  • Feature sets
    • Soon et al. (2001) define12 surface level features in four categories lexical, grammatical, semantic and positional
    • Ng and Cardie (2002) extend 12 to 53 with new features based on common-sense knowledge and linguistic intuitions
    • Ng (2007) proposes another six semantic features
    • Yang and Su (2007) extract semantic relatedness features from Wikipedia

Literature Survey

solution a two step procedure2
Solution: a Two-step Procedure

Clustering step

  • closest-first clustering (Soon et al., 2001)
  • Best-first clustering (Ng and Cardie, 2002)

closest-first

threshold=0.5

0.3

0.2

0.4

E1

E2

EM1

EM2

EM3

EM4

EM1

EM2

EM3

EM4

best-first

E1

EM1

E2

0.6

0.7

EM2

EM3

EM4

0.8

Literature Survey

solution from local clustering to global clustering

John Perry1, of Weston Golf Club2, announced his3 resignation yesterday.

Link Model:

Start Model:

Solution: From Local clustering to Global clustering
  • Problem in the two-step procedure:
  • works in a greedy style without searching the space of all possible clusterings
  • Luo et al. (2004)

[1,2, 3]

[1,2] 3*

[1,2] [3]

[1] 2* 3

[1,3] [2]

[1] [2] 3*

[1] [2,3]

[1] [2] [3]

  • Heuristic search algorithm that finds the most probable clustering, i.e., at each step of the search process, only the most promising nodes in the tree are expanded.
  • Still works in greedy style and may miss the optimal clustering

Literature Survey

slide39

Solution: From Local clustering to Global clustering

  • Ng (2005)
    • 54 coreference resolution systems (3 classification algorithms, 3 clustering algorithms, 3 instance creation methods and 2 feature sets)
    • global ranking model
    • rank the 54 candidate clusterings to get the best clustering
    • performance depends on the best clustering from one of 54 systems

system1

...

system54

clustering1

clustering54

ranking model

best clustering

Literature Survey

solution from supervised to unsupervised
Solution: From Supervised to Unsupervised
  • classification step is supervised
  • semi-supervised:
    • co-training (Muller et al., 2002)
    • self-training
    • EM
  • unsupervised:
    • Non-Parametric Bayesian Models based on Dirichlet Processes (Haghighi and Klein 2007)
    • Integer Linear Programming (Denis and Baldridge, 2007)
    • markov logic (Poon and Domingos, 2008)

Literature Survey

solution graph based clustering methodology1
Solution: graph-based clustering methodology
  • Nicolae and Nicolae (2006)

Literature Survey

slide43

Solution: graph-based clustering methodology

  • Minimum cut

Minimum cut is measured as the number of mentions that are correctly placed in their cluster.

two correct cases: average and maximum weight

5

0.1

0.6

1

0.2

0.1

0.5

x

x

3

4

score(cut) = 3

0.7

0.5

2

Literature Survey

slide44

Solution: graph-based clustering methodology

  • BESTCUT Algorithm

Mary1has a brother2, John3. The boy4is older than the girl5

Clustering:

{Mary1, the girl5} and {a brother2, John3, The boy4}

5

Recursive procedure

0.1

Find the best cut using algorithm (Stoer and Wagner,1997)

0.6

1

0.2

0.1

Stop the cut?

Yes: continue the procedure on the two subgraphs

No: form entities

0.5

3

4

0.7

0.5

2

Literature Survey

finding the best cut stoer and wagner 1997
Finding the Best Cut (Stoer and Wagner,1997)

5

5

5

5

0.1

0.1

0.1

0.1

0.6

0.6

0.6

0.6

Best Cut

1

1

1

1

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

0.5

0.5

0.5

0.5

3

4

3

3

3

4

4

4

score(cut1) = 3

score(cut2) = 4

score(cut3) = 5

0.7

0.7

0.7

0.7

0.5

0.5

0.5

0.5

2

2

2

2

score(cut2) = 3.5

Literature Survey

event coreference resolution
Event Coreference Resolution
  • Pioneering work in MUC (Message Understanding Conference) Evaluations in the 1990s
    • Humphreys et al.,1997 (ontology)
    • Bagga and Baldwin,1998 (Vector Space Model)
  • Events are based on scenarios, e.g., management succession, resignation, election, espionage.
  • ACE Evaluations define 8 fine-grained event types
  • Recent work:
    • Chen et al., 2009a (agglomerative clustering)
    • Chen and Ji, 2009b (spectral graph-based clustering)

Literature Survey

event coreference resolution agglomerative clustering chen et al 2009a
Event Coreference Resolution: agglomerative clustering(Chen et al., 2009a)
  • Similar to Luo et al. (2004)’s bell tree searching algorithm but using different notations
  • A pairwise event coreference model using event specific features (triggers/arguments/event attributes)
  • Event attributes play important role in distinguishing coreference from non-coreference
  • Performance bottleneck comes from system generated event mentions

Literature Survey

event coreference resolution graph based clustering methodology
Event Coreference Resolution: graph-based clustering methodology
  • Chen and Ji (2009b)

Literature Survey

event coreference resolution graph based clustering methodology2

cut(A,B) = 0.1+0.2+0.2+0.3=0.8

Event Coreference Resolution: graph-based clustering methodology

0.8

A

0.7

0.9

0.8

0.9

0.6

0.3

0.8

0.2

0.7

0.2

B

0.1

Literature Survey

event coreference resolution computing coreference likelihood
Event Coreference Resolution: Computing Coreference Likelihood

Method 1: Computing global weights

  • Compute 16 types of weights (8 trigger related and 8 argument related) based on a training corpus
  • An exponential function to incorporate the 16 weights

Method 2: Applying a Maximum Entropy Model

  • Learn a Maximum Entropy Model using trigger/distance/argument related features

Literature Survey

coreference resolution summary
Coreference Resolution: Summary
  • Major techniques that have been successfully applied in entity coreference resolution can also be adapted to event coreference resolution
  • An event is syntactically or structurally complex than an entity and an event contains lots more semantic meanings than an entity, which may imply parsing features, semantic features may help
  • Event coreference resolution and RTE (Recognizing Textual Entailment) task may complement each other

RTE: given two text fragments, whether the meaning of one text is entailed (can be inferred) from another text (Dagan et al., 2006).

Literature Survey

conclusions
Conclusions
  • Graph-based clustering methodology can be applied in various areas, thus we hope this survey can help “bridge” the interactions among research communities.
  • Use this survey as a manual reference for researchers to harness the graph-based clustering methodology in their own problems.

Literature Survey