slide1
Download
Skip this Video
Download Presentation
Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

Loading in 2 Seconds...

play fullscreen
1 / 27

Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 , - PowerPoint PPT Presentation


  • 143 Views
  • Uploaded on

I2R-NUS-MSRA at TAC 2011: Entity Linking. Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 , Zhiqiang Toh 2 , Yanchuan Sim 2 , Yunbo Cao 3 , Chin Yew Lin 3 and Chew Lim Tan 1. 1 National University of Singapore. 2 Institute for Infocomm Research. 3 Microsoft Research Asia.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 , ' - licia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

I2R-NUS-MSRA at TAC 2011:

Entity Linking

Wei Zhang1, Jian Su2, Bin Chen2,WentingWang2,

Zhiqiang Toh2, Yanchuan Sim2, Yunbo Cao3, Chin Yew Lin3 and Chew Lim Tan1

1 National University of Singapore

2 Institute for Infocomm Research

3Microsoft Research Asia

Text Analysis Conference, November 14-15, 2011

outline

I2R-NUS-MSRA at TAC 2011: Entity Linking

Outline
  • I2R-NUS team at TAC
    • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
      • Acronym Expansion
      • Semantic Features
      • Instance Selection
    • Investigate three algorithms for NIL query clustering
      • Spectral Graph Partitioning (SGP)
      • Hierarchical Agglomerative Clustering (HAC)
      • Latent Dirichlet allocation (LDA)
      • Combination system
  • Offline Combination with the system of MSRA team at KB linking step

Text Analysis Conference, November 14-15, 2011

outline1

I2R-NUS-MSRA at TAC 2011: Entity Linking

Outline
  • I2R-NUS team at TAC
    • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
      • Acronym Expansion
      • Semantic Features
      • Instance Selection
    • Investigate three algorithms for NIL query clustering
      • Spectral Graph Partitioning (SGP)
      • Hierarchical Agglomerative Clustering (HAC)
      • Latent Dirichlet allocation (LDA)
      • Combination system
  • Combine with the system of MSRA team at KB linking step

Text Analysis Conference, November 14-15, 2011

acronym expansion motivation

I2R-NUS-MSRA at TAC 2011: Entity Linking

Acronym Expansion - Motivation
  • Expanding an acronym from its context to reduce the ambiguities of a name
    • E.g.TSE in Wikipedia refers to 33 entries Vs. Tokyo Stock Exchange is unambiguous.

Text Analysis Conference, November 14-15, 2011

step 1 find expansion candidates

I2R-NUS-MSRA at TAC 2011: Entity Linking

Step 1 – Find Expansion Candidates
  • Identifying Candidate Expansions (e.g. for ACM)

Text Analysis Conference, November 14-15, 2011

step 2 candidate expansions ranking

I2R-NUS-MSRA at TAC 2011: Entity Linking

Step 2 – Candidate Expansions Ranking
  • Using SVM classifier to rank the candidates
  • Our SVM based acronym expansion
      • can handle link acronyms and full strings in the different sentences in the articles
        • Number of common characters between acronym and leading character of the expansion.
      • can handle acronym with swapped letters.
        • E.g. Communist Party of China Vs. CCP
        • Sentence distance between acronym and expansion

Text Analysis Conference, November 14-15, 2011

outline2

I2R-NUS-MSRA at TAC 2011: Entity Linking

Outline
  • I2R-NUS team at TAC
    • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
      • Acronym Expansion
      • Semantic Features
      • Instance Selection
    • Investigate three algorithms for NIL query clustering
      • Spectral Graph Partitioning (SGP)
      • Hierarchical Agglomerative Clustering (HAC)
      • Latent Dirichlet allocation (LDA)
      • Combination system
  • Combine with the system of MSRA team at KB linking step

Text Analysis Conference, November 14-15, 2011

related work on context similarity

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Related Work on Context Similarity
  • Zhang et al., 2010; Zheng et al., 2010; Dredze et al., 2010
    • Term Matching
    • However,

1) Michael Jordan is a leading researcher in machine learning and artificial intelligence.

2) Michael Jordan is currently a full professor at the University of California, Berkeley.

3) Michael Jordan (born February, 1963) is a former American professional basketball player.

4) Michael Jordan wins NBA MVP of 91-92 season.

No

Term Match

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

our system a wikipedia lda model

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Our System - A Wikipedia-LDA model
  • 1) Michael Jordan is a leading researcher in machine learning and artificial intelligence.
  • 2) Michael Jordan is currently a full professor at the University of California, Berkeley.
  • 3) Michael Jordan (born February, 1963) is a former American professional basketballplayer.
  • 4) Michael Jordan wins NBA MVP of 91-92 season.

Topic: Science

Topic: Basketball

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

wikipedia lda model

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Wikipedia – LDA Model

P( word i| category j)

P( category i| document j)

Document

Document

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

wikipedia lda model1

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Wikipedia – LDA Model
  • 1) Michael Jordan is a leading researcherin machine learning and artificial intelligence.
  • 2) Michael Jordan is currently a full professor at the University of California, Berkeley.
  • 3) Michael Jordan (born February, 1963) is a former American professional basketball player.
  • 4) Michael Jordan winsNBA MVP of 91-92 season.

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

outline3

I2R-NUS-MSRA at TAC 2011: Entity Linking

Outline
  • I2R-NUS team at TAC
    • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
      • Acronym Expansion
      • Semantic Features
      • Instance Selection
    • Investigate three algorithms for NIL query clustering
      • Spectral Graph Partitioning (SGP)
      • Hierarchical Agglomerative Clustering (HAC)
      • Latent Dirichlet allocation (LDA)
      • Combination system
  • Combine with the system of MSRA team at KB linking step

Text Analysis Conference, November 14-15, 2011

related work

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Related Work
  • Vector Space Model
    • Difficult to combine bag of words (BOW) with other features.
    • Performance needs to be improved
  • Supervised Approaches
    • Using manual annotated training instances
      • Dredze et al., 2010; Zheng et al., 2010
    • Using automatically generated training instances
      • Zhang et al. 2010

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

related work1

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Related Work
  • (News Article) Obama Campaign Drops The George W. Bush Talking Point …
  • Auto-generate training instance (Zhang et al., 2010)

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

related work2

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Related Work
  • From “George W. Bush” articles
    • No positive instances for “George H. W. Bush” “George P. Bush” and “George Washington Bush” generated
    • No negative instances for “George W. Bush” generated
  • Such positive negative training instance distributions may not be the same with the original ambiguous cases in the raw text collection
  • The distribution of the unambiguous mentions may not be the same in test data

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

the approach in our system
The Approach in Our System

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

  • An instance selection approach
    • Select an informative,representative, and diversesubset from the auto-generated data set.
    • Reduce the effect of the distribution differences

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

16

instance selection

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Instance Selection

auto-generated data set

Test on

training

SVM Classifier

Small Initial data set

Add these selected instances to Initial data set

2-D data set Illustration

Select Informative, representative and diverse Instances

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

SVM hyperplane

outline4

I2R-NUS-MSRA at TAC 2011: Entity Linking

Outline
  • I2R-NUS team at TAC
    • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
      • Acronym Expansion
      • Semantic Features
      • Instance Selection
    • Investigate three algorithms for NIL query clustering
      • Spectral Graph Partitioning (SGP)
      • Hierarchical Agglomerative Clustering (HAC)
      • Latent Dirichlet allocation (LDA)
      • Combination system
  • Combine with the system of MSRA team at KB linking step

Text Analysis Conference, November 14-15, 2011

spectral clustering
Spectral Clustering
  • Advantages over other clustering techniques
    • Globally optimized results
    • Efficient in time and space
    • Generally, produce a better result
  • Success in many areas
    • Image segmentation
    • Gene expression clustering
spectral clustering1
Spectral Clustering
  • Eigen Decomposition on Graph Laplacian
  • Dimensionality Reduction
  • (Luxburg, 2006)

George W. Bush

A = QɅQ-1

George H.W. Bush

hierarchical agglomerative clustering

I2R-NUS-MSRA at TAC 2011: Entity Linking

Hierarchical Agglomerative Clustering
  • Convert a doc into a feature vector: Wikipedia concepts, bag-of-words and named entities.
  • Estimate the weight of each feature using Query Relevance Weighting Model (Long and Shi, 2010):
      • this model shows good performance in Web People Search
      • In our work, original query name, its Wikipedia redirected names and its coreference chain mentions are all considered as appearances of the query name in the text.
  • Similarity scores : cosine similarity and overlap similarity.

Text Analysis Conference, November 14-15, 2011

hierarchical agglomerative clustering1

I2R-NUS-MSRA at TAC 2011: Entity Linking

Hierarchical Agglomerative Clustering
  • Docs referred to the same entity are clustered according to doc pair-wise similarity scores.
    • Start with singleton: each doc is a cluster
    • If there are two docs D and D\' in clusters Ci and Cj respectively:

Two clusters Ci and Cj are merged to form a new cluster Cij if Sim(D,D\' ) > γ

γ = 0.25

Calculate the similarity between the new cluster Cij and all remaining clusters

Text Analysis Conference, November 14-15, 2011

latent dirichlet allocation lda

I2R-NUS-MSRA at TAC 2011: Entity Linking

Latent Dirichlet Allocation (LDA)
  • LDA has been applied to many NLP tasks such as: summarization and text classification
  • In our approach, the learned topics can represent the underlying entities of the ambiguous names
  • Generative story:

Text Analysis Conference, November 14-15, 2011

slide24

I2R-NUS-MSRA at TAC 2011: Entity Linking

Three Clustering Systems Combination

  • Three classes SVM classifier to decide which system to be trusted
  • Features: scores given by the three systems

Combine with the system of MSRA team at KB linking step

  • Binary SVM classifier to decide which system to be trusted
  • Features: scores given by the two systems

Text Analysis Conference, November 14-15, 2011

submissions

I2R-NUS-MSRA at TAC 2011: Entity Linking

Submissions

Text Analysis Conference, November 14-15, 2011

conclusion

I2R-NUS-MSRA at TAC 2011: Entity Linking

Conclusion
  • Incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)
    • Acronym Expansion
    • Semantic Features
    • Instance Selection
  • Investigate three algorithms for NIL query clustering
    • Spectral Graph Partitioning (SGP)
    • Hierarchical Agglomerative Clustering (HAC)
    • Latent Dirichlet allocation (LDA)

Text Analysis Conference, November 14-15, 2011

ad