I2R-NUS-MSRA at TAC 2011:
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 , PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

I2R-NUS-MSRA at TAC 2011: Entity Linking. Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 , Zhiqiang Toh 2 , Yanchuan Sim 2 , Yunbo Cao 3 , Chin Yew Lin 3 and Chew Lim Tan 1. 1 National University of Singapore. 2 Institute for Infocomm Research. 3 Microsoft Research Asia.

Download Presentation

Wei Zhang 1 , Jian Su 2 , Bin Chen 2 ,WentingWang 2 ,

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Wei zhang 1 jian su 2 bin chen 2 wentingwang 2

I2R-NUS-MSRA at TAC 2011:

Entity Linking

Wei Zhang1, Jian Su2, Bin Chen2,WentingWang2,

Zhiqiang Toh2, Yanchuan Sim2, Yunbo Cao3, Chin Yew Lin3 and Chew Lim Tan1

1 National University of Singapore

2 Institute for Infocomm Research

3Microsoft Research Asia

Text Analysis Conference, November 14-15, 2011


Outline

I2R-NUS-MSRA at TAC 2011: Entity Linking

Outline

  • I2R-NUS team at TAC

    • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

      • Acronym Expansion

      • Semantic Features

      • Instance Selection

    • Investigate three algorithms for NIL query clustering

      • Spectral Graph Partitioning (SGP)

      • Hierarchical Agglomerative Clustering (HAC)

      • Latent Dirichlet allocation (LDA)

      • Combination system

  • Offline Combination with the system of MSRA team at KB linking step

Text Analysis Conference, November 14-15, 2011


Outline1

I2R-NUS-MSRA at TAC 2011: Entity Linking

Outline

  • I2R-NUS team at TAC

    • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

      • Acronym Expansion

      • Semantic Features

      • Instance Selection

    • Investigate three algorithms for NIL query clustering

      • Spectral Graph Partitioning (SGP)

      • Hierarchical Agglomerative Clustering (HAC)

      • Latent Dirichlet allocation (LDA)

      • Combination system

  • Combine with the system of MSRA team at KB linking step

Text Analysis Conference, November 14-15, 2011


Acronym expansion motivation

I2R-NUS-MSRA at TAC 2011: Entity Linking

Acronym Expansion - Motivation

  • Expanding an acronym from its context to reduce the ambiguities of a name

    • E.g.TSE in Wikipedia refers to 33 entries Vs. Tokyo Stock Exchange is unambiguous.

Text Analysis Conference, November 14-15, 2011


Step 1 find expansion candidates

I2R-NUS-MSRA at TAC 2011: Entity Linking

Step 1 – Find Expansion Candidates

  • Identifying Candidate Expansions (e.g. for ACM)

Text Analysis Conference, November 14-15, 2011


Step 2 candidate expansions ranking

I2R-NUS-MSRA at TAC 2011: Entity Linking

Step 2 – Candidate Expansions Ranking

  • Using SVM classifier to rank the candidates

  • Our SVM based acronym expansion

    • can handle link acronyms and full strings in the different sentences in the articles

      • Number of common characters between acronym and leading character of the expansion.

    • can handle acronym with swapped letters.

      • E.g. Communist Party of China Vs. CCP

      • Sentence distance between acronym and expansion

Text Analysis Conference, November 14-15, 2011


Outline2

I2R-NUS-MSRA at TAC 2011: Entity Linking

Outline

  • I2R-NUS team at TAC

    • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

      • Acronym Expansion

      • Semantic Features

      • Instance Selection

    • Investigate three algorithms for NIL query clustering

      • Spectral Graph Partitioning (SGP)

      • Hierarchical Agglomerative Clustering (HAC)

      • Latent Dirichlet allocation (LDA)

      • Combination system

  • Combine with the system of MSRA team at KB linking step

Text Analysis Conference, November 14-15, 2011


Related work on context similarity

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Related Work on Context Similarity

  • Zhang et al., 2010; Zheng et al., 2010; Dredze et al., 2010

    • Term Matching

    • However,

1) Michael Jordan is a leading researcher in machine learning and artificial intelligence.

2) Michael Jordan is currently a full professor at the University of California, Berkeley.

3) Michael Jordan (born February, 1963) is a former American professional basketball player.

4) Michael Jordan wins NBA MVP of 91-92 season.

No

Term Match

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand


Our system a wikipedia lda model

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Our System - A Wikipedia-LDA model

  • 1) Michael Jordan is a leading researcher in machine learning and artificial intelligence.

  • 2) Michael Jordan is currently a full professor at the University of California, Berkeley.

  • 3) Michael Jordan (born February, 1963) is a former American professional basketballplayer.

  • 4) Michael Jordan wins NBA MVP of 91-92 season.

Topic: Science

Topic: Basketball

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand


Wikipedia lda model

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Wikipedia – LDA Model

P( word i| category j)

P( category i| document j)

Document

Document

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand


Wikipedia lda model1

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Wikipedia – LDA Model

  • 1) Michael Jordan is a leading researcherin machine learning and artificial intelligence.

  • 2) Michael Jordan is currently a full professor at the University of California, Berkeley.

  • 3) Michael Jordan (born February, 1963) is a former American professional basketball player.

  • 4) Michael Jordan winsNBA MVP of 91-92 season.

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand


Outline3

I2R-NUS-MSRA at TAC 2011: Entity Linking

Outline

  • I2R-NUS team at TAC

    • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

      • Acronym Expansion

      • Semantic Features

      • Instance Selection

    • Investigate three algorithms for NIL query clustering

      • Spectral Graph Partitioning (SGP)

      • Hierarchical Agglomerative Clustering (HAC)

      • Latent Dirichlet allocation (LDA)

      • Combination system

  • Combine with the system of MSRA team at KB linking step

Text Analysis Conference, November 14-15, 2011


Related work

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Related Work

  • Vector Space Model

    • Difficult to combine bag of words (BOW) with other features.

    • Performance needs to be improved

  • Supervised Approaches

    • Using manual annotated training instances

      • Dredze et al., 2010; Zheng et al., 2010

    • Using automatically generated training instances

      • Zhang et al. 2010

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand


Related work1

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Related Work

  • (News Article) Obama Campaign Drops The George W. Bush Talking Point …

  • Auto-generate training instance (Zhang et al., 2010)

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand


Related work2

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Related Work

  • From “George W. Bush” articles

    • No positive instances for “George H. W. Bush” “George P. Bush” and “George Washington Bush” generated

    • No negative instances for “George W. Bush” generated

  • Such positive negative training instance distributions may not be the same with the original ambiguous cases in the raw text collection

  • The distribution of the unambiguous mentions may not be the same in test data

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand


The approach in our system

The Approach in Our System

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

  • An instance selection approach

    • Select an informative,representative, and diversesubset from the auto-generated data set.

    • Reduce the effect of the distribution differences

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

16


Instance selection

A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection

Instance Selection

auto-generated data set

Test on

training

SVM Classifier

Small Initial data set

Add these selected instances to Initial data set

2-D data set Illustration

Select Informative, representative and diverse Instances

The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

SVM hyperplane


Outline4

I2R-NUS-MSRA at TAC 2011: Entity Linking

Outline

  • I2R-NUS team at TAC

    • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

      • Acronym Expansion

      • Semantic Features

      • Instance Selection

    • Investigate three algorithms for NIL query clustering

      • Spectral Graph Partitioning (SGP)

      • Hierarchical Agglomerative Clustering (HAC)

      • Latent Dirichlet allocation (LDA)

      • Combination system

  • Combine with the system of MSRA team at KB linking step

Text Analysis Conference, November 14-15, 2011


Spectral clustering

Spectral Clustering

  • Advantages over other clustering techniques

    • Globally optimized results

    • Efficient in time and space

    • Generally, produce a better result

  • Success in many areas

    • Image segmentation

    • Gene expression clustering


Spectral clustering1

Spectral Clustering

  • Eigen Decomposition on Graph Laplacian

  • Dimensionality Reduction

  • (Luxburg, 2006)

George W. Bush

A = QɅQ-1

George H.W. Bush


Hierarchical agglomerative clustering

I2R-NUS-MSRA at TAC 2011: Entity Linking

Hierarchical Agglomerative Clustering

  • Convert a doc into a feature vector: Wikipedia concepts, bag-of-words and named entities.

  • Estimate the weight of each feature using Query Relevance Weighting Model (Long and Shi, 2010):

    • this model shows good performance in Web People Search

    • In our work, original query name, its Wikipedia redirected names and its coreference chain mentions are all considered as appearances of the query name in the text.

  • Similarity scores : cosine similarity and overlap similarity.

  • Text Analysis Conference, November 14-15, 2011


    Hierarchical agglomerative clustering1

    I2R-NUS-MSRA at TAC 2011: Entity Linking

    Hierarchical Agglomerative Clustering

    • Docs referred to the same entity are clustered according to doc pair-wise similarity scores.

      • Start with singleton: each doc is a cluster

      • If there are two docs D and D' in clusters Ci and Cj respectively:

    Two clusters Ci and Cj are merged to form a new cluster Cij if Sim(D,D' ) > γ

    γ = 0.25

    Calculate the similarity between the new cluster Cij and all remaining clusters

    Text Analysis Conference, November 14-15, 2011


    Latent dirichlet allocation lda

    I2R-NUS-MSRA at TAC 2011: Entity Linking

    Latent Dirichlet Allocation (LDA)

    • LDA has been applied to many NLP tasks such as: summarization and text classification

    • In our approach, the learned topics can represent the underlying entities of the ambiguous names

    • Generative story:

    Text Analysis Conference, November 14-15, 2011


    Wei zhang 1 jian su 2 bin chen 2 wentingwang 2

    I2R-NUS-MSRA at TAC 2011: Entity Linking

    Three Clustering Systems Combination

    • Three classes SVM classifier to decide which system to be trusted

    • Features: scores given by the three systems

    Combine with the system of MSRA team at KB linking step

    • Binary SVM classifier to decide which system to be trusted

    • Features: scores given by the two systems

    Text Analysis Conference, November 14-15, 2011


    Experiment for three clustering algorithms

    I2R-NUS-MSRA at TAC 2011: Entity Linking

    Experiment for Three Clustering Algorithms

    Text Analysis Conference, November 14-15, 2011


    Submissions

    I2R-NUS-MSRA at TAC 2011: Entity Linking

    Submissions

    Text Analysis Conference, November 14-15, 2011


    Conclusion

    I2R-NUS-MSRA at TAC 2011: Entity Linking

    Conclusion

    • Incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011)

      • Acronym Expansion

      • Semantic Features

      • Instance Selection

    • Investigate three algorithms for NIL query clustering

      • Spectral Graph Partitioning (SGP)

      • Hierarchical Agglomerative Clustering (HAC)

      • Latent Dirichlet allocation (LDA)

    Text Analysis Conference, November 14-15, 2011


  • Login