1 / 27

Person Name Disambiguation by Bootstrapping

Person Name Disambiguation by Bootstrapping. Presenter: Lijie Zhang Advisor: Weining Zhang. Outlines. Introduction Motivation Two-stage Clustering Algorithm Experiments. People Name Disambiguation.

israel
Download Presentation

Person Name Disambiguation by Bootstrapping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang

  2. Outlines • Introduction • Motivation • Two-stage Clustering Algorithm • Experiments

  3. People Name Disambiguation • Given a target name (query q ), search engine returns a set of web pagesP={d1, d2, …, dn} • Task: cluster web pages P such that each cluster refers to a single person.

  4. Example: People Name Disambiguation

  5. People Name Disambiguation • A typical solution: • Extract a set of features from each document returned by search engine • Cluster the documents based on some similarity metrics on sets of features • Two types of features • Strong features such as named entities (NEs), compound key words (CKWs), URLs • NE: Paul Allen, Microsoft (indicate the person Bill Gates) • CKW: chief software architect (a concept strongly related to Bill Gates) • Very strong ability to distinguish between clusters. • Weak features: single words

  6. People Name Disambiguation • Evaluation Metric: F measure • Treat each cluster as if it were the result of a query and each class as if it were the desired set of documents for a query • For class i and cluster j, • Recall(i, j)= nij/ni, Precision(i, j)=nij/nj • F(i, j) = (2 * Recall(i, j) * Precision(i, j)) / ((Precision(i, j) + Recall(i, j))

  7. Motivation • Problem of current systems: Using only strong features achieves high precision but low recall. • Proposed solution: two-stage clustering algorithm by bootstrapping to improve the recall value. • 1st stage: strong features • 2nd stage: weak features

  8. Two-stage Clustering Algorithm • Input: one query string • Output: a set of clusters 1. Preprocessing documents returned by search engine 2. First-stage clustering 3. Second-stage clustering

  9. Preprocessing a Document • Covert HTML files to text files • Remove HTML tags • Keep sentences • Extract text around query string • Using a window size • Extract strong features (NEs, CKWs, URLs)

  10. Extract Strong Features • Use Stanford NER to identify NEs: • a set of sets of names including names of persons, organizations, and places • Compound Key Word (CKW) Features: a set of CKWs • Extract compound words (CW): w1w2..wl • Score each CW: • Determine CKW based on a threshold of scores. • Extract URLs from the original HTML files • exclude URLs with high frequencies

  11. Two-stage Clustering Algorithm • Input: one or more query strings • Output: a set of clusters 1. Preprocessing documents returned by search engine 2.1st stage clustering 3. 2nd stage clustering

  12. First stage clustering • Calculate the similarities between documents based on these features • Use standard hierarchical agglomerative clustering (HAC) algorithm for clustering

  13. Document Similarities • Similarity for NE features and CKW features • avoids too small denominator values in the equation

  14. Document Similarities • Similarity for URLs

  15. Document Similarities • Similarity for NE: • Similarities for NE, CKW, and URL

  16. First stage clustering • Calculate the similarities between documents based on these features • Use standard hierarchical agglomerative clustering (HAC) algorithm for clustering

  17. HAC algorithm • Starts from one-in-one clustering, i.e. each document is a cluster • Iteratively merge the most similar cluster pairs, which similarity is above a threshold. • Cluster similarity:

  18. Two-stage Clustering Algorithm • Input: one or more query strings • Output: a set of clusters 1. Preprocessing documents returned by search engine 2.1st stage clustering 3. 2nd stage clustering

  19. Second Stage Clustering • Goal: Cluster documents still in one-in-one clustering after the first stage clustering • Idea of bootstrapping algorithm: • Given some seed instances, finds patterns useful to extract such seed instances; • Use these patterns to harvest new instances, and form the harvested new instances new patterns are induced. • Instances correspond to documents • Patterns correspond to weak features: 1-gram, 2-gram in experiment

  20. Second Stage Clustering

  21. Experiments Setup • Dataset: WePS-2 • 30 names, each has 150 pages • The same page can refer to two or more entities; • Evaluation Metrics [5] • Multiplicity precision and recall between document e and e’ C(e) is predicted cluster of e, L(e) is the cluster assigned to e by the gold standard

  22. Example of Evaluation Metrics L(1)={A,B} L(2)={A,B} C(1)={ct1, ct2} C(2)={ct1, ct2} L(1)={A,B} L(2)={A,B} C(1)={ct1} C(2)={ct1, ct2} L(1)={A,B} L(2)={A,B} C(1)={ct1,ct2,ct3} C(2)={ct1, ct2,ct3}

  23. Experiments Setup • Evaluation Metrics • Extended B-Cubed precision (BEP) and recall (BER)

  24. Experiments Setup • Baselines: • First stage clustering: all-in-one, one-in-one, combined baseline (each doc belongs to one cluster from all-in-one and one from one-in-one). • Second stage clustering: TOPIC algorithm, CKW algorithm

  25. Experiments Results

  26. References [1] A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vector space model. In Proceedings of COLING-ACL 1998, pages 79–85, 1998. [2] C. Niu, W. Li, and R. K. Srihari. Weakly supervised learning for cross-document person name disambiguation supported by information extraction. In Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics (ACL-2004), pages 598–605, 2004. [3] X. Liu, Y. Gong, W. Xu, and S. Zhu. Document clustering with cluster refinement and model selection capabilities. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 191–198, 2002. [4] X. Wan, M. L. J. Gao, and B. Ding. Person resolution in person search results: WebHawk. In Proceedings of CIKM2005, pages 163–170, 2005. [5] E. Amigo, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4), 2009. [6] Minoru Yoshida, Masaki Ikeda, Shingo Ono, Issei Sato, Hiroshi Nakagawa. Person Name Disambiguation by Bootstrapping. In Proceedings of SIGIR, 2010.

More Related