1 / 25

Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

Efficient Topic-based Unsupervised Name Disambiguation Yang Song, Jian Huang, Isaac G. Councill, Jia Li, C. Lee Giles JCDL2007. Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21. Outline. Introductoin Related Work Method Topic-based PLSA (Probabilistic Latent Semantic Analysis)

zaina
Download Presentation

Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Topic-based Unsupervised Name DisambiguationYang Song, Jian Huang, Isaac G. Councill,Jia Li, C. Lee GilesJCDL2007 Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

  2. Outline • Introductoin • Related Work • Method • Topic-based PLSA (Probabilistic Latent Semantic Analysis) • Topic-based LDA (Latent Dirichlet Allocation) • Clustering • Experiment • Conclusion Y.H Chang

  3. Introductoin • Name ambiguity • Sharing same name, misspelling, name abbreviations • Searching Google for “Yang Song”: • 1st page shows five different people’s home pages • In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. Y.H Chang

  4. Introductoin • Method Learning topic-name matrix by PLSA and LDA (feature set) Topic disambiguate with agglomerative clustering method In similar topic: generate name-name matrix People disambiguate with another agglomerative clustering method Y.H Chang

  5. Outline • Introductoin • Related Work • Method • Topic-based PLSA • Topic-based LDA • Clustering • Experiment • Conclusion Y.H Chang

  6. Related Work • [19]G. S. Mann and D. Yarowsky. Unsupervised personal name disambiguation. 2003 (transitivity problem) • [9]H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. 2005 (complexity O(N2)) • [12]J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large-scale databases. 2006 • [2]I. Bhattacharya and L. Getoor. A latent dirichlet model forunsupervised entity resolution. 2006 • The aforementioned work mainly tackled the name disambiguation problem using the metadata records of the authors. This paper solves the name disambiguation problem in a novel way, by accounting for the topic distribution of the authors and adopting unsupervised methods. Y.H Chang

  7. Outline • Introductoin • Related Work • Method • Topic-based PLSA • Topic-based LDA • Clustering • Experiment • Conclusion • Method Learning topic-name matrix by PLSA and LDA (feature set) Topic disambiguate with agglomerative clustering method … …… … Y.H Chang

  8. PLSA • From a statistical point of view, (1999) Hofmann presented an alternative to LSA, or Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI), which discovers sets of latent variables. • The model is described as an aspect model, assuming the existence of hidden factors underlying the co-occurrences among two sets of objects. Y.H Chang

  9. People’s name document PLSA Word z: topic of document Size=K • The goal of model fitting for PLSA is to estimate the parameters P(z),P(a|z), P(z|d),P(w|z), given a set of observations (d, a,w). The standard way to estimate the probability values is the Expectation-Maximization (EM) algorithm Y.H Chang

  10. PLSA Y.H Chang

  11. PLSA-Predicting New Name Appearances • Additionally, there is no natural way to assign probability to new documents. • Therefore, to predict the topics of new documents (with potentially new names) after training, the estimated P(w|z) parameters are used to estimate P(a|z) for new names a in test document dnewthrough a “folding-in” process. • Specifically, the E-step is the same as equation (4); however, the M-step maintains the original P(w|z) and only updates P(a|z) as well as P(z|d). Y.H Chang

  12. LDA • (2003) Blei et al. introduced a Bayesian hierarchical model, Latent Dirichlet Allocation (LDA), in which each document has its own topic distribution, drawn from a conjugate Dirichlet prior that remains the same for all documents in a collection. Y.H Chang

  13. a multinomial distribution φz for each topic z LDA • In our model, names (authors) and words are not directly related, i.e., each topic can generate a set of names and a set of words simultaneously with different probabilities, allowing more freedom to the model in parameter estimation. a word wdi from the multinomial distribution φzdi a multinomial Distribution θd a topic zdi from the multinomial distribution θd a name adi from the multinomial distribution λzdi Y.H Chang

  14. LDA • In the following section, we apply the Gibbs sampling framework to get around the intractability problem of parameter estimation. Y.H Chang

  15. Note that in our case, we do not estimate the parameters α, β and λ. For simplicity and performance, they are fixed at 50/K, 0.01 and 0.1 respectively. Gibbs sampling for the LDA model Y.H Chang

  16. Clustering Learning topic-name matrix by PLSA and LDA (feature set) Levenshtein distance (defined as Le(x, y)) is used as the measurement and as a result the similarity between two names x and y Topic disambiguate with agglomerative clustering method In similar topic: generate name-name matrix People disambiguate with another agglomerative clustering method Y.H Chang

  17. Outline • Introductoin • Related Work • Method • Topic-based PLSA • Topic-based LDA • Clustering • Experiment • Conclusion Y.H Chang

  18. Experiment • Web Appearances of Person Names • 12 person names => 187 different people • including SRI employees andprofessors are submittedas queries to the Google search engine, the first 100 pages are then retrieved for each query. Furthermore, to eliminate the bias towards longer documents, only the first 200 words are used in each example. • Author Appearances in Scientific Docs • We obtained the 9 most ambiguous author names from the entire data set , each of which has at least 20 name variations. In the worst case (C. Chen), 103 authors share the same name. Y.H Chang

  19. Experiment • Evaluation : • pair-level pairwise F1 score F1P and clusterlevel pairwise F1 score F1C. • F1P is defined as the pairwise precision pp and pairwise recall pr • Likewise, F1C is the harmonic mean of cluster precision cp and cluster recall cr Y.H Chang

  20. author-topic relationships in the CiteSeer data set extracted by the topic-based PLSA model. Y.H Chang

  21. Experiment Y.H Chang

  22. Experiment Y.H Chang

  23. Experiment • As a result, we empirically tested our models for the entire CiteSeer data set with more than 750,000 documents. • PLSA yields 418,500 unique authors in 2,570 minutes, while LDA finishes in 4,390 minutes with 418,775 authors.(1~3 days) Y.H Chang

  24. Outline • Introductoin • Related Work • Method • Topic-based PLSA • Topic-based LDA • Clustering • Experiment • Conclusion Y.H Chang

  25. Conclusion • We have proposed a novel framework for unsupervised name disambiguation by leveraging graphical Bayesian models and a hierarchical clustering method. • Although our primary focus in this paper is on person name disambiguation, our general approach should be equally applicable to other entity disambiguation domains. • Potential applications include noun phrases disambiguation, e.g., “tiger” as an animal, “tiger” as a golf player, “tiger” the baseball team, “tiger” the operating system or “tiger” for the new Java version. Y.H Chang

More Related