1 / 12

Disambiguation Problems in Digital Libraries

Disambiguation Problems in Digital Libraries. Tan Yee Fan 2006 August 11 WING Group Meeting. Introduction. Bibliographic digital libraries DBLP, Citeseer, ACM Portal, … Metadata records Authors, title, venue, year, … Inconsistencies and errors Typographical errors Abbreviation

ronny
Download Presentation

Disambiguation Problems in Digital Libraries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Disambiguation Problemsin Digital Libraries Tan Yee Fan 2006 August 11 WING Group Meeting

  2. Introduction • Bibliographic digital libraries • DBLP, Citeseer, ACM Portal, … • Metadata records • Authors, title, venue, year, … • Inconsistencies and errors • Typographical errors • Abbreviation • Different entities sharing same name • …

  3. Problem formulation • General disambiguation problem • Given a list of data items X • Find a function δ : X × X → {0, 1} such that • δ(x1, x2) = 1 if x1 and x2 matches • δ(x1, x2) = 0 otherwise • Matching relation is not necessarily transitive • δ(“ab”, “bc”) = 1 and δ(“bc”, “cd”) = 1,but δ(“ab”, “cd”) = 0 • If transitive, it is clustering/classification

  4. Related fields • String similarity • Edit distance, Jaro-Winkler, … • Abbreviation matching • Mostly deals with biomedical texts and in predefined formats • Data cleaning • High level architectures by database people • Social network analysis • Collaboration graphs of authors

  5. Citation matching, author name disambiguation • Can be cast as classification/clustering • Usual information source • Coauthor information, titles and venues • i.e. within the records themselves (internal) • Models • Naïve Bayes, K-means, SVM, vector space model, graphical models, … • Some apply methods to reduce number of comparisons required

  6. Resources • Internal resources • May contain insufficient information • Information may be difficult to extract • External resources • Web resources, ontologies • Contains additional freely available information • Objective • Combine internal and external resources

  7. Mixed citation problem • Given an ambiguous name X (belonging to k different authors) • Given a list of citations C containing X • Which citations in C belong to which author? Yoojin Hong, Byung-Won On and Dongwon Lee. SystemSupport for Name Authority Control Problem inDigital Libraries: OpenDBLP Approach. ECDL 2004. Sudha Ram, Jinsoo Park and Dongwon Lee. DigitalLibraries for the Next Millennium: Challenges andResearch Directions. Information Systems Frontiers 1999.

  8. Search engine results • For each citation c in C • Query search engine with title of c to obtain relevant URLs • Represent c by a feature vector of relevant URLs • Each URL weighted by its inverse host frequency • Cosine similarity between feature vectors • Perform clustering on C to derive k clusters

  9. External coauthor network • Coauthor network from DBLP metadata • Delete the node representing X and its edges • Similarity between two author names computed as an inverse of their distance • Similarity between two citations is pairwise sum of their author similarities Connected if they arecoauthors in someDBLP citation Each noderepresents a name

  10. Results

  11. Venue name disambiguation • To determine e.g. “TREC” = “Text Retrieval Conference” • Not using other parts of the citation records • Problems • Abbreviations are extremely common • Venues change name over time • Experiments using Google in progress • Using URL features • Using Google snippets

More Related