Name Disambiguation in Digital Libraries Tan Yee Fan 2005 October 19 WING Group Meeting
Digital libraries • DBLP, Citeseer, etc. • Information is stored as metadata records to facilitate searching • Author names • Titles • Publication titles • Inconsistency in metadata records hinders searching • Abbreviation of names and publication titles • Typographical errors
Are they the same author? • Danny Poo • Danny C. C. Poo, Teck-Kang Toh, Christopher S. G. Khoo, Glenn Hong. Development of an Intelligent Web Interface to Online Library Catalog Databases. APSEC 1999: 64-7 • Danny Chiang Choon Poo, Isaac K. C. Tan. Design of an Automatic Annotation Framework for Corporate Web Content. APSEC 2004: 384-391 • Hui Yang • Maan A. Kousa, Ahmed K. Elhakeem, Hui Yang. Performance of ATM networks under hybrid ARQ/FEC error control scheme. IEEE/ACM Trans. Netw. 7(6): 917-925 (1999) • Hui Yang, Tat-Seng Chua. QUALIFIER: Question Answering by Lexical Fabric and External Resources. EACL 2003: 363-370
Who am I, I am who? • Author name disambiguation • Given a large number of citations, how to determine which name is which author? • Closely related problem: citation matching • Given a large number of citations, how to determine which citations refer to the same papers? • Solutions must be scalable • DBLP has more than 660,000 citations • Citeseer has more than 730,000 documents
Ideas • Idea 1: determine the research field • Unfortunately, paper titles have limited words and some conferences tend to be broad • Idea 2: use coauthors information • Likely that an author will collaborate with a selected group of people • This group will likely publish a number of papers together • To find the similarity of coauthor lists
Forward direction:M. Kan = M.-Y. Kan = Min-Yen Kan • Problem • Pairwise comparison on all the coauthor lists is very expensive (few days also cannot finish) • Solution • Soft clustering on the coauthor lists using some cheap distance measure • Then perform pairwise comparison within the clusters • What is a good soft clustering algorithm?
Backward direction:This Hang Cui is not that Hang Cui • Difficult to determine using the metadata alone without external resources • Many authors have several distinct research areas • Each research area with different collaborators • Currently investigating what kind of external resource to use • Goooooooooogle for URLs?
The end • But the research has just begun…