Name Disambiguation in Digital Libraries - PowerPoint PPT Presentation

name disambiguation in digital libraries n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Name Disambiguation in Digital Libraries PowerPoint Presentation
Download Presentation
Name Disambiguation in Digital Libraries

play fullscreen
1 / 8
Name Disambiguation in Digital Libraries
113 Views
Download Presentation
keren
Download Presentation

Name Disambiguation in Digital Libraries

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Name Disambiguation in Digital Libraries Tan Yee Fan 2005 October 19 WING Group Meeting

  2. Digital libraries • DBLP, Citeseer, etc. • Information is stored as metadata records to facilitate searching • Author names • Titles • Publication titles • Inconsistency in metadata records hinders searching • Abbreviation of names and publication titles • Typographical errors

  3. Are they the same author? • Danny Poo • Danny C. C. Poo, Teck-Kang Toh, Christopher S. G. Khoo, Glenn Hong. Development of an Intelligent Web Interface to Online Library Catalog Databases. APSEC 1999: 64-7 • Danny Chiang Choon Poo, Isaac K. C. Tan. Design of an Automatic Annotation Framework for Corporate Web Content. APSEC 2004: 384-391 • Hui Yang • Maan A. Kousa, Ahmed K. Elhakeem, Hui Yang. Performance of ATM networks under hybrid ARQ/FEC error control scheme. IEEE/ACM Trans. Netw. 7(6): 917-925 (1999) • Hui Yang, Tat-Seng Chua. QUALIFIER: Question Answering by Lexical Fabric and External Resources. EACL 2003: 363-370

  4. Who am I, I am who? • Author name disambiguation • Given a large number of citations, how to determine which name is which author? • Closely related problem: citation matching • Given a large number of citations, how to determine which citations refer to the same papers? • Solutions must be scalable • DBLP has more than 660,000 citations • Citeseer has more than 730,000 documents

  5. Ideas • Idea 1: determine the research field • Unfortunately, paper titles have limited words and some conferences tend to be broad • Idea 2: use coauthors information • Likely that an author will collaborate with a selected group of people • This group will likely publish a number of papers together • To find the similarity of coauthor lists

  6. Forward direction:M. Kan = M.-Y. Kan = Min-Yen Kan • Problem • Pairwise comparison on all the coauthor lists is very expensive (few days also cannot finish) • Solution • Soft clustering on the coauthor lists using some cheap distance measure • Then perform pairwise comparison within the clusters • What is a good soft clustering algorithm?

  7. Backward direction:This Hang Cui is not that Hang Cui • Difficult to determine using the metadata alone without external resources • Many authors have several distinct research areas • Each research area with different collaborators • Currently investigating what kind of external resource to use • Goooooooooogle for URLs?

  8. The end • But the research has just begun…