1 / 48

ArnetMiner – Extraction and Mining of Academic Social Networks

ArnetMiner – Extraction and Mining of Academic Social Networks. 1 Jie Tang, 1 Jing Zhang, 1 Limin Yao, 1 Juanzi Li, 2 Li Zhang, and 2 Zhong Su 1 Knowledge Engineering Group, Dept. of Computer Science and Technology Tsinghua University 2 IBM, China Research Lab August 25 th 2008.

nani
Download Presentation

ArnetMiner – Extraction and Mining of Academic Social Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ArnetMiner– Extraction and Mining of Academic Social Networks 1Jie Tang, 1Jing Zhang, 1Limin Yao, 1Juanzi Li, 2Li Zhang, and 2Zhong Su 1Knowledge Engineering Group, Dept. of Computer Science and Technology Tsinghua University 2IBM, China Research Lab August 25th 2008

  2. Motivation “The information need is not only about publication…” “Academic search is treated as document search, but ignore semantics” However, we are not satisfactory with …

  3. Examples – Expertise search • When starting a • work in a new research topic; • Or brainstorming for novel ideas. • Who are experts in this field? • What are the top conferences in the field? • What are the best papers? • What are the top research labs? Researcher A ?

  4. Examples – Citation network analysis • an in-depth understanding of the research field? ? Researcher B

  5. Examples – Conference Suggestion authors Which conference should we submit the paper? ? Researcher C content

  6. Examples – Reviewer Suggestion Who are best matching reviewers for each paper? KDD Committee conference ? Paper content

  7. Topic Browser

  8. 2 1

  9. Academic Network Extraction in ArnetMiner = Researcher Profiling + Name Disambiguation

  10. Motivating Example 2 Contact Information 1 Educational history 1 Academic services Publications 1 2 UIUC coauthor affiliation 1 Ruud Bolle Publication #3 Professor 2 2 position Publication #5 coauthor

  11. Motivating Example 2 Contact Information 1 • Two key issues: • How to accurately extract the researcher profile information from the Web? • How to integrate the information from different sources? Educational history 1 Academic services Publications 1 2 UIUC coauthor affiliation 1 Ruud Bolle Publication #3 Professor 2 2 position Publication #5 coauthor

  12. Researcher Network Extraction 70.60% of the researchers have at least one homepage/introducing page 85.6% from universities 14.4% from companies 71.9% are homepages 28.1% are introducing pages 40% are in lists and tables 60% are natural language text There are a large number of person names having the ambiguity problem 300 most common male names are used by 1 billion+ people (78.74%) in USA Even 3 “Yi Li” graduated from the author’s lab 70% moved at least one time

  13. Our Approach Picture – based on Markov Random Field Markov Property: Special cases: - Conditional Random Fields - Hidden Markov Random Fields Researcher Profiling Name Disambiguation

  14. CRFs - Green nodes are hidden vars, - Purple nodes are observations ADR ADR … … … … AFF AFF AFF AFF AFF AFF POS POS POS POS POS POS OTH OTH OTH OTH OTH OTH Professor is UIUC a at He

  15. Token Definitions

  16. Feature Definition • Content features Standard Word Word features Whether the current token is a word Whether the word is capitalized Morphological features Image Token Image size The size of the image The value of height/width. The value of a person photo is often larger than 1 Image height/width ratio Image format JPG or BMP The number of the “unique color” used in the image and the number of bits used for per pixel, i.e. 32,24,16,8,1 Image color Whether the current image contains a person face Face recognition Whether the filename contains (partially) the researcher name Image filename Whether the “alt” of the image contains (partially) the researcher name Image “ALT” Image positive keywords “myself”, “biology” Image negative keywords “ads”, “banner”, “logo”

  17. Profiling Experiments • Dataset • 1,000 researchers from ArnetMiner.org • Baseline • Amilcare • Support Vector Machines • Unified_NT (CRFs without transition features) • Evaluation measures • Precision, Recall, F1

  18. Profiling Results—5-fold cross validation 83.37

  19. Name Disambiguation Proposal of a semi-supervised framework

  20. Our Method to Name Disambiguation • A hidden Markov Random Field model • Observable Variables X represent publications • Hidden Variables Y represent the labels of publications • Paper relationships define the dependencies over hidden variables

  21. Objective Function maximize 2 1 2 minimize

  22. Relationship Definition

  23. Parameterized Distance Function • We define the distance function as follows (Basu, 04): where • We can see that actually maps each vector xi into another new space, i.e. A1/2xi • To simplify our question, we define A as a diagonal matrix

  24. EM Framework • Initialization • use constraints to generate initial k clusters • E-Step • M-Step • Update cluster centroid • Update parameter matrix A

  25. Disambiguation Experiments • Data set:

  26. Our Approach vs. Baseline

  27. Contribution of Relationships

  28. Distribution Analysis (1) All methods can achieve good performance (2) Our method can achieve good performance (3) Our method can obtain not bad results, but still need further improvements

  29. Modeling the Academic Network and Applications

  30. The Academic Network • Academic Network • Heterogeneous objects: • Paper, Person, Conf./Journal • Relationships: • Conf./Journal publish paper • Paper cite paper • Person write paper • Person is PC memberof Conf./Journal • Person is coauthor of person • Challenges: • How to model the heterogeneous objects in a unified approach? • - How to apply the modeling approach to different applications?

  31. Modeling the Academic Network words authors Topic conference ACT1 ACT2 ACT3

  32. Generative Story of ACT1 Model Generative process Paper Latent Dirichlet Co-clustering Shafiei and Milios We present a generative model for clustering documents and terms. Our model is a four hierarchical bayesian model. We present efficient inference techniques based on Markow Chain Monte Carlo. We report results in document modeling, document and terms clustering … NLP ICDM 0.23 KDD 0.19 …. P(c|z) IR NIPS ICDM mining 0.23 clustering 0.19 classification 0.17 …. P(w|z) ML DM clustering Shafiei inference NLP ICML 0.23 NIPS 0.19 …. P(c|z) IR DM model 0.23 learning 0.19 boost 0.17 …. ML P(w|z) Milios

  33. ACT Model 1 Generative process: words authors Topic conference ACT1

  34. ACT Model 2 Generative process: authors words conference ACT2

  35. ACT Model 3 Generative process: authors words conference ACT3

  36. Applications Association search Expertise search Researcher interests Hot topic on a conference Topic browser

  37. Expertise Search • Calculate the relevance of query q and different objects (i.e., papers, authors, and conferences) • E.g.,

  38. Expertise Search Results Arnetminer data: 14,134 authors 10,716 papers 1,434 confs/journals Evaluation measures: pooled relevance + human judgement Baselines: - Language Model (LM) - LDA - Author Topic (AT)

  39. ArnetMiner Today

  40. ArnetMiner Today * Arnetminer data: > 0.5 M researcher profiles > 2M papers > 8M citation relationships > 4K conferences * Visits come from more than 165 countries * Continuously +20% increase of visits per month * Currently, more than 1,500 unique-ip visits per day. Top 10 countries 1. USA 6. Canada 2. China 7. Japan 3. Germany 8. France 4. India 9. Taiwan 5. UK 10. Italy

  41. Person Search Basic Info. Research Interests Social Network Publications

  42. ExpertiseSearch Finding experts, expertise conferences, and expertise papers for “data mining”

  43. Association Search Finding associations between persons - high efficiency - Top-K associations Usage: - to find a partner - to find a person with same interests

  44. Survey Paper Finding Survey papers

  45. Topic Browser 200 topics have been discovered automatically from the academic network

  46. Acknowledgements • National Science Foundation of China (NSFC) • National 985 Funding • Chinese Young Faculty Research Funding • Minnesota-China Collaboration Project • IBM CRL • Tsinghua-Google Joint Research Project • National Foundation Science Research (973)

  47. Thanks! Q&A & Demo HP: http://keg.cs.tsinghua.edu.cn/persons/tj/ Online URL: http://arnetminer.org If want to know more technique details, please come to our poster session tomorrow night.

More Related