1 / 35

Social Network Extraction of Academic Researchers

Social Network Extraction of Academic Researchers. Jie Tang, Duo Zhang, and Limin Yao Tsinghua University Oct. 29 th 2007. Outline. Motivation Related Work Problem Description Our Approach Experimental Results Summary. Motivation. More and more online social networks become available

cloris
Download Presentation

Social Network Extraction of Academic Researchers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Social Network Extraction of Academic Researchers Jie Tang, Duo Zhang, and Limin Yao Tsinghua University Oct. 29th 2007

  2. Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary

  3. Motivation • More and more online social networks become available • e.g., YouTube.com, Facebook.com, etc. • However, the social networks are usually separated • A question arises: can we build a integrated social network from the separated ones automatically? • As a case study, how to build an social network automatically for academic community? • ArnetMiner.org

  4. Motivating Example 2 Contact Information 1 Educational history 1 Academic services Publications 1 2 UIUC coauthor affiliation 1 Ruud Bolle Publication #3 Professor 2 2 position Publication #5 coauthor

  5. Motivating Example 2 Contact Information 1 • Two key issues: • How to accurately extract the researcher profile information from the Web? • How to integrate the information from different sources? Educational history 1 Academic services Publications 1 2 UIUC coauthor affiliation 1 Ruud Bolle Publication #3 Professor 2 2 position Publication #5 coauthor

  6. Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary

  7. Related Work – Person Profiling • Profile Information Extraction • E.g., Yu et al. (2005), resume IE • Alani et al. (2003), Artequakt system • Contact Information Extraction • E.g., Kristjansson et al. (2004), Interactive extraction • Balog and Rijke (2006), Heuristic rules • Information Extraction Methods • E.g., HMM (Ghahramani, 1997), • MEMM (McCallum, 2000), • CRFs (Lafferty, 2001)

  8. Related Work – Name Disambiguation • Unsupervised Methods • Hierarchy clustering, K-way spectral clustering, etc. • E.g. Han (2005), Mann (2003), Tan (2006) • Supervised Methods • Support Vector Machines, Naïve Bayes, etc. • E.g. Han (2004) • Graph-based Approach • Random Walk, etc. • E.g. Bekkerman (2005), Malin (2005), Minkov (2006)

  9. Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary

  10. Researcher Social Network Extraction 70.60% of the researchers have at least one homepage or an introducing page 85.6% from universities 14.4% from companies 71.9% are homepages 28.1% are introducing pages 40% are in lists and tables 60% are natural language text There are a large number of person names having the ambiguity problem Even 3 “Yi Li” graduated the author’s lab 70% moved at least one time

  11. Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary

  12. Markov Random Field Special Cases: - Conditional Random Fields - Hidden Markov Random Fields Markov Property:

  13. CRFs - Green nodes are hidden vars, - Purple nodes are observations ADR ADR … … … … AFF AFF AFF AFF AFF AFF POS POS POS POS POS POS OTH OTH OTH OTH OTH OTH Professor is UIUC a at He

  14. Processing Flow for Profiling 1 Preprocessing 2 Tagging Train Standard word Determine Tokens Special word Assigning tags Image Token Term Inputted docs Document Punc. mark Labeling data Test Model Learning 3 Feature definitions Learning a CRF model A unified tagging model Tagging results Labeled data

  15. Token Definitions Standard word Words in natural language Special word Including several general ‘special words’ e.g. email address, IP address, URL, date, number, money, percentage, unnecessary tokens (e.g. ‘===’ and ‘###’), etc. Image token <IMAGE src="defaul3.jpg" alt=""/> base NP, like “Computer Science” Term Punctuation marks Including period, question mark, and exclamation mark

  16. Possible Tag Assignment

  17. Feature Definition • Content features Standard Word Word features Whether the current token is a word Whether the word is capitalized Morphological features Image Token Image size The size of the image The value of height/width. The value of a person photo is often larger than 1 Image height/width ratio Image format JPG or BMP The number of the “unique color” used in the image and the number of bits used for per pixel, i.e. 32,24,16,8,1 Image color Whether the current image contains a person face Face recognition Whether the filename contains (partially) the researcher name Image filename Whether the “alt” of the image contains (partially) the researcher name Image “ALT” Image positive keywords “myself”, “biology” Image negative keywords “ads”, “banner”, “logo”

  18. Feature Definition • Pattern features • Term features

  19. Our Method to Name Disambiguation • A hidden Markov Random Field model • Observable Variables X represent publications • Hidden Variables Y represent the labels of publications • Constraints define the dependencies over hidden variables

  20. Objective Function maximize 2 1 2 minimize

  21. Constraint Definition

  22. Parameterized Distance Function • We define our distance function as follows: where • We can see that actually maps each vector xi into another new space, i.e. A1/2xi • To simplify our question, we define A as a diagonal matrix

  23. EM Framework • Initialization • use constraints to generate initial k clusters • E-Step • M-Step • Update cluster centroid • Update parameter matrix A

  24. Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary

  25. Profiling Experiments • Dataset • IK researchers from ArnetMiner.org • Baseline • Amilcare • Support Vector Machines • Unified_NT (CRFs without transition features) • Evaluation measures • Precision, Recall, F1

  26. Profiling Results—5-fold cross validation 83.37

  27. Contribution of Features

  28. Disambiguation Experiments • Data Sets: Abbreviated Name dataset Real Name dataset

  29. Experiment Setup • Baseline Method Unsupervised Hierarchical Clustering Method • Measurement

  30. Disambiguation Results

  31. Contribution of Different Constraint

  32. How Profiling and Disambiguation Help Expert Finding • Expert finding by using a PageRank-based method

  33. Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary

  34. Summary • Investigated the problem of researcher social network extraction • Proposed a unified approach to perform profiling and a constraint-based probabilistic model to name disambiguation • Experimental results show that our approaches outperform the baseline methods • When applying it to expert finding, we obtain a significant improvement on performances

  35. Thanks! Q&A HP: http://keg.cs.tsinghua.edu.cn/persons/tj/ Online Demo: http://arnetminer.org

More Related