610 likes | 706 Views
Expert Search. Data Sciences Summer Institute University of Illinois at Urbana-Champaign July 1, 2011. Expert Search Group. Expert Search is a search engine that returns a list of people who are experts in a particular area of study given a paper abstract or list of topics.
E N D
ExpertSearch Data Sciences Summer Institute University of Illinois at Urbana-Champaign July 1, 2011
Expert Search is a search engine that returns a list of people who are experts in a particular area of study given a paper abstract or list of topics. Expert Search Goal Expert #1 Expert Search Paper Abstract or Topic Expert #2 Expert #3 Expert #4
Generate a list of professors that should be invited to a talk on campus. Expert Search Current Use Summary of Talk List of people to e-mail
Data Crawling ____________________________________________________________________
Oluwadare Ibiyemi Fitzroy Nembhard Froswell Wallace Thapanapong Rukkanchanunt Data Crawling Team
13609 Documents Data Crawling 4445 Experts 987 Mb of Text 247 Programs of Study
Data Crawling crawl obtain retrieve save store
Get homepage URL from UIUC phonebook • Use search engine to obtain the URL if it is not in UIUC phonebook • Crawl homepage • Terminate the search at 300 HTML pages • Collect and process PDF files • Store the number of documents, homepage link, and text crawled from "homepage" Data Crawling
Data Classification _____________________________________________________________
Kendra Clay Eunki Kim Victoria Ko Bekah Van Maanen Classification & Extraction
Classification Classification & Extraction Extraction
The task for classification is to determine whether the URL listed for the expert is his or her homepage. Classification Goal HTML Text Files Classified Homepages
Classification: Labeling • Implemented supervised learning • Collected text from crawled data • Manually labeled 1300 web pages
Classification: Learning • Create directories for testing, training, and unlabeled data.
Classification: Classification • Learning Algorithm: Sparse Network Learner • Feature: Bag of Words (bigram) • 1063 files classified as homepages by classifier • Accuracy: 87.333%
Classification Classification & Extraction Extraction
Information(Keywords) Extraction • Two types: 1. Homepages / 2. Papers • Output used by Information Retrieval to match search query to an expert Information Extraction Expert Text Files Expert Interests
Extraction Task Use methods or apply rules 1. HTML code 2. Parsed Text Files Expert Text Files Extract Interests Expert Interest Text File
Various formats of homepages • Needs to set rules to deal with various cases Extraction Task- Challenges
Step 1: Extraction rules • - Get a big chunk of information • - Example of tokens: Research Areas, Interests, Specialization, Areas of Expertise, Field of Study • Step 2: Iteration rules • - Find what format it is and refine found information • - Example of formats: List, Comma, Table, Paragraph, Link • Repeat Step 1 and Step2 • until it founds the right part of information Extraction Task: Rules
Extraction Task: Example Webpage Step 1. Find “Research Areas“ Profile Research Courses Education Publications Areas Step 2. Define what format it is here: “List” with <ul> apply iteration rule with <li> http://abe.illinois.edu/faculty/M_Hirschi Soil erosion and sediment control Water quality and management
Papers - include some non-word text (i.e. mathematical notation, etc), may be incorrectly identified as keywords • Sol'n: take abstracts from paper • How long should a keyword phrase be to be useful in associating it to an expert? • Must define maximum length • How can we identify keywords? • Part-of-speech, noun phrases papers/pdfs Extraction Task: Challenges
Illinois Chunker Extraction Task : Tools Above: part of abstract from Pictorial Structures for Object Recognition by P. Felzenszwalb and D. Huttenlocher • NP are Candidate Keywords • Calculate weight for potential keywords • Take top 10 highest weight noun phrases
Rapid Automatic Keyword Extraction (RAKE) • Frequency: Total # of word occurrences. • Degree: (total # of individual occurrences of word in document) + (length of each noun phrase the word appears in) • Word score: s(w) = deg(w)/freq(w) • NP Score: np_s(w) = s(w1) + s(w2) +...+ s(wk), where w = (w1 w2...wk) = noun phrase and s(wk) = individual word score for word wk Extraction Task : Tools
Topic Modeling _____________________________________________________________
Pradip Karki Sam Somuah Topic Modeling Team
Goal: To discover latent topics in the “bag of words” associated with expert • Process: Latent Dirichlet Allocation and Gibbs Sampling Topic Modeling Expert Text Files Distribution of words over topics, topics over experts
Challenges: • Large number of documents • Experts have multiple areas of expertise. Topic Modeling:Motivation • Topic Modeling: • A generative model • Reduce dimensionality by mapping to a limited number of topics • "Hidden" topics can be discovered without the need for labeling.
The probabilities of the topics and words associated with the topics are used to retrieve relevant results by the Information retrieval group Topic Modeling
Topic Modeling Expert Text Files
Expert-Topic ExpertID TopicID Prob Topic Modeling:Output 3301 87 0.551046848 3301 173 0.127817630 3301 199 0.065532870 3301 176 0.049024193 Topic-Word TopicID Word Prob 87 data 0.177620899 87 mine 0.014229041 87 algorithm 0.012556624 87 pattern 0.124705841
Information Retrieval _____________________________________________________________
Sean Massung Fei Wu Information Retrieval Team
Given a user's query, the IR component acts as a search engine, ranking experts based on relevancy. Information Retrieval List of experts ordered by relevancy
System Flow HTTP POST Request HTTP GET with Key Information Retrieval UI Abstract or query Key Key/List List Crawl data Database
As expected, longer queries produce more accurate results • LM method is more accurate for short queries, whereas the TM method performs well on longer queries • Overall, we have good expert recall Results
User Interface __________________________________________________________
Fitzroy Nembhard Jerone Dunbar User Interface Team