1 / 60

Expert Search

Expert Search. Data Sciences Summer Institute University of Illinois at Urbana-Champaign July 1, 2011. Expert Search Group. Expert Search is a search engine that returns a list of people who are experts in a particular area of study given a paper abstract or list of topics.

helmut
Download Presentation

Expert Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ExpertSearch Data Sciences Summer Institute University of Illinois at Urbana-Champaign July 1, 2011

  2. Expert Search Group

  3. Expert Search is a search engine that returns a list of people who are experts in a particular area of study given a paper abstract or list of topics. Expert Search Goal Expert #1 Expert Search Paper Abstract or Topic Expert #2 Expert #3 Expert #4

  4. Generate a list of professors that should be invited to a talk on campus. Expert Search Current Use Summary of Talk List of people to e-mail

  5. Presentation Overview

  6.              Data Crawling                       ____________________________________________________________________

  7. Oluwadare Ibiyemi Fitzroy Nembhard Froswell Wallace Thapanapong Rukkanchanunt Data Crawling Team

  8. 13609 Documents Data Crawling 4445 Experts 987 Mb of Text 247 Programs of Study

  9. Data Crawling crawl obtain retrieve save store

  10. Get homepage URL from UIUC phonebook • Use search engine to obtain the URL if it is not in UIUC phonebook • Crawl homepage • Terminate the search at 300 HTML pages • Collect and process PDF files  • Store the number of documents, homepage link, and text crawled from "homepage" Data Crawling

  11. Data Crawling

  12. Data Crawling: Tools

  13.                       Data Classification                             _____________________________________________________________         

  14. Kendra Clay Eunki Kim Victoria Ko Bekah Van Maanen Classification & Extraction

  15. Classification Classification & Extraction Extraction

  16. The task for classification is to determine whether the URL listed for the expert is his or her homepage. Classification Goal HTML Text Files Classified Homepages

  17. Classification: Labeling • Implemented supervised learning • Collected text from crawled data • Manually labeled 1300 web pages

  18. Classification: Learning • Create directories for testing, training, and unlabeled data.

  19. Classification: Classification • Learning Algorithm: Sparse Network Learner • Feature: Bag of Words (bigram) • 1063 files classified as homepages by classifier • Accuracy: 87.333%

  20. Classification: Example

  21. Classification

  22. Classification: Example

  23. Classification Classification & Extraction Extraction

  24. Information(Keywords) Extraction • Two types: 1. Homepages  /  2. Papers • Output used by Information Retrieval to match search query to an expert Information Extraction Expert Text Files Expert Interests

  25. Extraction Task Use methods  or apply rules 1. HTML code   2. Parsed Text Files Expert Text Files Extract Interests Expert Interest  Text File

  26. Various formats of homepages • Needs to set rules to deal with various cases Extraction Task- Challenges

  27. Step 1: Extraction rules  •    -  Get a big chunk of information •    -  Example of tokens: Research Areas, Interests, Specialization,   Areas of Expertise, Field of Study • Step 2: Iteration rules •     - Find what format it is and refine found information  •     -  Example of formats: List, Comma, Table, Paragraph, Link  • Repeat Step 1 and Step2  •       until it founds the right part of information  Extraction Task: Rules

  28. Extraction Task: Example                                   Webpage                          Step 1. Find “Research Areas“ Profile    Research    Courses       Education         Publications                  Areas                                                                             Step 2. Define what format it is                                                                                     here: “List” with <ul>                                                                                     apply iteration rule with <li> http://abe.illinois.edu/faculty/M_Hirschi Soil erosion and sediment control Water quality and management

  29. Papers - include some non-word text (i.e. mathematical notation, etc), may be incorrectly identified as keywords • Sol'n: take abstracts from paper • How long should a keyword phrase be to be useful in associating it to an expert? • Must define maximum length • How can we identify keywords? • Part-of-speech, noun phrases papers/pdfs Extraction Task: Challenges

  30. Illinois Chunker Extraction Task : Tools  Above: part of abstract from Pictorial Structures for Object Recognition by P. Felzenszwalb and D. Huttenlocher • NP are Candidate Keywords • Calculate weight for potential keywords • Take top 10 highest weight noun phrases

  31. Rapid Automatic Keyword Extraction (RAKE) • Frequency: Total # of word occurrences. • Degree: (total # of individual occurrences of word in document) + (length of each noun phrase the word appears in) • Word score: s(w) = deg(w)/freq(w) • NP Score: np_s(w) = s(w1) + s(w2) +...+ s(wk), where w = (w1 w2...wk) = noun phrase and s(wk) = individual word score for word wk Extraction Task : Tools

  32. Classification & Extraction

  33.                          Topic Modeling                            _____________________________________________________________

  34.  Pradip Karki Sam Somuah Topic Modeling Team

  35. Goal: To discover latent topics in the “bag of words” associated with expert • Process: Latent Dirichlet Allocation and Gibbs Sampling Topic Modeling Expert Text Files Distribution of words over topics, topics over experts

  36. Challenges: • Large number of documents • Experts have multiple areas of expertise. Topic Modeling:Motivation • Topic Modeling: • A generative model • Reduce dimensionality by mapping to a limited number of topics • "Hidden"  topics can be discovered without the need for labeling.

  37. The probabilities of the topics and words associated with the topics are used to retrieve relevant results by the Information retrieval group Topic Modeling

  38. Topic Modeling Expert Text Files

  39. Expert-Topic ExpertID TopicID Prob Topic Modeling:Output 3301 87 0.551046848 3301 173 0.127817630 3301 199 0.065532870 3301 176 0.049024193 Topic-Word TopicID Word Prob 87 data 0.177620899 87 mine 0.014229041 87 algorithm 0.012556624 87 pattern 0.124705841

  40. Topic Modeling: Tools

  41.                     Information Retrieval                                 _____________________________________________________________

  42. Sean Massung Fei Wu Information Retrieval Team

  43. Given a user's query, the IR component acts as a search engine, ranking experts based on relevancy. Information Retrieval List of experts ordered by relevancy

  44. System Flow HTTP POST Request HTTP GET with Key Information Retrieval UI Abstract or query Key Key/List List Crawl data Database

  45. Information Retrieval

  46. As expected, longer queries produce more accurate results • LM method is more accurate for short queries, whereas the TM method performs well on longer queries • Overall, we have good expert recall Results

  47. Information Retrieval

  48.                           User Interface                                   __________________________________________________________

  49. Fitzroy Nembhard Jerone Dunbar User Interface Team

More Related