Data Crawling and Classification Institute

ExpertSearch Data Sciences Summer Institute University of Illinois at Urbana-Champaign July 1, 2011

Expert Search Group

Expert Search is a search engine that returns a list of people who are experts in a particular area of study given a paper abstract or list of topics. Expert Search Goal Expert #1 Expert Search Paper Abstract or Topic Expert #2 Expert #3 Expert #4

Generate a list of professors that should be invited to a talk on campus. Expert Search Current Use Summary of Talk List of people to e-mail

Presentation Overview

Data Crawling ____________________________________________________________________

Oluwadare Ibiyemi Fitzroy Nembhard Froswell Wallace Thapanapong Rukkanchanunt Data Crawling Team

13609 Documents Data Crawling 4445 Experts 987 Mb of Text 247 Programs of Study

Data Crawling crawl obtain retrieve save store

Get homepage URL from UIUC phonebook • Use search engine to obtain the URL if it is not in UIUC phonebook • Crawl homepage • Terminate the search at 300 HTML pages • Collect and process PDF files • Store the number of documents, homepage link, and text crawled from "homepage" Data Crawling

Data Crawling

Data Crawling: Tools

Data Classification _____________________________________________________________

Kendra Clay Eunki Kim Victoria Ko Bekah Van Maanen Classification & Extraction

Classification Classification & Extraction Extraction

The task for classification is to determine whether the URL listed for the expert is his or her homepage. Classification Goal HTML Text Files Classified Homepages

Classification: Labeling • Implemented supervised learning • Collected text from crawled data • Manually labeled 1300 web pages

Classification: Learning • Create directories for testing, training, and unlabeled data.

Classification: Classification • Learning Algorithm: Sparse Network Learner • Feature: Bag of Words (bigram) • 1063 files classified as homepages by classifier • Accuracy: 87.333%

Classification: Example

Classification

Classification: Example

Classification Classification & Extraction Extraction

Information(Keywords) Extraction • Two types: 1. Homepages / 2. Papers • Output used by Information Retrieval to match search query to an expert Information Extraction Expert Text Files Expert Interests

Extraction Task Use methods or apply rules 1. HTML code 2. Parsed Text Files Expert Text Files Extract Interests Expert Interest Text File

Various formats of homepages • Needs to set rules to deal with various cases Extraction Task- Challenges

Step 1: Extraction rules • - Get a big chunk of information • - Example of tokens: Research Areas, Interests, Specialization, Areas of Expertise, Field of Study • Step 2: Iteration rules • - Find what format it is and refine found information • - Example of formats: List, Comma, Table, Paragraph, Link • Repeat Step 1 and Step2 • until it founds the right part of information Extraction Task: Rules

Extraction Task: Example Webpage Step 1. Find “Research Areas“ Profile Research Courses Education Publications Areas Step 2. Define what format it is here: “List” with <ul> apply iteration rule with <li> http://abe.illinois.edu/faculty/M_Hirschi Soil erosion and sediment control Water quality and management

Papers - include some non-word text (i.e. mathematical notation, etc), may be incorrectly identified as keywords • Sol'n: take abstracts from paper • How long should a keyword phrase be to be useful in associating it to an expert? • Must define maximum length • How can we identify keywords? • Part-of-speech, noun phrases papers/pdfs Extraction Task: Challenges

Illinois Chunker Extraction Task : Tools Above: part of abstract from Pictorial Structures for Object Recognition by P. Felzenszwalb and D. Huttenlocher • NP are Candidate Keywords • Calculate weight for potential keywords • Take top 10 highest weight noun phrases

Rapid Automatic Keyword Extraction (RAKE) • Frequency: Total # of word occurrences. • Degree: (total # of individual occurrences of word in document) + (length of each noun phrase the word appears in) • Word score: s(w) = deg(w)/freq(w) • NP Score: np_s(w) = s(w1) + s(w2) +...+ s(wk), where w = (w1 w2...wk) = noun phrase and s(wk) = individual word score for word wk Extraction Task : Tools

Classification & Extraction

Topic Modeling _____________________________________________________________

Pradip Karki Sam Somuah Topic Modeling Team

Goal: To discover latent topics in the “bag of words” associated with expert • Process: Latent Dirichlet Allocation and Gibbs Sampling Topic Modeling Expert Text Files Distribution of words over topics, topics over experts

Challenges: • Large number of documents • Experts have multiple areas of expertise. Topic Modeling:Motivation • Topic Modeling: • A generative model • Reduce dimensionality by mapping to a limited number of topics • "Hidden" topics can be discovered without the need for labeling.

The probabilities of the topics and words associated with the topics are used to retrieve relevant results by the Information retrieval group Topic Modeling

Topic Modeling Expert Text Files

Expert-Topic ExpertID TopicID Prob Topic Modeling:Output 3301 87 0.551046848 3301 173 0.127817630 3301 199 0.065532870 3301 176 0.049024193 Topic-Word TopicID Word Prob 87 data 0.177620899 87 mine 0.014229041 87 algorithm 0.012556624 87 pattern 0.124705841

Topic Modeling: Tools

Information Retrieval _____________________________________________________________

Sean Massung Fei Wu Information Retrieval Team

Given a user's query, the IR component acts as a search engine, ranking experts based on relevancy. Information Retrieval List of experts ordered by relevancy

System Flow HTTP POST Request HTTP GET with Key Information Retrieval UI Abstract or query Key Key/List List Crawl data Database

Information Retrieval

As expected, longer queries produce more accurate results • LM method is more accurate for short queries, whereas the TM method performs well on longer queries • Overall, we have good expert recall Results

Information Retrieval

User Interface __________________________________________________________

Fitzroy Nembhard Jerone Dunbar User Interface Team

Data Crawling and Classification Institute

Data Crawling and Classification Institute

Presentation Transcript

4.1.3 Heuristic Search and Expert Systems (1)

Search Engine Optimization Expert, Web Marketing Services, S

A Probabilistic Model for Fine-Grained Expert Search

SEO Company | Search Engine Marketing | SEO Expert Brisbane

Search Expert & Top Rated Science Tutors in Novi

CXO Search – Industry Expert Executive Search

Get Expert Solutions From Search Engine Optimization Oklahoma

SEO Expert Singapore | Search Engine Optimization Company

How to search the best and expert SEO service?

Expert Search Engine Optimization in New York City

Solve Your Problem with Expert CFO Executive Search Service

The Semalt Islamabad Expert Explains How Search Engines Work

10 Inspirational Graphics About Search Engine Marketing Expert

Hire Dedicated SEO Expert for better search ranking

search marketing expert: What No One Is Talking About

Expert Executive Search Manager - Best Executive Management Headhunters

Graphic Designer Needed With Expert Skills - Search For Expert Graphic Designers

Expert Real Estate Executive Search - Alliance Recruitment Agency

Expert Executive Search Firms Boston - Alliance Recruitment Agency

Expert Execu Search Recruiters - Top Exec Search Firms

Search Consultancy - Expert Search Agency - Alliance Recruitment Agency

Expert Executive Search Montreal - Alliance Recruitment Agency