380 likes | 455 Views
Explore the challenges, terminology, and solutions in leveraging contextual information to improve web search results. Learn about the IntelliZap System and its innovative approach to search engine optimization. Dive into algorithms, user interfaces, and evaluation methods to enhance information retrieval in context.
E N D
Information Retrieval in Context Presenter: Xuehua Shen xshen@uiuc.edu Xuehua Shen @CS, UIUC
Presentation Layout • Problem Description • Terminology • Challenges • IntelliZap System[WWW2001] • Concerns Xuehua Shen @CS, UIUC
Problem • Search Engine has become key source of information 1998[GVU WWW Study]: 85% people use search engine to locate information Now [Craig’s Talk]: 500 million search on Internet per day 150 million search at Google per day • Efforts on Coverage and Relevance Xuehua Shen @CS, UIUC
Web Search Fact • Given 3-5 billion web pages on the Web huge and diverse info provided by Web • On average 1.7-words per query [Eric Brewer CACM 09/2002] little info provided by Users • Can search engine retrieve web pages very well? Xuehua Shen @CS, UIUC
Context • Context may provide extra information to help improve search result relevance • An example: Searching flowers [DirectHit 1999] Man: typically want sites that let them send flowers Woman: often want sites that let them order flower seeds or plants for gardening purposes • What context information useful? Xuehua Shen @CS, UIUC
Terminology • Ephemeral Context In a single search session Category[Inquirus2], Document being viewed [Watson], Feedback • Persistent Context increment over time, used in subsequent sessions User profile [My Yahoo!], Query history & Clickthrough Data [Google] Xuehua Shen @CS, UIUC
Terminology cont. • Personalization Search Engine use context information to provide different search results for different users • Customization Users manually configure their preferences Xuehua Shen @CS, UIUC
Challenges • How to capture and store useful information? • SearchPad[WWW2001]: • Server-proxy-client architecture • User explicitly mark relevant pages • Any shortcomings? Better ways? Xuehua Shen @CS, UIUC
Challenges cont. • Many retrieval models, also many user models, But how to merge them? • language model is used to represent context by Croft Xuehua Shen @CS, UIUC
Challenges • How to build such system, such as architecture Server side, client side? User Interface? • Server side: scalability, privacy • Client side: communication of context info with server Xuehua Shen @CS, UIUC
Challenges • How to evaluate such work? Metrics? • HARD (Hard Accuracy Relevance from Document) Track added this year leverage additional information about searcher and/or search context Xuehua Shen @CS, UIUC
Intellizap – General Description • Assumption: a large fraction of searches originate while users are reading documents on their computers. • Standpoint: Context is a body of words of surrounding a user-selected phrase • Intellizap System: Meta Search Engine with context-based query augmentation, search engine selection and reranking Xuehua Shen @CS, UIUC
Walkthrough of IntelliZap Xuehua Shen @CS, UIUC
Walkthrough cont. Xuehua Shen @CS, UIUC
Walkthrough cont. Xuehua Shen @CS, UIUC
Walkthrough cont. Xuehua Shen @CS, UIUC
Walkthrough cont. Xuehua Shen @CS, UIUC
How to use Context • augment query before sending queries to search engines • rerank the results returned by search engines Xuehua Shen @CS, UIUC
How to collect right amount of context • Don’t include all document as Watson System • Heuristics 1 establishing optimal context length as a function of the length of text phrase and individual frequencies • Heuristics 2 relative weighting of the text and context in augmented query emphasize marked text phrase weight of context word: monotonic function of their proximity to text Xuehua Shen @CS, UIUC
Algorithm Overview Xuehua Shen @CS, UIUC
Step 0: Semantic Network • Build Semantic Network (offline): statistics-based semantic network • Linear combination of vector-based correlation metric and WordNet-based metric Xuehua Shen @CS, UIUC
Semantic Network cont. • Vector-based correlation metric: 27 knowledge domains (computer, business etc.) 10,000 documents samples on Internet each word: a 27-dimension vector use correlation to measure distance • WordNet: capture semantic relations between words (hypernymy, hyponymy, meronymy and holonymy). WordNet:http://www.cogsci.princeton.edu/~wn/ Xuehua Shen @CS, UIUC
Step 1: Query Augmentation • Extract keywords from context surrounding the user-selected text utilizing semantic network typically context – about 50 words • use clustering algorithm to construct several queries of different topics Xuehua Shen @CS, UIUC
Step 2: Search Engine Selection • IntelliZap is a Meta Search Engine • Several general search engines ( such as Google, Altavista) • For several domains, specific search engines( such as WebMD, FindLaw) is assigned to as a priori. Xuehua Shen @CS, UIUC
Step 3: Results Reranking • There are several lists of results returned by several search engines. • Use semantic network to calculate distance between results titles/summaries and text/context Xuehua Shen @CS, UIUC
Evaluation Method • State-of-the-art: lack the benchmark • Use subjects recruited by external agency • Subjects don’t know objective of the experiments, just asked to do search and evaluate results Xuehua Shen @CS, UIUC
Experiment Result Xuehua Shen @CS, UIUC
Experiment Results cont. Xuehua Shen @CS, UIUC
Concerns • Privacy and security Million users info database of My Yahoo! Monitor users through queries they sent! • Relevance consistency Communication Problem Xuehua Shen @CS, UIUC
End • Thank you! Xuehua Shen @CS, UIUC
Backup Slides Xuehua Shen @CS, UIUC
Web Statistics • Accessibility of Information on the Web Steve Lawrence, Nature 1999 Xuehua Shen @CS, UIUC
Semantic Relation • Hypernymy: the semantic relation of being superordinate or belonging to a higher rank or class Synonym: superordination • Hyponymy: the semantic relation of being subordinate or belonging to a lower rank or class Synonym: subordination • Meronymy: the semantic relation that holds between a part and the whole Synonym: part to whole relation • Holonymy: the semantic relation that holds between a whole and its partsSynonym: whole to part relation • More at http://dictionary.metor.com/wnet/ Xuehua Shen @CS, UIUC
Clustering algorithm • Traditional clustering algorithm doesn’t work due to a large amount of noise and a small amount of information available 50 context words represented in 27 D space • Special clustering algorithm-High Dimensional clustering perform Recurrent Clustering analysis (averaging over iterations) refine results statistically Xuehua Shen @CS, UIUC
Limitation of Web • Freshness • Coverage( only publicly indexable web) • Bias (not index sites equally) Xuehua Shen @CS, UIUC
Several Systems--1 • Inquirus2: meta search engine • Watson Project (Jay Budzik,NWU): contents of full documents being edited in MS Word or Viewed in Explorer • Remembrance Agent (Bradley Rhodes,MIT): software agent just-in-time information retrieval Xuehua Shen @CS, UIUC
Several System--2 • Outride (renamed in 2001) GroupFire (spin off from PARC Xerox) in 2000 Xuehua Shen @CS, UIUC
Reference • [1] Graphic,Visualization and Usability Center GVU’s 10th WWW User Survey,1998 Xuehua Shen @CS, UIUC