龙星计划课程:信息检索Next-Generation Search Engines ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, email@example.com
Outline • Overview of web search • Next generation search engines
Characteristics of Web Information • “Infinite” size (Surface vs. deep Web) • Surface = static HTML pages • Deep = dynamically generated HTML pages (DB) • Semi-structured • Structured = HTML tags, hyperlinks, etc • Unstructured = Text • Different format (pdf, word, ps, …) • Multi-media (Textual, audio, images, …) • High variances in quality (Many junks) • “Universal” coverage (can be about any content)
General Challenges in Web Information Management • Handling the size of the Web • How to ensure completeness of coverage? • Efficiency issues • Dealing with or tolerating errors and low quality information • Addressing the dynamics of the Web • Some pages may disappear permanently • New pages are constantly created
“Free text” vs. “Structured text” • So far, we’ve assumed “free text” • Document = word sequence • Query = word sequence • Collection = a set of documents • Minimal structure … • But, we may have structures on text (e.g., title, hyperlinks) • Can we exploit the structures in retrieval?
Examples of Document Structures • Intra-doc structures (=relations of components) • Natural components: title, author, abstract, sections, references, … • Annotations: named entities, subtopics, markups, … • Inter-doc structures (=relations between documents) • Topic hierarchy • Hyperlinks/citations (hypertext)
Structured Text Collection General question: How do we search such a collection? A general topic Subtopic k Subtopic 1 ...
Exploiting Intra-document Structures[Ogilvie & Callan 2003] Select Dj and generate a query word using Dj “part selection” prob. Serves as weight for Dj Can be trained using EM Anchor text can be treated as a “part” of a document D Intuitively, we want to combine all the parts, but give more weights to some parts Think about query-likelihood model… Title Abstract Body-Part1 Body-Part2 … D1 D2 D3 Dk
Exploiting Inter-document Structures • Document collection has links (e.g., Web, citations of literature) • Query: text query • Results: ranked list of documents • Challenge: how to exploit links to improve ranking?
Exploiting Inter-Document Links “Extra text”/summary for a doc Authority Hub Description (“anchor text”) Links indicate the utility of a doc What does a link tell us?
PageRank: Capturing Page “Popularity”[Page & Brin 98] • Intuitions • Links are like citations in literature • A page that is cited often can be expected to be more useful in general • PageRank is essentially “citation counting”, but improves over simple counting • Consider “indirect citations” (being cited by a highly cited paper counts a lot…) • Smoothing of citations (every page is assumed to have a non-zero citation count) • PageRank can also be interpreted as random surfing (thus capturing popularity)
The PageRank Algorithm (Page et al. 98) Random surfing model: At any page, With prob. , randomly jumping to a page With prob. (1-), randomly picking a link to follow. d1 “Transition matrix” N= # pages d3 d2 d4 Stationary (“stable”) distribution, so we ignore time Iij = 1/N Initial value p(d)=1/N Iterate until converge
PageRank in Practice • Interpretation of the damping factor (0.15): • Probability of a random jump • Smoothing the transition matrix (avoid zero’s) • Normalization doesn’t affect ranking, leading to some variants • The zero-outlink problem: p(di)’s don’t sum to 1 • One possible solution = page-specific damping factor (=1.0 for a page with no outlink)
HITS: Capturing Authorities & Hubs[Kleinberg 98] • Intuitions • Pages that are widely cited are good authorities • Pages that cite many other pages are good hubs • The key idea of HITS • Good authorities are cited by good hubs • Good hubs point to good authorities • Iterative reinforcement…
The HITS Algorithm [Kleinberg 98] “Adjacency matrix” d1 d3 Initial values: a(di)=h(di)=1 d2 Iterate d4 Normalize:
Basic Search Engine Technologies … Browser Query Host Info. Results Retriever Efficiency!!! Precision Crawler Coverage Freshness ---- ---- … ---- ---- ---- ---- … ---- ---- … Indexer Cached pages Error/spam handling (Inverted) Index User Web
Component I: Crawler/Spider/Robot • Building a “toy crawler” is easy • Start with a set of “seed pages” in a priority queue • Fetch pages from the web • Parse fetched pages for hyperlinks; add them to the queue • Follow the hyperlinks in the queue • A real crawler is much more complicated… • Robustness (server failure, trap, etc.) • Crawling courtesy (server load balance, robot exclusion, etc.) • Handling file types (images, PDF files, etc.) • URL extensions (cgi script, internal references, etc.) • Recognize redundant pages (identical and duplicates) • Discover “hidden” URLs (e.g., truncated) • Crawling strategy is a main research topic (i.e., which page to visit next?)
Major Crawling Strategies • Breadth-First (most common(?); balance server load) • Parallel crawling • Focused crawling • Targeting at a subset of pages (e.g., all pages about “automobiles” ) • Typically given a query • Incremental/repeated crawling • Can learn from the past experience • Probabilistic models are possible The Major challenge remains to maintain “freshness” and good coverage with minimum resource overhead
Component II: Indexer • Standard IR techniques are the basis • Basic indexing decisions (stop words, stemming, numbers, special symbols) • Indexing efficiency (space and time) • Updating • Additional challenges • Recognize spams/junks • Exploit multiple features (PageRank, font information, structures, etc) • How to support “fast summary generation”? • Google’s contributions: • Google file system: distributed file system • Big Table: column-based database • MapReduce: Software framework for parallel computation • Hadoop: Open source implementation of MapReduce (mainly by Yahoo!)
Google’s Basic Solutions URL Queue/List Cached source pages (compressed) Inverted index Use many features, e.g. font, layout,… Hypertext structure
Component III: Retriever • Standard IR models are applicable but insufficient • Different information need (home page finding vs. topic-driven) • Documents have additional information (hyperlinks, markups, URL) • Information is often redundant and the quality varies a lot • Server-side feedback is often not feasible • Major extensions • Exploiting links (anchor text, link-based scoring) • Exploiting layout/markups (font, title field, etc.) • Spelling correction • Spam filtering • Redundancy elimination • In general, rely on machine learning to combine all kinds of features
Effective Web Retrieval Heuristics • High accuracy in home page finding can be achieved by • Matching query with the title • Matching query with the anchor text • Plus URL-based or link-based scoring (e.g. PageRank) • Imposing a conjunctive (“and”) interpretation of the query is often appropriate • Queries are generally very short (all words are necessary) • The size of the Web makes it likely that at least a page would match all the query words • Combine multiple features using machine learning
Home/Entry Page Finding Evaluation Results(TREC 2001) MRR %top10 %fail 0.774 88.3 4.8 0.772 87.6 4.8 Unigram Query Likelihood + Link/URL prior i.e., p(Q|D) p(D) [Kraaij et al. SIGIR 2002] Exploiting anchor text, structure or links 0.382 62.1 11.7 0.340 60.7 15.9 Query example: Haas Business School
Named Page Finding Evaluation Results (TREC 2002) Okapi/BM25 + Anchor Text Dirichlet Prior + Title, Anchor Text (Lemur) [Ogilvie & Callan SIGIR 2003] Best content-only Query example: America’s century farms
Learning Retrieval Functions • Basic idea: • Given a query-doc pair (Q,D), define various kinds of features Fi(Q,D) • Examples of feature: the number of overlapping terms, p(Q|D), PageRank of D, p(Q|Di), where Di may be anchor text or big font text • Hypothesize p(R=1|Q,D)=s(F1(Q,D),…,Fn(Q,D), ) where is parameters • Learn by fitting function s with training data (i.e., (d,q)’s where d is known to be relevant or non-relevant to q) • Methods: • Early work: logistic regression [Cooper 92, Gey 94] • Recent work:Ranking SVM [Joachims 02], RankNet [Burges et al. 05]
Learning to Rank • Advantages • May combine multiple features (helps improve accuracy and combat web spams) • May re-use all the past relevance judgments (self-improving) • Problems • No much guidance on feature generation (rely on traditional retrieval models) • All current Web search engines use some kind of learning algorithms to combine many features
Limitations of the Current Search Engines • Limited query language • Syntactic querying (sense disambiguation?) • Can’t express multiple search criteria (readability?) • Limited understanding of document contents • Bag of words & keyword matching (sense disambiguation?) • Heuristic query-document matching: mostly TF-IDF weighting • No guarantee for optimality • Machine learning can combine many features, but content matching remains the most important component in scoring
Limitations of the Current Search Engines (cont.) • Lack of user/context modeling • Using the same query, different users would get the same results (poor user modeling) • The same user may use the same query to find different information at different times • Inadequate support for multiple-mode information access • Passive search support: A user must take initiative (no recommendation) • Static navigation support: No dynamically generated links • Consider more integration of search, recommendation, and navigation • Lack of interaction • Lack of task support
Towards Next-Generation Search Engines • Better support for query formulation • Allow querying from any task context • Query by examples • Automatic query generation (recommendation) • Better search accuracy • More accurate information need understanding (more personalization and context modeling) • More accurate document content understanding (more powerful content analysis, sense disambiguation, sentiment analysis, …) • More complex retrieval criteria • Consider multiple utility aspects of information items (e.g., readability, quality, communication cost) • Consider collective value of information items (context-sensitive ranking)
Towards Next-Generation Search Engines (cont.) • Better result presentation • Better organization of search results to facilitate navigation • Better summarization • More effective and robust retrieval models • Automatic parameter tuning • More interactive search • More task support
Looking Ahead…. • More user modeling • Personalized search • Community search engine (collaborative information access) • More content analysis and domain modeling • Vertical search engines • More in-depth (domain-specific) natural language understanding • Text mining • More accurate retrieval models (life-time learning) • Going beyond search • Towards full-fledge information access: integration of search, recommendation, and navigation • Towards task support: putting information access in the context of a task
Summary • Web provides many challenges and opportunities for text information management • Search engine technology crawling + retrieval models + machine learning + software engineering • Current generation of search engines are limited in user modeling, content understanding, retrieval model… • Next generation of search engines likely moves toward personalization, domain-specific vertical search engines, collaborative search, task support, …
What You Should Know • Special characteristics of Web information (as compared with ordinary text collection) • Two kinds of structures of a text collection (intra-doc and inter-doc) • Basic ideas of PageRank and HITS • How a web search engine works • Limitations of current search engines