:

1. ??????????????:????�� Web Search
??? ????,????????,???? http://net.pku.edu.cn/~yhf, yhf AT net.pku.edu.cn CCF-ADL at Zhengzhou University, June 25-27, 2010 Thanks to Chengxiang Zhai for sharing many of these slides.
2. Outline
Overview of web search Next generation search engines 2 CCF-ADL at Zhengzhou University, June 25-27, 2010
3. Characteristics of Web Information
�Infinite� size (Surface vs. deep Web) Surface = static HTML pages Deep = dynamically generated HTML pages (DB) Semi-structured Structured = HTML tags, hyperlinks, etc Unstructured = Text Different format (pdf, word, ps, �) Multi-media (Textual, audio, images, �) High variances in quality (Many junks) �Universal� coverage (can be about any content) 3 CCF-ADL at Zhengzhou University, June 25-27, 2010
4. General Challenges in Web Information Management
Handling the size of the Web How to ensure completeness of coverage? Efficiency issues Dealing with or tolerating errors and low quality information Addressing the dynamics of the Web Some pages may disappear permanently New pages are constantly created 4 CCF-ADL at Zhengzhou University, June 25-27, 2010
5. �Free text� vs. �Structured text�
So far, we�ve assumed �free text� Document = word sequence Query = word sequence Collection = a set of documents Minimal structure � But, we may have structures on text (e.g., title, hyperlinks) Can we exploit the structures in retrieval? 5 CCF-ADL at Zhengzhou University, June 25-27, 2010
6. Examples of Document Structures
Intra-doc structures (=relations of components) Natural components: title, author, abstract, sections, references, � Annotations: named entities, subtopics, markups, � Inter-doc structures (=relations between documents) Topic hierarchy Hyperlinks/citations (hypertext) 6 CCF-ADL at Zhengzhou University, June 25-27, 2010
7. Structured Text Collection
... Subtopic 1 Subtopic k A general topic General question: How do we search such a collection? 7 CCF-ADL at Zhengzhou University, June 25-27, 2010
8. Exploiting Intra-document Structures[Ogilvie & Callan 2003]
Title Abstract Body-Part1 Body-Part2 � D D1 D2 D3 Dk Intuitively, we want to combine all the parts, but give more weights to some parts Think about query-likelihood model� 8 CCF-ADL at Zhengzhou University, June 25-27, 2010
9. Exploiting Inter-document Structures
Document collection has links (e.g., Web, citations of literature) Query: text query Results: ranked list of documents Challenge: how to exploit links to improve ranking? 9 CCF-ADL at Zhengzhou University, June 25-27, 2010
10. Exploiting Inter-Document Links
Description (�anchor text�) Links indicate the utility of a doc What does a link tell us? 10 CCF-ADL at Zhengzhou University, June 25-27, 2010
11. PageRank: Capturing Page �Popularity� [Page & Brin 98]
Intuitions Links are like citations in literature A page that is cited often can be expected to be more useful in general PageRank is essentially �citation counting�, but improves over simple counting Consider �indirect citations� (being cited by a highly cited paper counts a lot�) Smoothing of citations (every page is assumed to have a non-zero citation count) PageRank can also be interpreted as random surfing (thus capturing popularity) 11 CCF-ADL at Zhengzhou University, June 25-27, 2010
12. The PageRank Algorithm (Page et al. 98)
d1 d2 d4 �Transition matrix� d3 Iterate until converge N= # pages Stationary (�stable�) distribution, so we ignore time Random surfing model: At any page, With prob. ?, randomly jumping to a page With prob. (1-?), randomly picking a link to follow. Iij = 1/N Initial value p(d)=1/N 12 CCF-ADL at Zhengzhou University, June 25-27, 2010
13. PageRank in Practice
Interpretation of the damping factor ? (?0.15): Probability of a random jump Smoothing the transition matrix (avoid zero�s) Normalization doesn�t affect ranking, leading to some variants The zero-outlink problem: p(di)�s don�t sum to 1 One possible solution = page-specific damping factor (?=1.0 for a page with no outlink) 13 CCF-ADL at Zhengzhou University, June 25-27, 2010
14. HITS: Capturing Authorities & Hubs [Kleinberg 98]
Intuitions Pages that are widely cited are good authorities Pages that cite many other pages are good hubs The key idea of HITS Good authorities are cited by good hubs Good hubs point to good authorities Iterative reinforcement� 14 CCF-ADL at Zhengzhou University, June 25-27, 2010
15. The HITS Algorithm [Kleinberg 98]
d1 d2 d4 �Adjacency matrix� d3 Initial values: a(di)=h(di)=1 Iterate Normalize: 15 CCF-ADL at Zhengzhou University, June 25-27, 2010
16. Basic Search Engine Technologies
Web User 16 CCF-ADL at Zhengzhou University, June 25-27, 2010
17. Component I: Crawler/Spider/Robot
Building a �toy crawler� is easy Start with a set of �seed pages� in a priority queue Fetch pages from the web Parse fetched pages for hyperlinks; add them to the queue Follow the hyperlinks in the queue A real crawler is much more complicated� Robustness (server failure, trap, etc.) Crawling courtesy (server load balance, robot exclusion, etc.) Handling file types (images, PDF files, etc.) URL extensions (cgi script, internal references, etc.) Recognize redundant pages (identical and duplicates) Discover �hidden� URLs (e.g., truncated) Crawling strategy is a main research topic (i.e., which page to visit next?) 17 CCF-ADL at Zhengzhou University, June 25-27, 2010
18. Major Crawling Strategies
Breadth-First (most common(?); balance server load) Parallel crawling Focused crawling Targeting at a subset of pages (e.g., all pages about �automobiles� ) Typically given a query Incremental/repeated crawling Can learn from the past experience Probabilistic models are possible The Major challenge remains to maintain �freshness� and good coverage with minimum resource overhead 18 CCF-ADL at Zhengzhou University, June 25-27, 2010
19. Component II: Indexer
Standard IR techniques are the basis Basic indexing decisions (stop words, stemming, numbers, special symbols) Indexing efficiency (space and time) Updating Additional challenges Recognize spams/junks Exploit multiple features (PageRank, font information, structures, etc) How to support �fast summary generation�? Google�s contributions: Google file system: distributed file system Big Table: column-based database MapReduce: Software framework for parallel computation Hadoop: Open source implementation of MapReduce (mainly by Yahoo!) 19 CCF-ADL at Zhengzhou University, June 25-27, 2010
20. Google�s Basic Solutions
URL Queue/List Cached source pages (compressed) Inverted index Hypertext structure Use many features, e.g. font, layout,� 20 CCF-ADL at Zhengzhou University, June 25-27, 2010
21. Component III: Retriever
Standard IR models are applicable but insufficient Different information need (home page finding vs. topic-driven) Documents have additional information (hyperlinks, markups, URL) Information is often redundant and the quality varies a lot Server-side feedback is often not feasible Major extensions Exploiting links (anchor text, link-based scoring) Exploiting layout/markups (font, title field, etc.) Spelling correction Spam filtering Redundancy elimination In general, rely on machine learning to combine all kinds of features 21 CCF-ADL at Zhengzhou University, June 25-27, 2010 markup�� ??�� Text that is added to the data of a document in order to convey information about it. ?????????????????????????? Machine learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data; the difficulty lies in the fact that the set of all possible behaviors given all possible inputs is too complex to describe generally in programming languages, so that in effect programs must automatically describe programs. Artificial intelligence is a closely related field, as also probability theory and statistics, data mining, pattern recognition, adaptive control, and theoretical computer science.markup�� ??�� Text that is added to the data of a document in order to convey information about it. ?????????????????????????? Machine learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data; the difficulty lies in the fact that the set of all possible behaviors given all possible inputs is too complex to describe generally in programming languages, so that in effect programs must automatically describe programs. Artificial intelligence is a closely related field, as also probability theory and statistics, data mining, pattern recognition, adaptive control, and theoretical computer science.
22. Effective Web Retrieval Heuristics
High accuracy in home page finding can be achieved by Matching query with the title Matching query with the anchor text Plus URL-based or link-based scoring (e.g. PageRank) Imposing a conjunctive (�and�) interpretation of the query is often appropriate Queries are generally very short (all words are necessary) The size of the Web makes it likely that at least a page would match all the query words Combine multiple features using machine learning 22 CCF-ADL at Zhengzhou University, June 25-27, 2010
23. Home/Entry Page Finding Evaluation Results(TREC 2001)
Query example: Haas Business School 23 CCF-ADL at Zhengzhou University, June 25-27, 2010 MRR ??????Web???????,?????????????? ???????????????? � ????????????????????,????????? ?????????? � ????(Reciprocal Ranking, RR ) � ?????????????? � RR=1/r � r??????????????? � ?????????????,??RR??0 � ??????( Mean Reciprocal Ranking, MRR ) MRR = sigma_1_n (1/rank_q) / nMRR ??????Web???????,?????????????? ???????????????? � ????????????????????,????????? ?????????? � ????(Reciprocal Ranking, RR ) � ?????????????? � RR=1/r � r??????????????? � ?????????????,??RR??0 � ??????( Mean Reciprocal Ranking, MRR ) MRR = sigma_1_n (1/rank_q) / n
24. Named Page Finding Evaluation Results (TREC 2002)
Dirichlet Prior + Title, Anchor Text (Lemur) [Ogilvie & Callan SIGIR 2003] Okapi/BM25 + Anchor Text Best content-only Query example: America�s century farms 24 CCF-ADL at Zhengzhou University, June 25-27, 2010
25. Learning Retrieval Functions
Basic idea: Given a query-doc pair (Q,D), define various kinds of features Fi(Q,D) Examples of feature: the number of overlapping terms, p(Q|D), PageRank of D, p(Q|Di), where Di may be anchor text or big font text Hypothesize p(R=1|Q,D)=s(F1(Q,D),�,Fn(Q,D), ?) where ? is parameters Learn ? by fitting function s with training data (i.e., (d,q)�s where d is known to be relevant or non-relevant to q) Methods: Early work: logistic regression [Cooper 92, Gey 94] Recent work: Ranking SVM [Joachims 02], RankNet [Burges et al. 05] 25 CCF-ADL at Zhengzhou University, June 25-27, 2010
26. Learning to Rank
Advantages May combine multiple features (helps improve accuracy and combat web spams) May re-use all the past relevance judgments (self-improving) Problems No much guidance on feature generation (rely on traditional retrieval models) All current Web search engines use some kind of learning algorithms to combine many features 26 CCF-ADL at Zhengzhou University, June 25-27, 2010
27. Next-Generation Search Engines
28. Limitations of the Current Search Engines
Limited query language Syntactic querying (sense disambiguation?) Can�t express multiple search criteria (readability?) Limited understanding of document contents Bag of words & keyword matching (sense disambiguation?) Heuristic query-document matching: mostly TF-IDF weighting No guarantee for optimality Machine learning can combine many features, but content matching remains the most important component in scoring 28 CCF-ADL at Zhengzhou University, June 25-27, 2010 word sense disambiguation (WSD:????)word sense disambiguation (WSD:????)
29. Limitations of the Current Search Engines (cont.)
Lack of user/context modeling Using the same query, different users would get the same results (poor user modeling) The same user may use the same query to find different information at different times Inadequate support for multiple-mode information access Passive search support: A user must take initiative (no recommendation) Static navigation support: No dynamically generated links Consider more integration of search, recommendation, and navigation Lack of interaction Lack of task support 29 CCF-ADL at Zhengzhou University, June 25-27, 2010
30. Towards Next-Generation Search Engines
Better support for query formulation Allow querying from any task context Query by examples Automatic query generation (recommendation) Better search accuracy More accurate information need understanding (more personalization and context modeling) More accurate document content understanding (more powerful content analysis, sense disambiguation, sentiment analysis, �) More complex retrieval criteria Consider multiple utility aspects of information items (e.g., readability, quality, communication cost) Consider collective value of information items (context-sensitive ranking) 30 CCF-ADL at Zhengzhou University, June 25-27, 2010
31. Towards Next-Generation Search Engines (cont.)
Better result presentation Better organization of search results to facilitate navigation Better summarization More effective and robust retrieval models Automatic parameter tuning More interactive search More task support 31 CCF-ADL at Zhengzhou University, June 25-27, 2010
32. Looking Ahead�.
More user modeling Personalized search Community search engine (collaborative information access) More content analysis and domain modeling Vertical search engines More in-depth (domain-specific) natural language understanding Text mining More accurate retrieval models (life-time learning) Going beyond search Towards full-fledge information access: integration of search, recommendation, and navigation Towards task support: putting information access in the context of a task 32 CCF-ADL at Zhengzhou University, June 25-27, 2010
33. Summary
Web provides many challenges and opportunities for text information management Search engine technology ? crawling + retrieval models + machine learning + software engineering Current generation of search engines are limited in user modeling, content understanding, retrieval model� Next generation of search engines likely moves toward personalization, domain-specific vertical search engines, collaborative search, task support, � 33 CCF-ADL at Zhengzhou University, June 25-27, 2010

:

:

Presentation Transcript