260 likes | 436 Views
Web Search – Summer Term 2006 VI. Web Search - Ranking. (c) Wolfgang Hürst, Albert-Ludwigs-University. General Web Search Engine Architecture. CLIENT. WWW. PAGE REPOSITORY. QUERIES. RESULTS. QUERY ENGINE. RANKING. CRAWLER(S). COLLECTION ANALYSIS MOD. INDEXER MODULE. CRAWL CONTROL.
E N D
Web Search – Summer Term 2006VI. Web Search -Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University
General Web Search Engine Architecture CLIENT WWW PAGE REPOSITORY QUERIES RESULTS QUERY ENGINE RANKING CRAWLER(S) COLLECTION ANALYSIS MOD. INDEXER MODULE CRAWL CONTROL INDEXES UTILITY STRUCTURE TEXT USAGE FEEDBACK (CF. [1] FIG. 1)
1st Generation Web Search Engines Use only "on page" text data - word frequency, language 1995-1997 (AltaVista, Excite, Lycos, etc.) - Extended Boolean model - Matches: Exact, prefix, phrase, ... - Operators: AND, OR, AND NOT, NEAR, ... - Fields: TITLE, URL, HOST, ... - AND is somewhat easier to implement, maybe preferable as default for short queries- Ranking - TF like factors: TF, explicit keywords, words in the title, explicit emphasis (headers), etc. - IDF factors: IDF, total word count in corpus, frequency in query log, frequency in language TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002
INFORMATION INFORMATION NEED DATA / DOCUMENTS QUERY Classic IR vs. Web Search Initial problem is similar to traditional IR ... The no. of users ishuge. Very huge. The web is huge.Very huge. Big variety in users Big variety in data Users don't cooperate (short queries, ...) Doc. authors don't cooperate (spam,...) .. but basic conditions & characteristics differ significantly
Classic IR vs. Web Search Classic IR: Generally a 2-step process 1. Select subset of relevant documents (based on term appearance, etc.) 2. Rank / sort by estimated relevance (based on term frequency, etc.) Problems with web search: - Size of the WWW - Heterogeneity (length, quality, format, ...) - Not always self descriptive (e.g. search engine) - Can be manipulated (spam) Therefore: Pure content based ranking critical
The evolution of search engines 1st generation: Use only "on page", text data- Word frequency, language 1995-1997 (AltaVista, Excite, Lycos, etc.) 2nd gen.: Use off-page, web-specific data- Link (or connectivity) analysis- Click-through data (what results people click on)- Anchor-text (how people refer to a page) From 1998 (made popular by Google but everyone now) TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002
Some 2nd generation web IR ranking criteria - Extended Boolean model - Content-based techniques (variants of term vector or probabilistic model) - Ad-hoc factors (anti-porn heuristics, publication/location data, url depth, ...) - Human annotations (in Yahoo, ODP, etc.) - Popularity: Click-through feedback ("People vote with their clicks", DirectHit) - Anchor text - Connectivity-based techniques TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002
The Evolution of Search Engines Rough summary: First web search engines: Like tradition IR, i.e. 1. Search: Textprocessing to find pages 2. Ranking: Analysis of term distributions (and some other things) New / current web search engines: 1. Search: Textprocessing to find pages (incl. anchor text and some other things) 2. Ranking: Link analysis (and some other things)
The Web Graph What do we know about the link structure of the web? SOURCE: [3]
Considering the link structure Idea: Use link structure to get a content-independent feature to estimate relevanceof a web page Why? Because a link from page A to page B- Often represents a relationship of their contents- Can be interpreted as a reference or recommendation Compare citations in scientific literature- Articles with lots of citations: Considered to have a high scientific value(Bibliometric laws)
Considering the link structure First approach: Use the number of backlinks as a measure for a page's relevance Problem: web vs. scientific publications- Publications: rigorous review process, high quality, well-defined length and style, same goal, similar number of citations, etc.- Web pages: no quality control, no publication costs, can be manipulated (self-citations), large variance regarding quality, number of links, intended usage, length and style, etc.(Note: Compare differences to classic IR) Idea: Consider reputation and credibility of a link, i.e. the quality and importance of a web page
Simple PageRank - Definition Goal: Measure for the importance and quality of a web page Definition: Simple PageRank- Assume the web is a strongly connected graph- Define the Simple PageRank R of a page p as R(q) R(p) = Nq qBp
Simple PageRank - Example R(2) = 0,286 2 R(1) = 0,286 R(3) = 0,143 1 3 R(5) = 0,143 R(4) = 0,143 5 4 SOURCE: [3]
R(q) R(p) = Nq qBp Simple PageRank - Calculation Mathematical representation:If R = mx1 vector [R(1), R(2), ..., R(m)]and A = matrix with aij = 1 / N(j) if link from i to j exists = 0 otherwise Then R = A R Calculation:- Set of linear equations, but direct calculation of a solution too expensive- But: R is eigenvector of A (with eigenvalue 1)- Hence: Efficient approaches for iterative calculation exist (e.g. power iteration)
R(q) R(p) = Nq qBp Simple PageRank - Calculation Mathematical representation:If R = mx1 vector [R(1), R(2), ..., R(m)]and A = matrix with aij = 1 / N(j) if link from i to j exists = 0 otherwise Then R = A R Convergence:- Yes! (see literature for a proof)- Number of iterations can be reduced by - using "good" starting values - interruption as soon as the ranking order will (most likely) not change anymore But: Only true for strongly connected graphs
Problems if graph is not strongly connected Page Sink = group of pages that link to each other but have no outside links Page Leak = single page with no outside links (special case of a page sink)
R(q) R(p) = Nq qBp What happens with Leaks? 2 1 3 5 4 Possible Solutions:- Remove all Leaks before the PageRank is calculated- Or: Insert back links for each forward link to a Leak
R(q) R(p) = Nq qBp What happens with Sinks? 2 1 3 5 4 Possible Solutions:- Introduce a damping factor d
R(2) = 0,142 2 R(1) = 0,154 R(3) = 0,101 1 3 R(5) = 0,290 R(4) = 0,313 5 4 PageRank - Definition and Example Definition: PageRank R of page p Example: SOURCE: [3]
Intuitive Interpretation Link: Recommendation and quality statement - Important pages link to important pages - Many links from important pages: Target is important as well - Importance influences other pages and depends on them, hence: recursive calculation (relatively robust against spam)
R(q) R(p) = Nq qBp Another Intuitive Interpretation The Random Surfer Model: A random surfer performs a random walk on the web graph - Surfer on page q -> probability to click on any of the links = 1/Nq - Probability to get to page p: Sum of the probabilities of all links q in Bq times the probability of being on page q - The damping factor d represents the proba- bility of going to a random new page
The Random Surfer Model Comments to the Random Surfer Model: Easy, illustrative model to explain the involved mathematics, but many limitations: - Does not model the back button - Jumps and selection of links usually not random or uniformly distributed - Random walk can hardly be seen as a good search strategy
PageRank - First conclusion PageRank = Quality or importance measure for a web page Harder to manipulate than pure backlinks Content and query independent Ranking:-> The higher the PageRank,-> the higher a page's quality or importance,-> the higher the page's relevance Note: Other applications exist (e.g. crawling)
PageRank in the 1st Google engine Good for "common cases", not necessarily when searching for a particular information In practice: Combination with other features Example: First Google search engine [2] - Hits classified in different types (anchor, title, ...), represented by a vector of type-weights- Term distribution / no. of hits, represented by a vector of count-weights- Scalar product of both vectors = IR score- Final rank = combination of IR score and PageRank
PageRank today PageRank and web search: - Most likely used by most search engines (in some way) Other interesting (research) problems: - More efficient calculation- Storage and speed- Update- Fighting spam- Make better usage of the structure of the web- Consider the continuous change of the web's structure- Better study of its characteristics (practical + theor.): - Stability and sensitivity - Influence of the linkage - Influence of different parameters- Personalization and specialization
References - Indexing [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 5 (Ranking and Link Analysis) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 2 and 4.5.1 [3] BORDER, KUMAR, MAGHOUL, RAGHAVAN, RAJAGOPALAN, STATA, TOMKINS, WIENER: "GRAPH STRUCTURE IN THE WEB", WWW 2000 [4] PAGE, BRIN, MOTWANI, WINOGRAD: "THE PAGERANK CITATION RANKING: BRINGING ORDER TO THE WEB", STANFORD TECHNICAL REPORT [5] HAVELIWALA: "TOPIC-SENSITIVE PAGERANK", WWW 2002