1 / 16

Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.)

Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.). (c) Wolfgang Hürst, Albert-Ludwigs-University. The Evolution of Search Engines. 1st generation : Use only "on page", text data - Word frequency, language 1995-1997 (AltaVista, Excite, Lycos, etc.)

Download Presentation

Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Search – Summer Term 2006VI. Web Search -Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University

  2. The Evolution of Search Engines 1st generation: Use only "on page", text data- Word frequency, language 1995-1997 (AltaVista, Excite, Lycos, etc.) 2nd gen.: Use off-page, web-specific data- Link (or connectivity) analysis- Click-through data (what results people click on)- Anchor-text (how people refer to a page) From 1998 (made popular by Google but everyone now) PageRank [2], introduced by Brin and Page, used by Google HITS [3], introduced by Kleinberg (used by Teoma?) TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002

  3. Link-based ranking: HITS Motivation (compare PageRank): Broad-topic queries: deliver (too) large set of relevant results Therefore: Ranking based on the authority of a web page (cf. PageRank: quality / importance) Link: Interpreted as a conferral of authority Goal: Find pages with high authority (balance between relevance and popularity)

  4. Link-based ranking: HITS (cont.) Basic idea: Consider sub-graph of the web graph that contains as much relevant pages as possible Analyze the graph's link structure to find: Authorities = the most authoritative or definitive subset of relevant pages (for ranking) Hubs = Pages pointing to many related authorities (for their identification)

  5. www.google.com www.teoma.com HUBS AUTHORITIES searchenginewatch.com www.altavista.com dir.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/Searching_the_Web/Search_Engines_and_Directories/ www.alltheweb.com Authorities and Hubs - Example Example: Query “search engine”

  6. Authorities and Hubs - Basic idea Approach:- Generate a query-dependent sub-graph- Recursively calculate hubs and authorities Assume S is the set of pages in this sub-graph, then S should be- rather small- contain lots of relevant pages- contain the most important authorities Basic idea to generate such a sub-graph:- Get initial root set based on any IR criteria- Include the local neighborhood of this set

  7. Authorities and Hubs - Base set GIVEN:- QUERY Q- TEXT-BASED SEARCH ENGINE SE- CONSTANTS T AND D (NAT. NUMBERS)- SET R(Q) OF THE FIRST T RESULTS OF SE GIVEN Q ALGORITHM TO CALCULATE SUBGRAPH S(Q) S(Q) := R(Q)FOR EACH PAGE P IN R(Q) T+(P) := SET OF PAGES LINKED BY P T-(P) := SET OF PAGES LINKING TO P ADD ALL PAGES FROM T+(P) TO S(Q)IF |T-(P)| < DTHEN ADD ALL PAGES FROM T-(P) TO S(Q)ELSE ADD RANDOM SUBSET OF T-(P) TO S(Q)

  8. Query-dependent base set - Comments Why only use a sub-graph?- Advantage of query dependence- Reduces processing time (online calculation!) Why not just take the root set?- Appearance of query terms does not necessarily represent relevance (or authority)- Larger network is needed for link analysis In original work: Heuristics for special cases- Remove intrinsic links, i.e. links from the same domain (navigational links, etc.)- Consider only a certain number of links from one domain to a page p (to avoid spamming)

  9. Authorities: I-Operation Hubs: O-Operation Calculating Hubs and Authorities Obviously, there exists a mutual reinforcing relationship between Hubs and Authorities:- A good Hub links to many good Authorities- A good Authority is linked by many Hubs Hence, use an iterative algorithm to estimate a Hub and Authority value, respectively

  10. Authorities: I-Operation Hubs: O-Operation Calculating Hubs and Authorities q1 q1 q2 PAGE p PAGE p q2 q3 q3

  11. Calculating Hubs and Authorities GIVEN:- SUB-GRAPH G WITH N PAGES (FROM BASE SET S(Q))- CONSTANT NUMBER K ALGORITHM TO CALCULATE HUBS AND AUTHOR. X0 := (1, 1, ..., 1) Y0 := (1, 1, ..., 1) FOR i = 1, ..., K CALCULATE NEW WEIGHTS Xi BY APPLYING THE I-OPERATION TO Xi-1, Yi-1 CALCULATE NEW WEIGHTS Yi BY APPLYING THE O-OPERATION TO Xi, Yi-1 NORMALIZE Xi AND Yi

  12. Calculating Hubs and Authorities Convergence: see lit. Basic idea:

  13. PageRank vs. HITS PageRank HITS + + - Easy to compute, real- time execution is hard- Query specific- Works on small graphs - Hard to spam- Computes quality signal for all pages - - Local graph structure can be manufactured- Provides a signal only when there is direct connectivity (e.g. home pages) - - Non-trivial to compute- Not query specific- Does not work on small graphs Proven to be effective for general purpose ranking Well suited for supervised directory construction TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002

  14. Commercial search engines using HITS (Maybe?) Teoma, now search.ask.com "Teomas underlying technology is an extension of the HITS algorithm …", C. Sherman, April 2002, http://dc.internet.com/news/article.php/1002061(Not online anymore)

  15. References - HITS [1] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 [2] JON KLEINBERG: "AUTHORITATIVE SOURCES IN A HYPERLINKED ENVIRONMENT", JOURNAL OF THE ACM, VOL. 46, NO. 5, SEPTEMBER 1999

  16. General Web Search Engine Architecture CLIENT WWW PAGE REPOSITORY QUERIES RESULTS QUERY ENGINE RANKING CRAWLER(S) COLLECTION ANALYSIS MOD. INDEXER MODULE CRAWL CONTROL INDEXES UTILITY STRUCTURE TEXT USAGE FEEDBACK (CF. [1] FIG. 1)

More Related