1 / 21

Using Hyperlink structure information for web search

Using Hyperlink structure information for web search. Hyperlink structure information. Hyperlink analysis for the web by Monika R. Henzinger, Google Inc. Structural web search using a graph-based discovery system by Nitish Monocha etc., University of Texas . How are hyperlinks useful?.

marv
Download Presentation

Using Hyperlink structure information for web search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Hyperlink structure information for web search

  2. Hyperlink structure information • Hyperlink analysis for the web by Monika R. Henzinger, Google Inc. • Structural web search using a graph-based discovery system by Nitish Monocha etc., University of Texas

  3. How are hyperlinks useful? • Assumptions a)Assumption 1. A hyperlink from page A to page B is a recommendation of page B by the author of page A. b) Assumption 2. If page A and page B are connected by a hyperlink, then they might be on the same topic. c) Pages pointed by many pages are of higher quality than pages pointed to by fewer pages.

  4. main uses of hyperlink analysis • crawling (collecting the pages) • ranking (rank the returned results) • Compute the geographic scope of a web page • Find mirrored host • Compute the statistics of web pages and search engine • Major search engine use hyperlink analysis but do not want to disclose the algorithms

  5. Crawling • Collect web pages • Start with a set of pages, recursively visit the hyperlinks

  6. Traditional IR • Vector model or Boolean model • Does not work well in the web because: Web page authors manipulate the ranking. • The power of hyperlink analysis comes from the fact that it uses the content of other pages to rank the current page.

  7. Connectivity-Based Ranking(rank using hyperlink analysis) • query-independent schemes, which assign a score to a page independent of a given query; • query-dependent schemes, which assign a score to a page in the context of a given query.

  8. Model • Web pages as graph, page as node, hyperlink as edge. • Directed graph: link graph. Used for finding related pages • Undirected graph: co-citation graph. Used for categorizing related pages.

  9. Query-independent Ranking • Major drawbacks: it does not distinguish between the quality of a page pointed by a number of low-quality pages and the quality of a page pointed to by the same number of high-quality page. • PageRank algorithm. Weight each hyperlink to the page proportionally to the quality of the page containing the hyperlink. PageRank of a page A depends on the pagerank of a page B pointing to A. Used by Google.

  10. Query-dependent Ranking • Build query-specific graph: neighborhood graph. • Start set of documents matching the query • Augmented by the sets of the documents that either hyperlinks to or is hyper linked to by the documents in the start set. • Perform the hyperlink analysis.

  11. Query-dependent Ranking(continued) • Indegree-based approach. (the number of documents hyper linking to a document in the start set) • Authorities (pages with good content on a topic) and hubs (directory-like pages with many hyperlinks to pages on the topic) • HITS algorithm to determine good hubs and good authorities. Each node has auth score and hub score.

  12. Problems of HITS • Small additions to neighborhood graph may considerably change the scores of hub and auth. • Topic drift when the majority of pages on neighborhood graph is on a topic different from the query topic.

  13. Structural web search using a graph-based discovery system • WebSUBDUE: SUBDUE is the engine for knowledge discovery(data mining). Support structural search, text search, synonym search, and combinations of these searches. • Data preparation: Crawler written in Perl to build the labeled graph for the web site. • Labeled graph is feed into SUDUE system. • Query can be modeled as labeled graph as well. • Search the sub graph in the graph • Make comparison with existing search engine: AltaVista

  14. Find all pages that link to a page containing the term subdue

  15. Jobs in computer science

  16. Find hubs and authorities pages on “algorithm”

  17. Conclusion • Hyperlink structure information is valuable information. • Use of hyperlink information to enhance normal web search in crawling, ranking etc. • Use of hyperlink information to support structural search, which is still missing in existing search engine.

More Related