Using hyperlink structure information for web search l.jpg
Sponsored Links
This presentation is the property of its rightful owner.
1 / 21

Using Hyperlink structure information for web search PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on
  • Presentation posted in: General

Using Hyperlink structure information for web search. Hyperlink structure information. Hyperlink analysis for the web by Monika R. Henzinger, Google Inc. Structural web search using a graph-based discovery system by Nitish Monocha etc., University of Texas . How are hyperlinks useful?.

Download Presentation

Using Hyperlink structure information for web search

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Using Hyperlink structure information for web search


Hyperlink structure information

  • Hyperlink analysis for the web by Monika R. Henzinger, Google Inc.

  • Structural web search using a graph-based discovery system by Nitish Monocha etc., University of Texas


How are hyperlinks useful?

  • Assumptions

    a)Assumption 1. A hyperlink from page

    A to page B is a recommendation of

    page B by the author of page A.

    b) Assumption 2. If page A and page B are connected

    by a hyperlink, then they might be on

    the same topic.

    c) Pages pointed by many pages are of higher quality than pages pointed to by fewer pages.


main uses of hyperlink analysis

  • crawling (collecting the pages)

  • ranking (rank the returned results)

  • Compute the geographic scope of a web page

  • Find mirrored host

  • Compute the statistics of web pages and search engine

  • Major search engine use hyperlink analysis but do not want to disclose the algorithms


Crawling

  • Collect web pages

  • Start with a set of pages, recursively visit the hyperlinks


Traditional IR

  • Vector model or Boolean model

  • Does not work well in the web because: Web page authors manipulate the ranking.

  • The power of hyperlink analysis comes

    from the fact that it uses the content of other pages to rank the current page.


Connectivity-Based Ranking(rank using hyperlink analysis)

  • query-independent schemes, which assign a score to a page independent of a given query;

  • query-dependent schemes, which assign a score to a page in the context of a given query.


Model

  • Web pages as graph, page as node, hyperlink as edge.

  • Directed graph: link graph. Used for finding related pages

  • Undirected graph: co-citation graph. Used for categorizing related pages.


Query-independent Ranking

  • Major drawbacks: it does not distinguish between the quality of a page pointed by a number of low-quality pages and the quality of a page pointed to by the same number of high-quality page.

  • PageRank algorithm. Weight each hyperlink to the page proportionally to the quality of the page containing the hyperlink. PageRank of a page A depends on the pagerank of a page B pointing to A. Used by Google.


Query-dependent Ranking

  • Build query-specific graph: neighborhood graph.

  • Start set of documents matching the query

  • Augmented by the sets of the documents that either hyperlinks to or is hyper linked to by the documents in the start set.

  • Perform the hyperlink analysis.


Query-dependent Ranking(continued)

  • Indegree-based approach. (the number of documents hyper linking to a document in the start set)

  • Authorities (pages with good content on a topic) and hubs (directory-like pages with many hyperlinks to pages on the topic)

  • HITS algorithm to determine good hubs and good authorities. Each node has auth score and hub score.


Problems of HITS

  • Small additions to neighborhood graph may considerably change the scores of hub and auth.

  • Topic drift when the majority of pages on neighborhood graph is on a topic different from the query topic.


Structural web search using a graph-based discovery system

  • WebSUBDUE: SUBDUE is the engine for knowledge discovery(data mining). Support structural search, text search, synonym search, and combinations of these searches.

  • Data preparation: Crawler written in Perl to build the labeled graph for the web site.

  • Labeled graph is feed into SUDUE system.

  • Query can be modeled as labeled graph as well.

  • Search the sub graph in the graph

  • Make comparison with existing search engine: AltaVista


Find all pages that link to a page containing the term subdue


Jobs in computer science


Find hubs and authorities pages on “algorithm”


Conclusion

  • Hyperlink structure information is valuable information.

  • Use of hyperlink information to enhance normal web search in crawling, ranking etc.

  • Use of hyperlink information to support structural search, which is still missing in existing search engine.


  • Login