Using hyperlink structure information for web search l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Using Hyperlink structure information for web search PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on
  • Presentation posted in: General

Using Hyperlink structure information for web search. Hyperlink structure information. Hyperlink analysis for the web by Monika R. Henzinger, Google Inc. Structural web search using a graph-based discovery system by Nitish Monocha etc., University of Texas . How are hyperlinks useful?.

Download Presentation

Using Hyperlink structure information for web search

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Using hyperlink structure information for web search l.jpg

Using Hyperlink structure information for web search


Hyperlink structure information l.jpg

Hyperlink structure information

  • Hyperlink analysis for the web by Monika R. Henzinger, Google Inc.

  • Structural web search using a graph-based discovery system by Nitish Monocha etc., University of Texas


How are hyperlinks useful l.jpg

How are hyperlinks useful?

  • Assumptions

    a)Assumption 1. A hyperlink from page

    A to page B is a recommendation of

    page B by the author of page A.

    b) Assumption 2. If page A and page B are connected

    by a hyperlink, then they might be on

    the same topic.

    c) Pages pointed by many pages are of higher quality than pages pointed to by fewer pages.


Main uses of hyperlink analysis l.jpg

main uses of hyperlink analysis

  • crawling (collecting the pages)

  • ranking (rank the returned results)

  • Compute the geographic scope of a web page

  • Find mirrored host

  • Compute the statistics of web pages and search engine

  • Major search engine use hyperlink analysis but do not want to disclose the algorithms


Crawling l.jpg

Crawling

  • Collect web pages

  • Start with a set of pages, recursively visit the hyperlinks


Traditional ir l.jpg

Traditional IR

  • Vector model or Boolean model

  • Does not work well in the web because: Web page authors manipulate the ranking.

  • The power of hyperlink analysis comes

    from the fact that it uses the content of other pages to rank the current page.


Connectivity based ranking rank using hyperlink analysis l.jpg

Connectivity-Based Ranking(rank using hyperlink analysis)

  • query-independent schemes, which assign a score to a page independent of a given query;

  • query-dependent schemes, which assign a score to a page in the context of a given query.


Model l.jpg

Model

  • Web pages as graph, page as node, hyperlink as edge.

  • Directed graph: link graph. Used for finding related pages

  • Undirected graph: co-citation graph. Used for categorizing related pages.


Query independent ranking l.jpg

Query-independent Ranking

  • Major drawbacks: it does not distinguish between the quality of a page pointed by a number of low-quality pages and the quality of a page pointed to by the same number of high-quality page.

  • PageRank algorithm. Weight each hyperlink to the page proportionally to the quality of the page containing the hyperlink. PageRank of a page A depends on the pagerank of a page B pointing to A. Used by Google.


Query dependent ranking l.jpg

Query-dependent Ranking

  • Build query-specific graph: neighborhood graph.

  • Start set of documents matching the query

  • Augmented by the sets of the documents that either hyperlinks to or is hyper linked to by the documents in the start set.

  • Perform the hyperlink analysis.


Query dependent ranking continued l.jpg

Query-dependent Ranking(continued)

  • Indegree-based approach. (the number of documents hyper linking to a document in the start set)

  • Authorities (pages with good content on a topic) and hubs (directory-like pages with many hyperlinks to pages on the topic)

  • HITS algorithm to determine good hubs and good authorities. Each node has auth score and hub score.


Problems of hits l.jpg

Problems of HITS

  • Small additions to neighborhood graph may considerably change the scores of hub and auth.

  • Topic drift when the majority of pages on neighborhood graph is on a topic different from the query topic.


Structural web search using a graph based discovery system l.jpg

Structural web search using a graph-based discovery system

  • WebSUBDUE: SUBDUE is the engine for knowledge discovery(data mining). Support structural search, text search, synonym search, and combinations of these searches.

  • Data preparation: Crawler written in Perl to build the labeled graph for the web site.

  • Labeled graph is feed into SUDUE system.

  • Query can be modeled as labeled graph as well.

  • Search the sub graph in the graph

  • Make comparison with existing search engine: AltaVista


Find all pages that link to a page containing the term subdue l.jpg

Find all pages that link to a page containing the term subdue


Jobs in computer science l.jpg

Jobs in computer science


Find hubs and authorities pages on algorithm l.jpg

Find hubs and authorities pages on “algorithm”


Conclusion l.jpg

Conclusion

  • Hyperlink structure information is valuable information.

  • Use of hyperlink information to enhance normal web search in crawling, ranking etc.

  • Use of hyperlink information to support structural search, which is still missing in existing search engine.


  • Login