1 / 27

Authoritative Sources in a Hyperlinked Environment

Authoritative Sources in a Hyperlinked Environment. Jon M. Kleinberg Presentation by Julian Zinn. Searching the Web. Goal: find pages relevant to a query. The basic text-based search algorithms retrieve pages that contain the query keywords.

arella
Download Presentation

Authoritative Sources in a Hyperlinked Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn

  2. Searching the Web • Goal: find pages relevant to a query. • The basic text-based search algorithms retrieve pages that contain the query keywords. • Improved searching algorithms can examine the link structure of the web to learn about the contents of web pages. • This paper introduces an algorithm for identifying authoritative pages and hub pages.

  3. Overview • Issues in Searching • Algorithm Overview • Iterative Algorithm • Wrap-up

  4. Types of Queries • Specific queries: information about the topic is scarce. • Broad-topic queries: information about the topic is overabundant. We want to return the most ‘authoritative’ pages. • Similar-page queries: find pages that are ‘like’ a given page. This paper examines broad-topic queries.

  5. Complications with Text-based Search • An authoritative page for a query may not contain the query terms. • Example: www.uh.edu contains neither ‘University’ nor ‘Houston’, and has ‘UH’ only six times. • Text may be in the form of images or flash animations. • A page might not be self-descriptive. • Example: Honda does not describe itself as an automobile manufacturer and Google does not describe itself as a search engine.

  6. Examining Link Structure • The creator of a page p, by including a link to a page q, confers authority in some way to page q. • How can we exploit this latent human judgment information? • Pitfall: Many links, such as navigational links and advertisement links do not confer authority.

  7. Exploiting Link Structure 1 • An authoritative page must be popular. • So, of all pages that contain the query terms, return those with the highest in-degree. • Pitfall: Still misses authoritative pages that do not contain the query terms. • Pitfall: Universally popular pages (like www.yahoo.com) will be considered highly authoritative for any query terms they contain.

  8. Exploiting Link Structure 2 • Authoritative sources often do not link to other authoritative sources. • Examples: Toyota does not link to Honda, and Google does not link to Teoma. • Other pages, which we call hub pages, link to multiple authoritative sources. • Example: Auto enthusiast websites linking to multiple manufacturer’s websites. • The authoritative pages for a query share many hub pages.

  9. Overview • Issues in Searching • Algorithm Overview • Iterative Algorithm • Wrap-up

  10. Algorithm Overview • For a query , start with a text-based search to generate an initial root set R. • Enlarge the root set to a base set S. • Identify authoritative pages and hub pages in S. • Return the most authoritative pages in S.

  11. Desiderata for S S should be: • Relatively small. • Rich in relevant pages. • Contain most (or many) of the strongest authorities. R will satisfy 1 and 2, but not 3. Even the set of all pages that contain the query terms may not satisfy 3.

  12. Enlarging R to S • Pages in R may not be authoritative, but most authoritative pages are probably pointed to by at least one member of R. • Pages in R may not point to each other. • Let S = R + all pages pointed to by pages of R + some pages that point to pages of R. • Use a heuristic to avoid navigation links. Kleinberg’s experiments had R  200 and S  1000 to 5000.

  13. Identifying Hubs and Authorities • Our set S still has the problem of non-authoritative pages of high in-degree. • The authoritative pages are the popular pages that have a large overlap in the sets of pages that point to them. • The hub pages are the pages that point to many of the authoritative pages.

  14. Hubs and Authorities Picture Unrelated page of large in-degree authorities hubs

  15. Mutually Reinforcing Relationship • Good hubs point to many good authorities. • Good authorities are pointed to by many good hubs. • There must be an iterative algorithm.

  16. Overview • Issues in Searching • Algorithm Overview • Iterative Algorithm • Wrap-up

  17. Iterative Algorithm 1 • For each page p, we associate a non-negative authority weightx(p) and a non-negative hub weight y(p). • Values are normalized • Larger values indicate better pages.

  18. Iterative Algorithm 2 • If p points to many pages with large x-values, then p receives a large y-value: • If p is pointed to by many pages with large y-values, then p receives a large x-value:

  19. Iterative Algorithm 3 • We iterate and renormalize until values converge. • Therefore, we need to prove convergence. • The algorithm is a discrete-time evolution and can be written as multiplications of matrices and vectors • A result of linear algebra guarantees convergence of X and Y to the principle eigenvectors of MTM and MMT.

  20. X Y Z é ù 1 1 1 X ê ú ê ú = M 0 0 1 Y ê ú ê ú 1 1 0 Z ê ú ë û T = H H M M * i - i 1 T = A M M A * * - i i 1 Example: Mini Web = H M A * - i i 1 X T = A M H * - i i 1 Z Y

  21. X is the best hub Z is most authoritative Example ¥ Iteration 0 1 2 3 … X Z Y

  22. Overview • Issues in Searching • Algorithm Overview • Iterative Algorithm • Example • Wrap-up

  23. Notes to Consider • In general, we don’t need to iterate to convergence. • Paper contains a list of good results for various queries. • After initial text-based search, the text was ignored in favor of the link structure.

  24. Related Areas • Similar-page queries. • Connections with: • Social networks • Bibliometrics (citations) • Stand-alone hypertext environments • Clustering of link structures • Multiple sets of hubs and authorities • Diffusion and Generalization

  25. Conclusion • Influential paper – many citations. • Published at the same time as the Google page-rank algorithm. • HITS – Hyperlink Induced Topic Search • Clever (IBM) • Basis of Teoma search engine algorithm.

  26. References Kleinberg, Jon. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, Vol. 46, No. 5, September 1999, pp. 604-632. The mini-web example comes from http://www.cs.fiu.edu/~vagelis/presentations/RandomWalks.ppt

  27. The End

More Related