1 / 42

Auth oritative Sources in Hyperlinked Environment Jon M. Kleinberg JACM 1999

Auth oritative Sources in Hyperlinked Environment Jon M. Kleinberg JACM 1999. Presented By Raman Adaikkalavan Feb 23, 2005, CSE 6392 Instructor: Dr. Gautam Das. Overview. Problem – in general Query Types Problems of Answering Queries Authoritative Pages – Broad-topic queries

petersons
Download Presentation

Auth oritative Sources in Hyperlinked Environment Jon M. Kleinberg JACM 1999

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Authoritative Sources in Hyperlinked EnvironmentJon M. Kleinberg JACM 1999 Presented By Raman Adaikkalavan Feb 23, 2005, CSE 6392 Instructor: Dr. Gautam Das

  2. Overview • Problem – in general • Query Types • Problems of Answering Queries • Authoritative Pages – Broad-topic queries • Iterate Method/Algorithm • Similar Page Queries • Multiple Sets of Hubs and Authorities • Diffusion and Generalization • Evaluation • Comparison – ? • Conclusion

  3. Problem – in general • Searching on the www for discovering pages that are relevant to a given query • Improving Quality of search

  4. Query Types • Does Netscape support the JDK 1.1 code signing API • Specific queries • Find information about the Java programming language • Broad-topic queries • Find pages ‘similar’ to java.sun.com • Similar-page queries

  5. Problems with Answering Queries • Specific queries • Scarcity problem: Very few pages that contain required information • Difficult to determine the identity of the pages • Broad-topic queries • Abundance problem: Number of pages that could reasonably be returned as relevant is far too large for a human user to digest • Select a small set of the most “authoritative” or “definitive” ones – pages that are most relevant

  6. Authoritative Pages – Central focus • Given a query how to get the small set of authoritative pages corresponding to that query • How to accurately model authority in the context of a particularquery topic • Text-based searching/ranking – Sufficient ? Many prominent pages are not sufficiently self-descriptive. • “harvard” – www.harvard.edu • “search engines” – Yahoo, AltaVista, …? • “automobile manufacturers” – Honda, Toyota, …?

  7. Analysis of the Link Structure • Hyperlinks encode a considerable amount of latent human judgment – used for authority ? • e.g., the creator of page p, by including a link to page q, has in some measure conferred authority on q • a large number of links are created primarily for navigational purposes, back • Links to paid advertisements • relevance and popularity • Find pages using #inlinks – this would consider highly popular pages as authoritative – Yahoo.com

  8. Conferral of Authority • Model that consistently identifies relevant, authoritative www pages for broad search topics • Based on the relationship between ‘authorities’ and ‘hubs’ • Authorities: Pages that have relevant information about a given topic • Hubs: Pages that link to many related authorities

  9. Till Now WWW • Authoritativepages • Not only based on text • Usinglinkanalysis Information about Java PL (Broad Topic Queries)

  10. Can We Operate Over Entire WWW ? • Specific to a query; i.e., not predefined • Computational costs – should be reduced • Analysis of the link structure; which subgraph www should be operated on ? • All pages containing query string • May be over million pages - computation • Some or most of the best authorities may not belong to this set

  11. Finding Authoritative Pages • Steps • 1: Construct a focused subgraph (S) of the www; such that • S is relatively small • S is rich in relevant pages • S contains most (or many) of the strongest authorities • 2: Compute Hubs and Authorities from the focused subgraph

  12. Expanded Set Pages S t highest-ranked pages Rootset R Topic Search Engine At most d pages Forward link pages Backward link pages Construction of Focused Subgraph

  13. Offsetting Navigational Links • G[S] subgraph induced on the pages in S • Types of links • Transverse: if between pages with different domain names • Intrinsic: is between pages within the same domain name • Delete Intrinsic Links from G[S]; resulting in a graph G • Collusion: large # of pages from a single domain all point to a single page p. “This site is designated to…” Eliminate by a parameter m (approx 4 – 8)

  14. Finding Authoritative Pages • Steps • 1: Construct a focused subgraph (S) of the www • S is relatively small • S is rich in relevant pages • S contains most (or many) of the strongest authorities • 2: Compute Hubs and Authorities from the focused subgraph

  15. Computing Hubs & Authorities • Goal: Given a query find: • Good sources of content (authorities) • Good sources of links (hubs) FROM: Monika Henzinger, Hyperlink Analysis on the Web

  16. Intuition • Authoritycomes from in-edges. Being a goodhubcomes from out-edges. • Better authoritycomes from in-edges from good hubs. Being a better hubcomes from out-edges to good authorities. FROM: Monika Henzinger, Hyperlink Analysis on the Web

  17. Hubs and Authorities • An iterative algorithm • with each page p, we associate • a non-negative authority weight x<p> • a non-negative hub weight y<p> • weights of each type are normalized so their squares sum to 1 • pS(x<p>)2 = 1 pS(y<p>)2 = 1 • The pages with larger x and yvalues have “better” authorities and hubs respectively.

  18. Hubs and Authorities • If ppoints to many pages with large x-values, then it should receive a large y-value • If p is pointed to by many pages with large y-values, then it should receive a large x-value • Inlinks I: • Outlinks O:

  19. Hubs and Authorities • As one applies Iterate with arbitrary large k, the {xk} and {yk}converge to fixed points x* and y* • Let G = (V, E), with V = {p1, p2,…, pn}, and let A denote the adjacency matrix of the graph G: the (i, j)th entry of A is 1 if (pi, pj) is an edge of G, and is 0 otherwise. • x* is the principal eigenvector of ATA, and y* is the principal eigenvector of AAT • The convergence of Iterate is quite rapid (k=20 is sufficient)

  20. X X Y Y Z Z é é ù ù 1 1 1 0 1 1 X X ê ê ú ú ê ê ú ú = = M T M 0 1 0 0 1 1 Y Y ê ê ú ú ê ê ú ú 1 1 1 1 0 0 Z Z ê ê ú ú ë ë û û T = H H M M * i - i 1 T = A M M A * * - i i 1 Mini Web (Modified) Forward links Backward links HUBS AUTHORITIES X = H M A * - i i 1 T = A M H Z Y * - i i 1 SOURCE: Vagelis H, Random Walks Presentation

  21. T = A M M A * * - i i 1 X is the best hub Z is the most authoritative Mini Web (Modified) T = H H M M * i - i 1 é ù é ù 2 2 1 3 1 2 ê ú ê ú ê ú ê ú = = 2 2 1 1 1 0 T T M M M M ê ú ê ú ê ú ê ú 1 1 2 2 0 2 ê ú ê ú ë û ë û ¥ Iteration 1 2 3 … X Z Y SOURCE: Vagelis H, Random Walks Presentation

  22. Basic Results – Broad Topic Search

  23. Observations • Just “pure” analysis of link structure • i.e., text-based search is just an initial set • Pages legitimately considered as authoritative in the context of www without access to large-scale index of the www • i.e., global analysis of the full www link structure can be replaced by local method over small focused subgraph

  24. Overview • Problem – in general • Query Types • Problems of Answering Queries • Authoritative Pages – Broad-topic queries • Iterate Method/Algorithm • Similar Page Queries • Multiple Sets of Hubs and Authorities • Diffusion and Generalization • Evaluation • Comparison • Conclusion

  25. Similar-Page Queries • E.g., Find pages ‘similar’ to honda.com • Using links analysis to infer a notion of “similarity” among pages • We have found a page p that is of interest and it’s an authoritative page on a topic. • What do users of the WWW consider to be related to p when they create pages and links ? • If p is highly referenced ? – abundance problem

  26. Similar-Page Queries • In the local region of the link structure nearp, what are the strongest authorities • Can be a potential broad-topic summary of pages related to p • Normal Search; a query string  - “Find t pages containing ” as R and then get subgraph S • a page p -- “Find t pages pointing to p” as R and then get subgraph S

  27. Results – Similar Page Queries

  28. Multiple Sets of Hubs and Authorities • Broad-topic queries: most densely linked collection of hubs and authorities • Can we can find several densely linked collections of hubs and authorities among the same set S of pages. • Each collection could potentially be relevant to the query topic, but they could well-separated from one another in the graph G: • The query string  may have several very different meanings. E.g. “jaguar”, “java”. • The string may arise as a term in the context of multiple technical communities. E.g. “randomized algorithms”. • The string may refer to a highly polarized issue, involving groups that are not likely to link to one another. E.g. “abortion”.

  29. Multiple Sets of Hubs and Authorities • Relevant documents can be grouped in to several clusters • For Broad-topic Queries: x* is the principal eigenvector of ATA, and y* is the principal eigenvector of AAT • Can we use the non-principal eigenvectors to extract additional densely linked collections of hubs and authorities • Positive and Negative

  30. Results – Multiple Sets of H & A

  31. Results – Multiple Sets of H & A

  32. Results – Multiple Sets of H & A

  33. Diffusion and Generalization • Diffusion happens • if the  specifies a topic that is not sufficiently broad, there will be not enough relevant pages in G • the most relevant collection in G is not the “densest” one • as a result the I and O operations will find the diffused collection of authority corresponding to the “broader” topics • Limits the algorithm • The broader topic that supplants the original, too-specific query  very often represents a natural generalization of  • It provides a simple way of abstracting a specific query topic to a broader related one.

  34. Results – Diffusion & Generalization

  35. Results – Diffusion & Generalization • The use of non-principal eigenvectors, combined with basic term-matching, can be a simple way to extract collections of authoritative pages that are more relevant to a specific query topic

  36. Evaluation • 26 broad search topics, 37 users • For each topic, took the top 10 pages from AltaVista, the top five hubs and five authorities from Clever, and a random set of 10 pages from Yahoo • The results • For 31% of the topics, Yahoo and Clever received evaluations equivalent to each other • For 50%, Clever received a higher evaluation • For 19%, Yahoo received the higher evaluation

  37. Summary • Answering Broad-topic queries • Finding Authoritative Pages using the good hubs and good authorities • Answering similar-page queries by starting with a different root set • Finding Multiple Hubs and Authorities using non-principle eigenvectors • Overcoming Diffusion and Generalization by using non-principal eigenvectors and basic term matching

  38. PageRank vs. HITS • Computation: • Requires computation for each query • Query-dependent • Relatively easy to spam • Quality depends on quality of start set • Gives hubs as well as authorities • Computation: • Once for all documents and queries (offline) • Query-independent – requires combination with query-dependent criteria • Hard to spam FROM: Monika Henzinger, Hyperlink Analysis on the Web

  39. [Lempel] Not rank-stable: O(1) changes in graph can change O(N2) order-relations [Ng,Zheng, Jordan01] “Value”-Stable: change in k nodes (with PR values p1,…pk) results in p* s.t. PageRank vs. HITS • Not rank-stable • “value”-stablility depends on gap g between largest and second largest eigenvector: change of O(g) nodes results in p* s.t. FROM: Monika Henzinger, Hyperlink Analysis on the Web

  40. References/Slide Sources • Authoritative Sources in Hyperlinked EnvironmentJon M. Kleinberg JACM 1999 • Monika Henzinger “Hyperlink Analysis on the Web”. • Original Mini-web example http://www.cs.fiu.edu/~vagelis/presentations/RandomWalks.ppt • “Authoritative sources in a hyperlinked environment” Presentation By Vivek B. Tawde.

  41. Conclusion • Influential paper • Citeseer – 457 Citings • ACM – 115 Citings • Same time period as the Google page-rank algorithm

  42. Thank You

More Related