1 / 29

HITS Hypertext-Induced Topic Selection

HITS Hypertext-Induced Topic Selection. BÜŞRA İPEK SELİME IŞIK. OUTLINE. Introduction PageRank Algorithm HITS Algorithm HITS Example HITS vs PageRank Conclusion. Search Engines. 1.Crawler: retrieves the contents of web pages

nida
Download Presentation

HITS Hypertext-Induced Topic Selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HITSHypertext-Induced Topic Selection BÜŞRA İPEK SELİME IŞIK Selime Işık-Büşra İpek

  2. OUTLINE • Introduction • PageRank Algorithm • HITS Algorithm • HITS Example • HITS vs PageRank • Conclusion Selime Işık-Büşra İpek

  3. Search Engines 1.Crawler: retrieves the contents of web pages 2.Indexer: stores and indexes information on the retrieved pages 3.Ranker: determines the importance of web pages returned 4.Retrieval Engine: performs lookups on index tables Selime Işık-Büşra İpek

  4. Ranking • Today’s search engines may return millions of pages for a certain query • It is not possible for a user to preview all the returned results • So, ranking is helpful Selime Işık-Büşra İpek

  5. Rankers Rankers are classified into two groups : 1.Content-based rankers • number of matched terms • frequency of terms • location of terms 2.Connectivity-based rankers • links that point to them Selime Işık-Büşra İpek

  6. Link Analysis There are two famous link analysis methods: 1.PageRank Algorithm 2.HITS Algorithm Selime Işık-Büşra İpek

  7. PageRank • originally formulated by Sergey Brin and Larry Page • does not rank web sites as a whole but is determined for each page individually according to their authoritativeness • if an authoritative web page A links to page B, then B is also authoritative Selime Işık-Büşra İpek

  8. PageRank (2) • recursive formula • page rank initially 1 for all nodes • normalized when difference between two successive calculations is very small PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Selime Işık-Büşra İpek

  9. HITS • Kleinberg's hypertext-induced topic selection (HITS) algorithm is also developed for ranking documents based on the link information among a set of documents. Selime Işık-Büşra İpek

  10. Authorities and hubs • The algorithm produces two types of pages: - Authority: pages that provide an important, trustworthy information on a given topic - Hub: pages that contain links to authorities • Authorities and hubs exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed to by many good hubs Selime Işık-Büşra İpek

  11. 5 2 5 1 6 1 1 3 6 7 4 7 Authorities and hubs (2) a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7) Selime Işık-Büşra İpek

  12. Definitions • Authority: pages that provide an important, trustworthy information on a given topic • Hubs:pages that contain links to authorities • Indegree:number of incoming links to a given node, used to measure the authoritativeness • Outdegree:number of outgoing links from a given node, here it is used to measure the hubness Selime Işık-Büşra İpek

  13. HITS Algorithm • Hubs point to lots of authorities. • Authorities are pointed to by lots of hubs. • Together they form a bipartite graph: • Hubs Authorities Selime Işık-Büşra İpek

  14. Step By Step HITS-1 • determines a base set S • let set of documents returned by a standard search engine be called the root set R • Initialize S to R Selime Işık-Büşra İpek

  15. Step By Step HITS - 2 • Add to S all pages pointed to by any page in R. • Add to S all pages that point to any page in R • Maintain for each page p in S: Authority score: ap(vector a) Hub score: hp (vector h) Selime Işık-Büşra İpek

  16. Step By Step HITS - 3 • For each node initiliaze the ap and hp to 1/n • In each iteration calculate the authority weight for each node in S Selime Işık-Büşra İpek

  17. Step By Step HITS - 4 • In each iteration calculate the hub weight for each node in S • Note:The hub weights are computed from the current authority weights, which were computed from the previous hub weights. Selime Işık-Büşra İpek

  18. Step By Step HITS - 5 • After new weights are computed for all nodes, the weights are normalized: Selime Işık-Büşra İpek

  19. Convergence of HITS Algorithm • Let A be an adjacency matrix of S • Aij = 1 for i S , jS if and only if i->j • Authority and hub: ak = φkAThk-1; hk = ψkAak; • Combination of both formulas gives: ak = φkψk-1ATAak-1 for k > 1 hk = ψkφkAAThk-1 for k > 0 Selime Işık-Büşra İpek

  20. Convergence of HITS Algorithm-2 • The algorithm converges to a fixed point if iterated indefinitely and the resulting authority and hub vectors satisfy a* = (1/µ*)ATAa*; h* = (1/µ*)AATh*; • The authority vector a* is an eigenvector of ATA ,converging to ATA • The hub vector h* is an eigenvector of AAT, converging to AAT Selime Işık-Büşra İpek

  21. The Pseudocode of HITS Selime Işık-Büşra İpek

  22. HITS Example • Root Set R {1,2,3,4} • Extend it to form the base set S Selime Işık-Büşra İpek

  23. Authority and Hubness Weight HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Selime Işık-Büşra İpek

  24. HITS vs PageRank • HITS emphasizes mutual reinforcement between authority and hub webpages, while PageRank does not attempt to capture the distinction between hubs and authorities. It ranks pages just by authority. • HITS is applied to the local neighborhood of pages surrounding the results of a query whereas PageRank is applied to the entire web • HITS is query dependent but PageRank is query-independent Selime Işık-Büşra İpek

  25. HITS vs PageRank (2) • Both HITS and PageRank correspond to matrix computations. • Both can be unstable: changing a few links can lead to quite different rankings. • PageRank doesn't handle pages with no outedges very well, because they decrease the PageRank overall Selime Işık-Büşra İpek

  26. Conclusion • HITS is a general algorithm used for calculating the authority and hubs in order to rank the retrieved data • The basic aim of that algorithm is to induce the Web graph by finding set of pages with a search on a given topic (query). • Results demonstrates that it is good in calculating the authority nodes and hubness. Selime Işık-Büşra İpek

  27. References • http://www.cs.cornell.edu/home/kleinber/auth.pdf • http://www.dfki.de/~klusch/I2A-UDS-SS05/lecture-3.pdf • http://www.cs.utexas.edu/~mooney/ir-course/slides/LinkAnalysis.ppt#261,2,Meta-Search Engines • research.microsoft.com/users/tyliu/files/USTC-Lecture-tyliu.ppt • http://www.cs.cornell.edu/home/kleinber/ • http://www2002.org/CDROM/refereed/643/node2.html Selime Işık-Büşra İpek

  28. THANK YOU  Selime Işık-Büşra İpek

  29. ANY QUESTIONS? Selime Işık-Büşra İpek

More Related