1 / 16

Ranking the Web Frontier

Ranking the Web Frontier. Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang. Introduction & Contribution. Propose algorithmic innovations for the basic PageRank paradigm. Problem of Web Frontier ( Dangling Nodes)

Download Presentation

Ranking the Web Frontier

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ranking the Web Frontier Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang

  2. Introduction & Contribution • Propose algorithmic innovations for the basic PageRank paradigm. • Problem of Web Frontier ( Dangling Nodes) • Distinguish different types of Dangling Nodes • Propose four techniques for penalty pages • Problem of computing pagerank and rank manipulation • Explore Web hierarchical structure • HostRank & DirRank algorithms

  3. PageRank • BackLinks & Random surfer & Recursive computation • Ideal Model or • The web graph should be strongly connected. • A should be stochastic. (irreducible and aperiodic)

  4. PageRank • Improved Model Add a link from each page to every page and give each link a small transition probability controlled by a parameter α. Random Jump (teleportation) • virtual node n+1 • Variations Issues • Parameter α. • Random jump---uniform distribution • Dangling Nodes

  5. Dangling Nodes • Dangling nodes: Nodes that either have no outlinks or for which no outlinks are known. • How do pages become dangling nodes • Crawlers might not have crawled them. Dynamic Pages. • Protected by a robots.txt • Genuinely have no outlinks: PS, PDF • Meta tag indicating not to follow.

  6. Handling Dangling Nodes • Remove away and then added back. • Random jump • Reduced eigen-system. • Power-iteration. • A single step

  7. Penalty Pages and Link Rot • Penalty pages: pages that are dangling and produce 403 or 404 HTTP code. • Link Rot: links used to work but then broken. (Penalty Link, Dangling Link)

  8. Effects of Dangling Nodes on Ranking • Whether teleportation to dangling nodes. • Yes. 3 has the highest rank score. • No. [0.31746, 0.31746, 0.365079], • 0.269841. Less than 1and 2. • The number of dangling links. • 1 link: [0.198684, 0.283124, 0.283124, 0.235068] • 4 links: [0.195954, 0.229266, 0.279234, 0.29554]

  9. Push-back algorithm • If a page has a link to a penalty page, have its rank reduced by a fraction, and the excess rank should be returned to the pages that pushed rank to it in the previous iteration. • Retain (1-i), distribute iij to its backlinks.

  10. Self-Loop algorithm • Augment each page with a self-loop link to itself . With a i probability follow this link. bi is the number of outlinks from i to penalty pages. gi is the number of outlinks from i to non-penalty pages. • 1- becomes • Some variations.

  11. Jump-weighting algorithm • Instead of evenly redistribution, biasing the redistribution so that penalized pages receive less rank. • A straight-forward method • Weight the link from virtual node • to an unpenalized node in C (strongly connected node set) by  • to a penalized node by gi/(gi+bi)

  12. BHITS algorithm • Random walk in both Forward/Backward directions. • Forward step: the same as ordinary PageRank. • Backward step: • Non-dangling nodes: self-loop. • Dangling nodes: • non-penalty nodes: forward score to virtual node. • penalty nodes: divide score by # of inlinks. Equally propagate score among backward links. • Penalty page traverse to a random seed nodes. • Matrix representation

  13. HostRank algorithm • Web Hierarchical Structure • 62.4% links are internal to a site. • 82% outlinks are to the top level of sites. • Not jump uniformly, but to portal or Top-level pages. • Consider all pages on a site as a single body. • Assign them all a rank based on the collective value of information on that site. • Each site represented by one node in the graph. • Web size becomes smaller. Computation become less.

  14. DirRank algorithm • HostRank too coarse a level of granularity & heavy tail distribution. • DirRank graph • Node: groups of URLS with prefixes up to the last “/” or “?”. Virtual directory. • Edges: if there is a link from a URL in the source virtual directory to a URL in the destination virtual directory.

  15. Experiments Results • Setup: • Crawling on IBM Almaden • More than 1 billion pages; 37 billion links; 4.75 billion URLS. • Results: • Reduce computation. • DirRank: 114 million nodes/15 billion edges • HostRank: 19.7 billion hosts(nodes)/1.1 billion edges • Enhance resistance to link manipulation. • 11/20 in 100 million pages. vs 14/100 hostnames • Virtual node probability : 0.82 vs 0.17

  16. Conclusions • PageRank with uniform teleportation are easily subject to link manipulation. • HostRank and DirRank algorithm are both cheaper to compute and less subject to link manipulation. • The proposed 4 techniques for penalty pages can reduce bias and improve ranking performance. • In the future, hope can place the problem of web page ranking on a firmer scientific foundation besides on trade or economic domains.

More Related