1 / 23

The PageRank Citation Ranking: Bringing Order to the Web

The PageRank Citation Ranking: Bringing Order to the Web. Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das. Technology Overview. Motivation. WWW is huge and heterogeneous WebPages proliferate free of quality control Commercial interest to manipulate ranking

chiara
Download Presentation

The PageRank Citation Ranking: Bringing Order to the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The PageRank Citation Ranking: Bringing Order to the Web Presented by AishwaryaRengamannan 1000669605 Instructor: Dr. Gautam Das

  2. Technology Overview

  3. Motivation • WWW is huge and heterogeneous • WebPages proliferate free of quality control • Commercial interest to manipulate ranking • The ‘quality’ of a webpage is subjective to the users. Problem: Necessity to approximate the overall relative ‘importance’ of web pages. Solution: Take advantage of the Link Structure of the web

  4. Link structure of the Web • Forward Links(Outedges): The outgoing links from a webpage. C is A & B’s forward link. • Back Links(Inedges): Incoming links to a webpage. A & B are back links for C.

  5. Related Work • Academic paper citations • Link based analysis • Clustering methods that take link structure into account • Modeling web as Hubs and Authorities

  6. Ranking Intuition • The quantity of the backlinks to a webpage makes it important. • The quality of the back linked pages increases the ranking. “A page has high rank if the sum of the ranks of it’s backlinks is high.” How about having a backlink from www.yahoo.com?

  7. Naïve PageRank Calculation • u & v --> Webpages • Bu --> backlinks of u • Nv --> Forward Links from v to u. • R --> Ranks of the webpages • c <1 --> Used for normalization

  8. Matrix Representation ‘A’ is a square adjacency Matrix with • Rows and columns corresponding to web pages (u & v) • Au,v = 1/Nu if there is an edge from u to v • Au,v= 0 if there is no edge.

  9. Matrices Revisited Eigen Values and Eigen Vectors: • Matrix A (nXn) •   is an Eigen value of Aif there exists a non-zero vector v  such that Av= v • vector v  is called anEigen vector of A corresponding to  . • We can rewrite Av= v as (A− I)v=0, where I is identity matrix (nXn).

  10. Matrices Revisited(Contd…) How to solve for Eigen value and Eigen Vector?

  11. Sample Calculation 3 1 2 4

  12. Matrix Representation (contd…) • A --> square matrix of web pages • R --> vector over webpages • To find: Eigen Vector corresponding to dominant (maximum) Eigen value. • Could be computed by repeatedly iterating till it converges to the dominant Eigen value-Eigen Vector Matrix Notation gives R = c A R c : eigenvalue R : eigenvector of A R = Normalized R =

  13. Problem with Naïve PageRank Rank Sink: • Two web pages that point to each other but to no other page. Third page which points to one of them. • loop will accumulate rank but never distribute it (since there are no out edges).

  14. Solution – Extended version of PageRank Introducing Rank Source: E(u): a vector over the web pages that corresponds to a source of rank.

  15. Random Surfer Model • Random Surfer – Clicks on successive links at random. • The factor ‘E’ can be viewed as modeling this behavior. • “Surfer” periodically gets bored, jumped to a random page based on E.

  16. PageRank Computation - initialize vector over web pages Loop: - new ranks sum of normalized backlink ranks - compute normalizing factor - add escape term - control parameter While - stop when converged

  17. Another Problem? Dangling links: • Links to a page with no link to any other pages • Not clear where their weights should be distributed Solution: Remove them from the system until after calculating all other PageRanks!

  18. Implementation • Web crawler keeps a database of URLs so that it can discover all URLs on the web • To implement PageRank, the web crawler builds an index of the URLs as it crawls Problems??? • Infinitely large sites • Incorrect/Broken HTML • Sites are down • Web is always changing

  19. PageRank Implementation • Convert each URL into unique integer ID • Link structure sorted by the IDs • Remove dangling links • Make a initial assignment of ranks and iterate until convergence • Add the dangling links back • Iterate the process again to assign weights to all dangling links • Link database A, is normally kept in RAM

  20. Convergence Properties • Interpret web as a expander like graph. • if every subsets of nodes S has a neighborhood that is larger than some factor α times |S| • Verification - if the largest eigenvalue is sufficiently larger than the second-largest eigenvalue

  21. Applications of Page Rank • Search, Browsing and Traffic estimation. • Help user decide if a site is trustworthy. • Estimate web traffic. • Spam detection and prevention. • Predict citation counts

  22. http://www.techpavan.com/2008/11/20/backend-google-search/ • http://www.math.hmc.edu/calculus/tutorials/eigenstuff/ • http://williamcotton.com/pagerank-explained-with-javascript

More Related