1 / 23

Presented by Zheng Zhao Originally designed by Soumya Sanyal

Presented by Zheng Zhao Originally designed by Soumya Sanyal http://ranger.uta.edu/~gdas/Courses/Spring2005/DBIR/slides/The%20PageRank%20Citation%20Ranking%20-%20Redone.ppt.

okalani
Download Presentation

Presented by Zheng Zhao Originally designed by Soumya Sanyal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presented by Zheng Zhao Originally designed by Soumya Sanyal http://ranger.uta.edu/~gdas/Courses/Spring2005/DBIR/slides/The%20PageRank%20Citation%20Ranking%20-%20Redone.ppt The PageRank Citation Ranking: Bringing Order to the WebPage L. , Brin S. , Motwani R. , Winograd T. Stanford Digital Library Technologies Projecthttp://dbpubs.stanford.edu/pub/1999-66

  2. Outline • Paper Citations and the Web : Motivation • PageRank : Why it should be considered? • More PageRank: Nuts and bolts • PageRank Unleashed: Looking under the hood • Convergence and Random Walks : Why does it work? • Implementation: Getting your hands dirty • Personalized PageRank: The invisible source • Applications: What wasn’t apparent already • Conclusions

  3. Paper Citations and the Web : Motivation • Academic Citations link to other well known papers • But they are peer reviewed and have quality control • Web of academic documents are homogeneous in their quality, usage, citation & length • Most web pages link to web pages as well • Quality measure of a web page is subjective to the user though • Importance of a page is a quantity that isn’t intuitively possible to capture

  4. Contd. • An user wants to see what is most applicable to her needs first. • The job of the retrieval system is to present the more relevant documents up front. • The notion of quality or relative importance of a web page magnifies • The average quality experienced by an user is higher than the average quality of the average web page. • Notations Used: • Backlinks (inedges) : Links that point to a certain page • Forward Links (outedges): Links that emanate from that page

  5. PageRank : Why it should be considered? • Think of a color palette • Colors are formed by the mixture of one or more colors • The amount and intensity of each color you mix ultimately governs the color of the final mixture not the number of colors !!! • Now think of a Web Page • A number of back links (inedges) point to this webpage • Say a certain back link came from Yahoo! and another came from an obscure home page. • Think of the importance of the Yahoo! Page as opposed to the importance of the ‘home page’. • Now say the importance of the Yahoo! Page was mapped to the amount (intensity) of one color and the ‘home page’ to another color • Importance of back links rather than their number. + +

  6. More PageRank: Nuts and bolts • Say for any Web Page u the number of forward links is given by Fuand the number of back links beBuand Nu=| Fu | • R() = Rank of page u ; c = Normalization Constant • Note: c < 1 to cover for pages with no outgoing links

  7. Contd.. • So what does the overall picture look like? • A is designated to be a matrix, u and v correspond to the columns of this matrix

  8. Contd.. (Matrices Revisited) • Eigenvectors and eigenvalues • Given that A is a matrix, and R be a vector over all the Web pages, the dominant eigenvector is the one associated with the maximal eigenvalue. • It can be found out by recursing the previous equation till the recurrence converges. • A set of eigenvalues form what is called the eigenspace.

  9. Contd.. (A Walk Through Example) • Lets take an example AT=

  10. Contd.. • Matrix Notation R = c A R = M R c : eigenvalue R : eigenvector of A A x = λ x | A - λI | x = 0 A = R = Normalized =

  11. Contd.. (Markov Chains) • Random surfer model • Description of a random walk through the Web graph • Interpreted as a transition matrix with asymptotic probability that a surfer is currently browsing that page • The above notion is fundamental to any Markovian System. For a discrete notion of the above, the following is assumed. • Rt= M Rt-1M: transition matrix for a first-order Markov chain (stochastic) • The question is does it converge to some sensible solution (as t) regardless of the initial ranks ?

  12. Contd..(Issues..) • The above equation would converge were it not for a little problem • This problem is called the ‘Rank Sink’ Problem. • The sink accumulates rank, but never distributes it!

  13. Contd..() • In general many Web pages don’t have either backlinks or forward links. • Results in dangling edges of the graph • no parent  rank 0 • MT converges to a matrix whose last column is all zero • no children  no solution • MT converges to zero matrix

  14. Contd..(More Random Surfer) • How do we escape from this ? • A: We actually ‘escape’ from it. • Say a surfer is randomly clicking and hopping from one page to the other. • If this surfer keeps going back to the ‘same’ set of pages, she will get bored (in reality too) and try and ‘escape’ from this set of pages. • Hence, we associate an ‘escape’ factor E to account for this ‘boredom’. • How do we model this escape probability • We term this E to be a vector over all the web pages that accounts for each page’s escape probability.

  15. Contd.. • Given this Escape vector, how do we associate this with the original model • In matrix notation where • It can be rewritten as • Hence

  16. PageRank Unleashed: Looking under the hood The main algorithm : • What can we say about d and  ? • d1 is called the eigengap and it controls the rate of convergence •  is the convergence threshold

  17. Convergence and Random Walks : Why does it work? • Irreducible Aperiodic Markov Chains with a Primitive transition probability matrix • What is the issue all about? • We need a transition matrix model that is guaranteed convergence and does indeed converge to a unique stationary distribution vector.

  18. Contd.. • Addition of the escape vector E, allows us to make the original matrix A be both primitive and stochastic • This guarantees convergence • What about the addition of new links • Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don’t change significantly? • The connectivity of a portion of the graph is changed arbitrary • How will it affect the results of algorithms? • Ng et al. (2001) IJCAI and Bianchini et al. (2002) WWW’02 • It is possible to perturb a symmetric matrix by a quantity that grows as d1 that produces a constant perturbation of the dominant eigenvector

  19. Contd.. • Convergence Experiment(s) • Expander graphs and d1 (every subset S has a neighborhood bounded by some factor  times |S|) • Rapidly mixing random walk : Convergence is guaranteed in logarithmic time in the order of the size of the graph

  20. Implementation: Getting your hands dirty • In 1998 • 24 million web pages • Crawler builds an index of links • To do this in 5 days, 50 Web pages/second need to be crawled • 11 is the average outdegree, 550 links/second • 75 million unique URL’s to be compared against • URL’s are hashed to unique integer ID • No dangling links are kept initially • Vector E will help in convergence issues also • Weights were kept for 75 million URLs @ 4 bytes/weight (300MB) • Access to link Database is linear since it is sorted • `99 – 800 million pages; `00 - 2 billion; `01 – 4 billion

  21. Personalized PageRank: The invisible source • ||E||1=0.15 • Web Pages are valued because they exist! • Web Pages with many related links receive an overly high ranking • The other extreme – E for just one web page • Netscape Home Page and John McCarthy’s home page

  22. Applications: What wasn’t apparent already • Estimating Web Traffic • How PageRank corresponds to actual usage • Internet proxy cache from NLANR compared to PageRank • 2.6 million pages intersect with PageRank’s indexed 75 mil. • Web based email access is one plausible reason for this disparity • People look at certain pages but never link them • Backlink Predictor • PageRank is a better predictor for future citation counts than citation counts themselves. • Experiment starts out with one URL and no other information • Goal is to crawl the Web in the order of their importance • Importance being an Evaluation function on the number of citation counts (number of backlinks) • PageRank escapes local minima, citation count get stuck in these.

  23. Conclusions • In essence, the importance of one page being dependent on the importance of its predecessors is like a ‘peer’ review. • NASDAQ – 17th February, 2005 - $197.41 : Need I say More?

More Related