1 / 22

The math behind PageRank

The math behind PageRank. A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab, UNT. Partial citations of references. The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

xanti
Download Presentation

The math behind PageRank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The math behind PageRank A detailed analysis of the mathematical aspects of PageRank Computational Mathematics class presentation Ravi S Sinha LIT lab, UNT

  2. Partial citations of references • The Anatomy of a Large-Scale Hypertextual Web Search Engine • Sergey Brin and Lawrence Page • Inside PageRank • Monica Bianchini, Marco Gori, and Franco Scarselli • Deeper Inside PageRank • Amy Langville and Carl Meyer • Efficient Computation of PageRank • TaherHaveliwala • Topic Sensitive PageRank • TaherHaveliwala

  3. Overview of the talk • Why PageRank • What is PageRank • How PageRank is used • Math • More math • Remaining math

  4. Why PageRank • Need to build a better automatic search engine • Why? • Human maintained lists subjective and expensive to build (non-automatic) • Automatic engines based on keyword matching do a horrible job (just page content is not enough; cleverly placed words in a page can mislead search engines) • Advertisers sometimes mislead search engines • Solution: Google [modern day: much more than PageRank; getting smarter] • Exact technology: not public domain • Core technology: PageRank (utilizes link structure) • Other uses • Any problem that can be visualized as a graph problem where the centrality of the vertices needs to be computed (NLP, etc.)

  5. What is PageRank • A way to find the most ‘important’ vertices in a graph • PR(A) = (1-d) + d [ PR(T1) / C(T1) + … + PR(Tn) / C(Tn) ] • Forms a probability distribution over the vertices [sum = 1] • How does this relate to Web search? • Vertices = pages • Incoming edges = hyperlinks from other pages • Outgoing edges = hyperlinks to other pages

  6. Simple visualization: the simplest variant of PageRank in use [user behavior] Random surfer Only one incoming link, yet high PageRank Damping factor

  7. Lexical Substitution: A crash course There are different types of managed care systems

  8. PageRank in use: Lexical Substitution Weights: word similarity Directed/ undirected: whole other realm

  9. And now, the cool stuff

  10. The math behind PageRank • Intuitive correctness • Mathematical foundation • Stability • Complexity of computational scheme • Critical role of the parameters involved • The distribution of the page score • Role of dangling pages • How to promote certain vertices (Web pages)

  11. Intuitive correctness • Concept of ‘voting’ • Related to citation in scientific literature • More citations indicate great/ important piece of work • Random surfer / random walk • A page with many links to it must be important • A very important page must point to something equally important

  12. Mathematical foundation • Most researchers: Markov chains • Caveat: Only applicable in absence of dangling nodes • Basic idea: authority of a Web page unrelated to its contents [comes from the link structure] • Simple representation • Vector representation IN = [1, 1, 1 … 1]’ Transition matrix: ∑(each column) = 1 or 0

  13. Mathematical foundation (2) Google’s iterative version: converges to a stationary solution Jacobi algorithm Alternative computation ||x(t)||1 = 1; normalized

  14. Web communities: Energy balance [measure of authority]

  15. More on energy

  16. Even more on energy [community promotion] • Split same content into smaller vertices • Avoid dangling pages • Avoid many outgoing links

  17. Page promotion • Treat certain pages as communities • Bias certain pages by using a non-uniform distribution in the vector IN • Tinker with the connectivity [PageRank is proved to be affected by the regularity of the connection pattern]

  18. Computation of PageRank • PageRank can be computed on a graph changing over time • Practical interest [Web is alive] • An optimal algorithm exists for computing PageRank • Practical applications: Search engines, PageRank on billions of pages – efficiency! • Ο(|Η| log 1/ε) • NOT dependent on the connectivity or other dimensions • Ideal computation: stops when the ranking of vertices between two computations does not change [converge]

  19. The Markov model from the Web • The PageRank vector can only exist if the Markov chain is irreducible • By nature, the Web is non-bipartite, sparse, and produces a reducible Markov chain • The Web hyperlinked matrix is forced to be • Stochastic [non-negatives, all columns sum up to 1] • Remove dangling nodes/ replace relevant rows/ columns with a small value, usually [1/n].eT • Introduce personalization vector • Primitive • Non-negative • One positive element on the main diagonal • Irredicible

  20. More on the Markov structure • A convex combination of the original stochastic matrix and a stochastic perturbation matrix • Produces a stochastic, irreducible matrix • The PageRank vector is guaranteed to exist for this matrix • Every node directly connected to another node, all probabilities non zero • Irreducible Markov chain, will converge

  21. There’s more to PageRank • Computation • Power method • Notoriously slow • Method of choice • Requires no computation of intermediate matrices • Converges quickly • Linear systems method • The damping factor [usually 0.85] • Greater value: more iterations required • ‘Truer’ PageRanks • Dangling pages • Storage issues

  22. The end [for today] Thanks for listening!

More Related