- 126 Views
- Uploaded on
- Presentation posted in: General

The math behind PageRank

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

The math behind PageRank

A detailed analysis of the mathematical aspects of PageRank

Computational Mathematics class presentation

Ravi S Sinha

LIT lab, UNT

- The Anatomy of a Large-Scale Hypertextual Web Search Engine
- Sergey Brin and Lawrence Page

- Inside PageRank
- Monica Bianchini, Marco Gori, and Franco Scarselli

- Deeper Inside PageRank
- Amy Langville and Carl Meyer

- Efficient Computation of PageRank
- TaherHaveliwala

- Topic Sensitive PageRank
- TaherHaveliwala

- Why PageRank
- What is PageRank
- How PageRank is used
- Math
- More math
- Remaining math

- Need to build a better automatic search engine
- Why?
- Human maintained lists subjective and expensive to build (non-automatic)
- Automatic engines based on keyword matching do a horrible job (just page content is not enough; cleverly placed words in a page can mislead search engines)
- Advertisers sometimes mislead search engines

- Why?
- Solution: Google [modern day: much more than PageRank; getting smarter]
- Exact technology: not public domain
- Core technology: PageRank (utilizes link structure)

- Other uses
- Any problem that can be visualized as a graph problem where the centrality of the vertices needs to be computed (NLP, etc.)

- A way to find the most ‘important’ vertices in a graph
- PR(A) = (1-d) + d [ PR(T1) / C(T1) + … + PR(Tn) / C(Tn) ]
- Forms a probability distribution over the vertices [sum = 1]
- How does this relate to Web search?
- Vertices = pages
- Incoming edges = hyperlinks from other pages
- Outgoing edges = hyperlinks to other pages

Random surfer

Only one incoming link, yet high PageRank

Damping factor

There are different types of managed care systems

Weights: word similarity

Directed/ undirected: whole other realm

- Intuitive correctness
- Mathematical foundation
- Stability
- Complexity of computational scheme
- Critical role of the parameters involved
- The distribution of the page score
- Role of dangling pages
- How to promote certain vertices (Web pages)

- Concept of ‘voting’
- Related to citation in scientific literature
- More citations indicate great/ important piece of work

- Random surfer / random walk
- A page with many links to it must be important
- A very important page must point to something equally important

- Most researchers: Markov chains
- Caveat: Only applicable in absence of dangling nodes

- Basic idea: authority of a Web page unrelated to its contents [comes from the link structure]
- Simple representation
- Vector representation

IN = [1, 1, 1 … 1]’

Transition matrix: ∑(each column) = 1 or 0

Google’s iterative version: converges to a stationary solution

Jacobi algorithm

Alternative computation

||x(t)||1 = 1; normalized

- Split same content into smaller vertices
- Avoid dangling pages
- Avoid many outgoing links

- Treat certain pages as communities
- Bias certain pages by using a non-uniform distribution in the vector IN
- Tinker with the connectivity [PageRank is proved to be affected by the regularity of the connection pattern]

- PageRank can be computed on a graph changing over time
- Practical interest [Web is alive]

- An optimal algorithm exists for computing PageRank
- Practical applications: Search engines, PageRank on billions of pages – efficiency!
- Ο(|Η| log 1/ε)
- NOT dependent on the connectivity or other dimensions
- Ideal computation: stops when the ranking of vertices between two computations does not change [converge]

- The PageRank vector can only exist if the Markov chain is irreducible
- By nature, the Web is non-bipartite, sparse, and produces a reducible Markov chain
- The Web hyperlinked matrix is forced to be
- Stochastic [non-negatives, all columns sum up to 1]
- Remove dangling nodes/ replace relevant rows/ columns with a small value, usually [1/n].eT
- Introduce personalization vector

- Primitive
- Non-negative
- One positive element on the main diagonal
- Irredicible

- Stochastic [non-negatives, all columns sum up to 1]

- A convex combination of the original stochastic matrix and a stochastic perturbation matrix
- Produces a stochastic, irreducible matrix
- The PageRank vector is guaranteed to exist for this matrix

- Every node directly connected to another node, all probabilities non zero
- Irreducible Markov chain, will converge

- Computation
- Power method
- Notoriously slow
- Method of choice
- Requires no computation of intermediate matrices
- Converges quickly

- Linear systems method

- Power method
- The damping factor [usually 0.85]
- Greater value: more iterations required
- ‘Truer’ PageRanks

- Dangling pages
- Storage issues

Thanks for listening!