1 / 28

PageRank

PageRank. x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31 p 41 p 12 + p 31 p 42 p 12 + p 34 p 41 p 12 + p 34 p 42 p 12 + p 13 p 34 p 42 / Σ x 3 = p 41 p 21 p 13 + p 42 p 21 p 13 / Σ x 4 = p 21 p 13 p 34 / Σ. p 12. s 1. s 2. p 21.

moritz
Download Presentation

PageRank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PageRank

  2. x1 = p21p34p41 + p34p42p21 + p21p31p41 + p31p42p21 / Σ x2 = p31p41p12 + p31p42p12 + p34p41p12 + p34p42p12 + p13p34p42 / Σ x3 = p41p21p13 + p42p21p13 / Σ x4 = p21p13p34/ Σ p12 s1 s2 p21 p31 p13 p41 p42 p34 s3 s4 Σ = p21p34p41 + p34p42p21 + p21p31p41 + p31p42p21 + p31p41p12 + p31p42p12 + p34p41p12 + p34p42p12 + p13p34p42 + p41p21p13 + p42p21p13 + p21p13p34

  3. Ergodic Theorem Revisited If there exists a reverse spanning tree in a graph of the Markov chain associated to a stochastic system, then: • the stochastic system admits the following probability vector as a solution: (b) the solution is unique. (c) the conditions {xi ≥ 0}i=1,n are redundant and the solution can be computed by Gaussian elimination.

  4. Google PageRank Patent • “The rank of a page can be interpreted as the probability that a surfer will be at the page after following a large number of forward links.” The Ergodic Theorem

  5. Google PageRank Patent • “The iteration circulates the probability through the linked nodes like energy flows through a circuit and accumulates in important places.” Kirchoff (1847)

  6. Rank Sinks 1 3 5 7 2 4 6 No Spanning Tree

  7. Ranking web pages • Web pages are not equally “important” • www.joe-schmoe.com v www.stanford.edu • Inlinks as votes • www.stanford.edu has 23,400 inlinks • www.joe-schmoe.com has 1 inlink • Are all inlinks equal?

  8. Simple recursive formulation • Each link’s vote is proportional to the importance of its source page • If page P with importance x has n outlinks, each link gets x/n votes

  9. Matrix formulation • Matrix M has one row and one column for each web page • Suppose page j has n outlinks • If j ! i, then Mij=1/n • Else Mij=0 • M is a columnstochastic matrix • Columns sum to 1 • Suppose r is a vector with one entry per web page • ri is the importance score of page i • Call it the rank vector

  10. j i i = 1/3 M r Example Suppose page j links to 3 pages, including i r

  11. Eigenvector formulation • The flow equations can be written r = Mr • So the rank vector is an eigenvector of the stochastic web matrix • In fact, its first or principal eigenvector, with corresponding eigenvalue 1

  12. Example

  13. Power Iteration method • Simple iterative scheme (aka relaxation) • Suppose there are N web pages • Initialize: r0 = [1/N,….,1/N]T • Iterate: rk+1 = Mrk • Stop when |rk+1 - rk|1 <  • |x|1 = 1·i·N|xi| is the L1 norm • Can use any other vector norm e.g., Euclidean

  14. Random Walk Interpretation • Imagine a random web surfer • At any time t, surfer is on some page P • At time t+1, the surfer follows an outlink from P uniformly at random • Ends up on some page Q linked from P • Process repeats indefinitely • Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t • p(t) is a probability distribution on pages

  15. Spider traps • A group of pages is a spider trap if there are no links from within the group to outside the group • Random surfer gets trapped • Spider traps violate the conditions needed for the random walk theorem

  16. Random teleports • The Google solution for spider traps • At each time step, the random surfer has two options: • With probability , follow a link at random • With probability 1-, jump to some page uniformly at random • Common values for  are in the range 0.8 to 0.9 • Surfer will teleport out of spider trap within a few time steps

  17. Matrix formulation • Suppose there are N pages • Consider a page j, with set of outlinks O(j) • We have Mij = 1/|O(j)| when j!i and Mij = 0 otherwise • The random teleport is equivalent to • adding a teleport link from j to every other page with probability (1-)/N • reducing the probability of following each outlink from 1/|O(j)| to /|O(j)| • Equivalent: tax each page a fraction (1-) of its score and redistribute evenly

  18. The google matrix: Gj,i = q/n + (1-q)Ai,j/ni Where A is the adjacency matrix, n is the number of nodes and q is the teleport Probability .15

  19. Page Rank • Construct the N£N matrix A as follows • Aij = Mij + (1-)/N • Verify that A is a stochastic matrix • The page rank vectorr is the principal eigenvector of this matrix • satisfying r = Ar • Equivalently, r is the stationary distribution of the random walk with teleports

  20. Example 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 + 0.2 Yahoo 0.8 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 Amazon M’soft

  21. Dead ends • Pages with no outlinks are “dead ends” for the random surfer • Nowhere to go on next step

  22. Non- stochastic! Microsoft becomes a dead end 1/2 1/2 0 1/2 0 0 0 1/2 0 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 + 0.2 Yahoo 0.8 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 1/15 Amazon M’soft

  23. Dealing with dead-ends • Teleport • Follow random teleport links with probability 1.0 from dead-ends • Adjust matrix accordingly • Prune and propagate • Preprocess the graph to eliminate dead-ends • Might require multiple passes • Compute page rank on reduced graph • Approximate values for deadends by propagating values from reduced graph

  24. Computing page rank • Key step is matrix-vector multiply • rnew = Arold • Easy if we have enough main memory to hold A, rold, rnew • Say N = 1 billion pages • We need 4 bytes for each entry (say) • 2 billion entries for vectors, approx 8GB • Matrix A has N2 entries • 1018 is a large number!

  25. Computing PageRank • Ranks the entire web, global ranking • Only computed once a month • Few iterations!

  26. Sparse matrix formulation • Although A is a dense matrix, it is obtained from a sparse matrix M • 10 links per node, approx 10N entries • We can restate the page rank equation • r = Mr + [(1-)/N]N • [(1-)/N]N is an N-vector with all entries (1-)/N • So in each iteration, we need to: • Compute rnew = Mrold • Add a constant value (1-)/N to each entry in rnew

  27. Sparse matrix encoding • Encode sparse matrix using only nonzero entries • Space proportional roughly to number of links • say 10N, or 4*10*1 billion = 40GB • still won’t fit in memory, but will fit on disk source node degree destination nodes

More Related