1 / 84

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis. Mining Massive Datasets. Link Analysis Algorithms. PageRank Hubs and Authorities Topic- Sensitive PageRank Spam Detection Algorithms Other interesting topics we won ’ t cover

cblakney
Download Presentation

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis Mining Massive Datasets

  2. Link Analysis Algorithms PageRank Hubs and Authorities Topic-Sensitive PageRank Spam Detection Algorithms Other interesting topics we won’t cover Detecting duplicates and mirrors Mining for communities (community detection) (Refer to Chapter 10 of the textbook)

  3. Outline PageRank Topic-Sensitive PageRank Hubs and Authorities Spam Detection

  4. PageRank Web pages are not equally “important” www.joe-schmoe.com v www.stanford.edu Inlinks as votes www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink Are all inlinks equal? Recursive question! Ranking web pages

  5. PageRank Each link’s vote is proportional to the importance of its source page If page P with importance x has n outlinks, each link gets x/n votes Page P’s own importance is the sum of the votes on its inlinks Simple recursive formulation

  6. PageRank Simple “flow” model The web in 1839 Yahoo Amazon M’soft y/2 y = y /2 + a /2 a = y /2 + m m = a /2 y a/2 y/2 m a/2 m a

  7. PageRank 3 equations, 3 unknowns, no constants No unique solution All solutions equivalent modulo scale factor Additional constraint forces uniqueness y+a+m = 1 y = 2/5, a = 2/5, m = 1/5 Gaussian elimination method works for small examples, but we need a better method for large graphs Solving the flow equations

  8. PageRank Matrix M has one row and one column for each web page Suppose page j has n outlinks If j i, then Mij=1/n Else Mij=0 M is a columnstochastic matrix Columns sum to 1 Suppose r is a vector with one entry per web page ri is the importance score of page i Call it the rank vector |r| = 1 Matrix formulation

  9. PageRank Example j i i = 1/3 M r Suppose page j links to 3 pages, including i r

  10. PageRank The flow equations can be written r = Mr So the rank vector is an eigenvector of the stochastic web matrix In fact, its first or principal eigenvector, with corresponding eigenvalue 1 Eigenvector formulation

  11. PageRank Example Yahoo r = Mr Amazon M’soft y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y = y /2 + a /2 a = y /2 + m m = a /2

  12. PageRank Simple iterative scheme (aka relaxation) Suppose there are N web pages Initialize: r0 = [1/N,….,1/N]T Iterate: rk+1 = Mrk Stop when |rk+1 - rk|1 <  |x|1 = 1≤i≤N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean Power Iteration method

  13. PageRank Power Iteration Example Yahoo Amazon M’soft y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6 2/5 2/5 1/5 . . .

  14. PageRank Imagine a random web surfer At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P uniformly at random Ends up on some page Q linked from P Process repeats indefinitely Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t p(t) is a probability distribution on pages Random Walk Interpretation

  15. PageRank Where is the surfer at time t+1? Follows a link uniformly at random p(t+1) = Mp(t) Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) Then p(t) is called a stationary distribution for the random walk Our rank vector r satisfies r = Mr So it is a stationary distribution for the random surfer The stationary distribution

  16. PageRank A central result from the theory of random walks (aka Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0. Existence and Uniqueness

  17. PageRank A group of pages is a spider trap if there are no links from within the group to outside the group Random surfer gets trapped Spider traps violate the conditions needed for the random walk theorem Spider traps

  18. PageRank Microsoft becomes a spider trap Yahoo y a m y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 Amazon M’soft y a = m 1 1 1 1 1/2 3/2 3/4 1/2 7/4 5/8 3/8 2 0 0 3 . . .

  19. PageRank The Google solution for spider traps At each time step, the random surfer has two options: With probability , follow a link at random With probability 1-, jump to some page uniformly at random Common values for  are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps Random teleports

  20. PageRank Random teleports ( = 0.8) y y y 1/3 1/3 1/3 y 1/2 a 1/2 m 0 1/2 1/2 0 + 0.2* 0.8* 0.2*1/3 1/2 Yahoo 0.8*1/2 1/2 0.2*1/3 0.8*1/2 0.2*1/3 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 + 0.2 Amazon M’soft 0.8 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15

  21. PageRank Random teleports ( = 0.8) 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 + 0.2 Yahoo 0.8 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 Amazon M’soft y a = m 1 1 1 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688 7/11 5/11 21/11 . . .

  22. PageRank Suppose there are N pages Consider a page j, with set of outlinks O(j) We have Mij = 1/|O(j)| when ji and Mij = 0 otherwise The random teleport is equivalent to adding a teleport link from j to every other page with probability (1-)/N reducing the probability of following each outlink from 1/|O(j)| to /|O(j)| Equivalent: tax each page a fraction (1-) of its score and redistribute evenly Matrix formulation

  23. PageRank Construct the N*N matrix A as follows Aij = Mij + (1-)/N Verify that A is a stochastic matrix The PageRank vectorr is the principal eigenvector of this matrix satisfying r = Ar Equivalently, r is the stationary distribution of the random walk with teleports PageRank

  24. PageRank Pages with no outlinks are “dead ends” for the random surfer Nowhere to go on next step Dead ends

  25. PageRank Microsoft becomes a dead end Non- stochastic! 1/2 1/2 0 1/2 0 0 0 1/2 0 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 + 0.2 Yahoo 0.8 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 1/15 Amazon M’soft y a = m 1 1 1 1 0.6 0.6 0.787 0.547 0.387 0.648 0.430 0.333 0 0 0 . . .

  26. PageRank Teleport Follow random teleport links with probability 1.0 from dead ends Adjust matrix accordingly Prune and propagate Preprocess the graph to eliminate dead ends Might require multiple passes Compute PageRank on reduced graph Approximate values for dead ends by propagating values from reduced graph Dealing with dead ends

  27. PageRank Key step is matrix-vector multiplication rnew = Arold Easy if we have enough main memory to hold A, rold, rnew Say N = 1 billion pages We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N2 entries 1018 is a large number! Computing PageRank

  28. PageRank r = Ar, where Aij = Mij + (1-)/N ri = 1≤j≤N Aij rj ri = 1≤j≤N [Mij + (1-)/N] rj =  1≤j≤N Mij rj + (1-)/N 1≤j≤N rj =  1≤j≤N Mij rj + (1-)/N, since |r| = 1 r = Mr + [(1-)/N]N where [x]N is an N-vector with all entries x Rearranging the equation

  29. PageRank We can rearrange the PageRank equation: r = Mr + [(1-)/N]N [(1-)/N]N is an N-vector with all entries (1-)/N M is a sparse matrix! 10 links per node, approx 10N entries So in each iteration, we need to: Compute rnew = Mrold Add a constant value (1-)/N to each entry in rnew Sparse matrix formulation

  30. PageRank Sparse matrix encoding Encode sparse matrix using only nonzero entries Space proportional roughly to number of links say 10N, or 4*10*1 billion = 40GB still won’t fit in memory, but will fit on disk source node degree destination nodes

  31. PageRank Assume we have enough RAM to fit rnew, plus some working memory Store rold and matrix M on disk Basic Algorithm: Initialize: rold = [1/N]N Iterate: Update: Perform a sequential scan of M and rold to update rnew Write out rnew to disk as rold for next iteration Every few iterations, compute |rnew-rold| and stop if it is below threshold Need to read in both vectors into memory Basic Algorithm

  32. PageRank Update step Initialize all entries of rnew to (1-)/N For each page p (out-degree n): Read into memory: p, n, dest1,…,destn, rold(p) for j = 1..n: rnew(destj) += *rold(p)/n rold rnew src degree destination 0 0 1 1 2 2 3 3 4 4 5 5 6 6

  33. PageRank In each iteration, we have to: Read rold and M Write rnew back to disk IO Cost = 2|r| + |M| What if we had enough memory to fit both rnew and rold? What if we could not even fit rnew in memory? 10 billion pages Analysis

  34. PageRank Strip-based update Problem: thrashing

  35. PageRank Block Update algorithm

  36. PageRank Block Update algorithm src degree destination rnew 0 rold 1 0 1 2 3 2 3

  37. PageRank Some additional overhead But usually worth it Cost per iteration |M|(1+) + (k+1)|r| Block Update algorithm

  38. Outline PageRank Topic-Sensitive PageRank Hubs and Authorities Spam Detection

  39. Topic-Sensitive PageRank Measures generic popularity of a page Biased against topic-specific authorities Ambiguous queries e.g., jaguar Uses a single measure of importance Other models e.g., hubs-and-authorities Susceptible to link spam Artificial link topographies created in order to boost page rank Some problems with PageRank

  40. Topic-Sensitive PageRank Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When the random walker teleports, he picks a page from a set S of web pages S contains only pages that are relevant to the topic E.g., Open Directory (DMOZ) pages for a given topic (www.dmoz.org) For each teleport set S, we get a different rank vector rS Topic-Sensitive PageRank

  41. Topic-Sensitive PageRank Aij = Mij + (1-)/|S| if i is in S Aij = Mij otherwise Show that A is stochastic We have weighted all pages in the teleport set S equally Could also assign different weights to them Matrix formulation

  42. Topic-Sensitive PageRank Example 0.2 0.5 0.5 0.4 0.4 Node Iteration 0 1 2… stable 1 1.0 0.2 0.52 0.294 2 0 0.4 0.08 0.118 3 0 0.4 0.08 0.327 4 0 0 0.32 0.261 1 0.8 1 1 0.8 0.8 Suppose S = {1},  = 0.8 1 2 3 4 Note how we initialize the PageRank vector differently from the unbiased PageRank case.

  43. Topic-Sensitive PageRank Experimental results [Haveliwala 2000] Picked 16 topics Teleport sets determined using DMOZ E.g., arts, business, sports,… “Blind study” using volunteers 35 test queries Results ranked using PageRank and TSPR of most closely related topic E.g., bicycling using Sports ranking In most cases volunteers preferred TSPR ranking How well does TSPR work?

  44. Topic-Sensitive PageRank User can pick from a menu Use Bayesian classification schemes to classify query into a topic Can use the context of the query E.g., query is launched from a web page talking about a known topic History of queries e.g., “basketball” followed by “jordan” User context e.g., user’s My Yahoo settings, bookmarks, … Which topic ranking to use?

  45. Outline PageRank Topic-Sensitive PageRank Hubs and Authorities Spam Detection

  46. Hubs and Authorities Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained through a text search Can we organize these documents in some manner? PageRank offers one solution HITS (Hypertext-Induced Topic Selection) is another proposed at approx the same time (1998) Hubs and Authorities

  47. Hubs and Authorities Interesting documents fall into two classes Authorities are pages containing useful information course home pages home pages of auto manufacturers Hubs are pages that link to authorities course bulletin list of US auto manufacturers HITS Model

  48. Hubs and Authorities Idealized view Hubs Authorities

  49. Hubs and Authorities A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node Hub score and Authority score Represented as vectors h and a Mutually recursive definition

  50. Hubs and Authorities HITS uses a matrix A[i, j] = 1 if page i links to page j, 0 if not AT, the transpose of A, is similar to the PageRank matrix M, but AT has 1’s where M has fractions Transition Matrix A

More Related