Exploring Link Analysis Algorithms for Mining Massive Datasets

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis Mining Massive Datasets

Link Analysis Algorithms PageRank Hubs and Authorities Topic-Sensitive PageRank Spam Detection Algorithms Other interesting topics we won’t cover Detecting duplicates and mirrors Mining for communities (community detection) (Refer to Chapter 10 of the textbook)

Outline PageRank Topic-Sensitive PageRank Hubs and Authorities Spam Detection

PageRank Web pages are not equally “important” www.joe-schmoe.com v www.stanford.edu Inlinks as votes www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink Are all inlinks equal? Recursive question! Ranking web pages

PageRank Each link’s vote is proportional to the importance of its source page If page P with importance x has n outlinks, each link gets x/n votes Page P’s own importance is the sum of the votes on its inlinks Simple recursive formulation

PageRank Simple “flow” model The web in 1839 Yahoo Amazon M’soft y/2 y = y /2 + a /2 a = y /2 + m m = a /2 y a/2 y/2 m a/2 m a

PageRank 3 equations, 3 unknowns, no constants No unique solution All solutions equivalent modulo scale factor Additional constraint forces uniqueness y+a+m = 1 y = 2/5, a = 2/5, m = 1/5 Gaussian elimination method works for small examples, but we need a better method for large graphs Solving the flow equations

PageRank Matrix M has one row and one column for each web page Suppose page j has n outlinks If j i, then Mij=1/n Else Mij=0 M is a columnstochastic matrix Columns sum to 1 Suppose r is a vector with one entry per web page ri is the importance score of page i Call it the rank vector |r| = 1 Matrix formulation

PageRank Example j i i = 1/3 M r Suppose page j links to 3 pages, including i r

PageRank The flow equations can be written r = Mr So the rank vector is an eigenvector of the stochastic web matrix In fact, its first or principal eigenvector, with corresponding eigenvalue 1 Eigenvector formulation

PageRank Example Yahoo r = Mr Amazon M’soft y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y = y /2 + a /2 a = y /2 + m m = a /2

PageRank Simple iterative scheme (aka relaxation) Suppose there are N web pages Initialize: r0 = [1/N,….,1/N]T Iterate: rk+1 = Mrk Stop when |rk+1 - rk|1 <  |x|1 = 1≤i≤N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean Power Iteration method

PageRank Power Iteration Example Yahoo Amazon M’soft y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6 2/5 2/5 1/5 . . .

PageRank Imagine a random web surfer At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P uniformly at random Ends up on some page Q linked from P Process repeats indefinitely Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t p(t) is a probability distribution on pages Random Walk Interpretation

PageRank Where is the surfer at time t+1? Follows a link uniformly at random p(t+1) = Mp(t) Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) Then p(t) is called a stationary distribution for the random walk Our rank vector r satisfies r = Mr So it is a stationary distribution for the random surfer The stationary distribution

PageRank A central result from the theory of random walks (aka Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0. Existence and Uniqueness

PageRank A group of pages is a spider trap if there are no links from within the group to outside the group Random surfer gets trapped Spider traps violate the conditions needed for the random walk theorem Spider traps

PageRank Microsoft becomes a spider trap Yahoo y a m y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 Amazon M’soft y a = m 1 1 1 1 1/2 3/2 3/4 1/2 7/4 5/8 3/8 2 0 0 3 . . .

PageRank The Google solution for spider traps At each time step, the random surfer has two options: With probability , follow a link at random With probability 1-, jump to some page uniformly at random Common values for  are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps Random teleports

PageRank Random teleports ( = 0.8) y y y 1/3 1/3 1/3 y 1/2 a 1/2 m 0 1/2 1/2 0 + 0.2* 0.8* 0.2*1/3 1/2 Yahoo 0.8*1/2 1/2 0.2*1/3 0.8*1/2 0.2*1/3 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 + 0.2 Amazon M’soft 0.8 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15

PageRank Random teleports ( = 0.8) 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 + 0.2 Yahoo 0.8 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 Amazon M’soft y a = m 1 1 1 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688 7/11 5/11 21/11 . . .

PageRank Suppose there are N pages Consider a page j, with set of outlinks O(j) We have Mij = 1/|O(j)| when ji and Mij = 0 otherwise The random teleport is equivalent to adding a teleport link from j to every other page with probability (1-)/N reducing the probability of following each outlink from 1/|O(j)| to /|O(j)| Equivalent: tax each page a fraction (1-) of its score and redistribute evenly Matrix formulation

PageRank Construct the N*N matrix A as follows Aij = Mij + (1-)/N Verify that A is a stochastic matrix The PageRank vectorr is the principal eigenvector of this matrix satisfying r = Ar Equivalently, r is the stationary distribution of the random walk with teleports PageRank

PageRank Pages with no outlinks are “dead ends” for the random surfer Nowhere to go on next step Dead ends

PageRank Microsoft becomes a dead end Non- stochastic! 1/2 1/2 0 1/2 0 0 0 1/2 0 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 + 0.2 Yahoo 0.8 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 1/15 Amazon M’soft y a = m 1 1 1 1 0.6 0.6 0.787 0.547 0.387 0.648 0.430 0.333 0 0 0 . . .

PageRank Teleport Follow random teleport links with probability 1.0 from dead ends Adjust matrix accordingly Prune and propagate Preprocess the graph to eliminate dead ends Might require multiple passes Compute PageRank on reduced graph Approximate values for dead ends by propagating values from reduced graph Dealing with dead ends

PageRank Key step is matrix-vector multiplication rnew = Arold Easy if we have enough main memory to hold A, rold, rnew Say N = 1 billion pages We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N2 entries 1018 is a large number! Computing PageRank

PageRank r = Ar, where Aij = Mij + (1-)/N ri = 1≤j≤N Aij rj ri = 1≤j≤N [Mij + (1-)/N] rj =  1≤j≤N Mij rj + (1-)/N 1≤j≤N rj =  1≤j≤N Mij rj + (1-)/N, since |r| = 1 r = Mr + [(1-)/N]N where [x]N is an N-vector with all entries x Rearranging the equation

PageRank We can rearrange the PageRank equation: r = Mr + [(1-)/N]N [(1-)/N]N is an N-vector with all entries (1-)/N M is a sparse matrix! 10 links per node, approx 10N entries So in each iteration, we need to: Compute rnew = Mrold Add a constant value (1-)/N to each entry in rnew Sparse matrix formulation

PageRank Sparse matrix encoding Encode sparse matrix using only nonzero entries Space proportional roughly to number of links say 10N, or 4*10*1 billion = 40GB still won’t fit in memory, but will fit on disk source node degree destination nodes

PageRank Assume we have enough RAM to fit rnew, plus some working memory Store rold and matrix M on disk Basic Algorithm: Initialize: rold = [1/N]N Iterate: Update: Perform a sequential scan of M and rold to update rnew Write out rnew to disk as rold for next iteration Every few iterations, compute |rnew-rold| and stop if it is below threshold Need to read in both vectors into memory Basic Algorithm

PageRank Update step Initialize all entries of rnew to (1-)/N For each page p (out-degree n): Read into memory: p, n, dest1,…,destn, rold(p) for j = 1..n: rnew(destj) += *rold(p)/n rold rnew src degree destination 0 0 1 1 2 2 3 3 4 4 5 5 6 6

PageRank In each iteration, we have to: Read rold and M Write rnew back to disk IO Cost = 2|r| + |M| What if we had enough memory to fit both rnew and rold? What if we could not even fit rnew in memory? 10 billion pages Analysis

PageRank Strip-based update Problem: thrashing

PageRank Block Update algorithm

PageRank Block Update algorithm src degree destination rnew 0 rold 1 0 1 2 3 2 3

PageRank Some additional overhead But usually worth it Cost per iteration |M|(1+) + (k+1)|r| Block Update algorithm

Topic-Sensitive PageRank Measures generic popularity of a page Biased against topic-specific authorities Ambiguous queries e.g., jaguar Uses a single measure of importance Other models e.g., hubs-and-authorities Susceptible to link spam Artificial link topographies created in order to boost page rank Some problems with PageRank

Topic-Sensitive PageRank Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When the random walker teleports, he picks a page from a set S of web pages S contains only pages that are relevant to the topic E.g., Open Directory (DMOZ) pages for a given topic (www.dmoz.org) For each teleport set S, we get a different rank vector rS Topic-Sensitive PageRank

Topic-Sensitive PageRank Aij = Mij + (1-)/|S| if i is in S Aij = Mij otherwise Show that A is stochastic We have weighted all pages in the teleport set S equally Could also assign different weights to them Matrix formulation

Topic-Sensitive PageRank Example 0.2 0.5 0.5 0.4 0.4 Node Iteration 0 1 2… stable 1 1.0 0.2 0.52 0.294 2 0 0.4 0.08 0.118 3 0 0.4 0.08 0.327 4 0 0 0.32 0.261 1 0.8 1 1 0.8 0.8 Suppose S = {1},  = 0.8 1 2 3 4 Note how we initialize the PageRank vector differently from the unbiased PageRank case.

Topic-Sensitive PageRank Experimental results [Haveliwala 2000] Picked 16 topics Teleport sets determined using DMOZ E.g., arts, business, sports,… “Blind study” using volunteers 35 test queries Results ranked using PageRank and TSPR of most closely related topic E.g., bicycling using Sports ranking In most cases volunteers preferred TSPR ranking How well does TSPR work?

Topic-Sensitive PageRank User can pick from a menu Use Bayesian classification schemes to classify query into a topic Can use the context of the query E.g., query is launched from a web page talking about a known topic History of queries e.g., “basketball” followed by “jordan” User context e.g., user’s My Yahoo settings, bookmarks, … Which topic ranking to use?

Hubs and Authorities Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained through a text search Can we organize these documents in some manner? PageRank offers one solution HITS (Hypertext-Induced Topic Selection) is another proposed at approx the same time (1998) Hubs and Authorities

Hubs and Authorities Interesting documents fall into two classes Authorities are pages containing useful information course home pages home pages of auto manufacturers Hubs are pages that link to authorities course bulletin list of US auto manufacturers HITS Model

Hubs and Authorities Idealized view Hubs Authorities

Hubs and Authorities A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node Hub score and Authority score Represented as vectors h and a Mutually recursive definition

Hubs and Authorities HITS uses a matrix A[i, j] = 1 if page i links to page j, 0 if not AT, the transpose of A, is similar to the PageRank matrix M, but AT has 1’s where M has fractions Transition Matrix A

Exploring Link Analysis Algorithms for Mining Massive Datasets

Exploring Link Analysis Algorithms for Mining Massive Datasets

Presentation Transcript

Hongjian Li Department of Computer Science and Engineering Chinese University of Hong Kong

Hongjian Li Department of Computer Science and Engineering Chinese University of Hong Kong

Spring 2013 China ETA Program Shanghai Jiao Tong University

Xiangdong Ji University of Maryland Shanghai Jiao Tong University

Shanghai Jiao Tong University/ Northwestern University Dual MS Degree Program

Dept. Phys., Shanghai Jiao Tong Univ., China

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Minglu Li ( Department of Computer Science and Engineering, Shanghai Jiao Tong University )

Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University

Yong Yang Shanghai Jiao Tong University On behalf of Collaboration