Estimating the Global PageRank of Web Communities

Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin Presentation given by Scott J. McCallen Dept. of Computer Science Kent State University December 4th 2006

Localized Search Engines • What are they? • Focus on a particular community • Examples: www.cs.kent.edu (site specific) or all computer science related websites (topic specific) • Advantages • Searching for particular terms with several meanings • Relatively inexpensive to build and use • Use less bandwidth, space and time • Local domains are orders of magnitude smaller than global domain

Localized Search Engines (con’t) • Disadvantages • Lack of Global information • i.e. only local PageRanks are available • Why is this a problem? • Only pages within that community that are highly regarded will have high PageRanks • There is a need for a global PageRank for pages only within a local domain • Traditionally, this can only be obtained by crawling entire domain

Some Global Facts • 2003 Study by Lyman on the Global Domain • 8.9 billion pages on the internet (static pages) • Approximately 18.7 kilobytes each • 167 terabytes needed to download and crawl the entire web • These resources are only available to major corporations • Local Domains • May only contain a couple hundred thousand pages • May already be contained on a local web server (www.cs.kent.edu) • There is much less restriction to the entire dataset • The advantages of localized search engines becomes clear

Global (N) vs. Local (n) Each local domain isn’t aware of the rest of the global domain. Some parts overlap, but others don’t. Overlap represents links to other domains. How is it possible to extract global information when only the local domain is available? Excluding overlap from other domains gives a very poor estimate of global rank.

Proposed Solution • Find a good approximation to the global PageRank value without crawling entire global domain • Find a superdomain of local domain that will well approximate the PageRank • Find this superdomain by crawling as few as n or 2n additional pages given a local domain of n pages • Esessentially, add as few pages to the local domain as possible until we find a very good approximation of the PageRanks in the local domain

PageRank - Description • Defines importance of pages based on the hyperlinks from one page to another (the web graph) • Computes the stationary distribution of a Markov chain created from the web graph • Uses the “random surfer” model to create a “random walk” over the chain

PageRank Matrix • Given m x m adjacency matrix for the web graph, define the PageRank Matrix as • DU is diagonal matrix such that UDU-1 is column stochastic • 0 ≤ α ≤ 1 • e is vector of all 1’s • v is the random surfer vector

PageRank Vector • The PageRank vector r represents the page rank of every node in the webgraph • It is defined as the dominate eigenvector of the PageRank matrix • Computed using the power method using a random starting vector • Computation can take as much as O(m2) time for a dense graph but in practice is normally O(km), k being the average number of links per page

Algorithm 1 • Computing the PageRank vector based on the adjacency matrix U of the given web graph

Algorithm 1 (Explanation) • Input: Adjacency Matrix U • Output: PageRank vector r • Method • Choose a random initial value for r(0) • Continue to iterate using the random surfer probability and vector until reaching the convergence threshold • Return the last iteration as the dominant eigenvector for adjacency matrix U

Defining the Problem ( G vs. L) • For a local domain L, we have G as the entire global domain with an N x N adjacency matrix • Define G to be as the following • i.e. we partition G into separate sections that allow L to be contained • Assume that L has already been crawled and Lout is known

Defining the Problem (p* in g) • If we partition G as such, we can denote actual PageRank vector of L as with respect to g (the global PageRank vector) Note: EL selects only the nodes that correspond to L from g

Defining the Problem (n << N) • We define p as the PageRank vector computed by crawling only local domain L • Note that p will be much different than p* • Continue to crawl more nodes of the global domain and the difference will become smaller, however this is not possible • Find the supergraph F of L that will minimize the difference between p and p*

Defining the Problem (finding F) • We need to find F that gives us the best approximation of p* • i.e. minimize the following problem (the difference between the actual global PageRank and the estimated PageRank) • F is found with a greedy strategy, using Algorithm 2 • Essentially, start with L and add the nodes in Fout that minimize our objective and continue doing so a total of T iterations

Algorithm 2

Algorithm 2 (Explanation) • Input: L (local domain), Lout (outlinks from L), T (number of iterations), k (pages to crawl per iteration) • Output: p (an improved estimated PageRank vector) • Method • First set F (supergraph) and Fout equal to L and Lout • Compute the PageRank vector of F • While T has not been exceeded • Select k new nodes to crawl based on F, Fout, f • Expand F to include those new nodes and modify Fout • Compute the new PageRank vector for F • Select the elements from f that correspond to L and return p

Global (N) vs. Local (n) (Again) We know how to create the PageRank vector using the power method. Using it on only the local domain gives very inaccurate estimates of the PageRank. How can we select nodes from other domains (i.e. expanding the current domain) to improve accuracy? How far can selecting more nodes be allowed to proceed without crawling the entire global domain?

Selecting Nodes • Select nodes to expand L to F • Selected nodes must bring us closer to the actual PageRank vector • Some nodes will greatly influence the current PageRank • Only want to select at most O(n) more pages than those already in L

Finding the Best Nodes • For a page j in the global domain and the frontier of F (Fout), the addition of page j to F is as follows • uj is the outlinks from F to j • s is the estimated inlinks from j into F (j has not yet been crawled) • s is estimated based on the expectation of inlink counts of pages already crawled as so

Finding the Best Nodes (con’t) • We defined the PageRank of F to be f • The PageRank of Fj is fj+ • xj is the PageRank of node j (added to the current PageRank vector) • Directly optimizing requires us to know the global PageRank p* • How can we minimize the objective without knowing p*?

Node Influence • Find the nodes in Fout that will have the greatest influence on the local domain L • Done by attaching an influence score to each node j • Summation of the difference adding page j will make to PageRank vector among all pages in L • The influence score has a strong corollary to the minimization of the GlobalDiff(fj) function (as compared to a baseline, for instance, the total outlink count from F to node j)

Node Influence Results • Node Influence vs. Outlink Count on a crawl of conservative web sites

Finding the Influence • Influence must be calculated for each node j in frontier of F that is considered • We are considering O(n) pages and the calculation is O(n), we are left with a O(n2) computation • To reduce this complexity, approximating the influence of j may be acceptable, but how? • Using the power method for computing the PageRank algorithms may lead us to a good approximation • However, using the algorithm (Algorithm 1), requires having a good starting vector

PageRank Vector (again) • The PageRank algorithm will converge at a rate equal to the random surfer probability α • With a starting vector x(0), the complexity of the algorithm is • That is, the more accurate the vector becomes, the more complex the process is • Saving Grace: Find a very good starting vector for x(0), in which case we only need to perform one iteration of Algorithm 1

Finding the Best x(0) • Partition the PageRank matrix for Fj

Finding the Best x(0) • Simple approach • Use as the starting vector (the current PageRank vector) • Perform one PageRank iteration • Remove the element that corresponds to added node • Issues • The estimate of fj+ will have an error of at least 2αxj • So if the PageRank of j is very high, very bad estimate

Stochastic Complement • In an expanded form, the PageRank fj+ is • Which can be solved as • Observation: • This is the stochastic complement of PageRank matrix of Fj

Stochastic Complement (Observations) • The stochastic complement of an irreducible matrix is unique • The stochastic complement is also irreducible and therefore has unique stationary distribution • With regards to the matrix S • The subdominant eigenvalue is at most which means that for large l, it is very close to α

The New PageRank Approximation • Estimate the vector fj of length l by performing one PageRank iteration over S, starting at f • Advantages • Starting and ending with a vector of length l • Creates a lower bound for error of zero • Example: Considering adding a node k to F that has no influence over the PageRank of F • Using the stochastic complement yields the exact solution

The Details • Begin by expanding the difference between two PageRank vectors • with

The Details • Substitute PF into the equation • Summarizing into vectors

Algorithm 3 (Explanation) • Input: F (the current local subgraph), Fout (outlinks of F), f (current PageRank of F), k (number of pages to return) • Output: k new pages to crawl • Method • Compute the outlink sums for each page in F • Compute a scalar for every known global page j (how many pages link to j) • Compute y and z as formulated • For each of the pages in Fout • Computer x as formulated • Compute the score of each page using x, y and z • Return the k pages with the highest scores

PageRank Leaks and Flows • The change of a PageRank based on added a node j to F can be described as Leaks and Flows • A flow is the increase in local PageRanks • Represented by • Scalar is the total amount j has to distribute • Vector determines how it will be distributed • A leak is the decrease in local PageRanks • Leaks come from non-positive vectors x and y • X is proportional to the weighted sum of sibling PageRanks • Y is an artifact of the random surfer vector

Leaks and Flows J Leaks Random Surfer Siblings Local Pages Flows

Experiments • Methodology • Resources are limited, global graph is approximated • Baseline Algorithms • Random • Nodes chosen uniformly at random from known global nodes • Outlink Count • Node chosen have the highest number of outline counts from the current local domain

Results (Data Sets) • Data Set • Restricted to http pages that do not contain the characters ?, *, @, or = • EDU Data Set • Crawl of the top 100 computer science universities • Yielded 4.7 million pages, 22.9 million links • Politics Data Set • Crawl of the pages under politics in dmoz directory • Yielded 4.4 million pages, 17.2 million links

Results (EDU Data Set) • Normalizations show difference, Kendall shows similarity

Results (Politics Data Set)

Result Summary • Stochastic Complement outperformed other methods in nearly every trial • The results are significantly better than the random walk approach with minimal computation

Conclusion • Accurate estimates of the PageRank can be obtained by using local results • Expand the local graph based on influence • Crawl at most O(n) more pages • Use stochastic complement to accurately estimate the new PageRank vector • Not computationally or storage intensive

Estimating the Global PageRank of Web Communities The End Thank You

Estimating the Global PageRank of Web Communities

Estimating the Global PageRank of Web Communities

Presentation Transcript

PageRank

The PageRank Citation Ranking : Bringing Order to the Web

The PageRank Citation Ranking: Bring Order to the web

Local Approximation of PageRank and Reverse PageRank

PageRank

The PageRank Citation Ranking: Bringing Order to the Web

Estimating the ImpressionRank of Web Pages

Description of PageRank

GLOBAL COMMUNITIES

The PageRank Citation Ranking “Bringing Order to the Web”

PageRank

PageRank

Web Communities

Estimating PageRank on Graph Streams

PageRank

Estimating Global Burden of Disease

The PageRank Citation Ranking: Bringing Order to the Web

Estimating Global Burden of Disease

PageRank

PageRank

Global Communities