Local approximation of pagerank and reverse pagerank
This presentation is the property of its rightful owner.
Sponsored Links
1 / 56

Local Approximation of PageRank and Reverse PageRank PowerPoint PPT Presentation


  • 171 Views
  • Uploaded on
  • Presentation posted in: General

Local Approximation of PageRank and Reverse PageRank. Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08. Review of PageRank Local PageRank approximation Algorithm Lower bounds PageRank vs. Reverse PageRank Applications of Reverse PageRank. Overview. PageRank.

Download Presentation

Local Approximation of PageRank and Reverse PageRank

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Local approximation of pagerank and reverse pagerank

Local Approximation of PageRank and Reverse PageRank

Li-Tal Mashiach

Advisor: Dr. Ziv Bar-Yossef

13/03/08


Overview

  • Review of PageRank

  • Local PageRank approximation

  • Algorithm

  • Lower bounds

  • PageRank vs. Reverse PageRank

  • Applications of Reverse PageRank

Overview


Pagerank

PageRank

Most search engines analyze the hyperlink structure to order search results

PageRank

  • Important measure of ranking for all major search engines


Review of pagerank

Base rank

Sum of the

in-neighbors’ ranks

Review of PageRank

Rank divided among

all out-neighbors

Damping factor


Pagerank as a random walk

  • A random surfer is visiting the web:

    • With probability , selects a random out-link

    • With probability jumps to a random web page

PageRank as a Random Walk


Global pagerank computation

  • Run power method

    • Initialize:

    • Repeat until convergence:

  • Challenges:

    • Holding the whole web graph

    • Multiplying a matrix by a vector

Global PageRank Computation


Local pr approximation

Local PR Approximation

Global PR calculates PR to all pages

Sometime we are interested in the PR of a small number of pages

  • Person interested in the PR of his homepage

  • Online business is interested in the PR of his own website and his competitors’ website

    Do we need to calculate the PR of the whole graph for that?


Problem statement chen gan suel 2004

  • Given: local access to a directed graph G and target node

  • Output: PR(u)

  • local access:

  • Cost: Number of queries to the link server

Link Server

Problem Statement[Chen, Gan, Suel, 2004]


Overview1

Overview

Review of PageRank

Local PageRank approximation

Algorithm

Lower bounds

PageRank vs. Reverse PageRank

Applications of Reverse PageRank


Another characterization of pr jeh widom 2003

  • inft(v,u) – the fraction of the PR score of v that flows to u on paths of length t

v

u2

u1

Another Characterization of PR[Jeh, Widom, 2003]

u


Another characterization of pr jeh widom 20031

  • PRr(u) – PR score that flows into u from nodes at distance at most r from u

    Theorem:

v

u2

u1

Another Characterization of PR[Jeh, Widom, 2003]

u


Local pr brute force algorithm chen gan suel 2004

Local PR Brute Force Algorithm[Chen, Gan, Suel, 2004]

  • Goal: calculate PRr(u) for a sufficiently large r

  • Algorithm:

    • Crawl backwards the sub-graph of radius r around u

    • For each node v at layer t calculate the inft(v,u)

    • Sum up the weighted influence values

v

w1

w2

u


Local pr brute force algorithm

Local PR Brute Force Algorithm

u


Optimization by pruning

Optimization by Pruning

Heuristic to improve the cost

Prune all nodes whose influence is below some threshold

Was shown empirically to be sometimes better [Chen, Gan, Suel, 2004]

u


Analysis of the algorithm

Analysis of the Algorithm

  • This algorithm requires at most queries

    • r – number of iterations until the PR random walk almost converges

    • d – maximum in-degree of the graph

  • In case of slow PR convergence or high in-degree, the algorithm is not feasible


Limitations of the algorithm

  • In the web graph there are a lot of web pages with high in-degree

  • Conclusion: The algorithm is frequently unsuitable for the web graph

  • Is this a limitation of this

    specific algorithm only?

Limitations of the Algorithm


Lower bounds

Lower Bounds

  • Local PR approx. is hard for graphs with:

    • High in-degree nodes

    • Slow convergence of the PR random walk


Proof

Proof

x1

x2

x3

xm

  • By reduction from the OR problem

Input:

Output:

queries are needed even for

randomized algorithms


The reduction

The Reduction

1

1

0

m

X=

Gx=

….

u

  • A - Alg. that calculates local PR

  • B - Alg. that computes the OR function


The reduction1

The Reduction

1

1

0

m

X=

Gx=

Claim 1: Let |x| be the

number of 1’s in x. Then,

….

u

Claim 2: When ,


Proof cont

Proof Cont.

  • Given an input x, B simulates A on Gx, u

  • If PRx(u) ≥ p1 => OR=1

  • If PRx(u) ≤ p0 => OR=0

  • It means that the maximum number of queries A uses ≥


Conclusion

  • Local PageRank approximation is frequently infeasible on the web graph

Conclusion


Pagerank vs reverse pagerank

PageRank vs. Reverse PageRank

  • The local approximation algorithm should perform better on the Reverse Web Graph


Experimental setup

Experimental Setup

280,000 page crawl of the www.stanford.edu domain

22,000 page crawl of the www.cnn.com site


Convergence rate

Convergence Rate


Crawl growth rate

Crawl Growth Rate

In-deg: 38,606Out-deg: 255


Performance of the algorithm

Performance of the Algorithm


Applications of reverse pagerank

Applications of Reverse PageRank

Local RPR app.

Novel app.

TrustRank

Influencers in social networks

Hub web pages

Measuring semantic relatedness

Finding crawl seeds


Influencers in social networks

Influencers in Social Networks

Goal: Market a new product to be adopted by a large fraction of a social network

Method:

  • Initially target a few influential members

  • Trigger a word of mouth process

  • Results in a large number of users

    How should we choose these seed members?


Why rpr java et al 2006

  • Nodes with high RPR

    • Have short paths to many other nodes in the network

    • Frequently the only gateways to these nodes

Why RPR?[Java et al. 2006]


Influencers in social networks1

Influencers in Social Networks


Influencers in social networks2

Influencers in Social Networks

4-level BFS crawl

1-level BFS crawl

www.Livejournal.com, 3.5 million nodes


Hub web pages

Hub Web Pages

Goal: Find good starting points for search

  • Difficult to formulate queries

  • Broad search tasks

  • Need to understand the surrounding context

    Method: Find pages with short paths to many relevant pages


Why rpr fogaras 2003

  • High RPR pages tend to have short paths to many authorities

Why RPR?[Fogaras, 2003]


Hub web pages1

Hub Web Pages

Fraction of hubs in the top 20 results for the queries:

1. “computer scientists”

2. “global warming”

3. “folk dancing”

4. “queen Elizabeth”

Meta-search engine over Yahoo! search


Measuring semantic relatedness

Measuring Semantic Relatedness

Goal: Find the relatedness between two concepts

  • For Natural language processing applications

    Method: Use a taxonomy like the ODP or Wikipedia


Why rpr

Why RPR?

b is a strong sub-concept of a in a taxonomy if

  • there are many short paths from a to b

    RPR- measure of b as sub-concept of a

    RPR Similarity- two concepts will be similar in case they have significant overlap between their RPR vectors

  • similarity between the vectors RPRa and RPRb


Measuring semantic relatedness1

Measuring Semantic Relatedness

Relatedness to “Einstein”

Relatedness to “Computer”

Agriculture

Physics Prize

Newton Isaac

Internet

0.6

0.6

-0.4

www.dmoz.org taxonomy

WordSimilarity-353


Finding crawl seeds

Finding Crawl Seeds

Goal: Discover quickly new content on the web while incurring as little overhead as possible

  • Overhead: old pages / new pages

    Method: Find good seeds


Why rpr1

  • A page p has high RPR if

    • Many pages are reachable from p by short paths

    • These pages are not reachable from many other pages

u

Known page

Why RPR?

v

Unknown page


Finding crawl seeds1

Finding Crawl Seeds

Fraction of new pages discovered

Overhead

WebBase project, two crawls of ~1,000,000 pages, one week apart

4-level BFS crawl


Summary

Summary

Two graph properties make local PageRank approximation hard

The Web Graph is not suitable for

local PR approximation

The Reverse Web graph is suitable

for local PR approximation

RPR finds nodes that

  • have short paths to many other nodes

  • frequently the only gateways to these nodes

    Applications of RPR


Thanks

Thanks!


Appendix

Appendix


Proof high in degree deterministic algorithms

Proof – High in-degree Deterministic algorithms

x1

x2

x3

xm

  • By reduction from the majority-by-a-margin problem

Input:

Output: the majority

At least queries are needed


The reduction2

The Reduction

1

1

0

m

X=

Gx=

W1

W2

Wm

V1

V2

V3

u

  • A - Alg. that calculates local PR

  • B - Alg. that computes majority-by-a-margin


The reduction3

The Reduction

1

1

0

m

X=

Claim 1: Let |x| be the

number of 1’s in x. Then,

Gx=

W1

W2

Wm

V1

V2

V3

u

Claim 2: When ,


Proof cont1

Proof Cont.

  • Given an input x, B simulates A on Gx, u

  • If PRx(u) ≥ p1 => The majority bit of x is 1

  • If PRx(u) ≤ p0 => The majority bit of x is 0

  • It means that the maximum number of queries A uses ≥


Proof slow pr conversion randomized algorithms

Proof – Slow PR Conversion Randomized algorithms

x1

x2

x3

xm

  • By reduction from the OR problem

Input:

Output:

queries are needed even for

randomized algorithms


The reduction4

The Reduction

0

1

0

m

X=

Gx=

Sm

S1

……

T

  • A - Alg. that calculates local PR

  • B - Alg. that computes the OR function

u


The reduction5

The Reduction

0

1

0

m

X=

Gx=

Claim 1: Let |x| be the

number of 1’s in x. Then,

Sm

S1

……

T

Claim 2: When ,

u


Proof cont2

Proof Cont.

  • Given an input x, B simulates A on Gx, u

  • If PRx(u) ≥ p1 => OR=1

  • If PRx(u) ≤ p0 => OR=0

  • It means that the maximum number of queries A uses ≥


Proof slow pr convergence deterministic algorithms

Proof – Slow PR Convergence Deterministic algorithms

x1

x2

x3

xm

  • By reduction from the majority-by-a-margin problem

Input:

Output: the majority

At least queries are needed


The reduction6

The Reduction

1

1

0

m

X=

w1

w2

w3

w4

Gx=

wm-1

wm

……

  • A - Alg. that calculates local PR

  • B - Alg. that computes majority-by-a-margin

……

……

……

u


The reduction7

The Reduction

1

1

0

m

X=

Claim 1: Let |x| be the

number of 1’s in x. Then,

w1

w2

w3

w4

wm-1

wm

……

……

……

……

Claim 2: When ,

u


Proof cont3

Proof Cont.

  • Given an input x, B simulates A on Gx, u

  • If PRx(u) ≥ p1 => The majority bit of x is 1

  • If PRx(u) ≤ p0 => The majority bit of x is 0

  • It means that the maximum number of queries A uses ≥


  • Login