page rank n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
AMCS/CS 340: Data Mining PowerPoint Presentation
Download Presentation
AMCS/CS 340: Data Mining

Loading in 2 Seconds...

play fullscreen
1 / 50

AMCS/CS 340: Data Mining - PowerPoint PPT Presentation


  • 101 Views
  • Uploaded on

Page Rank. AMCS/CS 340: Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Outline. PageRank Introduction Matrix Formation Issues Topic-Specific PageRank Web Spam. 2. PageRank.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'AMCS/CS 340: Data Mining' - aziza


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
page rank
Page RankAMCS/CS 340: Data Mining

Xiangliang Zhang

King Abdullah University of Science and Technology

outline
Outline
  • PageRank
    • Introduction
    • Matrix Formation
    • Issues
  • Topic-Specific PageRank
  • Web Spam

2

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

pagerank
PageRank

PageRank is a link analysis algorithm used by Google Internet search engine

Google shows pages based in many variables

Words

Page title

Domain name…

PageRank

A measure of the importance of a web page

PageRank is what made Google the owner of internet

Why is it called PageRank?

3

story of google s pagerank
Story of Google’s PageRank

Because PageRank was developed at by Larry Page (hence the name Page-Rank)

Sergey Brin, Larry Page (1998). "The Anatomy of a Large-Scale Hypertextual Web Search Engine". Proceedings of the 7th international conference on World Wide Web (WWW).

4

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

web search

STRING

e.g. ‘books’

Input: string “books”

Output: unordered list of 845,740,000matches

WORD SEARCH ENGINE

845,740,000 matches

SELECT

PageRankTM

Ordered list of web pages matching the query

Web search

5

challenges of search
Challenges of Search
  • Web is big, how big ?

In the October 2010 survey we received responses from 232,839,963 sites.

---From Netcraft

a large number of stale blogs at wordpress.com and 163.com were expired from the survey.

Web pages per website: 273

(2005)

Estimate: 63.6 billion pages

6

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

challenges of search1
Challenges of Search
  • Web is big, how big ?
    • Much duplication (30--‐40%)
    • Best estimate of “unique” static HTML pages comes from search engine claims
    • Google =25 billion(?), Yahoo = 20 billion (?)
  • Web pages are not equally “important”
    • e.g., www.joe-schmoe.com v www.stanford.edu
    • Inlinks as votes
      • www.stanford.edu has 23,400 inlinks
      • www.joe-schmoe.com has 1 inlink
    • Are all inlinks equal?
    • Recursive question!

From Stanford CS345 Data Mining by

Anand Rajaraman, Jeffrey D. Ullman

7

the structure of web
The Structure of Web
  • What is the structure of the Web?
  • How is it organized?
  • As a Graph

Directed Graph

8

good or bad website
Good or Bad website?
  • Inlinks are “good” (recommendations)
  • Inlinks from a “good” site are better than inlinks from a “bad” site
  • but inlinks from sites with many outlinks are not as “good”...
  • “Good” and “bad” are relative.

web site xxx

web site xxx

web site xxx

web site a b c d e f g

web

site

pdq pdq ..

web site yyyy

web site a b c d e f g

web site yyyy

9

ranking nodes in the graph
Ranking nodes in the Graph
  • Since there is big diversity in the connectivity of the webgraph,
  • We can rank webpages by the link structure
  • Links as votes
    • Each link’s vote is proportional to the importance of its source page
    • If page P with importance x has noutlinks, each link gets x/n votes
    • Page P’s own importance is the sum of the votes on its inlinks

www.joe-­‐schmoe.com

www.stanford.edu

10

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

outline1
Outline
  • PageRank
    • Introduction
    • Matrix Formation
    • Issues
  • Topic-Specific PageRank
  • Web Spam

11

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

simple flow model
Simple “flow” model

The web in 1839

12

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

solving the flow equations
Solving the flow equations
  • 3 equations, 3 unknowns, no constants
    • No unique solution
    • All solutions are equivalent except for differences accounted for scale factor
  • Additional constraint forces uniqueness
    • y+a+m = 1
    • y = 2/5, a = 2/5, m = 1/5
  • Gaussian elimination method works for small examples, but we need a better method for large graphs

13

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

transforming the problem by matrix
Transforming the Problem by Matrix

The web in 1839

14

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

transforming the problem by matrix1
Transforming the Problem by Matrix

The web in 1839

?

15

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

matrix formulation
Matrix Formulation
  • Matrix M has one row and one column for each web page
  • Suppose pagejhas n out-links
    • If ji, then Mij=1/n
    • Else Mij=0
  • M is a column stochastic matrix
    • Columns sum to 1

Hyperlink Matrix

M=

16

problem formulation
Problem Formulation

Amazon

M’soft

Yahoo

Yahoo

Amazon

M’soft

  • Supposer is a vector with one entry per web page
    • riis the importance score of page i
    • call it the rank vector
    • |r| = 1

Vote of j to out-links

Importance of i

Received vote on i’s in-links

17

eigenvector formulation
Eigenvector formulation
  • The flow equations can be written
  • r = Mr
  • The rank vector ris an eigenvector of the stochastic Hyperlink matrix M
  • In fact, its first or principal eigenvector, with corresponding eigenvalue 1

18

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

pagerank1
PageRank
  • How to get the rank the webpages by finding the principle eigenvector ?

Power iteration

Inverse iteration

QR algorithm

19

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

power iteration
Power Iteration

M=

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

outline2
Outline
  • PageRank
    • Introduction
    • Matrix Formation
    • Issues
  • Topic-Specific PageRank
  • Web Spam

21

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

problems
Problems

3 Questions:

Does the sequence rk always converge?

Is the vector rk independent of r0 ?

Is the vector rk a good ranking?

3 Ideas

Dangling nodes (Spider trap)

Second eigenvector! (Convergence)

Looping nodes (Irreducible)

No!

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

problems1
Problems

3 Questions:

Does the sequence rk always converge?

Is the vector rk independent of r0 ?

Is the vector rk a good ranking?

3 Ideas

Dangling nodes (Spider trap)

Second eigenvector! (Convergence)

Looping nodes (Irreducible)

No!

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

dangling nodes spider trap
Dangling nodes (Spider Trap)

Pages without outlinks

Microsoft becomes a spider trap

0.5 0.5 0.375 0.31 0.25 0.2 …… 0

0 0.5 0.25 0.25 0.19 0.15 0.13 ……. 0

0 0 0.25 0.375 0.5 0.6 0.67 ……. 1

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

the dangling nodes matrix
The Dangling nodes Matrix

Pages without outlinks

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

dangling nodes
Dangling nodes

Every entry in S is the probability that a surfer goes from page i to page j, and if he/she gets a dangling node, chooses any page at random

the stochastic matrix
The Stochastic Matrix

S = M + D

S is stochastic, 0 ≤ Sij ≤ 1, ΣSj = 1

Dominant (stationary) eigenvector always exists, and the largest engenvalue is 1

Sr = r, λ1 = 1 [Perron-Frobenius]

ri is probability that surfer visits page i

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

problems2
Problems

3 Questions:

Does the sequence rk always converge?

Is the vector rk independent of r0 ?

Is the vector rk a good ranking?

3 Ideas

Dangling nodes (Spider trap)

Second eigenvector! (Convergence)

Looping nodes (Irreducible)

No!

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

the second eigenvalue
The second eigenvalue

Convergence rate rk r is determined by |λ2|

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

problems3
Problems

3 Questions:

Does the sequence rk always converge?

Is the vector rk independent of r0 ?

Is the vector rk a good ranking?

3 Ideas

Dangling nodes (Spider trap)

Second eigenvector! (Convergence)

Looping nodes (Irreducible)

No!

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

we are almost done
We are almost done

S must be primitive (λ2 ≤ 1)

S is not primitive, λ1 = 1, λ2 = 1 !!!

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

we are almost done1
We are almost done

A matrix is reducible if it can be placed into block upper/lower-triangular form by simultaneous row/column permutations [Wolfram.com]

S must be irreducible, so that the stationary vector has all positive entries

S is reducible

r

google matrix
Google Matrix

Damping factor 0 ≤ α ≤ 1

α ≈ 1  convergence slow

α ≈ 0  convergence fast

U: Matrix of ones, personalization vector

n: Total number of pages

Trade-off : α = 0.85.

Google matrix, it has been proven |λ2| = α

G is stochastic, all entries are positive, means irreducible and primitive, a dominant eigenvector always exists, and its coefficients are all greater than zero

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

outline3
Outline
  • PageRank
    • Introduction
    • Matrix Formation
    • Issues
  • Topic-Specific PageRank
  • Web Spam

34

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

back to power iteration example

Yahoo

Amazon

M’soft

Back to Power Iteration Example

y a m

y 1/2 1/2 0

a 1/2 0 1

m 0 1/2 0

r = Mr

y

a =

m

1/3

1/3

1/3

1/3

1/2

1/6

5/12

1/3

1/4

3/8

11/24

1/6

2/5

2/5

1/5

. . .

35

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

random walk interpretation
Random Walk Interpretation
  • Imagine a random web surfer
    • At any time t, surfer is on some page P
    • At time t+1, the surfer follows an outlink from P uniformly at random
    • Ends up on some page Q linked from P
    • Process repeats indefinitely
  • Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t
    • p(t) is a probability distribution on pages

36

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

the stationary distribution
The stationary distribution
  • Where is the surfer at time t+1?
    • Follows a link uniformly at random
    • p(t+1) = M*p(t)
  • Suppose the random walk reaches a state such that p(t+1) = M*p(t) = p(t)
    • Then p(t) is called a stationary distribution for the random walk
  • Our rank vector r satisfies r = Mr
    • So it is a stationary distribution for the random surfer

37

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

existence and uniqueness
Existence and Uniqueness

A central result from the theory of random walks (aka Markov processes):

For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.

38

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

topic specific page rank
Topic-Specific Page Rank
  • Instead of generic popularity, can we measure popularity within a topic?
    • E.g., computer science, health
  • Bias the random walk
    • Random walker prefers to pick a page from a set S of web pages
    • S contains only pages that are relevant to the topic
    • e.g., Open Directory (DMOZ) pages for a given topic (www.dmoz.org)
  • For each set S, we get a different rank vector rS

39

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

matrix formulation1
Matrix formulation
  • Let

Aij= Mij + (1-)/|S| if i ϵS

Aij= Mijotherwise

  • Ais stochastic
  • We have weighted all pages in set S equally
    • Could also assign different weights to them

40

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

example

0.2

0.5

0.5

0.4

0.4

Node Iteration

0 1 2… stable

1 1.0 0.2 0.52 0.294

2 0 0.4 0.08 0.118

3 0 0.4 0.08 0.327

4 0 0 0.32 0.261

1

0.8

1

1

0.8

0.8

Example

Suppose S = {1},  = 0.8

1

2

3

0 1 0 0

M= 0.5 0 0 0

0.5 0 0 1

0 0 1 0

4

0.2 1 0.2 0.2

A= 0.4 0 0 0

0.4 0 0 0.8

0 0 0.8 0

Note how we initialize the pagerankvector differs from the

unbiased page rank case.

41

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

outline4
Outline
  • PageRank
    • Introduction
    • Matrix Formation
    • Issues
  • Topic-Specific PageRank
  • Web Spam

42

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

web spam
Web Spam
  • Search has become the default gateway to the web
  • Very high premium to appear on the first page of search results
    • e.g., e-commerce sites
    • advertising-driven sites

43

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

what is web spam
What is web spam?
  • Spamming
    • any deliberate action to boost a web page’s position in search engine results,
    • incommensurate with page’s real value
  • Spam
    • web pages that are the result of spamming
  • Approximately 10-15% of web pages are spam

44

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

boosting techniques
Boosting techniques
  • Term spamming
    • Manipulating the text of web pages in order to appear relevant to queries
      • Repeat one or a few specific terms e.g., free, cheap
      • Dump a large number of unrelated terms e.g., copy entire dictionaries
  • Link spamming
    • Creating link structures that boost page rank
      • Get as many links from accessible pages as possible to target page t
      • Construct “link farm” to get page rank multiplier effect

45

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

detecting spam
Detecting Spam
  • Term spamming
    • Analyze text using statistical methods e.g., Naïve Bayes classifiers
    • Similar to email spam filtering
    • Also useful: detecting approximate duplicate pages
  • Link spamming
    • Open research area
    • One approach: TrustRank

46

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

trustrank idea
TrustRank idea
  • Basic principle: approximate isolation
    • It is rare for a “good” page to point to a “bad” (spam) page
  • Sample a set of “seed pages” from the web
  • Have an oracle (human) identify the good pages and the spam pages in the seed set
    • Expensive task, so must make seed set as small as possible

47

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

references
References

The anatomy of a large-scale hypertextual web search engine. Brin & Page 1998

Eigenstructure of the Google Matrix, Haveliwala & Kamvar 2003, Eldén 2003, Serra-Capizzano 2005

Rebecca Wills, Google’s PageRank: The Math Behind the Search Engine

Amy Langville & Carl Meyer, Google’s PageRank and Beyond

David Austin, How Google finds your needle in the Web’s Haystack, AMS

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

wassily leontief
WassilyLeontief

In 1941, Russian/American Harvard economist published a paper in which he divides a country's economy into sectors that both supply and receive resources from each other, although not in equal measure.

He developed an iterative method of valuing each sector based on the importance of the sectors that supply it. Sound familiar?

In 1973, Leontief was awarded the Nobel Prize in economics for this work.

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

what you should know
What you should know
  • How does PageRank work?
  • What are the main issues of using power iteration to get the principle eigenvector?
  • How to solve the issues?
  • How does topic-specific PageRank work?
  • What is Web Spam?

54

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining