- By
**aziza** - Follow User

- 101 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'AMCS/CS 340: Data Mining' - aziza

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Problems

Outline

- PageRank
- Introduction
- Matrix Formation
- Issues
- Topic-Specific PageRank
- Web Spam

2

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

PageRank

PageRank is a link analysis algorithm used by Google Internet search engine

Google shows pages based in many variables

Words

Page title

Domain name…

PageRank

A measure of the importance of a web page

PageRank is what made Google the owner of internet

Why is it called PageRank?

3

Story of Google’s PageRank

Because PageRank was developed at by Larry Page (hence the name Page-Rank)

Sergey Brin, Larry Page (1998). "The Anatomy of a Large-Scale Hypertextual Web Search Engine". Proceedings of the 7th international conference on World Wide Web (WWW).

4

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

e.g. ‘books’

Input: string “books”

Output: unordered list of 845,740,000matches

WORD SEARCH ENGINE

845,740,000 matches

SELECT

PageRankTM

Ordered list of web pages matching the query

Web search5

Challenges of Search

- Web is big, how big ?

In the October 2010 survey we received responses from 232,839,963 sites.

---From Netcraft

a large number of stale blogs at wordpress.com and 163.com were expired from the survey.

Web pages per website: 273

(2005)

Estimate: 63.6 billion pages

6

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Challenges of Search

- Web is big, how big ?
- Much duplication (30--‐40%)
- Best estimate of “unique” static HTML pages comes from search engine claims
- Google =25 billion(?), Yahoo = 20 billion (?)
- Web pages are not equally “important”
- e.g., www.joe-schmoe.com v www.stanford.edu
- Inlinks as votes
- www.stanford.edu has 23,400 inlinks
- www.joe-schmoe.com has 1 inlink
- Are all inlinks equal?
- Recursive question!

From Stanford CS345 Data Mining by

Anand Rajaraman, Jeffrey D. Ullman

7

Good or Bad website?

- Inlinks are “good” (recommendations)
- Inlinks from a “good” site are better than inlinks from a “bad” site
- but inlinks from sites with many outlinks are not as “good”...
- “Good” and “bad” are relative.

web site xxx

web site xxx

web site xxx

web site a b c d e f g

web

site

pdq pdq ..

web site yyyy

web site a b c d e f g

web site yyyy

9

Ranking nodes in the Graph

- Since there is big diversity in the connectivity of the webgraph,
- We can rank webpages by the link structure

- Links as votes
- Each link’s vote is proportional to the importance of its source page
- If page P with importance x has noutlinks, each link gets x/n votes
- Page P’s own importance is the sum of the votes on its inlinks

www.joe-‐schmoe.com

www.stanford.edu

10

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Outline

- PageRank
- Introduction
- Matrix Formation
- Issues
- Topic-Specific PageRank
- Web Spam

11

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Solving the flow equations

- 3 equations, 3 unknowns, no constants
- No unique solution
- All solutions are equivalent except for differences accounted for scale factor
- Additional constraint forces uniqueness
- y+a+m = 1
- y = 2/5, a = 2/5, m = 1/5
- Gaussian elimination method works for small examples, but we need a better method for large graphs

13

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Transforming the Problem by Matrix

The web in 1839

14

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Transforming the Problem by Matrix

The web in 1839

?

15

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Matrix Formulation

- Matrix M has one row and one column for each web page
- Suppose pagejhas n out-links
- If ji, then Mij=1/n
- Else Mij=0
- M is a column stochastic matrix
- Columns sum to 1

Hyperlink Matrix

M=

16

Problem Formulation

Amazon

M’soft

Yahoo

Yahoo

Amazon

M’soft

- Supposer is a vector with one entry per web page
- riis the importance score of page i
- call it the rank vector
- |r| = 1

Vote of j to out-links

Importance of i

Received vote on i’s in-links

17

Eigenvector formulation

- The flow equations can be written
- r = Mr
- The rank vector ris an eigenvector of the stochastic Hyperlink matrix M
- In fact, its first or principal eigenvector, with corresponding eigenvalue 1

18

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

PageRank

- How to get the rank the webpages by finding the principle eigenvector ?

Power iteration

Inverse iteration

QR algorithm

19

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Outline

- PageRank
- Introduction
- Matrix Formation
- Issues
- Topic-Specific PageRank
- Web Spam

21

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Problems

3 Questions:

Does the sequence rk always converge?

Is the vector rk independent of r0 ?

Is the vector rk a good ranking?

3 Ideas

Dangling nodes (Spider trap)

Second eigenvector! (Convergence)

Looping nodes (Irreducible)

No!

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Problems

3 Questions:

Does the sequence rk always converge?

Is the vector rk independent of r0 ?

Is the vector rk a good ranking?

3 Ideas

Dangling nodes (Spider trap)

Second eigenvector! (Convergence)

Looping nodes (Irreducible)

No!

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Dangling nodes (Spider Trap)

Pages without outlinks

Microsoft becomes a spider trap

0.5 0.5 0.375 0.31 0.25 0.2 …… 0

0 0.5 0.25 0.25 0.19 0.15 0.13 ……. 0

0 0 0.25 0.375 0.5 0.6 0.67 ……. 1

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Dangling nodes

Every entry in S is the probability that a surfer goes from page i to page j, and if he/she gets a dangling node, chooses any page at random

The Stochastic Matrix

S = M + D

S is stochastic, 0 ≤ Sij ≤ 1, ΣSj = 1

Dominant (stationary) eigenvector always exists, and the largest engenvalue is 1

Sr = r, λ1 = 1 [Perron-Frobenius]

ri is probability that surfer visits page i

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Problems

3 Questions:

Does the sequence rk always converge?

Is the vector rk independent of r0 ?

Is the vector rk a good ranking?

3 Ideas

Dangling nodes (Spider trap)

Second eigenvector! (Convergence)

Looping nodes (Irreducible)

No!

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The second eigenvalue

Convergence rate rk r is determined by |λ2|

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

3 Questions:

Does the sequence rk always converge?

Is the vector rk independent of r0 ?

Is the vector rk a good ranking?

3 Ideas

Dangling nodes (Spider trap)

Second eigenvector! (Convergence)

Looping nodes (Irreducible)

No!

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

We are almost done

S must be primitive (λ2 ≤ 1)

S is not primitive, λ1 = 1, λ2 = 1 !!!

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

We are almost done

A matrix is reducible if it can be placed into block upper/lower-triangular form by simultaneous row/column permutations [Wolfram.com]

S must be irreducible, so that the stationary vector has all positive entries

S is reducible

r

Google Matrix

Damping factor 0 ≤ α ≤ 1

α ≈ 1 convergence slow

α ≈ 0 convergence fast

U: Matrix of ones, personalization vector

n: Total number of pages

Trade-off : α = 0.85.

Google matrix, it has been proven |λ2| = α

G is stochastic, all entries are positive, means irreducible and primitive, a dominant eigenvector always exists, and its coefficients are all greater than zero

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Outline

- PageRank
- Introduction
- Matrix Formation
- Issues
- Topic-Specific PageRank
- Web Spam

34

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Amazon

M’soft

Back to Power Iteration Exampley a m

y 1/2 1/2 0

a 1/2 0 1

m 0 1/2 0

r = Mr

y

a =

m

1/3

1/3

1/3

1/3

1/2

1/6

5/12

1/3

1/4

3/8

11/24

1/6

2/5

2/5

1/5

. . .

35

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Random Walk Interpretation

- Imagine a random web surfer
- At any time t, surfer is on some page P
- At time t+1, the surfer follows an outlink from P uniformly at random
- Ends up on some page Q linked from P
- Process repeats indefinitely
- Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t
- p(t) is a probability distribution on pages

36

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The stationary distribution

- Where is the surfer at time t+1?
- Follows a link uniformly at random
- p(t+1) = M*p(t)
- Suppose the random walk reaches a state such that p(t+1) = M*p(t) = p(t)
- Then p(t) is called a stationary distribution for the random walk
- Our rank vector r satisfies r = Mr
- So it is a stationary distribution for the random surfer

37

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Existence and Uniqueness

A central result from the theory of random walks (aka Markov processes):

For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.

38

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Topic-Specific Page Rank

- Instead of generic popularity, can we measure popularity within a topic?
- E.g., computer science, health
- Bias the random walk
- Random walker prefers to pick a page from a set S of web pages
- S contains only pages that are relevant to the topic
- e.g., Open Directory (DMOZ) pages for a given topic (www.dmoz.org)
- For each set S, we get a different rank vector rS

39

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Matrix formulation

- Let

Aij= Mij + (1-)/|S| if i ϵS

Aij= Mijotherwise

- Ais stochastic
- We have weighted all pages in set S equally
- Could also assign different weights to them

40

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

0.5

0.5

0.4

0.4

Node Iteration

0 1 2… stable

1 1.0 0.2 0.52 0.294

2 0 0.4 0.08 0.118

3 0 0.4 0.08 0.327

4 0 0 0.32 0.261

1

0.8

1

1

0.8

0.8

ExampleSuppose S = {1}, = 0.8

1

2

3

0 1 0 0

M= 0.5 0 0 0

0.5 0 0 1

0 0 1 0

4

0.2 1 0.2 0.2

A= 0.4 0 0 0

0.4 0 0 0.8

0 0 0.8 0

Note how we initialize the pagerankvector differs from the

unbiased page rank case.

41

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Outline

- PageRank
- Introduction
- Matrix Formation
- Issues
- Topic-Specific PageRank
- Web Spam

42

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Web Spam

- Search has become the default gateway to the web
- Very high premium to appear on the first page of search results
- e.g., e-commerce sites
- advertising-driven sites

43

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

What is web spam?

- Spamming
- any deliberate action to boost a web page’s position in search engine results,
- incommensurate with page’s real value
- Spam
- web pages that are the result of spamming
- Approximately 10-15% of web pages are spam

44

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Boosting techniques

- Term spamming
- Manipulating the text of web pages in order to appear relevant to queries
- Repeat one or a few specific terms e.g., free, cheap
- Dump a large number of unrelated terms e.g., copy entire dictionaries
- Link spamming
- Creating link structures that boost page rank
- Get as many links from accessible pages as possible to target page t
- Construct “link farm” to get page rank multiplier effect

45

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Detecting Spam

- Term spamming
- Analyze text using statistical methods e.g., Naïve Bayes classifiers
- Similar to email spam filtering
- Also useful: detecting approximate duplicate pages
- Link spamming
- Open research area
- One approach: TrustRank

46

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

TrustRank idea

- Basic principle: approximate isolation
- It is rare for a “good” page to point to a “bad” (spam) page
- Sample a set of “seed pages” from the web
- Have an oracle (human) identify the good pages and the spam pages in the seed set
- Expensive task, so must make seed set as small as possible

47

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

References

The anatomy of a large-scale hypertextual web search engine. Brin & Page 1998

Eigenstructure of the Google Matrix, Haveliwala & Kamvar 2003, Eldén 2003, Serra-Capizzano 2005

Rebecca Wills, Google’s PageRank: The Math Behind the Search Engine

Amy Langville & Carl Meyer, Google’s PageRank and Beyond

David Austin, How Google finds your needle in the Web’s Haystack, AMS

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

WassilyLeontief

In 1941, Russian/American Harvard economist published a paper in which he divides a country's economy into sectors that both supply and receive resources from each other, although not in equal measure.

He developed an iterative method of valuing each sector based on the importance of the sectors that supply it. Sound familiar?

In 1973, Leontief was awarded the Nobel Prize in economics for this work.

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

What you should know

- How does PageRank work?
- What are the main issues of using power iteration to get the principle eigenvector?
- How to solve the issues?
- How does topic-specific PageRank work?
- What is Web Spam?

54

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Download Presentation

Connecting to Server..