Loading in 5 sec....

CS 277: Data Mining Mining Web Link StructurePowerPoint Presentation

CS 277: Data Mining Mining Web Link Structure

- 103 Views
- Uploaded on
- Presentation posted in: General

CS 277: Data Mining Mining Web Link Structure

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

CS 277: Data MiningMining Web Link Structure

HITS and PageRank; Google

March 27, 2013

- Based on information retrieval
- Boolean / vector model, etc.
- Based purely on 'on-page' factors, i.e., the text of the page

- Results were not very good
- Web doesn't have an editor to control quality
- Web contains deliberately misleading information (SEO)
- Great variety in types of information: Phone books, catalogs, technical reports, slide shows, ...
- Many languages, partial descriptions, jargon, ...

- How to improve the results?

- HITS
- Hubs and authorities

- PageRank
- Iterative computation
- Random-surfer model
- Refinements: Sinks and Hogs

- Google
- How Google worked in 1998
- Google over the years
- SEOs

NEXT

- Many queries are relatively broad
- "cats", "harvard", "iphone", ...

- Consequence: Abundance of results
- There may be thousands or even millions of pages that contain the search term, incl. personal homepages, rants, ...
- IR-type ranking isn't enough; still way too much for a human user to digest
- Need to further refine the ranking!

- Idea: Look for the most authoritative pages
- But how do we tell which pages these are?
- Problem: No endogenous measure of authoritativeness Hard to tell just by looking at the page.
- Need some 'off-page' factors

- Hyperlinks encode a considerable amount of human judgment
- What does it mean when a web page links another web page?
- Intra-domain links: Often created primarily for navigation
- Inter-domain links: Confer some measure of authority

- So, can we simply boost the rank of pages with lots of inbound links?

Team

Sports

“A-Team” page

Yahoo

Directory

Wikipedia

Cheesy

TV

Shows

page

Mr. T’s

page

Hollywood

“Series to

Recycle” page

- Idea: Give more weight to links from hub pages that point to lots of other authorities
- Mutually reinforcing relationship:
- A good hub is one that points to many good authorities
- A good authority is one that is pointed to by many good hubs

A

B

Hub

Authority

R

S

- Algorithm for a query Q:
- Start with a root set R, e.g., the t highest-ranked pages from the IR-style ranking for Q
- For each pR, add all the pages p points to, and up to d pages that point to p. Call the resulting set S.
- Assign each page pS an authority weight xp and a hub weight yp; initially, set all weights to be equal and sum to 1
- For each pS, compute new weights xp and yp as follows:
- New xp := Sum of all yq such that qp is an interdomain link
- New yp := Sum of all xq such that pq is an interdomain link

- Normalize the new weights such that both the sum of all the xp and the sum of all the yp are 1
- Repeat from step 4 until a fixpoint is reached
- If A is adjacency matrix, fixpoints are principal eigenvectors ofATA and AAT, respectively

J. Kleinberg, Authorative sources in a hyperlinked environment, Proceedings of ACM SODA Conference, 1998.

HITS – Hypertext Induced Topic Selection

Every page u has two distinct measures of merit, its hub score h[u] and its authority score a[u].

Recursive quantitative definitions of hub and authority scores

Relies on query-time processing

To select base set Vq of links for query q constructed by

selecting a sub-graph R from the Web (root set) relevant to the query

selecting any node u which neighbors any r \in R via an inbound or outbound edge (expanded set)

To deduce hubs and authorities that exist in a sub-graph of the Web

5

2

3

1

1

6

4

7

h(1) = a(5) + a(6) + a(7)

a(1) = h(2) + h(3) + h(4)

- Recursive dependency:
- a(v) Σ h(w)
- h(v) Σ a(w)

w Є pa[v]

w Є ch[v]

- Using Linear Algebra, we can prove:

a(v) and h(v) converge

Find a base subgraph:

- Start with a root set R {1, 2, 3, 4}

- {1, 2, 3, 4} - nodes relevant to the topic

- Expand the root set R to include all the children and a fixed number of parents of nodes in R

A new set S (base subgraph)

Authority

Hubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights

- Improves the ranking based on link structure
- Intuition: Links confer some measure of authority
- Overall ranking is a combination of IR ranking and this

- Based on concept of hubs and authorities
- Hub: Points to many good authorities
- Authority: Is pointed to by many good hubs
- Iterative algorithm to assign hub/authority scores

- Query-specific
- No notion of 'absolute quality' of a page; ranking needs to be computed for each new query

- HITS
- Hubs and authorities

- PageRank
- Iterative computation
- Random-surfer model
- Refinements: Sinks and Hogs

- Google
- How Google worked in 1998
- Google over the years
- SEOs

NEXT

- A technique for estimating page quality
- Based on web link graph, just like HITS
- Like HITS, relies on a fixpoint computation

- Important differences to HITS:
- No hubs/authorities distinction; just a single value per page
- Query-independent

- Results are combined with IR score
- Think of it as: TotalScore = IR score * PageRank
- In practice, search engines use many other factors(for example, Google says it uses more than 200)

Shouldn't E's vote be worth more than F's?

A

G

- Imagine a contest for The Web's Best Page
- Initially, each page has one vote
- Each page votes for all the pages it has a link to
- To ensure fairness, pages voting for more than one page must split their vote equally between them
- Voting proceeds in rounds; in each round, each page has the number of votes it received in the previous round
- In practice, it's a little more complicated - but not much!

B

H

E

I

C

How many levelsshould we consider?

F

J

D

- Each page i is given a rank xi
- Goal: Assign the xi such that the rank of each page is governed by the ranks of the pages linking to it:

Rank of page j

Rank of page i

Number of

links out

from page j

How do we computethe rank values?

Every page

j that links to i

Initialize all ranks tobe equal, e.g.:

Iterate untilconvergence

1

2

3

4

1

1

2

3

0.5

0.5

0.5

0.5

0.5

0.5

4

1

1

2

3

0.5

0.5

0.5

0.5

0.5

Weight matrix W

0.5

4

- Recall rj = importance of node j
rj = Si wij rii,j = 1,….n

e.g., r2 = 1 r1 + 0 r2 + 0.5 r3 + 0.5 r4

= dot product of r vector with column 2 of W

Let r = n x 1 vector of importance values for the n nodes

Let W = n x n matrix of link weights

=> we can rewrite the importance equations as

r = WTr

Need to solve the importance equations for unknown r, with known W

r = WTr

We recognize this as a standard eigenvalue problem, i.e.,

A r = lr (where A = WT)

with l = an eigenvalue = 1

and r = the eigenvector corresponding to l = 1

Need to solve for r in

(WT – l I) r = 0

Note: W is a stochastic matrix, i.e., rows are non-negative and sum to 1

Results from linear algebra tell us that:

(a) Since W is a stochastic matrix, W and WT have the same eigenvectors/eigenvalues

(b) The largest of these eigenvalues l is always 1

(c) the vector r corresponds to the eigenvector corresponding to the largest eigenvector of W (or WT)

Solving for the eigenvector of W we get

r = [0.2 0.4 0.133 0.2667]

Results are quite intuitive, e.g., 2 is “most important”

1

1

2

3

0.5

0.5

W

0.5

0.5

0.5

0.5

4

- Let
- N(p) = number outgoing links from page p
- B(p) = number of back-links to page p
- Each page b distributes its importance to all of the pages it points to (so we scale by 1/N(b))
- Page p’s importance is increased by the importance of its back set

- Create an m x m matrix M to capture links:
- M(i, j) = 1 / nj if page i is pointed to by page j and page j has nj outgoing links = 0 otherwise
- Initialize all PageRanks to 1, multiply by M repeatedly until all values converge:
- Computes principal eigenvector via power iteration

=

*

Amazon

Yahoo

Running for multiple iterations:

=

,

,

, …

Total rank sums to number of pages

=

*

Amazon

Yahoo

'dead end' - PageRankis lost after each round

Running for multiple iterations:

=

,

,

, … ,

=

*

Amazon

Yahoo

PageRank cannot flowout and accumulates

Running for multiple iterations:

=

,

,

, … ,

- Remove out-degree 0 nodes (or consider them to refer back to referrer)
- Add decay factor d to deal with sinks
- Typical value: d=0.85

- PageRank has an intuitive basis in random walks on graphs
- Imagine a random surfer, who starts on a random page and, in each step,
- with probability d, klicks on a random link on the page
- with probability 1-d, jumps to a random page (bored?)

- The PageRank of a page can be interpreted as the fraction of steps the surfer spends on the corresponding page
- Transition matrix can be interpreted as a Markov Chain

= 0.85

+

*

Amazon

Yahoo

Running for multiple iterations:

,

,

, … ,

,

=

… though does this seem right?

- Has become a big business
- White-hat techniques
- Google webmaster tools
- Add meta tags to documents, etc.

- Black-hat techniques
- Link farms
- Keyword stuffing, hidden text, meta-tag stuffing, ...
- Spamdexing
- Initial solution: <a rel="nofollow" href="...">...</a>
- Some people started to abuse this to improve their own rankings

- Doorway pages / cloaking
- Special pages just for search engines
- BMW Germany and Ricoh Germany banned in February 2006

- Link buying

- Estimates absolute 'quality' or 'importance' of a given page based on inbound links
- Query-independent
- Can be computed via fixpoint iteration
- Can be interpreted as the fraction of time a 'random surfer' would spend on the page
- Several refinements, e.g., to deal with sinks

- Considered relatively stable
- But vulnerable to black-hat SEO

- An important factor, but not the only one
- Overall ranking is based on many factors (Google: >200)

- Note: This is entirely speculative!

Links to 'bad neighborhood'

Keyword stuffing

Over-optimization

Hidden content (text has same color as background)

Automatic redirect/refresh

...

Keyword in title? URL?

Keyword in domain name?

Quality of HTML code

Page freshness

Rate of change

...

Fast increase in number of inbound links (link buying?)

Link farming

Different pages user/spider

Content duplication

...

High PageRank

Anchor text of inbound links

Links from authority sites

Links from well-known sites

Domain expiration date

...

Source: Web Information Systems, Prof. Beat Signer, VU Brussels

- PageRank assumes a “random surfer” who starts at any node and estimates likelihood that the surfer will end up at a particular page
- A more general notion: label propagation
- Take a set of start nodes each with a different label
- Estimate, for every node, the distribution of arrivals from each label
- In essence, captures the relatedness or influence of nodes
- Used in YouTube video matching, schema matching, …

- HITS
- Hubs and authorities

- PageRank
- Iterative computation
- Random-surfer model
- Refinements: Sinks and Hogs

- Google
- How Google worked in 1998
- Google over the years
- SEOs

NEXT

Focus was on scalabilityto the size of the Web

First to really exploitLink Analysis

Started as an academicproject @ Stanford;became a startup

Our discussion will beon early Google – todaythey keep things secret!

- “BigFile” system for storing indices, tables
- Support for 264 bytes across multiple drives, filesystems
- Manages its own file descriptors, resources
- This was the predecessor to GFS

- First use: Repository
- Basically, a warehouse of every HTML page (this is the 'cached page' entry), compressed in zlib (faster than bzip)
- Useful for doing additional processing, any necessary rebuilds
- Repository entry format:[DocID][ECode][UrlLen][PageLen][Url][Page]
- The repository is indexed (not inverted here)

- One index for looking up documents by DocID
- Done in ISAM (think of this as a B+ Tree without smart re-balancing)
- Index points to repository entries (or to URL entry if not crawled)

- One index for mapping URL to DocID
- Sorted by checksum of URL
- Compute checksum of URL, then perform binary search by checksum
- Allows update by merge with another similar file
- Why is this done?

- The list of searchable words
- (Presumably, today it’s used to suggest alternative words as well)
- The “root” of the inverted index

- As of 1998, 14 million “words”
- Kept in memory (was 256MB)
- Two parts:
- Hash table of pointers to words and the “barrels” (partitions) they fall into
- List of words (null-separated)

- Inverted index divided into “barrels” (partitions by range)
- Indexed by the lexicon; for each DocID, consists of a Hit List of entries in the document
- Two barrels: short(anchor and title); full (all text)

- Forward index uses the same barrels
- Indexed by DocID, then a list of WordIDs in this barrel and this document, then Hit Lists corresponding to the WordIDs

original tables from

http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm

- Used in inverted and forward indices
- Goal was to minimize the size – the bulk of data is in hit entries
- For 1998 version, made it down to 2 bytes per hit (though that’s likely climbed since then):

Plain

cap 1

font: 3

position: 12

vs.

Fancy

cap 1

font: 7

type: 4 position: 8

special-cased to:

Anchor

cap 1

font: 7

type: 4 hash: 4 pos: 4

- Single URL Server – the coordinator
- A queue that farms out URLs to crawler nodes
- Implemented in Python!

- Crawlers had 300 open connections apiece
- Each needs own DNS cache – DNS lookup is major bottleneck, as we have seen
- Based on asynchronous I/O

- Many caveats in building a “friendly” crawler (remember robot exclusion protocol?)

- Expect the unexpected
- They accidentally crawled an online game
- Huge array of possible errors: Typos in HTML tags, non-ASCII characters, kBs of zeroes in the middle of a tag, HTML tags nested hundreds deep, ...

- Social issues
- Lots of email and phone calls, since most people had not seen a crawler before:
- "Wow, you looked at a lot of pages from my web site. How did you like it?"
- "This page is copy-righted and should not be indexed"
- ...

- Lots of email and phone calls, since most people had not seen a crawler before:
- Typical of new services deployed "in the wild"
- We had similar experiences with our ePOST system and our measurement study of broadband networks

- Parse the query
- Convert words into wordIDs
- Seek to start of doclist in the short barrel for every word
- Scan through the doclists until there is a document that matches all of the search terms
- Compute the rank of that document
- IR score: Dot product of count weights and type weights
- Final rank: IR score combined with PageRank

- If we’re at the end of the short barrels, start at the doclists of the full barrel, unless we have enough
- If not at the end of any doclist, goto step 4
- Sort the documents by rank; return the top K

- Considers many types of information:
- Position, font size, capitalization
- Anchor text
- PageRank
- Count of occurrences (basically, TF) in a way that tapers off
- (Not clear if they did IDF at the time?)

- Multi-word queries consider proximity as well
- How?

- In 1998:
- 24M web pages
- About 55GB data w/o repository
- About 110GB with repository
- Lexicon 293MB

- Worked quite well with low-end PC
- In 2007: > 27 billion pages, >1.2B queries/day:
- Don’t attempt to include all barrels on every machine!
- e.g., 5+TB repository on special servers separate from index servers

- Many special-purpose indexing services (e.g., images)
- Much greater distribution of data (~500K PCs?), huge net BW
- Advertising needs to be tied in (>1M advertisers in 2007)

- Don’t attempt to include all barrels on every machine!

- August 2001: Search algorithm revamped
- Incorporate additional ranking criteria more easily

- February 2003: Local connectivity analysis
- More weight to links from experts' sites. Google's first patent.

- Summer 2003: Fritz
- Index updated incrementally, rather than in big batches

- June 2005: Personalized results
- Users can let Google mine their own search behavior

- December 2005: Engine update
- Allows for more comprehensive web crawling

Source: http://www.wired.com/magazine/2010/02/ff_google_algorithm/all/1

- May 2007: Universal search
- Users can get links to any medium (images, news, books, maps, etc) on the same results page

- December 2009: Real-time search
- Display results from Twitter & blogs as they are posted

- August 2010: Caffeine
- New indexing system; "50 percent fresher results"

- February 2011: Major change to algorithm
- The "Panda update" (revised since; Panda 3.3 in Feb 2012)
- "designed to reduce the rankings of low-quality sites"

- Algorithm is still updated frequently

Source: http://www.wired.com/magazine/2010/02/ff_google_algorithm/all/1

- Social networks = graphs
- V = set of “actors” (e.g., students in a class)
- E = set of interactions (e.g., collaborations)
- Typically small graphs, e.g., |V| = 10 or 50
- Long history of social network analysis (e.g. at UCI)
- Quantitative data analysis techniques that can automatically extract “structure” or information from graphs
- E.g., who is the most important “actor” in a network?
- E.g., are there clusters in the network?

- Comprehensive reference:
- S. Wasserman and K. Faust, Social Network Analysis, Cambridge University Press, 1994.

- General idea is that some nodes are more important than others in terms of the structure of the graph
- In a directed graph, “in-degree” may be a useful indicator of importance
- e.g., for a citation network among authors (or papers)
- in-degree is the number of citations => “importance”

- e.g., for a citation network among authors (or papers)
- However:
- “in-degree” is only a first-order measure in that it implicitly assumes that all edges are of equal importance

- wij = weight of link from node i to node j
- assume Sj wij= 1 and weights are non-negative
- e.g., default choice: wij= 1/outdegree(i)
- more outlinks => less importance attached to each

- Define rj = importance of node j in a directed graph
rj = Si wij rii,j = 1,….n

- Importance of a node is a weighted sum of the importance of nodes that point to it
- Makes intuitive sense
- Leads to a set of recursive linear equations

- Crawl the Web to get nodes (pages) and links (hyperlinks)
[highly non-trivial problem!]

- Weights from each page = 1/(# of outlinks)
- Solve for the eigenvector r (for l= 1) of the weight matrix
Computational Problem:

- Solving an eigenvector equation scales as O(n3)
- For the entire Web graph n > 10 billion (!!)
- So direct solution is not feasible
Can use the power method (iterative)r(k+1) = WTr (k)

for k=1,2,…..

r(k+1) = WTr(k)

Define a suitable starting vector r(1)

e.g., all entries 1/n, or all entries = indegree(node)/|E|, etc

Each iteration is matrix-vector multiplication =>O(n2)

- problematic?

no: since W is highly sparse (Web pages have limited outdegree), each iteration is effectively O(n)

For sparse W, the iterations typically converge quite quickly:

- rate of convergence depends on the “spectral gap”

-> how quickly does error(k) = (l2/ l1)k go to 0 as a function of k ?

-> if |l2| is close to 1 (= l1) then convergence is slow

- empirically: Web graph with 300 million pages

-> 50 iterations to convergence (Brin and Page, 1998)

Discrete-time finite-state first-order Markov chain, K states

Transition matrix A = K x K matrix

- Entry aij = P( statet = j | statet-1 = i), i, j = 1, … K
- Rows sum to 1 (since Sj P( statet = j | statet-1 = i) = 1)
- Note that P(state | ..) only depends on statet-1
P0 = initial state probability = P(state0 = i), i = 1, …K

0.8

0.9

1

0.2

2

0.1

0.2

0.2

3

0.6

K = 3

A =

P0 = [1/3 1/3 1/3]

Irreducibility:

- A Markov chain is irreducible if there is a directed path from any node to any other node
Steady-state distribution p for an irreducible Markov chain*:

pi = probability that in the long run, chain is in state I

The p’s are solutions to p = Atp

Note that this is exactly the same as our earlier recursive equations for node importance in a graph!

*Note: technically, for a meaningful solution to exist for p, A must be both irreducible and aperiodic

- W is a stochastic matrix (rows sum to 1) by definition
- can interpret W as defining the transition probabilities in a Markov chain
- wij= probability of transitioning from node i to node j

- Markov chain interpretation:r = WTr
-> these are the solutions of the steady-state probabilities for a Markov chain

page importance steady-state Markov probabilities eigenvector

- Recall that for the Web model, we set wij= 1/outdegree(i)
- Thus, in using W for computing importance of Web pages, this is equivalent to a model where:
- We have a random surfer who surfs the Web for an infinitely long time
- At each page the surfer randomly selects an outlink to the next page
- “importance” of a page = fraction of visits the surfer makes to that page
- this is intuitive: pages that have better connectivity will be visited more often

1

2

3

Page 1 is a “sink” (no outlink)

Pages 3 and 4 are also “sinks” (no outlink from the system)

Markov chain theory tells us that no steady-state solution exists

- depending on where you start you will end up at 1 or {3, 4}

Markov chain is “reducible”

4

- One simple solution to our problem is to modify the Markov chain:
- With probability a the random surfer jumps to any random page in the system (with probability of 1/n, conditioned on such a jump)
- With probability 1-a the random surfer selects an outlink (randomly from the set of available outlinks)

- The resulting transition graph is fully connected => Markov system is irreducible => steady-state solutions exist
- Typically a is chosen to be between 0.1 and 0.2 in practice
- But now the graph is dense!
However, power iterations can be written as:r(k+1) = (1- a) WTr(k) + (a/n) 1T

- Complexity is still O(n) per iteration for sparse W

- S. Brin and L. Page, The anatomy of a large-scale hypertextual search engine, in Proceedings of the 7th WWW Conference, 1998.
- PageRank = the method on the previous slide, applied to the entire Web graph
- Crawl the Web
- Store both connectivity and content

- Calculate (off-line) the “pagerank” r for each Web page using the power iteration method

- Crawl the Web
- How can this be used to answer Web queries:
- Terms in the search query are used to limit the set of pages of possible interest
- Pages are then ordered for the user via precomputed pageranks
- The Google search engine combines r with text-based measures
- This was the first demonstration that link information could be used for content-based search on the Web

- PageRank algorithm was the first algorithm for link-based search
- Many extensions and improvements since then
- See papers on class Web page

- Same idea used in social networks for determining importance

- Many extensions and improvements since then
- Real-world search involves many other aspects besides PageRank
- E.g., use of logistic regression for ranking
- Learns how to predict relevance of page (represented by bag of words) relative to a query, using historical click data
- See paper by Joachims on class Web page

- E.g., use of logistic regression for ranking
- Additional slides (optional)
- HITS algorithm, Kleinberg, 1998

- “rich get richer” syndrome
- not as “democratic” as originally (nobly) claimed
- certainly not 1 vote per “WWW citizen”

- also: crawling frequency tends to be based on pagerank
- for detailed grumblings, see www.google-watch.org, etc.

- not as “democratic” as originally (nobly) claimed
- not query-sensitive
- random walk same regardless of query topic
- whereas real random surfer has some topic interests
- non-uniform jumping vector needed
- would enable personalization (but requires faster eigenvector convergence)
- Topic of ongoing research

- random walk same regardless of query topic
- ad hoc mix of PageRank & keyword match score
- done in two steps for efficiency, not quality motivations

- e.g. [Ng & Zheng & Jordan, IJCAI-01 & SIGIR-01]
- HITS can be very sensitive to change in small fraction of nodes/edges in link structure
- PageRank much more stable, due to random jumps
- propose HITS as bidirectional random walk
- with probability d, randomly (p=1/n) jump to a node
- with probability d-1:
- odd timestep: take random outlink from current node
- even timestep: go backward on random inlink of node

- this HITS variant seems much more stable as d increased
- issue: tuning d (d=1 most stable but useless for ranking)

HITS

randomly

deleted 30%

of papers

PageRank