Download Presentation
Page Link Analysis and Anchor Text for Web Search Lecture 9

Loading in 2 Seconds...

1 / 52

# Page Link Analysis and Anchor Text for Web Search Lecture 9 - PowerPoint PPT Presentation

Page Link Analysis and Anchor Text for Web Search Lecture 9. Many slides based on lectures by Chen Li (UCI) an Raymond Mooney (UTexas). PageRank. Intuition: The importance of each page should be decided by what other pages “say” about this page

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

## PowerPoint Slideshow about 'Page Link Analysis and Anchor Text for Web Search Lecture 9' - julio

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Page Link Analysis and Anchor Text for Web SearchLecture 9

Many slides based on lectures by Chen Li (UCI) an Raymond Mooney (UTexas)

PageRank
• Intuition:
• The importance of each page should be decided by what other pages “say” about this page
• One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks)
• Problem:
• We can easily fool this technique by generating many dummy pages that point to our class page
Initial PageRank Idea
• Just measuring in-degree (citation count) doesn’t account for the authority of the source of a link.
• Initial page rank equation for page p:
• Nq is the total number of out-links from page q.
• A page, q, “gives” an equal fraction of its authority to all the pages it points to (e.g. p).
• c is a normalizing constant set so that the rank of all pages always sums to 1.

.08

.05

.05

.03

.08

.03

.03

.03

Initial PageRank Idea (cont.)
• Can view it as a process of PageRank “flowing” from pages to the pages they cite.

.1

.09

Initial Algorithm
• Iterate rank-flowing process until convergence:

Let S be the total set of pages.

Initialize pS: R(p) = 1/|S|

Until ranks do not change (much) (convergence)

For each pS:

For each pS: R(p) = cR´(p) (normalize)

Sample Stable Fixpoint

0.2

0.4

0.2

0.2

0.2

0.4

0.4

Linear Algebra Version
• Treat R as a vector over web pages.
• Let A be a 2-d matrix over pages where
• Avu= 1/Nuif u v elseAvu= 0
• Then R=cAR
• R converges to the principal eigenvector of A.
Example: MiniWeb
• Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft.
• Their weights are represented as a vector

Ne

MS

Am

For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS.

Materials by courtesy of Jeff Ullman

Iterative computation

Final result:

• Netscape and Amazon have the same importance, and twice the importance of Microsoft.
• Does it capture the intuition? Yes.

Ne

MS

Am

Observations
• We cannot get absolute weights:
• We can only know (and we are only interested in) those relative weights of the pages
• The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation:
Problem 1 of algorithm: dead ends!
• MS does not point to anybody
• Result: weights of the Web “leak out”

Ne

MS

Am

Problem 2 of algorithm: spider traps
• MS only points to itself
• Result: all weights go to MS!

Ne

MS

Am

Google’s solution: “tax each page”
• Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages.
• Example: assume 20% tax rate in the “spider trap” example.
The War of Search Engines
• More companies are realizing the importance of search engines
• More competitors in the market: Ask.com, Microsoft, Yahoo!, etc.
HITS
• Algorithm developed by Kleinberg in 1998.
• Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web.
• Based on mutually recursive facts:
• Hubs point to lots of authorities.
• Authorities are pointed to by lots of hubs.
Hubs and Authorities
• Motivation: find web pages to a topic
• E.g.: “find all web sites about automobiles”
• “Authority”: a page that offers info about a topic
• E.g.: DBLP is a page about papers
• E.g.: google.com, aj.com, teoma.com, lycos.com
• “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic
• E.g.: www.searchenginewatch.com is a hub of search engines
• http://www.ics.uci.edu/~ics214a/ points to many biblio-search engines
The hope

Authorities

Hubs

Long distance telephone companies

Base set
• Given text query (say browser), use a text index to get all pages containing browser.
• Call this the root set of pages.
• Add in any page that either
• points to a page in the root set, or
• is pointed to by a page in the root set.
• Call this the base set.
Visualization

Root

set

Base set

Assembling the base set
• Root set typically 200-1000 nodes.
• Base set may have up to 5000 nodes.
• How do you find the base set nodes?
• Follow out-links by parsing root set pages.
• Get in-links (and out-links) from a connectivity server.
• (Actually, suffices to text-index strings of the form href=“URL” to get in-links to URL.)
Distilling hubs and authorities
• Compute, for each page x in the base set, a hub scoreh(x) and an authority scorea(x).
• Initialize: for all x, h(x)1; a(x) 1;
• Iteratively update all h(x), a(x);
• After iterations
• output pages with highest h() scores as top hubs
• highest a() scores as top authorities.

Key

Iterative update
• Repeat the following updates, for all x:

x

x

Scaling
• To prevent the h() and a() values from getting too big, can scale down after each iteration.
• Scaling factor doesn’t really matter:
• we only care about the relative values of the scores.
How many iterations?
• Claim: relative values of scores will converge after a few iterations:
• in fact, suitably scaled, h() and a() scores settle into a steady state!
• proof of this comes later.
• We only require the relative orders of the h() and a() scores - not their absolute values.
• In practice, ~5 iterations get you close to stability.
Iterative Algorithm
• Use an iterative algorithm to slowly converge on a mutually reinforcing set of hubs and authorities.
• Maintain for each page p  S:
• Authority score: ap (vectora)
• Hub score: hp (vectorh)
• Initialize all ap = hp = 1
• Maintain normalized scores:
HITS Update Rules
• Authorities are pointed to by lots of good hubs:
• Hubs point to lots of good authorities:
Illustrated Update Rules

1

4

a4 = h1 + h2 + h3

2

3

5

4

6

h4 = a5 + a6 + a7

7

HITS Iterative Algorithm

Initialize for all p  S: ap = hp = 1

For i = 1 to k:

For all p  S: (update auth. scores)

For all p  S:(update hub scores)

For all p  S:ap= ap/c c:

For all p  S:hp= hp/c c:

(normalizea)

(normalizeh)

Example: MiniWeb

Normalization!

Ne

Therefore:

MS

Am

Convergence
• Algorithm converges to a fix-point if iterated indefinitely.
• Define A to be the adjacency matrix for the subgraph defined by S.
• Aij= 1 for i  S, j S iff ij
• Authority vector, a, converges to the principal eigenvector of ATA
• Hub vector, h, converges to the principal eigenvector of AAT
• In practice, 20 iterations produces fairly stable results.
Results
• Authorities for query: “Java”
• java.sun.com
• comp.lang.java FAQ
• Authorities for query “search engine”
• Yahoo.com
• Excite.com
• Lycos.com
• Altavista.com
• Authorities for query “Gates”
• Microsoft.com
• roadahead.com
Result Comments
• In most cases, the final authorities were not in the initial root set generated using Altavista.
• Authorities were brought in from linked and reverse-linked pages and then HITS computed their high authority score.
Pagerank

Pros

Hard to spam

Computes quality signal for all pages

Cons

Non-trivial to compute

Not query specific

Doesn’t work on small graphs

Proven to be effective for general purpose ranking

HITS & Variants

Pros

Easy to compute, real-time execution is hard [Bhar98b, Stat00]

Query specific

Works on small graphs

Cons

Local graph structure can be manufactured (spam!)

Provides a signal only when there’s direct connectivity (e.g., home pages)

Well suited for supervised directory construction

Comparison
Tag/position heuristics
• Increase weights of terms
• in titles
• in tags
• near the beginning of the doc, its chapters and sections
Anchor text (first usedWWW Worm - McBryan [Mcbr94])

Tiger image

Here is a great picture of a tiger

Cool tiger webpage

The text in the vicinity of a hyperlink is

descriptive of the page it points to.

Two uses of anchor text
• When indexing a page, also index the anchor text of links pointing to it.
• Retrieve a page when query matches its anchor text.
• To weight links in the hubs/authorities algorithm.
• Anchor text usually taken to be a window of 6-8 words around a link anchor.
Indexing anchor text
• When indexing a document D, include anchor text from links pointing to D.

Armonk, NY-based computer

giant IBM announced today

www.ibm.com

Big Blue today announced

record profits for the quarter

Joe’s computer hardware links

Compaq

HP

IBM

Indexing anchor text
• Can sometimes have unexpected side effects - e.g.,evil empire.
• Can index anchor text with less weight.
Weighting links
• In hub/authority link analysis, can match anchor text to query, then weight link.
Weighting links
• What is w(x,y)?
• Should increase with the number of query terms in anchor text.
• E.g.: 1+ number of query terms.

Armonk, NY-based computer

giant IBM announced today

x

y

www.ibm.com

Weight of this link for query computer is 2.

Weighted hub/authority computation
• Recall basic algorithm:
• Iteratively update all h(x), a(x);
• After iteration, output pages with
• highest h() scores as top hubs
• highest a() scores as top authorities.
• Now use weights in iteration.
• Raises scores of pages with “heavy” links.

Do we still have convergence of scores? To what?

Anchor Text
• Other applications
• Weighting/filtering links in the graph
• HITS [Chak98], Hilltop [Bhar01]
• Generating page descriptions from anchor text [Amit98, Amit00]

### Behavior-based ranking

Behavior-based ranking
• For each query Q, keep track of which docs in the results are clicked on
• On subsequent requests for Q, re-order docs in results based on click-throughs
• First due to DirectHit AskJeeves
• Relevance assessment based on
• Behavior/usage
• vs. content
Query-doc popularity matrix B

Docs

j

q

Queries

Bqj = number of times doc j

clicked-through on query q

When query q issued again, order docs

by Bqj values.

Issues to consider
• Weighing/combining text- and click-based scores.
• What identifies a query?
• Ferrari Mondial
• Ferrari Mondial
• Ferrari mondial
• ferrari mondial
• “Ferrari Mondial”
• Can use heuristics, but search parsing slowed.
Vector space implementation
• Maintain a term-doc popularity matrix C
• as opposed to query-doc popularity
• initialized to all zeros
• Each column represents a doc j
• If doc j clicked on for query q, update Cj Cj + q (here q is viewed as a vector).
• On a query q’, compute its cosine proximity to Cj for all j.
• Combine this with the regular text score.
Issues
• Normalization of Cj after updating
• Assumption of query compositionality
• “white house” document popularity derived from “white” and “house”
• Updating - live or batch?
Basic Assumption
• Relevance can be directly measured by number of click throughs
• Valid?
Validity of Basic Assumption
• Click through to docs that turn out to be non-relevant: what does a click mean?
• Self-perpetuating ranking
• Spam
• All votes count the same
Variants
• Time spent viewing page
• Difficult session management
• Inconclusive modeling so far
• Does user back out of page?
• Does user stop searching?
• Does user transact?