page link analysis and anchor text for web search lecture 9 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Page Link Analysis and Anchor Text for Web Search Lecture 9 PowerPoint Presentation
Download Presentation
Page Link Analysis and Anchor Text for Web Search Lecture 9

Loading in 2 Seconds...

play fullscreen
1 / 52

Page Link Analysis and Anchor Text for Web Search Lecture 9 - PowerPoint PPT Presentation


  • 232 Views
  • Uploaded on

Page Link Analysis and Anchor Text for Web Search Lecture 9. Many slides based on lectures by Chen Li (UCI) an Raymond Mooney (UTexas). PageRank. Intuition: The importance of each page should be decided by what other pages “say” about this page

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Page Link Analysis and Anchor Text for Web Search Lecture 9' - julio


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
page link analysis and anchor text for web search lecture 9

Page Link Analysis and Anchor Text for Web SearchLecture 9

Many slides based on lectures by Chen Li (UCI) an Raymond Mooney (UTexas)

pagerank
PageRank
  • Intuition:
    • The importance of each page should be decided by what other pages “say” about this page
    • One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks)
  • Problem:
    • We can easily fool this technique by generating many dummy pages that point to our class page
initial pagerank idea
Initial PageRank Idea
  • Just measuring in-degree (citation count) doesn’t account for the authority of the source of a link.
  • Initial page rank equation for page p:
    • Nq is the total number of out-links from page q.
    • A page, q, “gives” an equal fraction of its authority to all the pages it points to (e.g. p).
    • c is a normalizing constant set so that the rank of all pages always sums to 1.
initial pagerank idea cont

.08

.05

.05

.03

.08

.03

.03

.03

Initial PageRank Idea (cont.)
  • Can view it as a process of PageRank “flowing” from pages to the pages they cite.

.1

.09

initial algorithm
Initial Algorithm
  • Iterate rank-flowing process until convergence:

Let S be the total set of pages.

Initialize pS: R(p) = 1/|S|

Until ranks do not change (much) (convergence)

For each pS:

For each pS: R(p) = cR´(p) (normalize)

sample stable fixpoint
Sample Stable Fixpoint

0.2

0.4

0.2

0.2

0.2

0.4

0.4

linear algebra version
Linear Algebra Version
  • Treat R as a vector over web pages.
  • Let A be a 2-d matrix over pages where
    • Avu= 1/Nuif u v elseAvu= 0
  • Then R=cAR
  • R converges to the principal eigenvector of A.
example miniweb
Example: MiniWeb
  • Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft.
  • Their weights are represented as a vector

Ne

MS

Am

For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS.

Materials by courtesy of Jeff Ullman

iterative computation
Iterative computation

Final result:

  • Netscape and Amazon have the same importance, and twice the importance of Microsoft.
  • Does it capture the intuition? Yes.

Ne

MS

Am

observations
Observations
  • We cannot get absolute weights:
    • We can only know (and we are only interested in) those relative weights of the pages
  • The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation:
problem 1 of algorithm dead ends
Problem 1 of algorithm: dead ends!
  • MS does not point to anybody
  • Result: weights of the Web “leak out”

Ne

MS

Am

problem 2 of algorithm spider traps
Problem 2 of algorithm: spider traps
  • MS only points to itself
  • Result: all weights go to MS!

Ne

MS

Am

google s solution tax each page
Google’s solution: “tax each page”
  • Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages.
  • Example: assume 20% tax rate in the “spider trap” example.
the war of search engines
The War of Search Engines
  • More companies are realizing the importance of search engines
  • More competitors in the market: Ask.com, Microsoft, Yahoo!, etc.
slide15
HITS
  • Algorithm developed by Kleinberg in 1998.
  • Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web.
  • Based on mutually recursive facts:
    • Hubs point to lots of authorities.
    • Authorities are pointed to by lots of hubs.
hubs and authorities
Hubs and Authorities
  • Motivation: find web pages to a topic
    • E.g.: “find all web sites about automobiles”
  • “Authority”: a page that offers info about a topic
    • E.g.: DBLP is a page about papers
    • E.g.: google.com, aj.com, teoma.com, lycos.com
  • “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic
    • E.g.: www.searchenginewatch.com is a hub of search engines
    • http://www.ics.uci.edu/~ics214a/ points to many biblio-search engines
the hope
The hope

Authorities

Hubs

Long distance telephone companies

base set
Base set
  • Given text query (say browser), use a text index to get all pages containing browser.
    • Call this the root set of pages.
  • Add in any page that either
    • points to a page in the root set, or
    • is pointed to by a page in the root set.
  • Call this the base set.
visualization
Visualization

Root

set

Base set

assembling the base set
Assembling the base set
  • Root set typically 200-1000 nodes.
  • Base set may have up to 5000 nodes.
  • How do you find the base set nodes?
    • Follow out-links by parsing root set pages.
    • Get in-links (and out-links) from a connectivity server.
    • (Actually, suffices to text-index strings of the form href=“URL” to get in-links to URL.)
distilling hubs and authorities
Distilling hubs and authorities
  • Compute, for each page x in the base set, a hub scoreh(x) and an authority scorea(x).
  • Initialize: for all x, h(x)1; a(x) 1;
  • Iteratively update all h(x), a(x);
  • After iterations
    • output pages with highest h() scores as top hubs
    • highest a() scores as top authorities.

Key

iterative update
Iterative update
  • Repeat the following updates, for all x:

x

x

scaling
Scaling
  • To prevent the h() and a() values from getting too big, can scale down after each iteration.
  • Scaling factor doesn’t really matter:
    • we only care about the relative values of the scores.
how many iterations
How many iterations?
  • Claim: relative values of scores will converge after a few iterations:
    • in fact, suitably scaled, h() and a() scores settle into a steady state!
    • proof of this comes later.
  • We only require the relative orders of the h() and a() scores - not their absolute values.
  • In practice, ~5 iterations get you close to stability.
iterative algorithm
Iterative Algorithm
  • Use an iterative algorithm to slowly converge on a mutually reinforcing set of hubs and authorities.
  • Maintain for each page p  S:
    • Authority score: ap (vectora)
    • Hub score: hp (vectorh)
  • Initialize all ap = hp = 1
  • Maintain normalized scores:
hits update rules
HITS Update Rules
  • Authorities are pointed to by lots of good hubs:
  • Hubs point to lots of good authorities:
illustrated update rules
Illustrated Update Rules

1

4

a4 = h1 + h2 + h3

2

3

5

4

6

h4 = a5 + a6 + a7

7

hits iterative algorithm
HITS Iterative Algorithm

Initialize for all p  S: ap = hp = 1

For i = 1 to k:

For all p  S: (update auth. scores)

For all p  S:(update hub scores)

For all p  S:ap= ap/c c:

For all p  S:hp= hp/c c:

(normalizea)

(normalizeh)

example miniweb1
Example: MiniWeb

Normalization!

Ne

Therefore:

MS

Am

convergence
Convergence
  • Algorithm converges to a fix-point if iterated indefinitely.
  • Define A to be the adjacency matrix for the subgraph defined by S.
    • Aij= 1 for i  S, j S iff ij
  • Authority vector, a, converges to the principal eigenvector of ATA
  • Hub vector, h, converges to the principal eigenvector of AAT
  • In practice, 20 iterations produces fairly stable results.
results
Results
  • Authorities for query: “Java”
    • java.sun.com
    • comp.lang.java FAQ
  • Authorities for query “search engine”
    • Yahoo.com
    • Excite.com
    • Lycos.com
    • Altavista.com
  • Authorities for query “Gates”
    • Microsoft.com
    • roadahead.com
result comments
Result Comments
  • In most cases, the final authorities were not in the initial root set generated using Altavista.
  • Authorities were brought in from linked and reverse-linked pages and then HITS computed their high authority score.
comparison
Pagerank

Pros

Hard to spam

Computes quality signal for all pages

Cons

Non-trivial to compute

Not query specific

Doesn’t work on small graphs

Proven to be effective for general purpose ranking

HITS & Variants

Pros

Easy to compute, real-time execution is hard [Bhar98b, Stat00]

Query specific

Works on small graphs

Cons

Local graph structure can be manufactured (spam!)

Provides a signal only when there’s direct connectivity (e.g., home pages)

Well suited for supervised directory construction

Comparison
tag position heuristics
Tag/position heuristics
  • Increase weights of terms
    • in titles
    • in tags
    • near the beginning of the doc, its chapters and sections
anchor text first used www worm mcbryan mcbr94
Anchor text (first usedWWW Worm - McBryan [Mcbr94])

Tiger image

Here is a great picture of a tiger

Cool tiger webpage

The text in the vicinity of a hyperlink is

descriptive of the page it points to.

two uses of anchor text
Two uses of anchor text
  • When indexing a page, also index the anchor text of links pointing to it.
    • Retrieve a page when query matches its anchor text.
  • To weight links in the hubs/authorities algorithm.
  • Anchor text usually taken to be a window of 6-8 words around a link anchor.
indexing anchor text
Indexing anchor text
  • When indexing a document D, include anchor text from links pointing to D.

Armonk, NY-based computer

giant IBM announced today

www.ibm.com

Big Blue today announced

record profits for the quarter

Joe’s computer hardware links

Compaq

HP

IBM

indexing anchor text1
Indexing anchor text
  • Can sometimes have unexpected side effects - e.g.,evil empire.
  • Can index anchor text with less weight.
weighting links
Weighting links
  • In hub/authority link analysis, can match anchor text to query, then weight link.
weighting links1
Weighting links
  • What is w(x,y)?
  • Should increase with the number of query terms in anchor text.
    • E.g.: 1+ number of query terms.

Armonk, NY-based computer

giant IBM announced today

x

y

www.ibm.com

Weight of this link for query computer is 2.

weighted hub authority computation
Weighted hub/authority computation
  • Recall basic algorithm:
    • Iteratively update all h(x), a(x);
    • After iteration, output pages with
      • highest h() scores as top hubs
      • highest a() scores as top authorities.
  • Now use weights in iteration.
  • Raises scores of pages with “heavy” links.

Do we still have convergence of scores? To what?

anchor text
Anchor Text
  • Other applications
    • Weighting/filtering links in the graph
      • HITS [Chak98], Hilltop [Bhar01]
    • Generating page descriptions from anchor text [Amit98, Amit00]
behavior based ranking1
Behavior-based ranking
  • For each query Q, keep track of which docs in the results are clicked on
  • On subsequent requests for Q, re-order docs in results based on click-throughs
  • First due to DirectHit AskJeeves
  • Relevance assessment based on
    • Behavior/usage
    • vs. content
query doc popularity matrix b
Query-doc popularity matrix B

Docs

j

q

Queries

Bqj = number of times doc j

clicked-through on query q

When query q issued again, order docs

by Bqj values.

issues to consider
Issues to consider
  • Weighing/combining text- and click-based scores.
  • What identifies a query?
    • Ferrari Mondial
    • Ferrari Mondial
    • Ferrari mondial
    • ferrari mondial
    • “Ferrari Mondial”
  • Can use heuristics, but search parsing slowed.
vector space implementation
Vector space implementation
  • Maintain a term-doc popularity matrix C
    • as opposed to query-doc popularity
    • initialized to all zeros
  • Each column represents a doc j
    • If doc j clicked on for query q, update Cj Cj + q (here q is viewed as a vector).
  • On a query q’, compute its cosine proximity to Cj for all j.
  • Combine this with the regular text score.
issues
Issues
  • Normalization of Cj after updating
  • Assumption of query compositionality
    • “white house” document popularity derived from “white” and “house”
  • Updating - live or batch?
basic assumption
Basic Assumption
  • Relevance can be directly measured by number of click throughs
  • Valid?
validity of basic assumption
Validity of Basic Assumption
  • Click through to docs that turn out to be non-relevant: what does a click mean?
  • Self-perpetuating ranking
  • Spam
  • All votes count the same
variants
Variants
  • Time spent viewing page
    • Difficult session management
    • Inconclusive modeling so far
  • Does user back out of page?
  • Does user stop searching?
  • Does user transact?