Page Link Analysis and Anchor Text for Web Search Lecture 9

Page Link Analysis and Anchor Text for Web SearchLecture 9 Many slides based on lectures by Chen Li (UCI) an Raymond Mooney (UTexas)

PageRank • Intuition: • The importance of each page should be decided by what other pages “say” about this page • One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks) • Problem: • We can easily fool this technique by generating many dummy pages that point to our class page

Initial PageRank Idea • Just measuring in-degree (citation count) doesn’t account for the authority of the source of a link. • Initial page rank equation for page p: • Nq is the total number of out-links from page q. • A page, q, “gives” an equal fraction of its authority to all the pages it points to (e.g. p). • c is a normalizing constant set so that the rank of all pages always sums to 1.

.08 .05 .05 .03 .08 .03 .03 .03 Initial PageRank Idea (cont.) • Can view it as a process of PageRank “flowing” from pages to the pages they cite. .1 .09

Initial Algorithm • Iterate rank-flowing process until convergence: Let S be the total set of pages. Initialize pS: R(p) = 1/|S| Until ranks do not change (much) (convergence) For each pS: For each pS: R(p) = cR´(p) (normalize)

Sample Stable Fixpoint 0.2 0.4 0.2 0.2 0.2 0.4 0.4

Linear Algebra Version • Treat R as a vector over web pages. • Let A be a 2-d matrix over pages where • Avu= 1/Nuif u v elseAvu= 0 • Then R=cAR • R converges to the principal eigenvector of A.

Example: MiniWeb • Our “MiniWeb” has only three web sites: Netscape, Amazon, and Microsoft. • Their weights are represented as a vector Ne MS Am For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS. Materials by courtesy of Jeff Ullman

Iterative computation Final result: • Netscape and Amazon have the same importance, and twice the importance of Microsoft. • Does it capture the intuition? Yes. Ne MS Am

Observations • We cannot get absolute weights: • We can only know (and we are only interested in) those relative weights of the pages • The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation:

Problem 1 of algorithm: dead ends! • MS does not point to anybody • Result: weights of the Web “leak out” Ne MS Am

Problem 2 of algorithm: spider traps • MS only points to itself • Result: all weights go to MS! Ne MS Am

Google’s solution: “tax each page” • Like people paying taxes, each page pays some weight into a public pool, which will be distributed to all pages. • Example: assume 20% tax rate in the “spider trap” example.

The War of Search Engines • More companies are realizing the importance of search engines • More competitors in the market: Ask.com, Microsoft, Yahoo!, etc.

HITS • Algorithm developed by Kleinberg in 1998. • Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web. • Based on mutually recursive facts: • Hubs point to lots of authorities. • Authorities are pointed to by lots of hubs.

Hubs and Authorities • Motivation: find web pages to a topic • E.g.: “find all web sites about automobiles” • “Authority”: a page that offers info about a topic • E.g.: DBLP is a page about papers • E.g.: google.com, aj.com, teoma.com, lycos.com • “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic • E.g.: www.searchenginewatch.com is a hub of search engines • http://www.ics.uci.edu/~ics214a/ points to many biblio-search engines

The hope Authorities Hubs Long distance telephone companies

Base set • Given text query (say browser), use a text index to get all pages containing browser. • Call this the root set of pages. • Add in any page that either • points to a page in the root set, or • is pointed to by a page in the root set. • Call this the base set.

Visualization Root set Base set

Assembling the base set • Root set typically 200-1000 nodes. • Base set may have up to 5000 nodes. • How do you find the base set nodes? • Follow out-links by parsing root set pages. • Get in-links (and out-links) from a connectivity server. • (Actually, suffices to text-index strings of the form href=“URL” to get in-links to URL.)

Distilling hubs and authorities • Compute, for each page x in the base set, a hub scoreh(x) and an authority scorea(x). • Initialize: for all x, h(x)1; a(x) 1; • Iteratively update all h(x), a(x); • After iterations • output pages with highest h() scores as top hubs • highest a() scores as top authorities. Key

Iterative update • Repeat the following updates, for all x: x x

Scaling • To prevent the h() and a() values from getting too big, can scale down after each iteration. • Scaling factor doesn’t really matter: • we only care about the relative values of the scores.

How many iterations? • Claim: relative values of scores will converge after a few iterations: • in fact, suitably scaled, h() and a() scores settle into a steady state! • proof of this comes later. • We only require the relative orders of the h() and a() scores - not their absolute values. • In practice, ~5 iterations get you close to stability.

Iterative Algorithm • Use an iterative algorithm to slowly converge on a mutually reinforcing set of hubs and authorities. • Maintain for each page p  S: • Authority score: ap (vectora) • Hub score: hp (vectorh) • Initialize all ap = hp = 1 • Maintain normalized scores:

HITS Update Rules • Authorities are pointed to by lots of good hubs: • Hubs point to lots of good authorities:

Illustrated Update Rules 1 4 a4 = h1 + h2 + h3 2 3 5 4 6 h4 = a5 + a6 + a7 7

HITS Iterative Algorithm Initialize for all p  S: ap = hp = 1 For i = 1 to k: For all p  S: (update auth. scores) For all p  S:(update hub scores) For all p  S:ap= ap/c c: For all p  S:hp= hp/c c: (normalizea) (normalizeh)

Example: MiniWeb Normalization! Ne Therefore: MS Am

Example: MiniWeb Ne MS Am

Convergence • Algorithm converges to a fix-point if iterated indefinitely. • Define A to be the adjacency matrix for the subgraph defined by S. • Aij= 1 for i  S, j S iff ij • Authority vector, a, converges to the principal eigenvector of ATA • Hub vector, h, converges to the principal eigenvector of AAT • In practice, 20 iterations produces fairly stable results.

Results • Authorities for query: “Java” • java.sun.com • comp.lang.java FAQ • Authorities for query “search engine” • Yahoo.com • Excite.com • Lycos.com • Altavista.com • Authorities for query “Gates” • Microsoft.com • roadahead.com

Result Comments • In most cases, the final authorities were not in the initial root set generated using Altavista. • Authorities were brought in from linked and reverse-linked pages and then HITS computed their high authority score.

Pagerank Pros Hard to spam Computes quality signal for all pages Cons Non-trivial to compute Not query specific Doesn’t work on small graphs Proven to be effective for general purpose ranking HITS & Variants Pros Easy to compute, real-time execution is hard [Bhar98b, Stat00] Query specific Works on small graphs Cons Local graph structure can be manufactured (spam!) Provides a signal only when there’s direct connectivity (e.g., home pages) Well suited for supervised directory construction Comparison

Tag/position heuristics • Increase weights of terms • in titles • in tags • near the beginning of the doc, its chapters and sections

Anchor text (first usedWWW Worm - McBryan [Mcbr94]) Tiger image Here is a great picture of a tiger Cool tiger webpage The text in the vicinity of a hyperlink is descriptive of the page it points to.

Two uses of anchor text • When indexing a page, also index the anchor text of links pointing to it. • Retrieve a page when query matches its anchor text. • To weight links in the hubs/authorities algorithm. • Anchor text usually taken to be a window of 6-8 words around a link anchor.

Indexing anchor text • When indexing a document D, include anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today www.ibm.com Big Blue today announced record profits for the quarter Joe’s computer hardware links Compaq HP IBM

Indexing anchor text • Can sometimes have unexpected side effects - e.g.,evil empire. • Can index anchor text with less weight.

Weighting links • In hub/authority link analysis, can match anchor text to query, then weight link.

Weighting links • What is w(x,y)? • Should increase with the number of query terms in anchor text. • E.g.: 1+ number of query terms. Armonk, NY-based computer giant IBM announced today x y www.ibm.com Weight of this link for query computer is 2.

Weighted hub/authority computation • Recall basic algorithm: • Iteratively update all h(x), a(x); • After iteration, output pages with • highest h() scores as top hubs • highest a() scores as top authorities. • Now use weights in iteration. • Raises scores of pages with “heavy” links. Do we still have convergence of scores? To what?

Anchor Text • Other applications • Weighting/filtering links in the graph • HITS [Chak98], Hilltop [Bhar01] • Generating page descriptions from anchor text [Amit98, Amit00]

Behavior-based ranking

Behavior-based ranking • For each query Q, keep track of which docs in the results are clicked on • On subsequent requests for Q, re-order docs in results based on click-throughs • First due to DirectHit AskJeeves • Relevance assessment based on • Behavior/usage • vs. content

Query-doc popularity matrix B Docs j q Queries Bqj = number of times doc j clicked-through on query q When query q issued again, order docs by Bqj values.

Issues to consider • Weighing/combining text- and click-based scores. • What identifies a query? • Ferrari Mondial • Ferrari Mondial • Ferrari mondial • ferrari mondial • “Ferrari Mondial” • Can use heuristics, but search parsing slowed.

Vector space implementation • Maintain a term-doc popularity matrix C • as opposed to query-doc popularity • initialized to all zeros • Each column represents a doc j • If doc j clicked on for query q, update Cj Cj + q (here q is viewed as a vector). • On a query q’, compute its cosine proximity to Cj for all j. • Combine this with the regular text score.

Issues • Normalization of Cj after updating • Assumption of query compositionality • “white house” document popularity derived from “white” and “house” • Updating - live or batch?

Basic Assumption • Relevance can be directly measured by number of click throughs • Valid?

Page Link Analysis and Anchor Text for Web Search Lecture 9

Page Link Analysis and Anchor Text for Web Search Lecture 9

Presentation Transcript

Combining Link and Content Information in Web Search

Text and Web Search

Search And Text Analysis

The search for Vancouver’s lost anchor

Automatic Web Page Categorization by Link and Context Analysis

Search Text Mining Web Site Usability

Web Search and Text Mining

Link Analysis, PageRank and Search Engines on the Web

Lecture 9: Search 8

Web Crawlers and Link Analysis

Web Search and Text Mining

Web Search and Text Mining

CONTENT AND LINK ANALYSIS FOR SEARCHING THE WEB

Web Search and Text Mining

Lecture 16: Web search/Crawling/Link Analysis

Web Mining and Link Analysis

Anchor Text (public)

Anchor Text Definition - Top 7 Ways To Optimize Anchor Text

Web Search and Text Mining

Automatic Web Page Categorization by Link and Context Analysis

Translation of Web Queries Using Anchor Text Mining