- 99 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Hyperlink Analysis for the Web' - daphne

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### We want top-ranking documents to be both relevant and authoritative

Information Retrieval

- Input: Document collection
- Goal: Retrieve documents or text with information content that is relevant to user’s information need
- Two aspects:

1. Processing the collection

2. Processing queries (searching)

Classic information retrieval

- Ranking is a function of query term frequency within the document (tf) and across all documents (idf)
- This works because of the following assumptions in classical IR:
- Queries are long and well specified

“What is the impact of the Falklands war on Anglo-Argentinean relations”

- Documents (e.g., newspaper articles) are coherent, well authored, and are usually about one topic
- The vocabulary is small and relatively well understood

Web information retrieval

- None of these assumptionshold:
- Queries are short: 2.35 terms in avg
- Huge variety in documents: language, quality, duplication
- Huge vocabulary: 100s million of terms
- Deliberate misinformation
- Ranking is a function of the query terms and of the hyperlink structure

Connectivity-based ranking

- Hyperlink analysis
- Idea: Mine structure of the web graph
- Each web page is a node
- Each hyperlink is a directed edge
- Ranking Returned Documents
- Query dependent raking
- Query independent ranking

Query dependent ranking

- Assigns a score that measures the quality and relevance of a selected set of pages to a given user query.
- The basic idea is to build a query-specific graph, called a neighborhood graph, and perform hyperlink analysis on it.

Building a neighborhood graph

- A start set of documents matching the query is fetched from a search engine (typically 200-1000 nodes).
- The start set is augmented by its neighborhood, which is the set of documents that either hyperlinks to or is hyperlinked to by documents in the start set .(up to 5000 nodes)
- Each document in both the start set and the neighborhood is modeled by a node. There exists an edge from node A to node B if and only if document A hyperlinks to document B.
- Hyperlinks between pages on the same Web host can be omitted.

Neighborhood graph

- Subgraph associated to each query

Back Set

Forward Set

Query Results

= Start Set

Result1

b1

f1

f2

b2

Result2

...

…

...

bm

fs

Resultn

An edge for each hyperlink, but no edges within the same host

Hyperlink-Induced Topic Search (HITS)

- In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages:
- Hub pages are good lists of links on a subject.
- e.g., “Bob’s list of cancer-related links.”
- Authority pages occur recurrently on good hubs for the subject.
- Best suited for “broad topic” queries rather than for page-finding queries.
- Gets at a broader slice of common opinion.

Hubs and Authorities

- Thus, a good hub page for a topic points to many authoritative pages for that topic.
- A good authority page for a topic is pointed to by many good hubs for that topic.
- Circular definition - will turn this into an iterative computation.

HITS [K’98]

- Goal: Given a query find:
- Good sources of content (authorities)
- Good sources of links (hubs)

Intuition

- Authoritycomes from in-edges. Being a goodhubcomes from out-edges.
- Better authoritycomes from in-edges from good hubs. Being a better hubcomes from out-edges to good authorities.

Distilling hubs and authorities

- Compute, for each page x in the base set, a hub scoreh(x) and an authority scorea(x).
- Initialize: for all x, h(x)1; a(x) 1;
- Iteratively update all h(x), a(x);
- After iterations
- output pages with highest h() scores as top hubs
- highest a() scores as top authorities.

Key

Scaling

- To prevent the h() and a() values from getting too big, can scale down after each iteration.
- Scaling factor doesn’t really matter:
- we only care about the relative values of the scores.

How many iterations?

- Claim: relative values of scores will converge after a few iterations:
- in fact, suitably scaled, h() and a() scores settle into a steady state!
- We only require the relative orders of the h() and a() scores - not their absolute values.
- In practice, ~5 iterations get you close to stability.

Problems with the HITS algorithm(1)

- Only a relatively small part of the Web graph is considered, adding edges to a few nodes can change the resulting hubs and authority scores considerably.
- It is relatively easy to manipulate these scores.

Problems with the HITS algorithm(2)

- We often find that the neighborhood graph contains documents not relevant to the query topic. If these nodes are well connected, the topic driftproblem arises.
- The most highly ranked authorities and hubs tend not to be about the original topic.
- For example, when running the algorithm on the query “jaguar and car" the computation drifted to the general topic “car" and returned the home pages of different car manufacturers as top authorities, and lists of car manufacturers as the best hubs.

Improvements

- To avoid “undue weight” of the opinion of a single person
- All the documents on a single host have the same influence on the document they are connected to as a single document would.
- Ideas
- If there are k edges from documents on a first host to a single document on a second host, we give each edge an authority weight of 1/k.
- If there are l edges from a single document on a first host to a set of documents on a second host, we give each edge a hub weight of 1/l.

Improvements

- To solve topic drift problem, content analysis can be used.
- Ideas
- Eliminating non-relevant nodes from the graph
- Regulating the influence of a node based on its relevance.

Improvements

- Computing Relevance Weights for Nodes
- The documents in the start set is used to define a broader query and match every document in the graph against this query.
- Specifically, the concatenation of the first 1000 words from each document are considered to be the query, Q and compute similarity(Q;D)
- All nodes whose weights are below a threshold are pruned.

Improvements

- Regulating the Influence of a Node
- Let W[n] be the relevance weight of a node n
- W[n]* A[n] is used instead of A[n] for computing the hub scores.
- W[n]*H[n] is used instead of H[n] for computing the authority score.
- This reduces the influence of less relevant nodes on the scores of their neighbors.

Query-independent ordering

First generation: using link counts as simple measures of popularity.

Two basic suggestions:

Undirected popularity:

Each page gets a score = the number of in-links plus the number of out-links (3+2=5).

Directed popularity:

Score of a page = number of its in-links (3).

Query processing

First retrieve all pages meeting the text query (say venture capital).

Order these by their link popularity (either variant on the previous page).

Spamming simple popularity

Exercise: How do you spam each of the following heuristics so your page gets a high score?

Pagerank scoring

Imagine a browser doing a random walk on web pages:

Start at a random page

At each step, go out of the current page along one of the links on that page, equiprobably

“In the steady state” each page has a long-term visit rate - use this as the page’s score.

1/3

1/3

1/3

Not quite enough

The web is full of dead-ends.

Random walk can get stuck in dead-ends.

Makes no sense to talk about long-term visit rates.

??

Teleporting

At a dead end, jump to a random web page.

At any non-dead end, with probability 10%, jump to a random web page.

With remaining probability (90%), go out on a random link.

10% - a parameter.

Result of teleporting

Now cannot get stuck locally.

There is a long-term rate at which any page is visited (not obvious, will show this).

How do we compute this visit rate?

Markov chains

A Markov chain consists of n states, plus an nntransition probability matrixP.

At each step, we are in exactly one of the states.

For 1 i,j n, the matrix entry Pij tells us the probability of j being the next state, given we are currently in state i.

Pii>0

is OK.

i

j

Pij

Markov chains

Clearly, for all i,

Markov chains are abstractions of random walks.

Exercise: represent the teleporting random walk from 3 slides ago as a Markov chain, for this case:

Ergodic Markov chains

A Markov chain is ergodic if

you have a path from any state to any other

For any start state, after a finite transient time T0, the probability of being in any state at a fixed time T>T0 is nonzero.

Ergodic Markov chains

For any ergodic Markov chain, there is a unique long-term visit rate for each state.

Steady-state probability distribution.

Over a long time-period, we visit each state in proportion to this rate.

It doesn’t matter where we start.

Probability vectors

A probability (row) vector x= (x1, … xn) tells us where the walk is at any point.

E.g., (000…1…000) means we’re in state i.

1

i

n

More generally, the vector x= (x1, … xn) means the walk is in state i with probability xi.

Change in probability vector

If the probability vector is x= (x1, … xn) at this step, what is it at the next step?

Recall that row i of the transition prob. Matrix P tells us where we go next from state i.

So from x, our next state is distributed as xP.

Steady state example

The steady state looks like a vector of probabilities a= (a1, … an):

ai is the probability that we are in state i.

3/4

1/4

1

2

3/4

1/4

For this example, a1=1/4 and a2=3/4.

How do we compute this vector?

Let a= (a1, … an) denote the row vector of steady-state probabilities.

If we our current position is described by a, then the next step is distributed as aP.

But a is the steady state, so a=aP.

Solving this matrix equation gives us a.

One way of computing a

Recall, regardless of where we start, we eventually reach the steady state a.

Start with any distribution (say x=(10…0)).

After one step, we’re at xP;

after two steps at xP2 , then xP3 and so on.

“Eventually” means for “large” k, xPk = a.

Algorithm: multiply x by increasing powers of P until the product looks stable.

Google’s approach

- Assumption: A link from page A to page B is a recommendation of page B by the author of A(we say B is successor of A)
- Quality of a page is related to its in-degree
- Recursion: Quality of a page is related to
- its in-degree, and to
- the quality of pages linking to it
- PageRank[BP ‘98]

Definition of PageRank

- Consider the following infinite random walk (surf):
- Initially the surfer is at a random page
- At each step, the surfer proceeds
- to a randomly chosen web page with probability d
- to a randomly chosen successor of the current page with probability 1-d
- The PageRank of a page p is the fraction of steps the surfer spends at p in the limit.

PageRank (cont.)

By random walk theorem:

- PageRank = stationary probability for this Markov chain, i.e.

where n is the total number of nodes in the graph

PageRank (cont.)

B

A

d

d

P

PageRank of P is

(1-d)* ( 1/4th the PageRank of A + 1/3rd the PageRank of B ) +d/n

PageRank

- Used in Google’s ranking function
- Query-independent
- Summarizes the “web opinion” of the page importance

Computation:

Once for all documents and queries (offline)

Query-independent – requires combination with query-dependent criteria

Hard to spam

Computation:

Requires computation for each query

Query-dependent

Relatively easy to spam

Quality depends on quality of start set

PageRank vs. HITSRelevance is being modeled by cosine scores

- Authority is typically a query-independent property of a document
- Assign to each document a query-independentquality score in [0,1] to each document d
- Denote this by g(d)

Net score

- Consider a simple total score combining cosine relevance and authority
- net-score(q,d) = g(d) + cosine(q,d)
- Can use some other linear combination than an equal weighting
- Now we seek the top K docs by net score

Top K by net score – fast methods

- First idea: Order all postings by g(d)
- Key: this is a common ordering for all postings
- Thus, can concurrently traverse query terms’ postings for
- Postings intersection
- Cosine score computation

Why order postings by g(d)?

- Under g(d)-ordering, top-scoring docs likely to appear early in postings traversal
- In time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal early
- Short of computing scores for all docs in postings

Champion lists in g(d)-ordering

Can combine champion lists with g(d)-ordering

Maintain for each term a champion list of the r docs with highest g(d) + tf-idftd

Seek top-K results from only the docs in these champion lists

High and low lists

- For each term, we maintain two postings lists called high and low
- Think of high as the champion list
- When traversing postings on a query, only traverse high lists first
- If we get more than K docs, select the top K and stop
- Else proceed to get docs from the low lists
- Can be used even for simple cosine scores, without global quality g(d)
- A means for segmenting index into two tiers

Other applications

- Web Pages Collection
- The crawling process usually starts from a set of source Web pages. The Web crawler follows the source page hyperlinks to find more Web pages.
- This process is repeated on each new set of pages and continues until no more new pages are discovered or until a predetermined number of pages have been collected.
- The crawler has to decide in which order to collect hyperlinked pages that have not yet been crawled.
- The crawlers of different search engines make different decisions, and so collect different sets of Web documents.
- A crawler might try to preferentially crawl “high quality” Web pages.

Other applications

- Web Page Categorization
- Geographical Scope
- Whether a given Web page is of interest only for people in a given region or is of nation- or worldwide interest is an interesting problem for hyperlink analysis.
- A page’s hyperlink structure also reflects its range of interest.
- Local pages are mostly hyperlinked to by pages from the same region, while hyperlinks to pages of nationwide interest are roughly uniform throughout the country.
- This information lets search engines tailor query results to the region the user is in.

Reference

- Monika Henzinger, “hyperlink analysis for the web”, IEEE internet computing 2001.
- J. Cho, H. García-Molina, and L. Page, “Efficient Crawling through URL Ordering,” Proc. Seventh Int’l World Wide Web Conf., Elsevier Science, New York, 1998.
- S. Chakrabarti et al., “Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text,” Proc. Seventh Int’l World Wide Web Conf., Elsevier Science, New York, 1998.
- K. Bharat and M. Henzinger, “Improved Algorithms for Topic Distillation in Hyperlinked Environments,” Proc. 21st Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 98), ACM Press, New York, 1998
- L. Page et al., “The PageRank Citation Ranking: Bringing Order to the Web,” Stanford Digital Library Technologies, Working Paper 1999-0120, Stanford Univ., Palo Alto, Calif., 1998.
- I. Varlamis et al., “THESUS, a Closer View on Web Content Management Enhanced with Link Semantics”, IEEE Transactions on Knowledge and Data Engineering, vol. 16, No. 6, June 2004.

Download Presentation

Connecting to Server..