1 / 11

Link-based ranking I

Link-based ranking I. Broder’s breakdown. First-generation ranking: Ranked Boolean with TF-IDF-like factors Second-generation: Off-page, Web-specific factors Anchor text, click-through, link analysis Plus, focus on corpus-improvement Third-generation (yet to come):

jena-sykes
Download Presentation

Link-based ranking I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Link-based ranking I

  2. Broder’s breakdown • First-generation ranking: • Ranked Boolean with TF-IDF-like factors • Second-generation: • Off-page, Web-specific factors • Anchor text, click-through, link analysis • Plus, focus on corpus-improvement • Third-generation (yet to come): • Answering the need behind the query

  3. Link-analysis • Made famous by Google, but used by everyone now • Basic idea: a “link” is an endorsement • Has roots in bibliographic citation analysis • Decades-old work for determining “important” papers

  4. Naïve approach: in-degree • Idea: use in-degree to rank pages • Off-host links are better endorsements • Problems: • Too democratic: • Mossberg is better endorsement than Stata • 2-link page is stronger endorsement than 100-link • Easy to Spam (for above reasons)

  5. PageRank: recursive extension of in-degree • Let R(P) be the “PageRank” of P • R(P) = j/N+(1-j)*sum R(b_i)/outdegree(b_i) • where j is a number in (0,1) • b_i are the pages pointing to P • (draw picture) • (point out connection to previous page)

  6. Random-walk model • Imagine the following model of a surfer: • Prob. j: jump to a random page • Prob. 1-j: follow random link from current page • A random walk is a Markov process • An N-state system with an NxN “transition matrix” T of (indep) transition probabilities • Tik is probability of jumping from state i to k • PageRank: Tik = j/N + (1-j)/outdegree(i) • Markov processes have been studied extensively

  7. Markov processes • Markov process is ergodic if: • No zeros in transition matrix • ==> Can be in any state at any time step with non-zero probability • For ergodic Markov processes, there are unique long-term visit rates for every state known as stationary or steady-state probabilities that are independent of the process’ starting state

  8. Ergodic Markov processes • Let r be the N-dimensional vector giving the stationary prob’s for an erg. Markov process • The following is true: r = rT • Thus, r is the principle (left) eigenvector of T, that is, the eigenvector with the largest eigenvalue

  9. Computing stationary prob’s • Start with any r0, then: • r1 = r0 T, r2 = r1 T (= r0*T^2), etc. • Converges rapidly (<== Web is has a good “expansion factor”/is “rapidly mixing” <== outlinks from a small, random set leads to a sufficiently larger set)

  10. Searching with PageRank • Naïve: Find all pages that contain query terms, then rank according to PageRank • In reality, combine PageRank with “IR” score: • Weight anchor, title, header, and body hits differently (e.g., anchor hits might weigh high due to extra trust) • Non-linear “tapering” so no one can overpower • At one point, AV had almost 100 factors! • Tuning very important, very expensive

  11. Eigenvectors and ranking • Many link-based ranking schemes are based on eigenvector computations • Simple variations: • Bias jump probabilities by additional notion of “endorsement” (still query independent) • Bias link/jump probabilities by topic relevance (query dependence) or by “personal interests” (“personal PageRank”) • Next time: “Hubs and Authorities”

More Related