1 / 40

CS246

CS246. Link-Based Ranking. Problems of TFIDF Vector. Works well on small controlled corpus, but not on the Web Top result for “American Airlines” query: accident report of American Airline flights Do users really care how many times “American Airlines” mentioned? Easy to spam

lance
Download Presentation

CS246

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS246 Link-Based Ranking

  2. Problems of TFIDF Vector • Works well on small controlled corpus, but not on the Web • Top result for “American Airlines” query: accident report of American Airline flights • Do users really care how many times “American Airlines” mentioned? • Easy to spam • Ranking purely based on page content • Authors can manipulate page content to get high ranking • Any idea?

  3. Link-based Ranking • People “expect” to get AA home page for the query “American Airlines” • Many pages point to AA home page, but not to accident report • Use link-count!

  4. Simple Link Count • Still easy to spam • Create many pages and add links to a page • How to avoid spam?

  5. PageRank • A page is important if it is pointed by many important pages • PR(p) = PR(p1)/n1 + … + PR(pk)/nkpi : page pointing to p, ni : number of links in pi • PageRank of p is the sum of PageRanks of its parents • One equation for every page • N equations, N unknown variables

  6. Ne MS Am Example: Web of 1842 • Netscape, Microsoft and Amazon PR(n) = PR(n)/2 + PR(a)/2 PR(m) = +PR(a)/2 PR(a) = PR(n)/2 + PR(m)

  7. PageRank: Matrix Notation • Web graph matrix M = { mij } • Each page i corresponds to row i and column i of the matrix M • mij = 1/n if page i is one of the n children of page jmij = 0 otherwise • PageRank vector • PageRank equation

  8. PageRank: Iterative Computation • Initially every page has a unit of importance • At each round, each page shares its importance among its children and receives new importance from its parents • Eventually the importance of each page reaches a limit • Stochastic matrix

  9. Example: Web of 1842 Ne MS Am

  10. PageRank: Eigenvector • PageRank equation • is the principal eigenvector of M

  11. PageRank: Random Surfer Model • The probability of a Web surfer to reach a page after many clicks, following random links Random Click

  12. Problems on the Real Web • Dead end • A page with no links to send importance • All importance “leak out of” the Web • Crawler trap • A group of one or more pages that have no links out of the group • Accumulate all the importance of the Web

  13. Example: Dead End • No link from Microsoft Dead end Ne MS Am

  14. Example: Dead End Ne MS Am

  15. Solution to Dead End • Assume a surfer to jumps to a random page at a dead end Ne MS Am

  16. Example: Crawler Trap • Only self-link at Microsoft Crawler trap Ne MS Am

  17. Example: Crawler Trap Ne MS Am

  18. Crawler Trap: Damping Factor “Tax” each page some fraction of its importance and distribute it equally Probability to jump to a random page Assuming 20% tax

  19. Link Spam Problem • Q: What if a spammer creates a lot of pages and create a link to a single spam page? • PageRank better than simple link count, but still vulnerable to link spam • Q: Any way to avoid link spam?

  20. TrustRank [Gyongyi et al. 2004] • Good pages don’t point to spam pages • Trust a page only if it is linked by what you trust • Same as PageRank except the random jump probability term

  21. TrustRank: Theory [Bianchini et al. 2005] consider a set of pages S IN(S) S OUT(S) DP(S)

  22. TrustRank: Theory [Bianchini et al. 2005]

  23. What Does It Mean? • PS = 0 if BS= 0 and PIN= 0 • You cannot improve your TrustRank simply by creating more pages and linking within yourself • To get non-zero TrustRank, you need to be either trusted or get links from outside

  24. Is TrustRank the Ultimate Solution? • Not really… • Honeypot: A page with good content with hidden links to spams • Good users link to honeypot due to its quality content • Blogs, forums, wikis, mailing lists • Easy to add spam links • Link exchange • Set of sites exchanging links to boost ranking • A never-ending rat race…

  25. Anti-Spamming at Search Engines • Anchor text • Consider what others think about your page • Give higher weights to anchors from high PageRank pages • More difficult to spam • TrustRank • To gain importance, you need to convince many pages under other’s control or convince search engines • More difficult to spam • Consider inter-site links with higher weight

  26. Hub and Authority • More detailed evaluation of importance • A page is useful if • It has good contents or • It has links to useful pages (good bookmark) • Hub/Authority • Authority: pages with good contents • Hub: pages pointing to good content pages

  27. Hub/Authority: Definition • Recursive definition similar to PageRank • Authority pages are linked to by many hub pages • Hub pages link to many authority pages • H(p) = A(p1) + … + A(pk)A(p) = H(p1) + … + H(pm)

  28. Hub/Authority: Matrix Notation • Web graph matrix A = { aij } • Each page i corresponds to row i and column i of the matrix A • aij = 1 if page i points to page jaij = 0 otherwise • A is not a stochastic matrix • AT: similar to PageRank matrix M, without stochastic restriction

  29. Ne MS Am Example: Web of 1842 • [n, m, a]: vector

  30. Hub/Authority: Iterative Computation • Hub/Authority vector • : divergence scaling factor • : divergence scaling factor • Compute and iteratively with scaling

  31. Hub/Authority: Eigenvector • : eigenvector of : eigenvector of

  32. Ne MS Am Example: Web of 1842

  33. Hub/Authority and Root Set • Apply the equations on a small neighbor graph (base set) • Start with, say, 100 pages on “bicycling” • Add pages pointing to the 100 pages • Add pages that the 100 pages are pointing to • Identified pages are good “Hub” and “Authority” on “bicycling”

  34. Hub/Authority and Web Community • Hub/Authority is often used to identify Web communities • Nice notion of “Hub” and “Authority” of the community • Often Hub and Authority are tightly linked to each other

  35. Any Questions?

  36. Questions • Can we apply Hub/Authority to the entire Web like PageRank?

  37. Hub/Authority on the Entire Web? • Hub/Authority works well on a topic-specific subset, but works poorly for the whole Web • Easy to spam • Create a page pointing to many authority pages (e.g., Yahoo, Google, etc.) The page becomes a good hub page • On the page, add a link to your home page

  38. Questions • Can we apply PageRank to a small base set?

  39. PageRank on a Small Subset • In general, PageRank works better for larger dataset • We may be able to compute “topic-specific” PageRank • Any other way for “topic-specific” PageRank?

  40. Summary: Link-Based Ranking • PageRank • TrustRank variation • Hub/Authority

More Related