Combating Web Spam with TrustRank - PowerPoint PPT Presentation

xenon
combating web spam with trustrank n.
Skip this Video
Loading SlideShow in 5 Seconds..
Combating Web Spam with TrustRank PowerPoint Presentation
Download Presentation
Combating Web Spam with TrustRank

play fullscreen
1 / 24
Download Presentation
Combating Web Spam with TrustRank
114 Views
Download Presentation

Combating Web Spam with TrustRank

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Combating Web Spam with TrustRank Zolt´an Gy¨ongyi - Hector Garcia-Molina - Jan Pedersen Presented By: MahekJasani (USC)

  2. Web Spam It refers to hyperlinked pages on WWW that are created with the intention of misleading search engines.

  3. How is it done? • Adding thousands of keywords often making text invisible to humans • Creating Large number of bogus web pages

  4. Web Spam Detection Web Spam Detection is important! Not only for search engines But for users as well as content providers

  5. Goal of this Paper • The goal of this paper is to assist human experts who detect web spam. • The methods presented in this paper can be used either as : • helpers in an initial scanning process, suggesting pages that should be examined more closely by an expert • As a counter bias to be applied when the results are ranked

  6. Preliminaries • Web Model Graph G = (V,E) • V corresponds to webpages • E corresponds to directed links directed links that connect webpages 1 4 2 3

  7. Web Model unreferenced non-referencing • Transition Matrix T • Inverse Transition Matrix U 4 2 3 1

  8. PageRank • PageRank Algorithm uses link information to assign global importance scores to all pages on web • Two main factors that affect PageRank • The intuition is that a webpage is important if several other paged point to it. • The importance of certain page influences and is being influenced by the importance of some other pages.

  9. PageRank • Thus the PageRank score r(p) of page p is defined as: • The equivalent matrix equation form is: •  -> decay factor; (q) -> out-degree of q • First part comes from the pages that point to P • the other part is static and equal for all web pages

  10. Assessing Trust Oracle Function • Determining if a page is spam is subjective and requires human evaluation • It’s a notion of a human checking a page for spam. Example

  11. Trust Function Intuition: -> Good pages seldom point to bad ones (Approximation Isolation) -> However, good pages may get tricked • Trust function T(p) -> Probability that a given page P is good.

  12. Trust function • Ideal Trust Property • But, in practice it is difficult to achieve • Ordered Trust Property • Another way is Threshold Trust property Threshold value

  13. Evaluation Metrics • A binary function I(T,O, p,q) is introduced to signal if a bad page received an equal or higher trust score than a good page 1) Pairwise Orderedness • If pairord equals 1, there are no cases when T misrateda pair. Conversely, if pairord equals zero, then T misratedall the pairs

  14. Evaluation Metrics 2) Precision • It is defined as the fraction of good among all pages in X that have a trust score above : 3) Recall • It is defined as the ratio between the number of good pages with a trust score above and the total number of good pages in X:

  15. Computing Trust • Ignorant Trust Function • Let L = 3, Seed set = {1, 3, 6} • Let o and to denote the vectors of oracle and trust scores for each page, respectively. In this case, Performance: Pairwise Orderedness : 17/21 Precision: 1 Recall: 0.5

  16. Computing Trust • M-step Trust Function Using different values of M Performance decreases with values of M > 2.

  17. Trust Attenuation • Trust Dampening • Trust Splitting

  18. Selecting Seeds • Two heuristics: • Inverse PageRank • Select pages with maximum number of outlinks • High PageRank • Give preference to pages with high PageRank L=2? S = {1, 3} / {2,3} (desirable) S = {1, 2} (Inverse PageRank)

  19. TrustRank Algorithm • Input -> Web Graph, Transition Matrix T, and other parameters L=3, B=0.85, MB=20 • Select Seeds • Rank Seeds • Invoke Oracle Function L times (assign 1 to good seeds) • Normalize the result of Oracle Function • Biased PageRank Equation Static score distribution vector

  20. Experiments • Set of 31,003,946 sites • Seeds were selected using Inverse PageRank Algorithm • Top 1250 seeds were manually evaluated. • Out of that 178 sites turned out to be good seeds.

  21. Results

  22. Results

  23. Results • Pairwise Orderedness

  24. CONCLUSION • Guess what? QUESTIONS?