Combating Web Spam with TrustRank

114 Views

Download Presentation
## Combating Web Spam with TrustRank

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Combating Web Spam with TrustRank**Zolt´an Gy¨ongyi - Hector Garcia-Molina - Jan Pedersen Presented By: MahekJasani (USC)**Web Spam**It refers to hyperlinked pages on WWW that are created with the intention of misleading search engines.**How is it done?**• Adding thousands of keywords often making text invisible to humans • Creating Large number of bogus web pages**Web Spam Detection**Web Spam Detection is important! Not only for search engines But for users as well as content providers**Goal of this Paper**• The goal of this paper is to assist human experts who detect web spam. • The methods presented in this paper can be used either as : • helpers in an initial scanning process, suggesting pages that should be examined more closely by an expert • As a counter bias to be applied when the results are ranked**Preliminaries**• Web Model Graph G = (V,E) • V corresponds to webpages • E corresponds to directed links directed links that connect webpages 1 4 2 3**Web Model**unreferenced non-referencing • Transition Matrix T • Inverse Transition Matrix U 4 2 3 1**PageRank**• PageRank Algorithm uses link information to assign global importance scores to all pages on web • Two main factors that affect PageRank • The intuition is that a webpage is important if several other paged point to it. • The importance of certain page influences and is being influenced by the importance of some other pages.**PageRank**• Thus the PageRank score r(p) of page p is defined as: • The equivalent matrix equation form is: • -> decay factor; (q) -> out-degree of q • First part comes from the pages that point to P • the other part is static and equal for all web pages**Assessing Trust**Oracle Function • Determining if a page is spam is subjective and requires human evaluation • It’s a notion of a human checking a page for spam. Example**Trust Function**Intuition: -> Good pages seldom point to bad ones (Approximation Isolation) -> However, good pages may get tricked • Trust function T(p) -> Probability that a given page P is good.**Trust function**• Ideal Trust Property • But, in practice it is difficult to achieve • Ordered Trust Property • Another way is Threshold Trust property Threshold value**Evaluation Metrics**• A binary function I(T,O, p,q) is introduced to signal if a bad page received an equal or higher trust score than a good page 1) Pairwise Orderedness • If pairord equals 1, there are no cases when T misrateda pair. Conversely, if pairord equals zero, then T misratedall the pairs**Evaluation Metrics**2) Precision • It is defined as the fraction of good among all pages in X that have a trust score above : 3) Recall • It is defined as the ratio between the number of good pages with a trust score above and the total number of good pages in X:**Computing Trust**• Ignorant Trust Function • Let L = 3, Seed set = {1, 3, 6} • Let o and to denote the vectors of oracle and trust scores for each page, respectively. In this case, Performance: Pairwise Orderedness : 17/21 Precision: 1 Recall: 0.5**Computing Trust**• M-step Trust Function Using different values of M Performance decreases with values of M > 2.**Trust Attenuation**• Trust Dampening • Trust Splitting**Selecting Seeds**• Two heuristics: • Inverse PageRank • Select pages with maximum number of outlinks • High PageRank • Give preference to pages with high PageRank L=2? S = {1, 3} / {2,3} (desirable) S = {1, 2} (Inverse PageRank)**TrustRank Algorithm**• Input -> Web Graph, Transition Matrix T, and other parameters L=3, B=0.85, MB=20 • Select Seeds • Rank Seeds • Invoke Oracle Function L times (assign 1 to good seeds) • Normalize the result of Oracle Function • Biased PageRank Equation Static score distribution vector**Experiments**• Set of 31,003,946 sites • Seeds were selected using Inverse PageRank Algorithm • Top 1250 seeds were manually evaluated. • Out of that 178 sites turned out to be good seeds.**Results**• Pairwise Orderedness**CONCLUSION**• Guess what? QUESTIONS?