Combating Web Spam with TrustRank ( Zoltan Gyongyi , Hector Garcia-Molina, Jan Pedersen )

Combating Web Spam with TrustRank(ZoltanGyongyi, Hector Garcia-Molina, Jan Pedersen) Jacob Kalakal Joseph CS 586 (Fall 2011) | Class Presentation | Nov 07, 2011

Outline • Challenge: Webspam • Algorithmic webspam detection is difficult • Human experts are slow and expensive • Solution: TrustRank • Intuition • Algorithm • Evaluation • Experiments, results and analysis CS586-Joseph

Webspam CS586-Joseph

Webspam • Malicious techniques to achieve better than deserved search engine ranks • AKA: Spamdexing, search spam, search engine spam, or Search Engine Poisoning • Techniques: • Content spam (keyword stuffing, hidden text, etc) • Link spam (link farms, honey-pots) CS586-Joseph

Web Model Edge = Link Node = Web page CS586-Joseph

Web Model Outlink Inlink Inlink CS586-Joseph

Web Model Outdegree =1 Indegree = 2 CS586-Joseph

Web Model Non-referencing page Isolated page Unreferenced page CS586-Joseph

Simplifications Collaspe CS586-Joseph

Simplifications Discard CS586-Joseph

Transition Matrix and Inverse Transition Matrix CS586-Joseph

Assessing Trust – Oracle function Human evaluation - Expensive and slow CS586-Joseph

Assessing Trust – Approximate Isolation CS586-Joseph

Assessing Trust – Trust Function • Ideal Trust Property • Ordered Trust Property • Threshold Trust Property CS586-Joseph

Evaluation Metrics PairwiseOrderedness Precision Recall CS586-Joseph

TrustRank Algorithm Intuition: Good pages point to other good pages Select Seeds Propagate trust Repeat Step 2 with decay factor=0.85 and 20 iterations to convergence CS586-Joseph

Trust Attenuation - Dampening CS586-Joseph

Trust Attenuation - Splitting CS586-Joseph

Experiments - DataSet Altavista crawl of August 2003 Simplification: Several billion pages to 31 million sites using proprietary algorithm Observation: 1/3 sites were unreferenced; however they do not matter much since they get low rankings First author was the oracle CS586-Joseph

Experiments - Seed Selection • Two Schemes • Inverse PageRank • High PageRank • Observation • Some top ranked pages were spam • Selection • 31million->25,000 (top inverse PR) ->7,900 (listed sites) -> 1,250 (manual evaluation) ->178 (seeds) CS586-Joseph

Experiments - Seed Selection CS586-Joseph

Experiments - % Spam per bucket CS586-Joseph

Experiments – PairwiseOrderedness CS586-Joseph

Experiments – Precision and Recall CS586-Joseph

Contributions Formally defined webspam and webspam detection algorithms Defined matrices for accessing the efficacy of algorithms (Pairwiseorderedness) Defined seed selection schemes (Inverse PR and High PR) Introduced TrustRank Experiments CS586-Joseph

Related Work Future Research • Experiment with various combinations of dampening and splitting for trust propagation • Select seeds iteratively Builds upon PageRank Spam detection in Text (Machine Learning) Spam detection in Link (Graph Clustering) CS586-Joseph

References and Resources http://dl.acm.org/citation.cfm?id=1316740 http://infolab.stanford.edu/~zoltan http://en.wikipedia.org/wiki/Spamdexing http://en.wikipedia.org/wiki/PageRank CS586-Joseph

 CS586-Joseph

Combating Web Spam with TrustRank ( Zoltan Gyongyi , Hector Garcia-Molina, Jan Pedersen )

Combating Web Spam with TrustRank ( Zoltan Gyongyi , Hector Garcia-Molina, Jan Pedersen )

Presentation Transcript

Tactics for Combating Link Spam at Yale

Heuristics for Detecting Spam Web Pages

Detecting Spam Web Pages

Combating In-Box Spam

Dealing with Spam

Web Spam Detection with Anti-Trust Rank

Slicing Spam with Occam’s Razor

Combating Web Spam with TrustRank

Web Spam Taxonomy

Filtering Spam With

Spam, Spam, Spam, Spam….

Detecting Web Spam Created with Markov Chains Text Generators

Combating Spam Server-side

Dealing With Spam

Topical TrustRank: Using Topicality to Combat Web Spam

Web Spam

What is WEB SPAM

Detecting Spam Web Pages

Detecting Web Spam with CombinedRank

Spam, Spam, Spam, Spit and Spim

Filtering Spam With

Web Spam