1 / 34

Splog Detection: Temporal and Link Properties Solution

This paper presents a splog detection task and proposes a solution based on temporal and link properties. The unique characteristics of splogs are identified, and a time-sensitive online detection task is proposed to capture these characteristics. The detection technique utilizes temporal and link properties for effective splog detection.

levip
Download Presentation

Splog Detection: Temporal and Link Properties Solution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Splog Detection Task andA Solution Based on Temporal and Link Properties Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Siaodan Song, Yun Chi, Koji Hino, Hari Sundaram, Jun Tatemura and Belle Tseng Presenter: Belle Tseng NEC Laboratories America, Cupertino, CA.

  2. Problem statement Goal: combat spam in the blogosphere • What are splogs? • How to detect splogs? • How to evaluate anti-splogs techniques? Approach: splog detection task & solution • Identify unique characteristics of splogs • Propose a time-sensitive online detection task that captures the unique characteristics • Propose a splog detection technique based on temporal & link properties WWW 2006, January 2, 20202

  3. Outline of the talk • Introduction • Splog detection task • Our detection method • Data pre-processing & annotation • Experiment results • Concluding remarks WWW 2006, January 2, 20203

  4. Introduction • Motivation • Related work • What are splogs? WWW 2006, January 2, 20204

  5. Motivation • Splogs are polluting the blogosphere… • 10-20% of blogs are splogs [1] • An average of 44 of the top 100 blogs search results in three popular blog search engines came from splogs [1] • 75% of new pings came from splogs; more than 50% of claimed blogs pinging weblogs.com are splogs [2] • Research issues • What are splogs? • How to detect splogs? • How to evaluate anti-splogs techniques? no concrete definition! splogs are different from web spams! a comparative evaluation framework on TREC dataset captures the unique characteristics of splogs Splog (spam+blog)—a new and serious problem in the blogosphere! WWW 2006, January 2, 20205

  6. Related work • Web spam detection • Content analysis • [Ntoulas06]: statistical properties in content • Link analysis • [Gyongyi05]: spam mass estimation • Splog detection • [Kolari06]: apply web spam detection & topic identification techniques in splog detection However, splogs are different… WWW 2006, January 2, 20206

  7. Example (1): keyword stuffing WWW 2006, January 2, 20207

  8. Example (2): stolen content Traditional content analysis is not enough! WWW 2006, January 2, 20208

  9. Example (3): link farm WWW 2006, January 2, 20209

  10. Example (4): via trackback links Traditional link analysis is not enough! WWW 2006, January 2, 202010

  11. What are splogs? • Splog: a blog created by an author who has the intention of spamming • NOTE: a blog having comment spam or trackback spam is not considered a splog S: splog W: affiliate website Ads/ppc: profitable mechanism WWW 2006, January 2, 202011

  12. Characteristics of splogs • Typical characteristics • Machine-generated content • No value-addition • Hidden agenda, usually an economic goal • Uniqueness of splogs • Dynamic content • Non-endorsement link Splog detection—different from web spam detection! WWW 2006, January 2, 202012

  13. Task Definition • Framework • Traditional IR-based evaluation • Proposed online evaluation WWW 2006, January 2, 202013

  14. Framework • Splog detector for the blog search engines • Different from the web search engine in the growing contents (feeds) • So, time is crucial • Entries become available gradually  time dealy to gather enough evidence • A splog persists in the index with growing content  detect it as soon as possible • How fast is the detector? • Make a decision withless evidence b1, b2, b3…: downloaded blogs e1, e2, e3…: downloaded entries WWW 2006, January 2, 202014

  15. Detection tasks • Traditional IR-based evaluation • with ground truth • K-fold cross-validation • Performance measures: precision/recall, AUC, ROC plot, etc. • without ground truth • Performance measure: average precision at top N of the ranked list based on pooling of multiple detection list WWW 2006, January 2, 202015

  16. Online evaluation • A framework to evaluate time-sensitive detection performance B(t1): a partition consisting of blogs discovered during ti-1 to ti pjk: detection performance at time tj on the partition at tk (B(tk)) Pi: average performance for each delay i=j-k WWW 2006, January 2, 202016

  17. Detection Method • Baseline features • Temporal regularity • Link regularity WWW 2006, January 2, 202017

  18. Baseline features • A subset of the content features presented in [Ntoulas06] • In practice, • Extract features from 5 parts of a blog • tokenized URLs, blog and post titles, anchor text, blog homepage content, and post (entry) content • Vectorize by word count, average word length, and a tf-idf vector • Prune rarely-used words • Feature selection using Fisher linear discriminant analysis (LDA)—to avoid over-fitting WWW 2006, January 2, 202018

  19. New features • Challenges • Content-based methods: suffer from more sophisticated content generation schemes • Link-based methods: suffer from different semantics of links; link graph is more dynamic and incomplete • Observation • Content: machine-generated posts • How to capture the characteristics in machine-generated content? • Link: to drive traffic to a specific set of affiliate websites • How to capture the characteristics in specific linking targets? Splogs’ motivation is different from normal, human-generated blogs! Temporal regularity estimation Link regularity estimation WWW 2006, January 2, 202019

  20. Temporal regularity (TCR) • Temporal content regularity (TCR) • Captures the similarity between growing contents • Estimated by autocorrelation of the content • Similarity measure: histogram intersection distance distance between two posts (k posts in between) TCR: autocorrelation Amount of common contents of two posts WWW 2006, January 2, 202020

  21. TCR examples WWW 2006, January 2, 202021

  22. Temporal regularity (TSR) • Temporal structural regularity (TSR) • captures consistency in timing of content creation • estimated by the entropy of the post-time difference distribution • Use hierarchical clustering method blog entropy of post-time Normalized by the maximum observed blog entropy WWW 2006, January 2, 202022

  23. TSR examples WWW 2006, January 2, 202023

  24. Link regularity (LR) • captures consistency in blogs’ targeting websites • Splog—more consistent behavior because its main intention is to drive traffic to affiliate websites • Affiliate websites—not authoritative to normal bloggers • Analyzing the linking behavior using HITS algorithm • LR: compute hub scores with out-link normalization • Splogs target focused set of websites, while normal blogs usually have more diverse targets WWW 2006, January 2, 202024

  25. Classification • Binary classification: splog or normal blogs • Use SVMs classifier with a radial basis function kernel • Combine baseline features with TCR, TSR, LR R (TCR, TSR, LR) SVMs Splog/non-splog base-n WWW 2006, January 2, 202025

  26. Data-Preprocessing & Ground Truth • Pre-processing • Annotation tool • Disagreement among annotators • Ground truth WWW 2006, January 2, 202026

  27. Data • TREC dataset: 100,649 feeds • Removing duplicate feeds and feeds without homepage or permalinks  43.6K unique blogs • Most blogs are discovered in the first week • used blogs discovered in the first week in online experiment WWW 2006, January 2, 202027

  28. Annotation (1) • An interface for annotators • Five labels: • (N) Normal • (S) Splog • (B) Borderline • (U) Undecided • (F) Foreign WWW 2006, January 2, 202028

  29. Annotation (2) • Disagreement among annotators • They agree more on normal blogs but less on near-splog blogs (S/B/U) • Pooling? • Splog recognition: conservative vs. aggressive • Ground truth • Label 9240 blogs (random & stratified sampling) • 7905 labeled as normal, 525 labeled as splogs • Low splog percentage • Some known splogs are pre-filtered • Focus on the 43.6K subset of blogs having both homepage and entries WWW 2006, January 2, 202029

  30. Experimental Results • Offline detection • Online detection WWW 2006, January 2, 202030

  31. Offline evaluation base-n: n-dimensional baseline features R+base-n: with temporal and link regularity features WWW 2006, January 2, 202031

  32. Online experiment testing period linking graph Week 2 Week 7 Week 1 WWW 2006, January 2, 202032

  33. Online evaluation • Without sufficient content data, the regularity features provide a significant boost to the performance WWW 2006, January 2, 202033

  34. Summary • Splog—a new and serious problem in the blogosphere • Detection of splogs is different from web spam detection • Identifying new detection tasks • Online evaluation measure how quickly a detector can identify splogs • Introducing useful and unique features of blogs/splogs • temporal and link regularity measures • Annotation • Guideline and tool help reduce annotation effort WWW 2006, January 2, 202034

More Related