1 / 30

SVMs for the Blogosphere: Blog Identification and Splog Detection

SVMs for the Blogosphere: Blog Identification and Splog Detection. Pranam Kolari, Tim Finin, Anupam Joshi. http://ebiquity.umbc.edu. Computational Approaches to Analyzing Weblogs, Stanford, March 27-29, 2006. Blogosphere - the brighter side. Panel View Market Research PR Monitoring

graham
Download Presentation

SVMs for the Blogosphere: Blog Identification and Splog Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi http://ebiquity.umbc.edu Computational Approaches to Analyzing Weblogs, Stanford, March 27-29, 2006

  2. Blogosphere - the brighter side • Panel View • Market Research • PR Monitoring • From Presentations • Opinion Extraction • Demography based analysis P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  3. Blogosphere - the darker side (1) • From the Panel • Blogger is cracking down splogs • SixApart and TypePad • Content Hijacking • From Presentations • Removing SPAM an essential part of blog search engine • Cost of cleaning up splogs and its effect on results P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  4. Blogosphere - the darker side (2) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  5. The Blogosphere Information Audience BLOG HOSTS Blogger msn-spaces livejournal PING SERVERS SPLOGS SPINGS P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  6. Spings – weblogs.com P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  7. Spings – weblogs.com (2) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  8. Spings – weblogs.com (3) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  9. Splogs – icerocket.com P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  10. Splogs – icerocket.com (2) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  11. A Featured Splog? P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  12. Splogs – technorati.com (2) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  13. Splogs – The Source! “Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…” “Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!” “Easily Dominate Any Market, AnySearch Engine, Any Keyword.” “Holy Grail Of Advertising... “ P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  14. Spam we target -- summarized • Non-blogs • For increased search engine exposure • Through BLOG IDENTIFICATION • Splogs • Adsense clicks for high-paying contexts (i) • Unjustifiably increase page-rank (importance) of affiliates – link farms (ii) • Combination of (i) and (ii) • Through SPLOG DETECTION P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  15. This work • Can machine learning models be effective to counter splogs on the blogosphere? • How do they perform when using features local to a blog? P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  16. Dataset for Training • Technorati random sampling • 500K blogs – May/June 2005 • Dropped those from top blogging hosts • Blog Identification is an easy tasking using just URL patterns/domains • Sampled the rest in different ways to create training datasets P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  17. Blog-HomePage/Non-Blog • Sampled for blog home-pages • Sampled for external links from these blogs to capture contextually similar pages – but from non-blogs • All samples were manually verified • Training set consists of 2100 positive and 2100 negative samples – multiple languages • Lets call this (BH, NB) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  18. Blog-SubPage/Non-Blog • Sampled for local-links from BH • Sampled for out-links similar to NB • No manual verification • 2600 positive and 2600 negative samples • Lets call this (BNH, NB) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  19. Authentic Blog/Splog • Manually identified 700 splogs (English) in the BH sample • Sampled for 700 blogs from the rest • 700 positive and 700 negative samples • Lets call this (AB, S) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  20. Comparison Baselines • Blog Identification • Splog Detection is a known problem! P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  21. Evaluation - Background • SVMs as implemented by libsvm • Leave-One-Out cross-validation • No stop word elimination • No stemming • Mutual Information for feature selection • Frequency count provided similar results • Binary feature encoding • Others encodings give similar results P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  22. New features for blogs • Hyper-links on a page • Tokenized by “/” and “-” • Anchor-text on a page • Meta tags • From HTML HEAD element • 4-grams • Contiguous blocks of 4 characters • Combinations • words and urls • meta and link • urls, anchors, meta P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  23. Blog Identification – (BH, NB) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  24. Blog Identification – (BNH, NB) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  25. Splog Detection - (AB, S) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  26. An quick Analysis • Ping Servers • Our analysis in December 2005 • At least 75% of pings are spings • Technorati Index • Data from week of March 20, 2006 • Random queries to sample for 10K blogs • 3K blogspot, 2.5K livejournal, 1.8K msn • We predict that 1.5K blogspot, 250 from LJ are splogs • Overall 2.5K/10K are splogs ~ 25% of the fresh index! P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  27. Blogosphere Spam - Summary Information Audience BLOG HOSTS 25% 50% Blogger msn-spaces livejournal 10% PING SERVERS 75% P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  28. And its not getting easier … But spammers still leave trails that can be exploited P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  29. Conclusion • Blogosphere is prone to spam at various infrastructure points • Local content based models can be quite effective by itself • 75% of pings and further downstream, 25% of fresh content is spam • Blogger’s problem is now livejournal’s problem, and now everyone’s problem • Combining local and global splog models is our current direction P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

  30. Questions? • Google “Splog Detection” • memeta • http://memeta.umbc.edu • eBiquity • http://ebiquity.umbc.edu • http://ebiquity.umbc.edu/blogger • Check out Umbria’s report on splogs • http://www.umbrialistens.com/files/uploads/umbria_splog.pdf P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection

More Related