1 / 30

A Quantitative Study of Forum Spamming Using Context-Based Analysis

A Quantitative Study of Forum Spamming Using Context-Based Analysis. Yi-Min Wang^ Ming Ma^. Yuan Niu* Hao Chen* Francis Hsu*. *UC Davis, ^Microsoft Research. User. Spammer. A Look at the Web. Why do we care about spam?. Users want to Look at quality pages on the web

Download Presentation

A Quantitative Study of Forum Spamming Using Context-Based Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Quantitative Study of Forum Spamming Using Context-Based Analysis Yi-Min Wang^ Ming Ma^ Yuan Niu* Hao Chen* Francis Hsu* *UC Davis, ^Microsoft Research

  2. User Spammer A Look at the Web

  3. Why do we care about spam? • Users want to • Look at quality pages on the web • Interact without the trouble of moderation • Surf safely • Search engines want to • Provide good search results • Profit from ads • We want to investigate the landscape of the problem • Popular battleground: web forums

  4. Why Web Forums? • Open communities: wiki, forums, blogs • Increasingly easy to contribute

  5. Why Web Forums?

  6. 3. Propagates Splog URL Returns 2. Writes Splog URLs 4. Sends User to Doorway URL 1. Creates 5. Redirects User Spammer How Spammers Operate Search Engine Comment Spam Search Results Doorway Pages (Splogs) Spammer Domain

  7. How to deal with the problem? • Content based approach • Constrained by content retrieved • May be deceived by tricks like cloaking and redirection • We propose: context-based analysis

  8. Context-based Analysis • Consisting of • Redirection • Cloaking analysis • See dynamic content not served to crawlers • Use the Strider URL Tracer • Flag large number of doorway pages to spam domains • Based on intuition that: • Publishing links is necessary to increase popularity • We must see the destination URL eventually

  9. Doorways & Redirections Google search: Coach handbag

  10. Redirection Analysis • Fed URLs to Strider URL Tracer, which records all pages visited • Ranked top 3rd Party Domains by redirections • Seed known spammer domain • Identified doorway pages based on association with spammer domains • Manually investigated unknown domains to expand the blacklist

  11. Cloaking Analysis • Diff-based check • Run URL twice – once with anti-cloaking, once without • Crawler-browser cloaking (User-agent, scripting-on/off) • Click-through cloaking (Referer)

  12. www.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.htmlwww.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.html Javascript Enabled www.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.html Javascript Disabled Google Search: ringtones download Crawler-Browser Cloaking

  13. Crawler-Browser Cloaking

  14. Advertising Page from Click-throughs Cached page/ Scripting off/ Crawler View Cached page/ Scripting off/ Crawler View Directly Visiting the Page Directly Visiting the Page Click-Through Cloaking

  15. 3. Propagates Splog URL Returns 2. Writes Splog URLs 4. Sends User to Doorway URL Search User 1. Creates 5. Redirects User Webhost Spammer Three Perspectives Search Engine Comment Spam Search Results Doorway Pages (Splogs) Spammer Domain

  16. Search User

  17. Search User • Chose 9 popular forum software – written in Perl/PHP, hosted/unhosted • WWWBoard, Hypernews, Ikonboard, Ezboard, Bravenet, Invision Board, Phpbb, Phorum, and VBulletin • Compiled popular tags and common spam terms –list of 190 keywords • “Myspace, jewelry, casino, shopping, baseball…” • Searched for all <keyword, forum-software> pairs in Google & MSN

  18. Search User • Search terms returned spammed forums in top 20 results from both Google and MSN • Only exception is “palm-texas-holdem-game” • Top 5 most spammed forums:

  19. Honeyblogs • Spammers: • Create their own doorway pages, and • Promote the doorways by posting to other people’s pages • Honeyblogs lure the spammer in: • No moderation, default accept all policy • Pinged blog aggregators with every post • Abandoned within three months

  20. Honeyblogs • 41,100 comments collected over 339 days • 19,297 comments received in the last month • Ilium – 930/1432 • Litlog – 3734/5714 • Spammer activity got me kicked off my hosting server

  21. Honeyblog Activity

  22. Honeyblog Activity 3142

  23. Webhost Perspective • Focus on splog doorways • Above Numbers are lower bounds • Consider only pages using cloaking & redirection

  24. Webhost Perspective • Blogspot: 1,091 splogs • Most popular • Randomly sampled 1% of profile pages created in July and extracted all blog links – 13,389 • 60% of splogs used cloaking • 24% of splogs redirected to filldirect.com

  25. Webhost Perspective • Blogspoint: 3535 splogs • 2166 redirected to finance-web-search.com • 917 redirected to casino-web-search.com • Blogstudio: 198 splogs • 130 redirected to finance-web-search.com • 54 redirected to casino-web-search.com • Blogsharing: 82 splogs • Plumber related link spamming in splogs

  26. Also of note… • Malicious URLs • Previous work by MSR (Strider HoneyMonkey)1 discovered sites that actively exploit browser vulnerabilities • We tested 8 known malicious URLs for presence on the web • Found 5 spammed in forums, 2 in link farms, 1 in referrer logs • Universal redirectors • Redirects user to any URL (sometimes destination is obfuscated): • www.rit.edu/~ksa/cgi-bin/splinks/click.cgi?num=2&url=[your url here] • http://tinyurl.com/3c7twl • http://www.canadianpharmacyltd.com/group.php?id=59&aid=860 • Could be used to serve malicious URLs, particularly those on .edu and .gov sites 1Yi-Min Wang, et al. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. NDSS, 2006.

  27. Related Work (Part 1) • Diff-based cloaking • Wu & Davison – Diff-based cloaking combined with content based analysis • Our approach detects click-through cloaking • Content based approaches • Fetterly, Manasse and Najork – URL properties, clustering pages of similar content • Mishne, Carmel, Lempel – Compared statistical models of comments & target pages against post content • Kolari, Finin and Joshi – Meta tag text, anchor text, URLs • Our approach is complimentary to content-based approaches

  28. Related Work (Part 2) • Measurements of Trust • Metaxas et al – Defined trust neighborhoods • Benczur et al – SpamRank: Identify outliers by looking at PageRank of the site and its “supporters” • Similarly, our approach propagates distrust by following redirections • Plugins to aid moderating forums/blogs • Akismet • Bad Behavior, Spam Karma • Our approach does not require cooperation from forum owners

  29. Conclusions • Context-based approach successfully detects advanced cloaking & redirection based spam • Spammers are pervasive • 189 of 190 search terms returned spammed forums in the top 20 search results from both Google and MSN • Same spammer redirecting to two domains on blogspoint and blogstudio

  30. Future work • There is hope! • Economic solution • Identifies middlemen in online advertising • Read our WWW07 paper1 • http://wwwcsif.cs.ucdavis.edu/~niu • http://research.microsoft.com/csm/strider/ 1Yi-Min Wang et al. Spam Double-Funnel: Connecting Web Spammers with Advertisers. WWW 2007.

More Related