1 / 46

Countering Spam Using Classification Techniques

Countering Spam Using Classification Techniques. Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008. Overview. Introduction Countering Email Spam Problem Description Classification History Ongoing Research Countering Web Spam Problem Description

muriel
Download Presentation

Countering Spam Using Classification Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008

  2. Overview • Introduction • Countering Email Spam • Problem Description • Classification History • Ongoing Research • Countering Web Spam • Problem Description • Classification History • Ongoing Research • Conclusions

  3. Introduction • The Internet has spawned numerous information-rich environments • Email Systems • World Wide Web • Social Networking Communities • Openness facilities information sharing, but it also makes them vulnerable…

  4. Denial of Information (DoI) Attacks • Deliberate insertion of low quality information (or noise) into information-rich environments • Information analog to Denial of Service (DoS) attacks • Two goals • Promotion of ideals by means of deception • Denial of access to high quality information • Spam is the currently the most prominent example of a DoI attack

  5. Overview • Introduction • Countering Email Spam • Problem Description • Classification History • Ongoing Research • Countering Web Spam • Problem Description • Classification History • Ongoing Research • Conclusions

  6. Countering Email Spam • Close to 200 billion (yes, billion) emails are sent each day • Spam accounts for around 90% of that email traffic • ~2 million spam messages every second

  7. Old Email Spam Examples

  8. Problem Description • Email spam detection can be modeled as a binary text classification problem • Two classes: spam and legitimate (non-spam) • Example of supervised learning • Build a model (classifier) based on training data to approximate the target function • Construct a function f: M {spam, legitimate} such that it overlaps F: M {spam, legitimate} as much as possible

  9. Problem Description (cont.) • How do we represent a message? • How do we generate features? • How do we process features? • How do we evaluate performance?

  10. How do we represent a message? • Classification algorithms require a consistent format • Salton’s vector space model (“bag of words”) is the most popular representation • Each message m is represented as a feature vector f of n features: <f1, f2, …, fn>

  11. How do we generate features? • Sources of information • SMTP connections • Network properties • Email headers • Social networks • Email body • Textual parts • URLs • Attachments

  12. How do we process features? • Feature Tokenization • Alphanumeric tokens • N-grams • Phrases • Feature Scrubbing • Stemming • Stop word removal • Feature Selection • Simple feature removal • Information-theoretic algorithms

  13. How do we evaluate performance? • Traditional IR metrics • Precision vs. Recall • False positives vs. False negatives • Imbalanced error costs • ROC curves

  14. Classification History • Sahami et al. (1998) • Used a Naïve Bayes classifier • Were the first to apply text classification research to the spam problem • Pantel and Lin (1998) • Also used a Naïve Bayes classifier • Found that Naïve Bayes outperforms RIPPER

  15. Classification History (cont.) • Drucker et al. (1999) • Evaluated Support Vector Machines as a solution to spam • Found that SVM is more effective than RIPPER and Rocchio • Hidalgo and Lopez (2000) • Found that decision trees (C4.5) outperform Naïve Bayes and k-NN

  16. Classification History (cont.) • Up to this point, private corpora were used exclusively in email spam research • Androutsopoulos et al. (2000a) • Created the first publicly available email spam corpus (Ling-spam) • Performed various feature set size, training set size, stemming, and stop-list experiments with a Naïve Bayes classifier

  17. Classification History (cont.) • Androutsopoulos et al. (2000b) • Created another publicly available email spam corpus (PU1) • Confirmed previous research than Naïve Bayes outperforms a keyword-based filter • Carreras and Marquez (2001) • Used PU1 to show that AdaBoost is more effective than decision trees and Naïve Bayes

  18. Classification History (cont.) • Androutsopoulos et al. (2004) • Created 3 more publicly available corpora (PU2, PU3, and PUA) • Compared Naïve Bayes, Flexible Bayes, Support Vector Machines, and LogitBoost: FB, SVM, and LB outperform NB • Zhang et al. (2004) • Used Ling-spam, PU1, and the SpamAssassin corpora • Compared Naïve Bayes, Support Vector Machines, and AdaBoost: SVM and AB outperform NB

  19. Classification History (cont.) • CEAS (2004 – present) • Focuses solely on email and anti-spam research • Generates a significant amount of academic and industry anti-spam research • Klimt and Yang (2004) • Published the Enron Corpus – the first large-scale corpus of legitimate email messages • TREC Spam Track (2005 – present) • Produces new corpora every year • Provides a standardized platform to evaluate classification algorithms

  20. Ongoing Research • Concept Drift • New Classification Approaches • Adversarial Classification • Image Spam

  21. Concept Drift • Spam content is extremely dynamic • Topic drift (e.g., specific scams) • Technique drift (e.g., obfuscations) • How do we keep up with the Joneses? • Batch vs. Online Learning

  22. New Classification Approaches • Filter Fusion • Compression-based Filtering • Network behavioral clustering

  23. Adversarial Classification • Classifiers assume a clear distinction between spam and legitimate features • Camouflaged messages • Mask spam content with legitimate content • Disrupt decision boundaries for classifiers

  24. Camouflage Attacks • Baseline performance • Accuracies consistently higher than 98% • Classifiers under attack • Accuracies degrade to between 50% and 70% • Retrained classifiers • Accuracies climb back to between 91% and 99%

  25. Camouflage Attacks (cont.) • Retraining postpones the problem, but it doesn’t solve it • We can identify features that are less susceptible to attack, but that’s simply another stalling technique

  26. Image Spam • What happens when an email does not contain textual features? • OCR is easily defeated • Classification using image properties

  27. Overview • Introduction • Countering Email Spam • Problem Description • Classification History • Ongoing Research • Countering Web Spam • Problem Description • Classification History • Ongoing Research • Conclusions

  28. Countering Web Spam • What is web spam? • Traditional definition • Our definition • Between 13.8% and 22.1% of all web pages

  29. Ad Farms • Only contain advertising links (usually ad listings) • Elaborate entry pages used to deceive visitors

  30. Ad Farms (cont.) • Clicking on an entry page link leads to an ad listing • Ad syndicators provide the content • Web spammers create the HTML structures

  31. Parked Domains • Domain parking services • Provide place holders for newly registered domains • Allow ad listings to be used as place holders to monetize a domain • Inevitably, web spammers abused these services

  32. Parked Domains (cont.) • Functionally equivalent to Ad Farms • Both rely on ad syndicators for content • Both provide little to no value to their visitors • Unique Characteristics • Reliance on domain parking services (e.g., apps5.oingo.com, searchportal.information.com, etc.) • Typically for sale by owner (“Offer To Buy This Domain”)

  33. Parked Domains (cont.)

  34. Advertisements • Pages advertising specific products or services • Examples of the kinds of pages being advertised in Ad Farms and Parked Domains

  35. Problem Description • Web spam detection can also be modeled as a binary text classification problem • Salton’s vector space model is quite common • Feature processing and performance evaluation are also quite similar • But what about feature generation…

  36. How do we generate features? • Sources of information • HTTP connections • Hosting IP addresses • Session headers • HTML content • Textual properties • Structural properties • URL linkage structure • PageRank scores • Neighbor properties

  37. Classification History • Davison (2000) • Was the first to investigate link-based web spam • Built decision trees to successfully identify “nepotistic links” • Becchetti et al. (2005) • Revisited the use of decision trees to identify link-based web spam • Used link-based features such as PageRank and TrustRank scores

  38. Classification History • Drost and Scheffer (2005) • Used Support Vector Machines to classify web spam pages • Relied on content-based features as well as link-based features • Ntoulas et al. (2006) • Built decision trees to classify web spam • Used content-based features (e.g., fraction of visible content, compressibility, etc.)

  39. Classification History • Up to this point, previous web spam research was limited to small (on the order of a few thousand), private data sets • Webb et al. (2006) • Presented the Webb Spam Corpus – a first-of-its-kind large-scale, publicly available web spam corpus (almost 350K web spam pages) • http://www.webbspamcorpus.org • Castillo et al. (2006) • Presented the WEBSPAM-UK2006 corpus – a publicly available web spam corpus (only contains 1,924 web spam pages)

  40. Classification History • Castillo et al. (2007) • Created a cost-sensitive decision tree to identify web spam in the WEBSPAM-UK2006 data set • Used link-based features from [Becchetti et al. (2005)] and content-based features from [Ntoulas et al. (2006)] • Webb et al. (2008) • Compared various classifiers (e.g., SVM, decision trees, etc.) using HTTP session information exclusively • Used the Webb Spam Corpus, WebBase data, and the WEBSPAM-UK2006 data set • Found that these classifiers are comparable to (and in many cases, better than) existing approaches

  41. Ongoing Research • Redirection • Phishing • Social Spam

  42. Redirection • 144,801 unique redirect chains (1.54 average HTTP redirects) • 43.9% of web spam pages use some form of HTML or JavaScript redirection

  43. Phishing • Interesting form of deception that affects email and web users • Another form of adversarial classification

  44. Social Spam • Comment spam • Bulletin spam • Message spam

  45. Conclusions • Email and web spam are currently two of the largest information security problems • Classification techniques offer an effective way to filter this low quality information • Spammers are extremely dynamic, generating various areas of important future research…

  46. Questions

More Related