1 / 34

A Study of Supervised Spam Detection Applied to Eight Months of Personal E-Mail

A Study of Supervised Spam Detection Applied to Eight Months of Personal E-Mail. Gordon Cormack and Thomas Lynam Presented by Hui Fang. Feel free to interrupt when you have any question or comment!. Detour: Some background about Email Spam.

zeke
Download Presentation

A Study of Supervised Spam Detection Applied to Eight Months of Personal E-Mail

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Study of Supervised Spam Detection Applied to Eight Months of Personal E-Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang

  2. Feel free to interrupt when you have any question or comment!

  3. Detour: Some background about Email Spam Some slides are adapted from the Tutorial on Junk Mail Filtering by Geoff Hulten and Joshua Goodman.

  4. What is Spam? • Typical legal definition • Unsolicited commercial email from someone without a pre-existing business relationship • Definition mostly used • Whatever the users think

  5. Unofficial Statistics of Spam (Feb.3 to Feb. 12) It is inconvenient, annoying and wasteful of computer resources. Its volume threatens to overwhelm our ability to recognize useful messages.

  6. Spam Detection Ham Spam Is this just text categorization? What are the special challenges?

  7. Text classification alone is not enough • Spammers now often try to obscure text. • Special features are necessary. • E.g. subject line vs. body text • E.g. Mail in the middle of the night is more likely to be spam than mail in the middle of the day. • …

  8. Weather Report Guy • Content in Image Weather, Sunny, High 82, Low 81, Favorite…

  9. Secret Decoder Ring Dude • Another spam that looks easy • Is it?

  10. Secret Decoder Ring Dude • Character Encoding • HTML word breaking Phar&#109;acy Prod&#117;c<!LZJ>t<!LG>s

  11. Diploma Guy • Word Obscuring Dlpmoia Pragorm Caerte a mroe prosoeprus

  12. Diploma Guy • Word Obscuring Dipmloa Paogrrm Cterae a more presporous

  13. Diploma Guy • Word Obscuring Dimlpoa Pgorram Cearte a more poosperrus

  14. Diploma Guy • Word Obscuring Dpmloia Pragorm Caetre a more prorpeosus

  15. Diploma Guy • Word Obscuring Dlpmoia Pragorm Carete a mroe prorpseous

  16. More of Diploma Guy • Diploma Guy is good at what he does

  17. One Solution to Spam Detection • Machine Learning • Learn spam versus good

  18. Naïve Bayes • Want • Use Bayes Rule: • Assume independence: probability of each word independent of others

  19. A Bayesian Approach to Filtering Junk E-Mail1998 -Sahami, Dumais, Heckerman, Horvitz • One of the first papers on using machine learning to combat spam • Used Naïve Bayes • Feature Space: Words, Phrases, Domain-Specific Features • Evaluation Data: ~1700 Messages, ~88% Spam, from volunteer’s private e-mail

  20. A Bayesian Approach to Filtering Junk E-Mail1998 -Sahami, Dumais, Heckerman, Horvitz • Hand Crafted Features • 35 Phrases • ‘Free Money’ • ‘Only $’ • ‘be over 21’ • 20 Domain Specific Features • Domain type of sender (.edu, .com, etc) • Sender name resolutions (internal mail) • Has attachments • Time received • Percent of non-alphanumeric characters in subject • Best collection of heuristics discussed in literature • Without them: Spam precision 97.1% Spam recall 94.3% • With them: Spam precision 100% Spam recall 98.3%

  21. A Plan for Spam2002 – P. Graham • Widely cited in the open source community • Uses a heavily tuned version of Naïve Bayes • Feature Space: Words in header and body • Feature Selection: ~23,000 features • all that appeared more than 5 times • Evaluation Data: ~8000 messages from author; ~50% spam • Results: Spam precision 100%, Spam recall 99.5%

  22. Algorithms Used in Spam Detection • Naïve Bayes reported to do very well • More complex algorithms have some gain

  23. Focus of the paper Which Algorithm is Best? • Very difficult to tell • No consistently-used good data set • No standard evaluation measures

  24. End of Detour

  25. Present several evaluation measures for spam detection • Compare methods in six open-sources spam filters • Analysis the experiment results Overview of the Paper A Study of Supervised Spam Detection Applied to Eight Months of Personal E-Mail

  26. Problem: Supervised Spam Detection

  27. Methods • Methods in six open-source spam filters • Spamassassin • Bogofilter • CRM-114 • DSPAM • SpamBayes • Spamprobe

  28. Data • A person’s eight month E-mails • From Aug. 2003 to March 2004 • Stored in the order received • 49,086 messages with judgements • 9,038 (18.4%) ham • 40,048 (81.6%) spam

  29. Evaluation Measures (1) Result a: ham (correctly classified) [true negative] b: spam misclassification [false negative] c: ham misclassification [false positive] d: spam (correctly classified) [true negative] • Accuracy: (a+d)/(a+b+c+d) • Spam recall: d/(b+d) • Spam precision: d/(d+c) • Ham misclassification rate: c/(a+c) • Spam misclassification rate: b/(b+d)

  30. Evaluation Measures (2) • Ham/Spam tradeoff curve, i.e. ROC curve • Single ham/spam tradeoff score: ROC area under the curve • The probability that a random spam message will receive a higher score than a random ham message

  31. Evaluation Measures (3) • Ham/Spam leaning curve

  32. Misclassification by Genre • Not all types of ham are equal • Some more likely misclassified • Some more likely missed if filtered • Some more valuable • Spam can similarly be classified

  33. Conclusion • Present several possible evaluation measures for spam detection • Compare several spam detection methods • Provide Analysis of the experiment results • However, it would be more interesting to compare the performance of different algorithms (e.g. NB vs. SVM).

  34. The End • Thank you!

More Related