1 / 31

Adaptive Filtering: One Year On

Adaptive Filtering: One Year On. John Graham-Cumming Research Director, Sophos’s Anti-Spam Task Force Author, POPFile. Adaptive Filtering. Definition: An email filter that can be taught to recognize different types of mail without writing rules. Most use some machine learning technique:

telma
Download Presentation

Adaptive Filtering: One Year On

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Filtering: One Year On John Graham-Cumming Research Director, Sophos’s Anti-Spam Task Force Author, POPFile

  2. Adaptive Filtering • Definition: An email filter that can be taught to recognize different types of mail without writing rules. • Most use some machine learning technique: • Naïve Bayesian Classification1 • knn2 • Support Vector Machines3 • All provide some measure of “spamminess”

  3. Machine Learning & Anti-spam • A little more than one year • Papers • Mar 1998: SpamCop: A Spam Classification & Organization Program1 • Jul 1998: A Bayesian Approach to Filtering Junk E-mail2 • 2000: An evaluation of Naive Bayesian anti-spam filtering3 • Aug 2002: A Plan for Spam4 • Patents • Jun 1998: 6,161,130: Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set • Jun 1999: 6,592,627: System and method for organizing repositories of semi-structured documents such as email

  4. Why now? • The “Grandma Problem” • Confluence of events: • Spam getting close to 50% of all mail1 • Email reaching 1/3 of adults in US2 • Fast processors can handle the processing load • No other good alternatives • Laws? • Migrate from SMTP?3

  5. Two Routes • Open Source • Lots of open source anti-spam solutions • Many are “wannabe” solutions that simply implemented Paul Graham’s ideas • Some are interesting tools (bogofilter, POPFile, SpamBayes) • Commercial • Vendors now incorporating Adaptive Filtering into their anti-spam products • Classic tradeoff: • Free, open source, community supported • Fee, “productized”, vendor supported

  6. Practical Open Source Filters • General mail filters1 • Aug 1996: ifile • Aug 2002: POPFile • Oct 2002: dbacl • Spam Filters2 • Bogofilter, SpamBayes, Bayesian Spam Filter, SpamProbe, SpamWizard, BSpam, The Spam Secretary, Expaminator, SqueakyMail, Bayespam, spaminator, Quick Spam Filter, Annoyance Filter, DSPAM, PASP, Spam Blocker, CRM114 • SpamAssassin (added Bayesian in 2.5)

  7. Mainstream Adaptive Filtering • General • SwiftFile (for Lotus Notes)1 • Ella Pro (for Microsoft Outlook)2 • Anti-spam Desktop • Mozilla 1.3, Eudora 6.0 • Microsoft MSN 8, Microsoft Outlook 2003 • AOL 9.0, Apple Mail.app (Jaguar) • Anti-spam Gateway • Sophos PureMessage 4.x • Prediction: By end of 2004 every major email client includes adaptive filtering

  8. The Problems • Man-in-the-street Usability • False Positives • Over training • One man’s spam is another man’s ham • Internationalization

  9. Usability • Proxy, plug-in and external filters are too complex • General user needs: • To not understand the underlying mechanism • Complete integration with mail client • Obvious operation (e.g. spam is moved into a folder call Spam) • Automatic whitelisting (if I send to Mom, Mom is ok)

  10. False Positives • False Positive == Good mail identified as bad • False Negative == Spam identified as good • People tolerate false negatives, but hate false positives • Spam filters must guard against false positives: • Bias towards False Negatives (“A Plan for Spam”) • Cross check results (SpamBayes) • High spam threshold

  11. Over Training • Occurs when user loads up adaptive filter with lots more spam than ham • e.g. feeds entire spam archive into filter • Some adaptive filters then think everything is spam • For Naïve Bayes classifiers the “train on errors” methodology works well in practice. • User teaches filter only on mails it incorrectly classified • “No, that’s spam or no, that’s ham” button

  12. One man’s spam… • Can be hard to unsubscribe from legitimate bulk mail • Users tell spam filter that legitimate mail is spam • Creates false positives for other users in shared systems • e.g. I say CNET News email is spam, you want it • Ideal system has two parts • Gateway spam filter run by IT group • Individual preferences on each client

  13. Internationalization • Tokenization non-trivial for some languages • In English words are “space separated” • Thisisnotthecaseinsomeotherlanguages: • Japanese (POPFile の特別な使い方) • Different punctuation • ¿Español? «Français» • UTF-8, Unicode • أخبار و تقاريرlooks like ÃÎÈÇÑ æ ÊÞÇÑíÑ

  14. Spammer’s Response • Overwhelm filter with “good words” • Hide those good words from people • Use HTML as trickery toolbox • Three techniques: • And the Kitchen Sink • Invisible Ink • Camouflage • More in Sophos’s Field Guide to Spam1

  15. And the Kitchen Sink • Throw in innocent words before or after the HTML <html><body>Viagra</body></html> Hi, Johnny! It was really nice to have dinner with you last night. See you soon, love Mom

  16. And the Kitchen Sink • Spammer hopes reader concentrates on the spam message part • Ineffective because user gets to see the innocent words • Spammers need ways to hide the innocent words • So they’ve taken inspiration from search engine trickery…

  17. Invisible Ink • Use HTML font colors to write white on white <body bgcolor=white>Viagra<font color=white>Hi, Johnny! It was really nice to have dinner with you last night. See you soon, love Mom</font></body>

  18. Invisible Ink • Easily spotted if filter groks HTML • Can confuse filters that just drop HTML tags • Spammers have noticed that Invisible Ink is being targeted • They’ve adapted…

  19. Camouflage • Use very similar HTML colors <body bgcolor=#113333> <font color=yellow>Viagra</font> <font color=#123939>some innocent words</font> </body>

  20. Camouflage Hard to see, but “some innocent words” do appear

  21. Pythagoras Spots Spam • Foreground and background colors are coordinates in 3D • Imagine a Red axis, a Green axis and a Blue • Similar colors are close • Dissimilar colors are far apart • Pythagoras’ Theorem (3D)1 gives the color distance ● (FF,FF,00) Red Sweet, I rule in 2003 Green ● (12,39,39) ● (11,33,33) (00,00,00) Blue

  22. Spammers love HTML

  23. Trick Trends - Two Increasing

  24. Tricks Make Spam Spotting Easier • Bad news for spammers: • The harder you try to obscure your messages the easier they are to filter • Spam trickery becomes the spam fingerprint • Bad news for end users: • Spammers will react by making spam more innocentHi, I saw your profile and wanted to get in touch, please check out my site at www.some-viagra-site.com

  25. The Filter Paradox • Do filters make spam more effective? • One spammer claimed on /. “Your filters help cut down on the complaints to ISPs […] you no longer complain to uce@ftc.gov, my access providers, or anyone else who might cause me problems” • Time will tell

  26. The End • Following slides are for reference purposes

  27. References • Slide 2 • http://www.wikipedia.org/wiki/Naive_Bayesian_classification • http://www.usenix.org/events/sec02/full_papers/liao/liao_html/node4.html • http://citeseer.nj.nec.com/tong00support.html

  28. References • Slide 3 • http://citeseer.nj.nec.com/pantel98spamcop.html • http://citeseer.nj.nec.com/sahami98bayesian.html • http://citeseer.nj.nec.com/androutsopoulos00evaluation.html • http://www.paulgraham.com/spam.html • Slide 4 • Wired, p50, September 2003 predicts 50% of all mail will be spam by September 2004 • US Census Bureau, 2000 • One proposal is AMTP:http://www.ietf.org/internet-drafts/draft-weinman-amtp-00.txt

  29. References • Slide 5 • POPFile: http://popfile.sourceforge.netifile: http://www.nongnu.org/ifile/ • Search SourceForge and Freshmeat • Slide 6 • http://www.research.ibm.com/swiftfile/ • http://www.openfieldsoftware.com/Ella.asp

  30. References • Slide 17 • http://www.activestate.com/Products/PureMessage/Field_Guide_to_Spam/

  31. (x, y, z) δ β (a, b, c) α Pythagoras in 3D • Distance between two points in space • Pythagoras: δ2 = α2 + β2 • Pythagoras: α2 = (x-a)2 + (z-c)2 • β2 = (y-b)2 δ = √ ( (x-a)2 + (y-b)2 + (z-c)2 )

More Related