Adaptive Filtering: One Year On

Adaptive Filtering: One Year On John Graham-Cumming Research Director, Sophos’s Anti-Spam Task Force Author, POPFile

Adaptive Filtering • Definition: An email filter that can be taught to recognize different types of mail without writing rules. • Most use some machine learning technique: • Naïve Bayesian Classification1 • knn2 • Support Vector Machines3 • All provide some measure of “spamminess”

Machine Learning & Anti-spam • A little more than one year • Papers • Mar 1998: SpamCop: A Spam Classification & Organization Program1 • Jul 1998: A Bayesian Approach to Filtering Junk E-mail2 • 2000: An evaluation of Naive Bayesian anti-spam filtering3 • Aug 2002: A Plan for Spam4 • Patents • Jun 1998: 6,161,130: Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set • Jun 1999: 6,592,627: System and method for organizing repositories of semi-structured documents such as email

Why now? • The “Grandma Problem” • Confluence of events: • Spam getting close to 50% of all mail1 • Email reaching 1/3 of adults in US2 • Fast processors can handle the processing load • No other good alternatives • Laws? • Migrate from SMTP?3

Two Routes • Open Source • Lots of open source anti-spam solutions • Many are “wannabe” solutions that simply implemented Paul Graham’s ideas • Some are interesting tools (bogofilter, POPFile, SpamBayes) • Commercial • Vendors now incorporating Adaptive Filtering into their anti-spam products • Classic tradeoff: • Free, open source, community supported • Fee, “productized”, vendor supported

Practical Open Source Filters • General mail filters1 • Aug 1996: ifile • Aug 2002: POPFile • Oct 2002: dbacl • Spam Filters2 • Bogofilter, SpamBayes, Bayesian Spam Filter, SpamProbe, SpamWizard, BSpam, The Spam Secretary, Expaminator, SqueakyMail, Bayespam, spaminator, Quick Spam Filter, Annoyance Filter, DSPAM, PASP, Spam Blocker, CRM114 • SpamAssassin (added Bayesian in 2.5)

Mainstream Adaptive Filtering • General • SwiftFile (for Lotus Notes)1 • Ella Pro (for Microsoft Outlook)2 • Anti-spam Desktop • Mozilla 1.3, Eudora 6.0 • Microsoft MSN 8, Microsoft Outlook 2003 • AOL 9.0, Apple Mail.app (Jaguar) • Anti-spam Gateway • Sophos PureMessage 4.x • Prediction: By end of 2004 every major email client includes adaptive filtering

The Problems • Man-in-the-street Usability • False Positives • Over training • One man’s spam is another man’s ham • Internationalization

Usability • Proxy, plug-in and external filters are too complex • General user needs: • To not understand the underlying mechanism • Complete integration with mail client • Obvious operation (e.g. spam is moved into a folder call Spam) • Automatic whitelisting (if I send to Mom, Mom is ok)

False Positives • False Positive == Good mail identified as bad • False Negative == Spam identified as good • People tolerate false negatives, but hate false positives • Spam filters must guard against false positives: • Bias towards False Negatives (“A Plan for Spam”) • Cross check results (SpamBayes) • High spam threshold

Over Training • Occurs when user loads up adaptive filter with lots more spam than ham • e.g. feeds entire spam archive into filter • Some adaptive filters then think everything is spam • For Naïve Bayes classifiers the “train on errors” methodology works well in practice. • User teaches filter only on mails it incorrectly classified • “No, that’s spam or no, that’s ham” button

One man’s spam… • Can be hard to unsubscribe from legitimate bulk mail • Users tell spam filter that legitimate mail is spam • Creates false positives for other users in shared systems • e.g. I say CNET News email is spam, you want it • Ideal system has two parts • Gateway spam filter run by IT group • Individual preferences on each client

Internationalization • Tokenization non-trivial for some languages • In English words are “space separated” • Thisisnotthecaseinsomeotherlanguages: • Japanese (POPFile の特別な使い方) • Different punctuation • ¿Español? «Français» • UTF-8, Unicode • أخبار و تقاريرlooks like ÃÎÈÇÑ æ ÊÞÇÑíÑ

Spammer’s Response • Overwhelm filter with “good words” • Hide those good words from people • Use HTML as trickery toolbox • Three techniques: • And the Kitchen Sink • Invisible Ink • Camouflage • More in Sophos’s Field Guide to Spam1

And the Kitchen Sink • Throw in innocent words before or after the HTML <html><body>Viagra</body></html> Hi, Johnny! It was really nice to have dinner with you last night. See you soon, love Mom

And the Kitchen Sink • Spammer hopes reader concentrates on the spam message part • Ineffective because user gets to see the innocent words • Spammers need ways to hide the innocent words • So they’ve taken inspiration from search engine trickery…

Invisible Ink • Use HTML font colors to write white on white <body bgcolor=white>ViagraHi, Johnny! It was really nice to have dinner with you last night. See you soon, love Mom</body>

Invisible Ink • Easily spotted if filter groks HTML • Can confuse filters that just drop HTML tags • Spammers have noticed that Invisible Ink is being targeted • They’ve adapted…

Camouflage • Use very similar HTML colors <body bgcolor=#113333> Viagra some innocent words </body>

Camouflage Hard to see, but “some innocent words” do appear

Pythagoras Spots Spam • Foreground and background colors are coordinates in 3D • Imagine a Red axis, a Green axis and a Blue • Similar colors are close • Dissimilar colors are far apart • Pythagoras’ Theorem (3D)1 gives the color distance ● (FF,FF,00) Red Sweet, I rule in 2003 Green ● (12,39,39) ● (11,33,33) (00,00,00) Blue

Spammers love HTML

Trick Trends - Two Increasing

Tricks Make Spam Spotting Easier • Bad news for spammers: • The harder you try to obscure your messages the easier they are to filter • Spam trickery becomes the spam fingerprint • Bad news for end users: • Spammers will react by making spam more innocentHi, I saw your profile and wanted to get in touch, please check out my site at www.some-viagra-site.com

The Filter Paradox • Do filters make spam more effective? • One spammer claimed on /. “Your filters help cut down on the complaints to ISPs […] you no longer complain to uce@ftc.gov, my access providers, or anyone else who might cause me problems” • Time will tell

The End • Following slides are for reference purposes

References • Slide 2 • http://www.wikipedia.org/wiki/Naive_Bayesian_classification • http://www.usenix.org/events/sec02/full_papers/liao/liao_html/node4.html • http://citeseer.nj.nec.com/tong00support.html

References • Slide 3 • http://citeseer.nj.nec.com/pantel98spamcop.html • http://citeseer.nj.nec.com/sahami98bayesian.html • http://citeseer.nj.nec.com/androutsopoulos00evaluation.html • http://www.paulgraham.com/spam.html • Slide 4 • Wired, p50, September 2003 predicts 50% of all mail will be spam by September 2004 • US Census Bureau, 2000 • One proposal is AMTP:http://www.ietf.org/internet-drafts/draft-weinman-amtp-00.txt

References • Slide 5 • POPFile: http://popfile.sourceforge.netifile: http://www.nongnu.org/ifile/ • Search SourceForge and Freshmeat • Slide 6 • http://www.research.ibm.com/swiftfile/ • http://www.openfieldsoftware.com/Ella.asp

References • Slide 17 • http://www.activestate.com/Products/PureMessage/Field_Guide_to_Spam/

(x, y, z) δ β (a, b, c) α Pythagoras in 3D • Distance between two points in space • Pythagoras: δ2 = α2 + β2 • Pythagoras: α2 = (x-a)2 + (z-c)2 • β2 = (y-b)2 δ = √ ( (x-a)2 + (y-b)2 + (z-c)2 )

Adaptive Filtering: One Year On