Anti-SPAM experience at LAL

Anti-SPAM experience at LAL Michel Jouvin LAL / IN2P3 jouvin@lal.in2p3.fr

LAL Context • Message Router : Sendmail • Milter API to call an external program for filtering before delivery • Message Store : Execmail IMAP • Derived from Cyrus v1 • Mail clients capable of message filtering • Mulberry, Pine, Outlook, Netscape/Mozilla, Entourage… Anti-SPAM at LAL - HEPix - Edinburgh 2004

Policy Decisions… • Do virus and SPAM detection at server level • Let the user choose final processing if not a security problem • Only for SPAM, not for virus • Virus : forbidden extensions rather than antivirus • Virus main threat during first hours/days : antivirus not up to date • + : Proactive, low resource consumption • - : some useful extensions (ex : .zip) • Anti-virus run on desktop • SPAM : tagged at server level with a SPAM probability (score) • Some predefined filters proposed for supported clients Anti-SPAM at LAL - HEPix - Edinburgh 2004

… Policy Decisions • Avoid black / grey list • Effective no more than a few months (work around by spammers) • Negative side effects on users (black listed ISPs) • Relying on an uncontrolled critical service (black list maintainer) Anti-SPAM at LAL - HEPix - Edinburgh 2004

Virus Protection : MIMEDefang • Configured to remove suspect parts based on their extensions • Recipient still receive a message with a text replacing the attachment • One header (X-MIMEdefang-action) added to help filtering • 2 classes of suspect extensions • Always junk mails (.scr, .pif…) : just thrown away… • Sometimes useful (.exe, .zip) : quarantined, retrieval possible • MIMEDefang can call other modules • Embedded Perl interpreter to ease call of external modules • Can be used to call Amavis (Antivirus), SpamAssassin… • Can restrict call of external modules to certain messages • Don’t call SpamAssassin for large messages (> 100K) : never a SPAM • Provides significant performance enhancement Anti-SPAM at LAL - HEPix - Edinburgh 2004

SPAM Detection : SpamAssassin... • At LAL : Perl module called by MIMEDefang • No extra process, no starting cost for every message • Dependent on other Perl modules • Experienced a bad problem with HTML because of an old HTML::Parse • Several types of filtering • Rules based • Bayesian analysis : based on message tokenization and statistics • Black / grey lists Anti-SPAM at LAL - HEPix - Edinburgh 2004

… SPAM Detection : SpamAssassin • Compute a score (probability to be a SPAM) • Score >= 5 can be considerered as SPAM • Very few false positive : always related to misconfigured clients • Add headers (X-Spam-Score/Status) and attachement (SpamAssasin.Report) • Header and attachment lists the reasons behind the score • Possibility to modify the subject • LAL : prefix the subject with (SPAM ****) : number of * = score / 5 • Efficient filtering possible looking at the headers Anti-SPAM at LAL - HEPix - Edinburgh 2004

Bayesian Analysis… • Rules based analysis less and less efficient • Spammers very responsive to rules improvements • LAL : 30% of undetected SPAM last winter • Bayesian analysis inactive because of some misconfiguration • Bayesian analysis : based on an (old) text analysis method • Message is tokenized : tokens in one set of chars, token separator in another set • Learning phase : for each token, counts everytime it appears in a SPAM or HAM (non SPAM), compute a probability (stored in a DB) • Analysis : compute a probability for the message according to the probability of each token in the message Anti-SPAM at LAL - HEPix - Edinburgh 2004

… Bayesian Analysis • Uses message headers and content • Important to teach the filter with original (not forwarded) message • Not language sensitive • Very difficult for spammers to work it around • Every token database is unique • Very few false positive • False positive : valid message with score >= 5 • LAL : no false positive so far (a few weeks) Anti-SPAM at LAL - HEPix - Edinburgh 2004

Bayesian Filter Administration • Learning phase is critical • Initial learning with 1000s of SPAM ad HAM • LAL initial set of message : 5000 messages (2/3 HAM, 1/3 SPAM) • Must cover message diversity to avoid side effect (language, topic…) • Messages used for learning must be (manually) carefully sorted between SPAM and HAM • Learning must be renewed periodically • Token expiration protects against evolving patterns and limits DB size • Auto-learn feature helps maintain the database accurate • Need to manually feed the filter with incorrectly detected SPAMs to refine the database (false positive or false negative) Anti-SPAM at LAL - HEPix - Edinburgh 2004

Conclusions • Pattern matching not enough, Bayesian looks promising • Raised SPAM detection efficiency to > 90% with initial learning • Hope to reach at least 95% while refining learning • Take time to converge, don’t make changes every day • SPAM profile / volume not the same every day • Need time to stabilize (auto-learning curve) • Validate changes • Keep a reference set of SPAM and HAM (need to be updated) • Administration load still a question • How to collect / process false positive / negative from users ? Anti-SPAM at LAL - HEPix - Edinburgh 2004

Anti-SPAM experience at LAL