1 / 11

Anti-SPAM experience at LAL

Anti-SPAM experience at LAL. Michel Jouvin LAL / IN2P3 jouvin@lal.in2p3.fr. LAL Context. Message Router : Sendmail Milter API to call an external program for filtering before delivery Message Store : Execmail IMAP Derived from Cyrus v1 Mail clients capable of message filtering

alban
Download Presentation

Anti-SPAM experience at LAL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anti-SPAM experience at LAL Michel Jouvin LAL / IN2P3 jouvin@lal.in2p3.fr

  2. LAL Context • Message Router : Sendmail • Milter API to call an external program for filtering before delivery • Message Store : Execmail IMAP • Derived from Cyrus v1 • Mail clients capable of message filtering • Mulberry, Pine, Outlook, Netscape/Mozilla, Entourage… Anti-SPAM at LAL - HEPix - Edinburgh 2004

  3. Policy Decisions… • Do virus and SPAM detection at server level • Let the user choose final processing if not a security problem • Only for SPAM, not for virus • Virus : forbidden extensions rather than antivirus • Virus main threat during first hours/days : antivirus not up to date • + : Proactive, low resource consumption • - : some useful extensions (ex : .zip) • Anti-virus run on desktop • SPAM : tagged at server level with a SPAM probability (score) • Some predefined filters proposed for supported clients Anti-SPAM at LAL - HEPix - Edinburgh 2004

  4. … Policy Decisions • Avoid black / grey list • Effective no more than a few months (work around by spammers) • Negative side effects on users (black listed ISPs) • Relying on an uncontrolled critical service (black list maintainer) Anti-SPAM at LAL - HEPix - Edinburgh 2004

  5. Virus Protection : MIMEDefang • Configured to remove suspect parts based on their extensions • Recipient still receive a message with a text replacing the attachment • One header (X-MIMEdefang-action) added to help filtering • 2 classes of suspect extensions • Always junk mails (.scr, .pif…) : just thrown away… • Sometimes useful (.exe, .zip) : quarantined, retrieval possible • MIMEDefang can call other modules • Embedded Perl interpreter to ease call of external modules • Can be used to call Amavis (Antivirus), SpamAssassin… • Can restrict call of external modules to certain messages • Don’t call SpamAssassin for large messages (> 100K) : never a SPAM • Provides significant performance enhancement Anti-SPAM at LAL - HEPix - Edinburgh 2004

  6. SPAM Detection : SpamAssassin... • At LAL : Perl module called by MIMEDefang • No extra process, no starting cost for every message • Dependent on other Perl modules • Experienced a bad problem with HTML because of an old HTML::Parse • Several types of filtering • Rules based • Bayesian analysis : based on message tokenization and statistics • Black / grey lists Anti-SPAM at LAL - HEPix - Edinburgh 2004

  7. … SPAM Detection : SpamAssassin • Compute a score (probability to be a SPAM) • Score >= 5 can be considerered as SPAM • Very few false positive : always related to misconfigured clients • Add headers (X-Spam-Score/Status) and attachement (SpamAssasin.Report) • Header and attachment lists the reasons behind the score • Possibility to modify the subject • LAL : prefix the subject with (SPAM ****) : number of * = score / 5 • Efficient filtering possible looking at the headers Anti-SPAM at LAL - HEPix - Edinburgh 2004

  8. Bayesian Analysis… • Rules based analysis less and less efficient • Spammers very responsive to rules improvements • LAL : 30% of undetected SPAM last winter • Bayesian analysis inactive because of some misconfiguration • Bayesian analysis : based on an (old) text analysis method • Message is tokenized : tokens in one set of chars, token separator in another set • Learning phase : for each token, counts everytime it appears in a SPAM or HAM (non SPAM), compute a probability (stored in a DB) • Analysis : compute a probability for the message according to the probability of each token in the message Anti-SPAM at LAL - HEPix - Edinburgh 2004

  9. … Bayesian Analysis • Uses message headers and content • Important to teach the filter with original (not forwarded) message • Not language sensitive • Very difficult for spammers to work it around • Every token database is unique • Very few false positive • False positive : valid message with score >= 5 • LAL : no false positive so far (a few weeks) Anti-SPAM at LAL - HEPix - Edinburgh 2004

  10. Bayesian Filter Administration • Learning phase is critical • Initial learning with 1000s of SPAM ad HAM • LAL initial set of message : 5000 messages (2/3 HAM, 1/3 SPAM) • Must cover message diversity to avoid side effect (language, topic…) • Messages used for learning must be (manually) carefully sorted between SPAM and HAM • Learning must be renewed periodically • Token expiration protects against evolving patterns and limits DB size • Auto-learn feature helps maintain the database accurate • Need to manually feed the filter with incorrectly detected SPAMs to refine the database (false positive or false negative) Anti-SPAM at LAL - HEPix - Edinburgh 2004

  11. Conclusions • Pattern matching not enough, Bayesian looks promising • Raised SPAM detection efficiency to > 90% with initial learning • Hope to reach at least 95% while refining learning • Take time to converge, don’t make changes every day • SPAM profile / volume not the same every day • Need time to stabilize (auto-learning curve) • Validate changes • Keep a reference set of SPAM and HAM (need to be updated) • Administration load still a question • How to collect / process false positive / negative from users ? Anti-SPAM at LAL - HEPix - Edinburgh 2004

More Related