filtron a learning based anti spam filter l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Filtron : A Learning-Based Anti-Spam Filter PowerPoint Presentation
Download Presentation
Filtron : A Learning-Based Anti-Spam Filter

Loading in 2 Seconds...

play fullscreen
1 / 15

Filtron : A Learning-Based Anti-Spam Filter - PowerPoint PPT Presentation


  • 529 Views
  • Uploaded on

First Conference on Email and Anti-Spam (CEAS) Filtron : A Learning-Based Anti-Spam Filter Eirinaios Michelakis ( ernani@iit.demokritos.gr ), Ion Androutsopoulos ( ion@aueb.gr ), George Paliouras ( paliourg@iit.demokritos.gr ), George Sakkis ( gsakkis@rutgers.edu ),

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Filtron : A Learning-Based Anti-Spam Filter' - Sophia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
filtron a learning based anti spam filter

First Conference on Email and Anti-Spam (CEAS)

Filtron: A Learning-Based Anti-Spam Filter

Eirinaios Michelakis (ernani@iit.demokritos.gr),

Ion Androutsopoulos (ion@aueb.gr),

George Paliouras (paliourg@iit.demokritos.gr),

George Sakkis (gsakkis@rutgers.edu),

Panagiotis Stamatopoulos (takis@di.uoa.gr)

Mountain View, CA, July 30th and 31st 2004

outline
Outline
  • Spam Filtering: past, present and future
  • Anti-spam filtering with Filtron
  • In Vitro Evaluation
  • In Vivo Evaluation
  • Conclusions
spam filtering past present and future
Spam Filtering: past, present and future
  • Past:
    • Black-lists and white-lists of e-mail addresses
    • Handcrafted rules looking for suspicious keywords and patterns in headers
  • Present:
    • Machine learning-based filters
      • Mostly using Naïve Bayes classifier
      • Examples: Mozilla’s spam filter, POPFILE, K9
    • Signature based filtering (Vipul’s Razor)
  • Future:
    • Combination of several techniques (SpamAssassin)
filtron an overview
Filtron: An overview
  • A multi-platform learning-based anti-spam filter.
  • Features for simple the user:
    • Personalized: based on her legitimate messages
    • Automatically updating black/white lists
    • Efficient: server-side filtering and interception rules
  • Features for the advanced user and the researcher:
    • Customizable learning component
      • Through Wekaopen source machine learning platform
    • Support for creating publicly available message collections
      • Privacy-preserving encoding of messages and user profiles
  • Portable: Implemented in Java and Tcl/Tk
  • Currently supported under POSIX-compatible mail servers (MS Exchange Server port efforts under way)
filtron s architecture

Filtron

Spam

folders

Preprocessor

Attribute

Selector

Legitimate

folders

attribute set

black list,

white list

Vectorizer

training

vectors

User

model

induced

classifier

Learner

Filtron’s Architecture
preprocessing
Preprocessing
  • Break down mailbox(es) into distinct messages
  • Remove from every message:
    • mail headers
    • html tags
    • attached files
  • Remove messages with no textual content
  • Store 5 messages per sender
    • Avoids bias towards regular correspondents.
  • Remove duplicates
  • Encode messages (optional)
in vitro evaluation
In Vitro Evaluation
  • We investigated the effect of:
    • Single-token versus multi-token attributes (n-grams for n=1,2,3)
    • Number of attributes (40-3000)
    • Learning algorithm (Naïve Bayes, Flexible Bayes, SVMs, LogitBoost)
    • Training corpus size (~ 10%-100% of full training corpus)
  • Cost-Sensitive Learning Formulation
    • Misclassifying a legitimate message as spam (LS) is λ times more serious an error than misclassifying a spam to legitimate (SL)
    • Two usage scenarios (λ = 1, 9)
in vitro evaluation cont
In Vitro Evaluation (cont.)
  • Evaluation:
    • Four message collections (PU1, PU2, PU3, PUA)
    • Stratified 10-fold cross validation
  • Results:
    • No clear winner among learning algorithms wrt accuracy

 Efficiency (or other criteria) more important for real usage.

    • Nevertheless, SVMs consistently among two best
    • No substantial improvement with n-grams (for n>1)
  • Refer to the TR for more details:
    • Learning to filter unsolicited commercial e-mail, TRN 2004/2, NCSR “Demokritos” (http://www.iit.demokritos.gr/skel/i-config/)
in vivo evaluation
In Vivo Evaluation
  • Seven month live-evaluation by the third author
  • Training collection: PU3
    • 2313 legitimate / 1826 spam
  • Learning algorithm: SVM
  • Cost scenario: λ = 1
  • Retained attributes: 520 1-grams
    • Numeric values (term frequency)
  • No black-list was used
post mortem analysis false positives
Post-Mortem AnalysisFalse Positives
  • 52 false positives (out of 6732)
  • 52%: Automatically generated messages
    • subscription verifications, virus warnings, etc.
  • 22%: Very short messages
    • 3-5 words in message body
    • Along with attachments and hyperlinks
  • 26%: Short messages
    • 1-2 lines
    • Written in casual style, often exploited by spammers
    • With no attachments or hyperlinks
post mortem analysis false negatives
Post-Mortem AnalysisFalse Negatives
  • 173 false negatives (out of 6732)
  • 30%: “Hard Spam”
    • Little textual information, avoiding common suspicious word patterns
    • Many images and hyperlinks
    • Tricks to confuse tokenizers
  • 8%: Advertisements of pornographic sites with very casual and well chosen vocabulary
  • 23%: Non-English messages
    • Under-represented in the training corpus
  • 30%: Encoded messages
    • BASE64 format; Filtron could not process it at that time
  • 6%: Hoax letters
    • Long formal letters (“tremendous business opportunity !”)
    • Many occurrences of the receiver’s full name
  • 3%: Short messages with unusual content
conclusions
Conclusions
  • Signs of arms race between spammers and content-based filters
  • Filtron’s performance deemed satisfactory, though it can be improved with:
    • More elaborate preprocessing to tackle usual countermeasures of spammers (misspellings, uncommon words, text on images)
    • Regular retraining
  • Currently most promising approach: combination of different filtering approaches along with Machine Learning
    • Collaborative filtering
    • Filtering in the transport layer level