A False Positive Safe Neural Network for Spam Detection

A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi acosoi@bitdefender.com

Does this look familiar?

Anatrim

Oh boy, it’s getting worst!!!

Bad Bad Spammer!!! • Databases: • D: Random legitimate text • D1: Different rephrases of a certain spam phrase • D2: Different rephrases of another spam phrase • ………………… • Dn: Different rephrases of another spam phrase • Create spam message script: • Choose a random phrase from D1 • Choose random text from D • Choose a random phrase from D2 • Choose random text from D • ……………. • Chose random phrase from Dn • Send message. • Appeared as a consequence of botnets • 40 samples of different subjects • 50 samples of different titles • 30 samples of different titles (part II) • 60000 different combinations

Features • Larger time frame – KeyWord!!!! • Weak features • Words like “Anatrim”, “Viagra”, “Xanax”, “Stock” • Simple word combinations like “Stock alert”, “Strong buy” • Simple Header Heuristics (for both spam and ham) like: valid reply, weird message id, forged headers • Example: • Top 500 spammy words from a Bayesian dictionary • Some simple header heuristics from spamassasins’ SARE Ninjas • Trainer’s personal flavour

Why ART? • Training occurs by modifying the weights of each neuron • For large amounts of data, forgetting important details might actually happen • Solves the stability-plasticity dilemma • Based on template detection • Unlimited number of templates involves unlimited number of patterns • 2 self organizing neural networks + a mapping module = supervised organizing neural network

Adaptive Resonance Theory • Similar to a cluster algorithm (as many clusters as needed) • ARTMAP = ARTa + ARTb + MapField

ART Vigilance Small Value - Imprecise Big value - Fragmented • A big value: Accepts small errors; Many small clusters; High precision • A small value: Accepts high errors; A few big clusters; Errors can appear

ART ++

Algorithm

Corpus • 2.5 million spam messages (sampled on waves with a high degree of variation) and around 1000 simple low relevance text heuristics (not counting the standard header heuristics). • The first 1000 words (ordered by discrimination, but with a minimum of 10-30 hundred occurrences) from a bayesian dictionary trained on this corpus, and also standard header heuristics. • Almost 1 million legitimate email messages • 75% of the message corpus were used for training the neural network and, • 25% were used in testing the neural network. • 1.5 days to train!!!!

Results • FP: 1% 0.0001% • FN: 4% 20 % • On some corpuses (TREC 2006) we had … not so great results (but current heuristics) • FN: 35% () • FP: 2 email messages! () • At least, just a few false positives!

Conclusions • ART + Simple Features + Spam = Love • ART + False Positives + Spam = OMG!!! • (ART++) = Heuristic Filter + ARTMAP • Must use a lot of email messages. It is highly difficult to find representative samples for individual waves. • Can also be applied to other neural networks • Interesting PowerPoint template…

Thanks QUESTIONS?

A False Positive Safe Neural Network for Spam Detection

A False Positive Safe Neural Network for Spam Detection

Presentation Transcript

False Positive Emails

Neural Network

PPI network construction and false positive detection

PPI network construction and false positive detection

Opinion Spam Detection

Spam Email Detection

Neural Network

Neural Network

False Identity Detection

NEURAL NETWORK-BASED FACE DETECTION

Spam Image Identification Using an Artificial Neural Network

A False Positive Safe Neural Network for Spam Detection

A LVQ-based neural network anti-spam email approach

Network-Level Spam Detection

A Spam Mail-based Solution for Botnet Detection and Network Bandwidth Protection

Face Detection Using Neural Network

Spam Detection

NEURAL NETWORK

Rotation Invariant Neural-Network Based Face Detection