1 / 16

A False Positive Safe Neural Network for Spam Detection

A False Positive Safe Neural Network for Spam Detection. Alexandru Catalin Cosoi acosoi@bitdefender.com. Does this look familiar?. Anatrim. Oh boy, it’s getting worst!!!. Oh boy, it’s getting worst!!!. Bad Bad Spammer!!!. Databases: D: Random legitimate text

Download Presentation

A False Positive Safe Neural Network for Spam Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi acosoi@bitdefender.com

  2. Does this look familiar?

  3. Anatrim

  4. Oh boy, it’s getting worst!!!

  5. Oh boy, it’s getting worst!!!

  6. Bad Bad Spammer!!! • Databases: • D: Random legitimate text • D1: Different rephrases of a certain spam phrase • D2: Different rephrases of another spam phrase • ………………… • Dn: Different rephrases of another spam phrase • Create spam message script: • Choose a random phrase from D1 • Choose random text from D • Choose a random phrase from D2 • Choose random text from D • ……………. • Chose random phrase from Dn • Send message. • Appeared as a consequence of botnets • 40 samples of different subjects • 50 samples of different titles • 30 samples of different titles (part II) • 60000 different combinations

  7. Features • Larger time frame – KeyWord!!!! • Weak features • Words like “Anatrim”, “Viagra”, “Xanax”, “Stock” • Simple word combinations like “Stock alert”, “Strong buy” • Simple Header Heuristics (for both spam and ham) like: valid reply, weird message id, forged headers • Example: • Top 500 spammy words from a Bayesian dictionary • Some simple header heuristics from spamassasins’ SARE Ninjas • Trainer’s personal flavour

  8. Why ART? • Training occurs by modifying the weights of each neuron • For large amounts of data, forgetting important details might actually happen • Solves the stability-plasticity dilemma • Based on template detection • Unlimited number of templates involves unlimited number of patterns • 2 self organizing neural networks + a mapping module = supervised organizing neural network

  9. Adaptive Resonance Theory • Similar to a cluster algorithm (as many clusters as needed) • ARTMAP = ARTa + ARTb + MapField

  10. ART Vigilance Small Value - Imprecise Big value - Fragmented • A big value: Accepts small errors; Many small clusters; High precision • A small value: Accepts high errors; A few big clusters; Errors can appear

  11. ART ++

  12. Algorithm

  13. Corpus • 2.5 million spam messages (sampled on waves with a high degree of variation) and around 1000 simple low relevance text heuristics (not counting the standard header heuristics). • The first 1000 words (ordered by discrimination, but with a minimum of 10-30 hundred occurrences) from a bayesian dictionary trained on this corpus, and also standard header heuristics. • Almost 1 million legitimate email messages • 75% of the message corpus were used for training the neural network and, • 25% were used in testing the neural network. • 1.5 days to train!!!!

  14. Results • FP: 1% 0.0001% • FN: 4% 20 % • On some corpuses (TREC 2006) we had … not so great results (but current heuristics) • FN: 35% () • FP: 2 email messages! () • At least, just a few false positives!

  15. Conclusions • ART + Simple Features + Spam = Love • ART + False Positives + Spam = OMG!!! • (ART++) = Heuristic Filter + ARTMAP • Must use a lot of email messages. It is highly difficult to find representative samples for individual waves. • Can also be applied to other neural networks • Interesting PowerPoint template…

  16. Thanks QUESTIONS?

More Related