1 / 28

The Fight against Spam - A Machine Learning Approach

ELPUB 2007, Vienna. The Fight against Spam - A Machine Learning Approach. Jiri Hynek (jhynek@kiv.zcu.cz) Karel Jezek (jezek_ka@kiv.zcu.cz). www.textmining.cz. Contents:. Stats 101 Today‘s Spam Types Spammer Tricks Text-Based Spam Filter Implementation Results. Contents:.

zora
Download Presentation

The Fight against Spam - A Machine Learning Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ELPUB 2007, Vienna The Fight against Spam- A Machine Learning Approach Jiri Hynek (jhynek@kiv.zcu.cz) Karel Jezek (jezek_ka@kiv.zcu.cz) www.textmining.cz

  2. Contents: • Stats 101 • Today‘s Spam Types • Spammer Tricks • Text-Based SpamFilter Implementation • Results

  3. Contents: Spamming is publishing: Web Spam (“comment spam“) • blogs, (unmoderated) forums, wikis Why: to trigger higher page-ranking! Unsolicited marketing spam in our e-mails – info dissemination to the public Why: sell products!

  4. A bit of Terminology:“Canned meat made largely from pork“ Ham vs. Spam (Spam mail) UCE (Unsolicited Commercial Email) UBM (Unsolicited Bulk Mail) EMP (Excessive Multi-Posting) Junk mail Bulk email

  5. Stats 101 Top five spam categories: Online Pharmacies 20.0% Mortgage Refinancing 9.7% Investment/financial services 9.0% Male products (\/i@gra, CI@1i$) 8.7% Discount computer software 6.9% Communications of the ACM, February 2007/Vol. 50 No.2

  6. Stats 101 1998: Mere 10% of overall mail volume Now: 80% Communications of the ACM, February 2007/Vol. 50 No.2 Average spammers‘ revenue: $1 per 45,000 spams dispatched A database of 100 million e-mails costs 100 dollars, spam software included (www.symantec.com)

  7. Today‘s Spam Types Text Spam

  8. Today‘s Spam Types Text Spam Commonly used phrases filtered out by antispam filters (and words to avoid, of course) Free! 50% off! Click Here Call now! Subscribe Earn $ Discount! Eliminate Debt Double your income You're a Winner! Reverses Aging Hidden Information you requested Stop / Stops Lose Weight Multi level Marketing Million Dollars Opportunity Compare Removes Collect Amazing Cash Bonus Promise You Credit Loans Satisfaction Guaranteed Serious Cash Search Engine Listings

  9. Today‘s Spam Types Image-Based Spam

  10. Today‘s Spam Types Image-Based Spam in our mailboxes

  11. Today‘s Spam Types Phishing

  12. Today‘s Spam Types Captcha - fighting web spam

  13. Common Spammer Tricks Tricks to fool statistical spam filters: • Avoidance of keywords (such as stock, Viagra, etc.), • Frequent change in sender’s address, • Message encoding (such as base64, commonly used for secure message transfer), • Hashing (e.g. insertion of HTML tags into messages), • Use of images instead of plain text (namely GIF, JPEG, and PNG).

  14. New Spammer Tricks Character Hashing: I finlaly was able to lsoe the wieght I have been sturggling to lose for years! And I couldn't bileeve how simple it was! Amizang pacth makes you shed the ponuds! It's Guanarteed to work or your menoy back!

  15. New Spammer Tricks Keyword masking by repeating characters: Buuuyyyy cheeeeaaap viaaagraaa Word obfuscations: \/laGr@ Need a{} Dpiloma? sh1pp1ng //orldwide S0ft T4bs Ci@li$ repl1ca w4tches from r0lex

  16. New Spammer Tricks Word obfuscations: • There are 62,424 (3 x 12 x 17 x 2 x 3 x 17) ways to portray the name Viagra. In fact, there are 600,426,974,379,824,381,952 ways to spell Source: http://cockeyed.com/lessons/viagra/viagra.html

  17. New Spammer Tricks ASCII Art:     \|||||/                         ( o   o )          -ooO--(_)--Ooo— / \

  18. New Spammer Tricks ASCII Art:

  19. New Spammer Tricks Good word attacks (Bayesian poisoning) Russa says McGwire belongs in Hall AP - 35 minutes ago One year on, the face live! EDITORS' BLOG CNN.com AP Action on Elder Abuse Politics My Sources Weather Alerts Back Security SPACE.com The council is now proposing to increase the annual fee to nurses Freeman dies AFP Pope calls for Islam dialogue "There's a lot of theoreticalCSMonitor.com Last Updated: Tuesday, 28 November 2006, 23:13 GMT Bad rapto top ^^ Five girls killed in Iraqi clash This is where a little bit of help28, 6:33 AM ET Wales Lottery Video: Bush Praises Estonia As War on Terror AllyANALYSIS Mucking about? Hazards Podcasts ELSEWHERE ON THE BBC At the same timeVictims Were Asleep Fashion Wire Daily AFP Football's elite Baby beluga dies athands-on situation." 'My mother was assaulted' Entertainment Search World Radio 2 Google together Mr Litvinenko's movements on 1 November, the day he fell...

  20. New Spammer Tricks Good word attacks

  21. A Filter to Fight Text-Based Spam It‘s just another Short Document Classification Problem: The Itemsets Filter Plain Bayes Filter LSI Filter SVM Filter GZip (Compression-based) filter

  22. Standard Spam Testing Collections PU1: A mixture of 481 spam messages and 618 legitimate messages PU123A: Four corpora, based on private mailboxes Enron Corpus: 200,399 unique messages collected by 158 users (mostly managers)

  23. Itemsets Spam Filter: Results FPI = (#ham as spam) / #ham i.e. the proportion of legitimate messages deleted by mistake. FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.

  24. SVM Spam Filter: Results FPI = (#ham as spam) / #ham i.e. the proportion of legitimate messages deleted by mistake. FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.

  25. GZip Spam Filter: Results FPI = (#ham as spam) / #ham i.e. the proportion of legitimate messages deleted by mistake. FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter. …We will look into this in the near future

  26. Light at the end of the tunnel? • Payment per e-mail? • Quite unlikely… • E-mail authentication by SIDF • Sender ID Framework (by Microsoft) • … registered list of servers of domain owners • Confirmation of e-mail source domain (automatically, by ISPs) • Protects 40% of legitimate email sent worldwide • Helps combat phishing scams / domain spoofing (forging a sender's address)

  27. Light at the end of the tunnel? • DomainKeys Identified Mail (DKIM) • Similar technology by Yahoo, Cisco Systems, Sendmail, PGP • Based on digital signatures • An official proposed standard by Internet Engineering Task Force

  28. Thank You For Your AttentionQuestions? FEEDBACK

More Related