1 / 29

Classifying and Filtering Spam Using Search Engines

Classifying and Filtering Spam Using Search Engines. Oleg Kolesnikov College of Computing Georgia Tech. >50% of all e-mail today is spam?. Source: brightmail.com. Scale. IDC: of 31bn messages sent each day, 18%, or 5.6bn were s[pc]am messages Brightmail decoy network stats:

jagger
Download Presentation

Classifying and Filtering Spam Using Search Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classifying and Filtering Spam Using Search Engines Oleg Kolesnikov College of Computing Georgia Tech

  2. >50% of all e-mail today is spam? Source: brightmail.com

  3. Scale • IDC: of 31bn messages sent each day, 18%, or 5.6bn were s[pc]am messages • Brightmail decoy network stats: 6.7 bn spam messages sent in March, 2003, varying from 100 to ~100,000 identical e-mails sent at a time

  4. Current techniques to deal with SPAM/UCE: • Blacklisting • Signature-based Filtering • Statistical/Bayesian Filtering • Heuristic Filtering • Challenge-Response Filtering • Sender-pays • Laws

  5. Blacklisting • MAPS (Mail Abuse Prevention System) RBL catches only 24% of spam with 34% false positives (the spam police article, gaudi/gaspar) • Self-appointed sheriffs/vigilantes, legitimate business increasingly caught in crossfire, e.g. iBill was losing $100k/day during each of the four days of blacklisting • Only a first cut at the problem, never b-lists more than 50% of the servers sending spam (Graham)

  6. Sample and Signature-based Filtering • Set up a network of DECOY e-mail addresses. Any messages sent to these addresses must be spam=>if the same message is sent to a protected address, the message must be SPAM, too (that’s what Brightmail does) • Not very flexible -- spammers take the lead in coming up with tricks • Make each spam different

  7. Brightmail (used by MS/Hotmail, Earthlink, Verizon, ebay etc. )

  8. Basic Statistical Filtering • W: Must be TRAINED, S: relatively low false positives • Starts with two message corpuses -- spam and legitimate • Splits messages into TOKENs • Assigns each token a probability, based on the probability of its appearance in spam corpus e.g. ‘naked’ may have 67% probability of appearing in spam, say vs. ‘regards’ -- 10% • when a new message arrives, stat filter takes top N tokens with the probability that is the farthest from the middle 50% both ways, applies Bayesian Theorem, and comes up with a RANKING for the e-mail

  9. Heuristic Filtering • What kind of filters can you come up with JUST BY LOOKING at a spam e-mail? • Sender name looks bogus? • Header fields are missing? • Lots of html? • Take all these rules and heuristic observations, assign weights/points, and put them into a database • You’ve got yourself an early version of SPAMASSASSIN

  10. SpamAssassin • The way you can make it work (let’s say with postfix): 1) perl -MCPAN -e ‘install Mail::SpamAssassin’ 2) learn on database of spam and legitimate e-mails using sa-learn (part of spamassassin) 3) add a filter program to filter all incoming mail through spamc, a part of spamassassin: /usr/bin/spamc | /usr/sbin/sendmail -i “$@”; exit $? 4) spamc adds headers, something like: X-Spam-Flag: {YES|NO}, X-Spam-Level: *** 5) The headers are caught by a user’s procmail recipe and mail is classified appropriately

  11. Heuristic Filtering Two • W: Public heuristic rules database; makes it relatively easy for spammers to come up with way to bypass the system => The rules database needs to be updated frequently • May not be as effective today as other methods, such as stat filtering

  12. Challenge-Response Filtering • Whenever you receive an e-mail from someone NOT on your whitelist, an automatic reply is sent telling what steps the sender should take to be considered for the whitelist (e.g. send you a confirmation, make a donation, solve a puzzle, etc.) • Very effective at stopping spam BUT has a number of drawbacks: valid mail delayed, kind of harsh -- some may think of it as inconsiderate and never reply, extra work for senders etc.

  13. Stats for different approaches (MessageLabs)

  14. Problems with Statistical and other keyword-dependent methods • 1) Heavily dependent on effective parsing and the presence of “true” tokens, e.g. spammers fooling parsers: Examples: • White background: <font color=white>research data and other statistically strong keywords that are present in legitimate e-mails</font> • Splitting words: ch<!-- valid -->eck this p<!-- news -->orn • Adding extra characters and spaces to confuse parsers (F*R E-E) and so forth (javascript, fake html tags, browser-specific tricks) 2) • 2) Spam may contain too little text and be TOO close to real e-mails in keywords. This is a more serious problem. I’ll give an example later.

  15. My research • Developed and implemented a system for filtering of unwanted mail using Google • Can be used WITHOUT training

  16. Classification of current spam

  17. Thoughts • Some users must click on those ads or else there would be no spam (somebody IS interested in it after all) • There may be more of such users in the future as new regulations appear and spam becomes less of an annoyance and more of an ad • Some users may like to receive SPAM-looking messages, for instance, marketing reports, offers, etc., that look very much like spam

  18. Two main observations I use • Spam is USER-SPECIFIC • Most spammers expect users to TAKE some ACTION upon reading spam; in other words, there has to be a FEEDBACK mechanism

  19. Targeting the feedback mechanism • How effective would a spam be without an easy feedback mechanism?

  20. URLs as a feedback mechanism • Of ~1800 spam messages in the classical spam corpuses I have analyzed, ~95% of messages contained URLs • Of the remaining 5%, approximately 1/2 seemed to be damaged submissions (i.e. MIME conversion and other types of errors), the rest consisted of two types of letters: • Messages with 1-800 numbers and faxes (including Nigerian scam) • Religious letters

  21. Basic Approach: URLSP • The basic approach was to extract URLs, apply a user-specific whitelist based on a user’s mailbox (masks such as .edu, cnn.com etc.) and classify everything else as spam • The first version I implemented has been in use at Tech since December’02 • Has actually been working quite well

  22. Effective but rather naive • First version effective but rather naive • Granularity and false positives can be a problem

  23. Next version: Classifying URLs • CLASSIFY URLs using Google and Open Directory • Use whitelists/blacklists of categories and URLs BASED on user mailbox and individual preferences

  24. DMOZ/ODP

  25. Example • Based on files automatically generated from your mailbox, configure the system as follows (blacklist* f. are omitted): whitelist.url: .edu, .mil, .gov, www.nmap.com, www.epic.org, www.cypherpunks.to etc. whitelist.cat: Top/Computers/Security/Anti_Virus/Products Top/Computers/Security/Products_and_Tools/Cryptography/PGP Top/Computers/Security/Products_and_Tools/Password_Tools ...

  26. URL Classifier: Categories Extracted from SPAM • Examples of categories of URLs extracted from spam: Top/Business/Consumer_Goods_and_Services/Beauty/Cosmetics Top/Business/Employment/Careers Top/Business/Financial_Services/Mortgages Top/Business/Investing/Day_Trading/Brokerages Top/Business/Investing/Day_Trading/Education_and_Training Top/Business/Investing/News_and_Media/Newsletters/Stocks_and_Bonds Top/Business/Marketing_and_Advertising/Direct_Marketing/Mailing_Lists/MLM Top/Regional/North_America/Canada/Business_and_Economy/Employment/Job_Search Top/Shopping/Gifts/Personalized Top/Shopping/Home_and_Garden/Kitchen_and_Dining/Appliances/Parts ...

  27. GTUC v1.0 (Basic) • Register for a free account on a CoC-based filtering server • Forward your mail to the server • The mail will be automatically classified into three folders as it arrives • Inbox, Unknown, spam-can • Read your mail with IMAP

  28. Spam of the future • Innovative feedback mechanisms • Appearance as close to legitimate e-mails as possible, e.g. >>> From: rcarlos@legitimate.com Hi, here is an interesting article. You should check it out -- net::“terminator_25” Roberto Carlos

  29. Solution • Current best--Combination of approaches • Categorization and URL-based filtering can help • Uncategorized URLs? Similarity + retrieval of html and categorization with token stats/heuristics

More Related