Email and Spam Joshua Goodman Microsoft Corporation research.microsoft

3. Email addiction 41% check email first thing in the morning 23% have checked in bed in their pajamas

4. Overview Email Most important application Great research problems for people working on Machine Learning / Data Mining Spam Techniques spammers use Solutions to Spam Fun problems you find building real systems

5. Part 1: Email A sample of interesting machine learning / data mining email problems Finding what�s important Priorities Organizing mail Auto foldering Auto tagging Finding what�s interesting Automatic search Contact finding

6. Priorities(Eric Horvitz, Andy Jacobs, David Hovel, etc.) Automatically determines how important your email is Send to your cell phone Different sound/toast Uses machine learning Sent directly to you? From your manager? Uses future tense? Future dates?

7. Auto Foldering (Jake Brutlag and Chris Meek) Use machine learning to figure out automatically what folder mail goes in. Interesting text classification problem Folders contain as few as three entries Data changes over time

8. Automatic Tagging for Email (Arun C. Surendran, John C. Platt and Erin Renshaw) Automatically tag email messages to enrich search organization and navigation. How it works: Put messages into clusters Naming clusters is hard Use domain-dependent filtering (remove common intranet words) Use noun phrases from subjects Words do not have to occur in all messages in cluster

9. Automatic Search(Joshua Goodman and Vitor Carvalho) Automatically show users useful search results Examined over 20 factors Automatically train machine learning system to weight them. Frequency of keywords in Internet Search query logs (MSN) is third most helpful feature (after TF and IDF) Helped solve lots of linguistic problems Almost everything in query logs is a �meaningful phrase� Much easier to port to multiple languages

10. Contact Finding(T. Kristjansson, A. Culotta, P. Viola and A. McCallum) Automatically find contact information in an email message. Machine learning method � train it by showing examples

11. Other Interesting Email Research All of the research I�ve just shown you is from Microsoft Research Main reason: much easier to steal slides from colleagues with nearby offices Why do people in MSR spend so much time working on email problems? CALO Project �Cognitive Assistant that Learns and Organizes�: DARPA funded project lead by SRI, with 22 organizations participating Main way you deal with your automated assistant is through email. RADAR Project Primarily at CMU (11 research groups) (DARPA funded) Cognitive assistant that can do tasks like space planning, automated web master, etc. Primary interface to the assistant is through email

12. Part 2: Spam SPAM is the number one problem for email systems Estimates from about 71% to 87% of mail is spam If you stop 90% of the spam, over a billion spam a day will get past filters worldwide, and 20% of your inbox will be spam. Overview Techniques spammers use Solutions to Spam And some of the interesting research problems http://www.tekrati.com/research/News.asp?id=6933http://www.tekrati.com/research/News.asp?id=6933

13. Techniques spammers use A few examples of tricks spammers use to get past spam filters Most spam filters have text classification as main or important part, often with linear models (e.g. Na�ve Bayes, etc.)

14. The Hitchhiker Chaffer Content Chaff Random passages from the Hitchhiker�s Guide Footers from valid mail

15. Hitchhiker Chaffer�s Later Work Can use hidden text, e.g. white on white or many other tricks User sees only spammy text Spam filter sees everything, including good words.

16. Hitchhiker Chaffer�s Later Work Can use hidden text, e.g. white on white or many other tricks

17. Secret Decoder Ring Looks easy Is it?

18. Secret Decoder Ring Dude Character Encoding HTML word breaking

19. Diploma Guy Word Obscuring





24. More of Diploma Guy Diploma Guy is good at what he does

25. Trends in Spam Exploits(Hulten et al.)

26. Solutions to Spam Filtering Machine Learning Matching/Fuzzy Hashing (Blackhole Lists (IP addresses)) Postage Turing Tests, Money, Computation (Disposable Email Addresses) Smart Proof

27. Filtering TechniqueMachine Learning Learn spam versus good Problem: need source of training data Get users to volunteer GOOD and SPAM Over 100,000 volunteers on Hotmail, over 50,000 new labeled examples/day. Use standard text classification features, but also email/spam features Time of day, number of recipients, etc. But spammers are adapting to machine learning too Images, different words, misspellings, etc.

28. Finding Cool Problems by Building Systems Fun problems we found when we shipped adaptation for a spam filter Fun problems we found when we worried about losing good mail.

29. What Happened When we Shipped an Adaptive Spam Filter The first spam filter we shipped was adaptive If user corrected mistakes, we improved the filter. What to do if the user does not correct mistakes? We assumed the filter was correct For users who rarely fixed mistakes, this lead to catastrophically bad results � the filter got worse and worse and worse

30. Threshold DriftConservative Threshold Setting

31. Threshold DriftLots of Spam Classified as Good

32. Threshold DriftNew Separator Parallel to Old

33. Threshold DriftNew Separator Parallel to Old

34. Adaptation with partial user feedback is hard Users may correct all errors, or only all spam, all good, 50% spam, 10% spam, no errors, etc. Need to work no matter what the user correction rate is Great problem that you find when you try to build a real system

35. Fun problems we found when we worried about losing good mail Most machine learning focuses on accuracy Assumes all errors equally bad For spam (and most other problems) cost of deleting good mail much higher than cost of spam in inbox

36. Our technique(Scott Yih and Joshua Goodman) First, learn a model on all training data (e.g. linear classifier) Pick the subset of the data in the region you care about Find all messages, good and spam, that are more than, say, 50% likely to be spam according to the first model Train a new model on only this data At test time, use both models Works substantially better than other techniques: at the desired low false positive rate, reduce spam by 20%-40% at compared to normal techniques. Can make exciting progress even in well-explored area like text classification when you build a system.

37. Conclusion (1/2) Building systems is a great way to find interesting and important new problems Sometimes leads to fundamental research

38. Conclusion (2/2)

39. Disposable Email Addresses You have one address for each sender JOSHUAGO1895422@microsoft.com All go to same mailbox If I give you my address, and you send me spam, I just delete the address How do new senders get an address? If I send mail to 3 people, which address is it From? Hard to remember!

40. My Favorite Solution If we could get everyone at Hotmail to never answer any spam, spammers would just give up sending to Hotmail. So, when new Hotmail users sign up, send them 100 really tempting ads If they answer any of them, terminate account

41. My Favorite Solution If we could get everyone at Hotmail to never answer any spam, spammers would just give up sending to Hotmail. So, when new Hotmail users sign up, send them 100 really tempting ads If they answer any of them, terminate account Hotmail management refuses to consider this.

42. I tried to ship a grammar checker Eric Brill gave a keynote in ??? �Processing Natural Language without Natural Language Processing� All you need is lots of data You can build a grammar checker with very simple machine learning. Solve common grammar problems like �their�/ �they�re�, etc. Makes NLP sound really boring and problems seem easy. Grammar checking is actually a very interesting problem

43. Why grammar checking is interesting (and hard) after all Product groups already had good solutions for English Wanted Brazilian Portuguese There�s tons of well-edited data for English Try finding data for Brazilian Portuguese, etc. �There�s no data like more data� only applies if there is more data English is uninflected, but most languages have strong inflection If you don�t morphologically analyze, the vocabulary is effectively huge, multiplying the data sparsity problem

44. What else went wrong Top priority: agreement (singular/plural, gender) Traditional ML approach to grammar checking (�confusable word pairs�) is local, no structure Works well for > 90% of �test� instances, because most agreement is local. People doesn�t make mistakes when the subject and verb is next to each other People who make a mistake is most likely to do so when the subject and verb is far apart. Need grammar, or some other powerful technique No Brazilian Portuguese treebank Grammar checking is a great problem for NLP Trying to build a real system helps us find problems we didn�t even know we had.

45. Blackhole Lists Lists of IP addresses that send spam Open relays, Open proxies, DSL/Cable lines, etc� Easy to make mistakes Open relays, DSL, Cable send good and spam� Who makes the lists? Some list-makers very aggressive Some list-makers too slow

46. tatyanaatkins: want to make money?joshuagood9: how?tatyanaatkins: have run a textile company and get pay in cheques and money ordersjoshuagood9: how do I make money?tatyanaatkins: i gt my clients to send them to u while u cash em and remove your pay then sen the rest to me joshuagood9: Why don't you cash them yourself?tatyanaatkins: because presently i am traveling around and this come in at a rate faster than i can tatyanaatkins: need assistance in catching uptatyanaatkins: if u wish i can send u the letter of incoporationjoshuagood9: yes, email it to mejoshuagood9: joshuagood9@yahoo.comtatyanaatkins: hold onjoshuagood9: you are in nigeria?tatyanaatkins: yestatyanaatkins: that's where the factory isjoshuagood9: how much will you pay metatyanaatkins: u get up to 200 dollars every deliveryjoshuagood9: what is in a delivery? how do I get the money to you?tatyanaatkins: i get the clients to send them to ujoshuagood9: and then what?tatyanaatkins: u cash it and send via western unionjoshuagood9: sounds easytatyanaatkins: yeahjoshuagood9: why do you pay me so muchmoney?joshuagood9: how many do I have to cash? Is one "delivery" one check? or a lot?tatyanaatkins: cos people have eloped with my money n the pastjoshuagood9: why will you trust me?tatyanaatkins: so i have decided to pay good so we all can be satisfiedjoshuagood9: that makes sensejoshuagood9: Let me call you on the phone, and we can talk about ittatyanaatkins: okjoshuagood9: what is your number?tatyanaatkins: 2340833830119joshuagood9: oh, that's internationaljoshuagood9: I;m at work nowjoshuagood9: I'll have to call you later, from hometatyanaatkins: oktatyanaatkins: are u interested?joshuagood9: of course!tatyanaatkins: so i'll send u your letterjoshuagood9: my letter?tatyanaatkins: of employmentjoshuagood9: oh, ok Nigerian Chatter

48. joshuagood9: hi theresuperchristina: hey there how u doin?joshuagood9: are you a bot?superchristina: im not a bot are u? loljoshuagood9: are you a bot?superchristina: i hate bots loljoshuagood9: asl?superchristina: im 21 f usa and u?joshuagood9: I am fine, thank yousuperchristina: right on asl?...� im 20 f usajoshuagood9: 74/M, WAsuperchristina: nice age joshuagood9: thank yousuperchristina: yw sweety..could u do me a favor..check out my homepage and my profile see if my cam works? brb Chat Bot

Email and Spam Joshua Goodman Microsoft Corporation research.microsoft

Email and Spam Joshua Goodman Microsoft Corporation research.microsoft

Presentation Transcript

Email Security And Anti-Spam Tutorial

Gordon Bell Microsoft Research Gbell@microsoft research.microsoft/~gbell

Gordon Bell Microsoft Research Gbell@microsoft research.microsoft/~gbell

Microsoft Corporation

EMAIL AND SPAM

Microsoft Corporation

Phishing and Spam Email

Spam: Email Gone Wild!

Spam Email and Bandwidth Monitoring

Spam Email Detection

Microsoft Corporation

Email Security And Anti-Spam Tutorial

Spam Email

Microsoft Corporation

Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

CAN SPAM and Your Email Marketing

Susan Dumais Microsoft Research research.microsoft/~sdumais

Email Spam Filtering Service

Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

Email Security And Anti-Spam Tutorial

CAN SPAM and Your Email Marketing

Gordon Bell Microsoft Research Gbell@microsoft research.microsoft/~gbell