Spam and Personal Privacy

Spam and Personal Privacy Presented by: Ashley Embry

Outline • What is Spam? A. Types of Spam • Where Did the Word “spam” Originate? • How Spam Begins: A General Explanation • Who Has the Potential to be a Spammer? • Statistics About Spam • Getting Rid of Spam • Breakdown of a Spam Filter • Conclusions • Questions for the class

What is Spam? There are many definitions of spam that are used. • Electronic junk mail or junk newsgroup postings. • Any unsolicited automated e-mail. • Email advertising for some product sent to a mailing list or newsgroup. Spam is simply flooding the internet with many copies of the same message in an attempt to force the message on people who would not otherwise choose to receive it.

Types of Spam There are two main types of Spam: 1. Usenet Spam is aimed at people who read newsgroups but rarely or never post and give their information away. 2. E-mail spam targets individual users with direct mail messages. E-mail spam lists are created by scanning Usenet postings, stealing Internet mailing list, or searching for addresses.

Where Did the Word “spam” Originate? The history of calling inappropriate postings in great numbers “spam” is from a Monty Python skit where a couple goes into a restaurant and the wife tries to get something other than Spam. In the background there is a group of Vikings who are singing the praises of Spam. Pretty soon the only thing that you can hear is… Like the song spam is the endless repetition of worthless text.

Another proposal is that “spam” was thought of by a computer lab group at the University of Southern California, who gave it the name because it has many of the same characteristics as the lunch meat Spam. • Nobody wants it or ever asks for it. • No one ever eats it; it is the first item to be pushed to the side when eating the entrée. • Sometimes it is actually tasty, like the 1% of junk mail that is really useful to some people.

How Spam Begins: A General Explanation • Spammers only need access to your address. After that its just a matter of sending the e-mails. • The primary sources that spammers use are newsgroups and chat rooms. • The second source used is the Web itself. Spammers can create search engines that look for the @ sign which indicates an e-mail address. • The third source is sites created specifically to attract e-mail recipients. • “Win $1 million!!! Just Click Here!” • “ Would you like news letters form our partners”

Finally, probably the most common source of e-mail addresses comes from searching the e-mail servers of large e-mail hosting companies like Hotmail. • The Hotmail article “A Spammer’s Paradise” reads: A dictionary attack utilizes software that opens a connection to the mail server and rapidly submits millions of random e-mail addresses. Many of these addresses have slight variations, such as "jdoe1abc@hotmail.com" and jdoe2def@hotmail.com. The software then records the address locations and adds those addresses to the spammer's list. These lists are typically resold to many other spammers .

Who Has the Potential to be a Spammer ? Anyone can be a spammer. Scenario Let’s say your grandmother bakes the best banana nut bread ever created, and you want to sell the recipe for $5. You have 100 people in your personal e-mail address book. You send out an e-mail advertising, “Big Momma’s Nana Nut Bread - only $5 !!!” From your 100 e-mails you get 2 orders and make $10. Imagine if you had sent out 1,000,000 e-mails…

Statistics About Spam In a single day in May, the No. 1 internet service provider AOL Time Warner (AOL) blocked 2 billion spam messages—88 per subscriber—from hitting it’s customers e-mail accounts. Microsoft (MSFT) which operates the No.2 service provider MSN and Hotmail says it blocks an average of 2.4 billion spams per day.

Getting Rid of Spam • Avoid giving out your e-mail address to unfamiliar or unknown recipients. • Use your e-mail application’s filtering features. • Report the spam e-mailer to the spammer’s ISP. • Use spam filtering software.

Breakdown of a spam filter Most spam blockers use filters that search for commonly used phrases or writing styles that are overly aggressive and found in mass e-mail marketing. Spammers try to fool the filters by changing their writing styles and formats so that their messages can sneak past the filters. The best technology currently available to stop spam is spam filtering software. The simplest filters use keywords such as “xxx,” “viagra,” etc, but they are also more likely to block the e-mails that you do want to receive.

Example The more advanced filters, Bayesian filters for example, take this approach further to statistically identify spam based on frequency. • An example of how this statistical filtering works: • Start with one collection of spam and one of nonspam mail, and each collection had about 4000 messages in it. • Scan the entire text of each message of the collection. • Consider alphanumeric characters, dashes, apostrophes, and dollar signs to be as part of tokens (words) and everything else to be a token separator. (i.e. qt234abc, $75, u’tt) • Count the number of times each token occurs in each message. You will end up with two large tables with each one showing the different tokens and how many times it appeared in the messages.

Finally, create a third table that relates the token to the probability (ranging from .01 to .99) that an e-mail containing it is a spam. When new mail arrives now, it is scanned into tokens, and the fifteen tokens whose probabilities are the farthest from the neutral probability of .5 are then used to calculate the probability that the e-mail is a spam.

Algorithms/Program language To determine probability of the token being in a spam: let ((g (* 2 (or (gettable token good) 0 )) (b (or (gettable token bad) 0 )) (unless (< (+ g b) 5) (max .01 (min .99 (float (/ (min 1 (/ b nbad)) (+ (min 1 (/ g ngood)) (min 1 (/ b nbad)))))) To determine if the e-mail is a spam using the probabilities of the 15 chosen tokens: let ((prod (apply # ‘ * probs))) (/ prod (+ prod (apply # ‘ * (mapcar # ‘ (lambda (x) (-1 x)) probs))))

Example token list with probabilities: madam 0.99 promotion 0.99 shortest 0.047225013 sorry 0.0499 valuable 0.82347 *information taken from www.paulgraham.com

Wrapping it Up Whether constructing a spam list or implementing a spam filtering program, spam is based on the concept and utilization of computer science.

Questions for the Class By the end of this presentation you should be able to answer the following question: Name 2 techniques we learned in CIS class that are used by spammers or in spam filtering. • Pattern-Matching when searching for email addresses or when evaluating words for spam tendencies. • Writing algorithms to eventually implement program.

Bibliography “Before Spam Brings the Web to Its Knees.” June 10, 2003. http.//www.businessweek.com/technology/content/jun2003/tc20030610_1670_tc104.htm Brain, Marshall. “How Spam Works” http://computer.howstuffworks.com/spam.htm “Getting Rid of Spam” http://www.webopedia.com/DidYouKnow/Internet/2002/GettingRidofSpam.asp Graham, Paul. “A Plan for Spam.” Aug.2002. http://www.paulgraham.com/spam.html

Mueller, Scott H. “ What is Spam?” http://spam.abuse.net/overview/whatisspam.shtml “Origins of Spam” http://digital.net/~gandalf/spamfaq.html#item8c “Spam” July 20, 2004. http://www.webopedia.com/TERM/s/spam.html

Spam and Personal Privacy