SPAM Christian Loza Srikanth Palla Liqin Zhang
Overview • Introduction • Background • Measurement • Methods • Compare different methods • Conclusions
If you use email, it's likely that you've recently been visited by a piece of Spam- an unsolicited, unwanted messag, sent to you with out your permission.Sending spam violates the Acceptable Use Policy (AUP)of almost all ISP's and can lead to the termination of the sender's account.As the recipient directly bears the cost of delivery, storage, and processing, one could regard spam as the electronic equivalent of "postage-due" junk mail. Introduction
Spammers frequently engage in deliberate fraud to send out their messages. Spammers often use false names, addresses, phone numbers, and other contact information to set up "disposable" accounts at various Internet service providers. They also often use falsified or stolen credit card numbers to pay for these accounts. This allows them to move quickly from one account to the next as the host ISPs discover and shut down each one. Introduction
Introduction • In recent years, the spam has show no signals of stopping growth • This is mainly because it does work • The advantage is that is a cheap way to increase the customer base.
Spammers frequently engage in deliberate fraud to send out their messages. Spammers often use false names, addresses, phone numbers, and other contact information to set up "disposable" accounts at various Internet service providers. They also often use falsified or stolen credit card numbers to pay for these accounts. This allows them to move quickly from one account to the next as the host ISPs discover and shut down each one.
Spammers frequently go to great lengths to conceal the origin of their messages. They do this by spoofing e-mail addresses . The spammer hacks the email protocol SMTP so that a message appears to originate from another email address. Some ISPs and domains require the use of SMTP AUTHallowing positive identification of the specific account from which an e-mail originates.
One cannot completely spoof an e-mail address chain, since the receiving mailserver records the actual connection from the last mailserver's IP address; however, spammers can forge the rest of the ostensible history of the mailservers the e-mail has ostensibly traversed. Spammers frequently seek out and make use of vulnerable third-party systems such as open mail relays and open proxy servers.
Spammers may harvest e-mail addresses from a number of sources. A popular method uses e-mail addresses which their owners have published for other purposes. Usenet posts, especially those in archives such as Google groups, frequently yield addresses. Simply searching the Web for pages with addresses ― such as corporate staff directories ― can yield thousands of addresses, most of them deliverable. Address Collection
Spammers have also subscribed to discussion mailing lists for the purpose of gathering the addresses of posters. The DNS and WHOIS systems require the publication of technical contact information for all Internet domains spammers have illegally crawled these resources for email addresses. Many spammers utilize programs called Web Spiders to find email addresses on web pages.Because spammers offload the bulk of their costs onto others, however, they can use even more computationally expensive means to generate addresses. Address Collection
A dictionary attack consists of an exhaustive attempt to gain access to a resource by trying all possible credentials ― usually, usernames and passwords. Spammers have applied this principle to guessing email addresses ― as by taking common names and generating likely email addresses for them at each of thousands of domain names.Spammers sometimes use various means to confirm addresses as deliverable. For instance, including a Web bug in a spam message written in HTML may cause the recipient's mail client to transmit the recipient's address, or any other unique key, to the spammer's Web site. Address Collection
To better understand the concepts in this presentation let us consider the following terminology. Mail User Agent (MUA). This refers to the program used by the client to send and receive e-mail from. It is usually referred to as the "mail client." An example of this is Pine or Eudora. Mail Transfer Agent (MTA). This refers to the program used running on the server to store and forward e-mail messages. It is usually referred to as the "mail server program." An example of this is sendmail or the Microsoft Exchange server. Terminology
In a normal configuration, sendmail sits in the background waiting for new messages. When a new connection arrives, a child process is invoked to handle the connection, while the parent process goes back to listening for new connections. When a message is received, the sendmail child process puts it into the mail queue (usually stored in /var/spool/mqueue). If it is immediately deliverable, it is delivered and removed from the queue. If it is not immediately deliverable, it will be left in the queue and the process will terminate. Messages left in the queue will stay there until the next time the queue is processed. The parent sendmail will usually fork a child process to attempt to deliver anything left in the queue at regular intervals.
Email messages are compose of two parts: 1. Headers (lines of the form "field: value" which contain information about the message, such as "To:", "From:", "Date:", and "Message- ID:") 2. Body (the text of the message) Structure of E-mail Message
From firstname.lastname@example.org Mon Jul 5 23:46:19 1999 Received: (from johndoe@localhost) by students.uiuc.edu (8.9.3/8.9.3) id LAA05394; Mon, 5 Jul 1999 23:46:18 -0500 Received: from staff.uiuc.edu (staff.uiuc.edu [220.127.116.11]) by students.uiuc.edu (8.9.3/8.9.3) id XAA24214; Mon, 5 Jul 1999 23:46:25 -0500 Date: Mon, 5 Jul 1999 23:46:18 -0500 From: John Doe <email@example.com> To: John Smith <firstname.lastname@example.org> Message-Id: <199907052346.LAA05394@students.uiuc.edu> Subject: This is a subject header. This is the message body. It is seperated from the headers by a blank line. The message body can span multiple lines. Example
Here is an example SMTP transaction: 1. Client connects to server's SMTP port (25). 2. Server: 220 staff.uiuc.edu ESMTP Sendmail 8.10.0/8.10.0 ready; Mon, 13 Mar 2000 14:54:08 -0600 3. Client: helo students.uiuc.edu 4. Server: 250 staff.uiuc.edu Hello email@example.com [18.104.22.168], pleased to meet you 5. Client: mail from: firstname.lastname@example.org 6. Server: 250 2.1.0 email@example.com... Sender ok 7. Client: rcpt to: firstname.lastname@example.org 8. Server: 250 2.1.5 email@example.com... Recipient ok 9. Client: data 10. Server: 354 Enter mail, end with "." on a line by itself 11. Client: Received: (from johndoe@localhost) by students.uiuc.edu (8.9.3/8.9.3) id LAA05394; Mon, 5 Jul 1999 23:46:18 -0500 Date: Mon, 5 Jul 1999 23:46:18 -0500 From: John Doe <firstname.lastname@example.org> To: John Smith <email@example.com> Message-Id: <199907052346.LAA05394@students.uiuc.edu> Subject: This is a subject header. This is the message body. It is seperated from the headers by a blank line.The message body can span multiple lines. 12. Server: 250 2.0.0 e2DKuDw34528 Message accepted for delivery 13. Client: quit 14. Server: 221 2.0.0 staff.uiuc.edu closing connection The sender and recipient addresses used in the SMTP transaction are called the Message Envelope. Note that these addresses do not need to have any similarity to the addresses in the message headers!
Early on, spammers discovered that if they sent large quantities of spam directly from their ISP accounts, recipients would complain and ISPs would shut their accounts down. Thus, one of the basic techniques of sending spam has become to send it from someone else's computer and network connection. By doing this, spammers protect themselves in several ways: they hide their tracks, get others' systems to do most of the work of delivering messages, and direct the efforts of investigators towards the other systems rather than the spammers themselves. Delivering Spam messages
A mail filter is a piece of software which takes an input of an email message. For its output, it might pass the message through unchanged for delivery to the user's mailbox, it might redirect the message for delivery elsewhere, or it might even throw the message away. Some mail filters are able to edit messages during processing. Mail filters
Hello, Hi, this is your opportunity to buy a house with new mortage rates. To find more about this, just click here. Introduction • Application of Text Categorization • The Spam classification is defined as a binary problem: Email is Spam OR is not Spam. • Automatic text categorization assigns emails to one of the above categories, using different methods • One of this methods is the Centroid-based classification SPAM NOT SPAM
Background • Text Classification: classify documents into categories • Spam • un-spam • Classification process • preprocess message • Remove tag • Stop-word removal • Word stemming • Training --- build the classification model • Testing --- evaluate the model
Methodologies • Bayes-Naives • Centroid-Based • Content-based
Is the philosophical tenet that the mathematical theory of probability applies to the degree of plausibility of a statement. This also applies to the degree of believability contained within the rational agents of a truth statement. Additionally, when a statement is used with Bayes' theorem, it then becomes a Bayesian inference. Bayesianism
If A and B are two separate but possibly dependent random events, then: Probability of A and B occurring together = Pr[(A,B)] The conditional probability of A, given that B occurs = Pr[(A|B)] The conditional probability of BB, given that AA occurs = Pr[(B|A)] Baye's Rule
From elementary rules of probability : Pr[(A,B)] = Pr[(A|B)]Pr[(B)] = Pr[(B|A)]Pr[(A)] Dividing the right-hand pair of expressions by Pr[(B)] gives Bayes' rule: Pr[A|B] = Pr[B|A]Pr[A] ----------------- Pr[B]
In problems of probabilistic inference, we are often trying to estimate the most probable underlying model for a random process, based on some observed data or evidence. If AA represents a given set of model parameters, and BB represents the set of observed data values, then the terms in equation are given the following terminology: Pr[A] is the prior probability of the model A (in the absence of any evidence) Pr[B] is the probability of the evidence B Pr[B|A] is the likelihood that the evidence B was produced, given that the model was A Pr[A|B] is the posterior probability of the model being A, given that the evidence is B.
Mathematically, Bayes' rule states likelihood * prior posterior = ------------------------------ marginal likelihood
All statistical algorithms for spam filtering begin with a vector representation of individual e-mail messages. The length of the term vector is the number of distinct words in all the e-mail messages in the training data. The entry for a particular word in the term vector for a particular e-mail message is usually he number of occurences of the word in the e-mail message. Representing E-mail for statistical Algorithms
Table below presents toy training data comprising four e-mail messages. These data contain ten distinct words: the, quick, brown, fox, rabbit, ran, and, run, at, and rest. # Message Spam 1 The quick brown fox no 2 The quick rabbit ran and ran yes 3 rabbit run run run no 4 rabbit at rest yes Training data comprising four labeled e-mail messages
# and at brown fox quick rabbit ran rest run the 1 1 0 0 0 1 1 1 0 0 0 0 1 2 1 2 0 0 2 0 0 1 0 1 0 1 3 0 0 0 0 3 1 1 1 0 0 1 0 4 2 0 3 2 0 0 0 0 1 0 1 1 Term Vectors corresponding to training data
If the training data comprise thousands of e-mail messages, the number of distinct words often exceeds 10,000. Two simple strategies to reduce the size of the term vector somewhat are to remove “stop words” (words like and, of, the, etc.) and to reduce words to their root form, a process known as stemming (so, for example, “ran” and “run” reduce to “run”). Table 3 shows the reduced term vectors along with the spam label.
X1 X2 X3 X4 X5 X6 Y # brown fox quick rabbit rest run Spam 1 1 1 1 0 2 1 0 2 0 1 1 0 3 0 1 3 0 0 1 0 0 1 0 4 0 0 0 1 1 2 1 Term vectors after stemming and stop word removal, spam label coded as 0=no,1=yes
Let X = (X1,. .., Xd) denote the term vector for a random e-mail message, where d is the number of distinct words in the training data, after stemming and stopword removal. Let Y denote the corresponding spam label. The Naive Bayes model seeks to build a model for: Pr(Y = 1|X1= x1,. .., Xd= xd). From Bayes theorem, we have: Pr(Y = 1|X1= x1,. .., Xd= xd) = Pr(Y = 1) * Pr(X1=x1,. .., Xd= xd|Y = 1) ------------------------------------------------ Pr(X1= x1,. .., Xd= xd) Navie Bayes for Spam
Centroid-based method • The documents are represented using a Vector-space model. • Each document is represented as a Term Frequency vector (TF) t2 d1 d4 d3 d2 t1
Centroid-based method • A refinement of this model is the inverse document frequency (IDF) • This is to limit the discrimination power of frequent terms and stop words, and to emphasize words that appear in specific documents. • IDF is log(N/dfi) • The size of the document is normalized
Centroid-based method • The distance between two vectors is defined using the cosine function • Finallly, one Centroid Vector C is defined for each category (spam/not spam) as
Centroid-based method • We can measure the similarity between one document and the Centroid of the category with the following function
Steps: Centroid-based Method • TRAINNING Determine the document vectors using TD/IDF. t2 d7 d8 d5 d6 d3 d2 d1 d4 t1
Steps: Centroid-based Method • TRAINNING Calculate the centroid for the categories SPAM and NOT SPAM t2 d7 CSPAM d8 d5 d6 d3 CNOT SPAM d2 d1 d4 t1
Steps: Centroid-based Method • CLASSIFICATION Given a new document dn, calculate the document vector representation (like in the training stage) t2 dn t1
Steps: Centroid-based Method • CLASSIFICATION Measure the distance between the vector dn and the Centroids of the Categories SPAM / NOT SPAM t2 CSPAM dn CNOT SPAM t1
Steps: Centroid-based Method • CLASSIFICATION (cont.) Measure the distance between the vector dn and the Centroids of the Categories SPAM / NOT SPAM t2 CSPAM dn CNOT SPAM t1
Steps: Centroid-based Method • FINAL RESULT Obtain the maximum similarity between the document and the Centroids of SPAM and NOT SPAM for i=1,2 where 1=SPAM and 2=NOT SPAM
Analysis of Results • The standard methodology for measuring performance of text classification methods are the Precision and Recall n. of correctly predicted positives N of predicted positive examples P= n. of correctly predicted positives N of all positive examples R=
Analysis of Results • None Precision or Recall can give a good measure by themselves. To have an idea of the performance, we have to combine them. R better 2PR P+R F= P
Some results • Compared agains kNN and Naïve Based, the Centroid method performs better
Content Based Approach • Spam can be detected • before reading the message --- non-content based: • Based on special protocol  – voip protocol • Based on address book – build an email network • Based on IP address  • ….. • After process the content of the email --- content based
Content-based Approach • Non-content based approach • remove spam message if contain virus, worms before read. • leaves some messages un-labeled • Content based method: • widely used method • may need lots pre-labeled message • label message based its content • Zdziarski said that it's possible to stop spam, and that content-based filters are the way to do it • Focus on content based method
Method of content-based • Bayesian based method  • Centroid-based method • Machine learning method  • Latent Semantic indexing LSI • Contextual Network Graphs (CNG) • Rule based method • ripper rule: a list of predefined rules that can be changed by hand • Memory based method • saving cost
Measurement • Accuracy: the percentage of correct classified correct/(correct + un-correct) • False positive: if a message is a spam, but misclassify to un-spam. • Goal: • Improve accuracy • Prevent false positive Correct Un-correct Spam No spam False positive