Improving spam detection based on structural similarity
Download
1 / 22

Improving Spam Detection Based on Structural Similarity - PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on

Improving Spam Detection Based on Structural Similarity. By Luiz H. Gomes, Fernando D. O. Castro, Rodrigo B. Almeida, Luis M. A. Bettencourt, Virg í lio A. F. Almeida, Jussara M. Almeida Presented at Steps to Reducing Unwanted Traffic on the Internet Workshop, 2005 Presented by Jared Bott.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Improving Spam Detection Based on Structural Similarity' - germaine-puckett


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Improving spam detection based on structural similarity

Improving Spam Detection Based on Structural Similarity

By Luiz H. Gomes, Fernando D. O. Castro, Rodrigo B. Almeida,

Luis M. A. Bettencourt, Virgílio A. F. Almeida, Jussara M. Almeida

Presented at Steps to Reducing Unwanted Traffic on the Internet Workshop, 2005

Presented by Jared Bott


Outline
Outline

  • Overview

  • Concepts

  • Detecting Spam

  • Experimental Results

  • Analysis of Paper


Overview
Overview

  • New algorithm to detect spam messages

  • Uses email information that is harder to change

  • Works in conjunction with another spam classifier

    • I.e. SpamAssassin

  • Less false positives than compared methods


Spam detection problem
Spam Detection Problem

  • Spam detection algorithms use some part of emails to determine if a message is spam

    • Spammers change messages so that they do not meet detection criteria for spam

    • Very easy to change spam messages, usernames, domains, subjects, etc.


Key idea
Key Idea

  • The lists that spammers and legitimate users send messages to and from can be used as the identifiers of classes of email traffic.

    • The lists of addresses spammers send to are unlikely to be similar to those of legitimate users.

    • Lists don’t change that often


Using lists
Using Lists

  • A user is not just an email address. It can be a domain, etc.

  • Represent email user as a vector in multi-dimensional conceptual space created with all possible contacts

    • Each sender and each recipient has their own vector

  • Model relationship between senders and recipients


Constructing vectors
Constructing Vectors

  • If there is at least one email sent from sender si to recipient rn, then the value in si’s vector’s nth dimension is 1. Otherwise, that value is 0.

  • If there is at least one email received by recipient ri from sender sn, the value in ri’s vector’s nth dimension is 1. Otherwise it is 0.



Similarity between senders
Similarity Between Senders

  • Similarity between senders si and sk is the cosine of the angle between their vectors

    • cos(si, sk)

    • 0 means no shared contact

    • 1 means identical contact lists

  • In legitimate email, a 1 means that the senders operate in the same social group.

  • In spammers, a 1 means that the senders use the same list or are the same person.


Grouping users into clusters
Grouping Users Into Clusters

  • Group users with similar vectors

    • Users with similar vectors are likely to have related roles, i.e. spammer or legitimate user

  • Each cluster is represented by a vector

    • This vector is the sum of all its component users’ vectors


Similarity between a user and a cluster
Similarity Between a User and a Cluster

  • Similarity is derived from user to user similarity equation

    • If sender si is a member of cluster sck, then the similarity is cos(sck – si, si).

    • If sender si is not a member of cluster sck, then the similarity is cos(sck, si).

  • Similarity between a user and a cluster will change over time

    • Remove the user’s vector from the cluster’s vector when computing similarity and reclassifying a user


Detecting spam
Detecting Spam

  • Two probabilities to compute

    • Ps(m) – Probability of an email m being sent by a spammer

    • Pr(m) – Probability of an email m being addressed to users that receive spam


Detecting spam1
Detecting Spam

  • When an email arrives, classify it using some other method

  • Find the cluster (sc) the email’s sender belongs in

    • If many users in the cluster send messages that are classified as spam by auxiliary method, the probability of all the users in that cluster sending spam is high

  • Update the sc’s spam probability

  • Ps(m)←sc’s spam probability


Detecting spam2
Detecting Spam

  • For all recipients of the email, find the cluster (rc) each one belongs to

  • Update the spam probability for each cluster

  • Pr(m)← Pr(m) + spam probability of each rc

  • Pr(m)← Pr(m)/number of recipients


Detecting spam3
Detecting Spam

  • Compute a spam rank for the email based upon Pr(m)and Ps(m)

  • If the spam rank is above some threshold (ω), label it as spam

  • If the spam rank is below 1- ω, label it is legitimate

  • Otherwise label the email as the auxiliary method’s classification


Experimental results
Experimental Results

  • Tested on a log of eight days of email from a large Brazilian university

  • Tested on a 2.8 GHz Pentium 4 with 512 MB RAM

    • Able to classify 20 messages per second

    • Faster than the average message arrival peak rate



Results1
Results

  • Manually checked false positives to see if they were spam or not

    • Auxiliary algorithm had more false positives


Strengths
Strengths

  • Less false positives than SpamAssassin

  • Low-cost

  • Works with message information that doesn’t change that much


Weaknesses
Weaknesses

  • Needs an additional message classifier, i.e. SpamAssassin

  • Manual tuning of algorithm


Improvements
Improvements

  • Time correlation of similar addresses

  • Collaborative filtering based upon user feedback