Claudiu musat ionut grigorescu carmen mitrica alexandru trifan
Sponsored Links
This presentation is the property of its rightful owner.
1 / 31

Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Spam Clustering using Wave Oriented K Means. Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN. You’ll be hearing quite a lot about…. Spam signatures Previous approaches Spam Features Clustering K-Means K-Medoids Stream clustering Constraints.

Download Presentation

Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Spam Clustering


Wave Oriented K Means

Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN

You’ll be hearing quite a lot about…

  • Spam signatures

    • Previous approaches

    • Spam Features

  • Clustering

    • K-Means

    • K-Medoids

    • Stream clustering

  • Constraints

You’ll be hearing quite a lot about…

  • Spam signatures

    • Previous approaches

    • Spam Features

  • Clustering

    • K-Means

    • K-Medoids

    • Stream clustering

  • Constraints

You’ll be hearing quite a lot about…

  • Spam signatures

    • Previous approaches

    • Spam Features

  • Clustering

    • K-Means

    • K-Medoids

    • Stream clustering

  • Constraints

And we’ll connect the dots

But the essence is…

"A nation that forgets its past is doomed to repeat it."

Winston Churchill

And finally some result charts

Spam signatures

  • Strong relation with dentistry

  • Necessary Evil ?

  • Last resort

Spam signatures (2)

  • Most annoying problem is that they are labor intensive

  • An extension of filtering email by hand

  • More automation is badly needed to make signatures work

Spam features

  • The ki of the spam business

  • Its DNA

  • Everything and yet nothing

  • Anything that has a constant value in a given spam wave

Email Layout

  • We noticed then that though spammers tend to change everything in an email to conceal the fact that it’s actually spam, they tend to preserve a certain layout.

  • We encoded the layout of a message in a string of tokens such as 141L2211.

  • This later evolved in a message summary such as BWWWLWWNWWE

  • To this day, message layout is the most effective feature

  • We also use variations of this feature for the MIME parts, for the paragraph contents and so on.

Other Spam Features - headers

  • Subject length, the number of separators, the maximum length of any word

  • The number of received fields(turned out we were drunk and high when we chose this one)

  • Whether it had a name in the from field

  • A quite nice example is the stripped date format

    • Take the date field

    • Strip it of all alpha-numeric characters

    • Store what’s left

    • “ ,    :: - ()” or “,    :: +” or “,    :: + ”

  • Any more suggestions?

Other Spam Features – body

  • Its length; the number of lines; whether it has long paragraphs or not; the number of consecutive blank lines;

    • Basically any part of the email layout that we felt was more important than the average

  • The number of links/email addresses/phone numbers

  • Bayes poison

  • Attatchments

  • Etc.

Combining features (1)

  • One stick is easy to break

  • The Roman fasces symbolized power and authority

  • The symbol of strength through unity from the Roman Empire to the U.S.

  • The most obvious problem – our sticks are different.

    • Strings, integers, bools

    • I’ll stress this later

fasces lictoriae (bundles of the lictors)

Combining features (2)

  • If it’s an A and at the same time a B then it’s spam

  • The idea of combining features never died out

  • Started with its relaxed form – adding scores

    • if it has “Viagra” in it – increase its spam score by 10%.

  • Evolution came naturally

National Guard Bureau insignia

Why cluster spam?

  • A “well doh” kind of slide

  • To extract the patterns we want

    • How do we combine spam traits to get a reliable spam pattern ?

    • And which are the traits that matter most?

  • Agglomerative clustering is just one of many options

    • Neural Networks

    • ARTMap worked beautifully on separating ham from spam

So why agglomerative?

  • Because the problem stated before is wrong

  • We don’t just want spam patterns.

    • We want patterns for that spam wave alone

  • Most neural nets make a binary decision. We want a plurality of classes.

  • Still there are other options, like SVM’s.

    • They don’t handle well on clustering strings

    • We want something that accepts just about any feature as long as you can compute a distance

K-means and K-medoids

  • So we chose the simplest of methods – the widely popular K-Means

    • In a given feature space each item to be classified is a point.

    • The distance between the points indicates the resemblance of the original items.

    • From a given set of instances to be clustered, it creates k classes based on their similarity

  • For spaces where the mean of two points cannot be computed, there is a variety of k-means: k-medoids.

    • This actually solves the different stick problem

    • As usual by solving a problem we introduce a whole range of others.

  • Combining them

An Example

  • Is it a line or a square?

  • What about string features?

Our old model

  • Focus mainly on correctly defining some powerful spam features

  • We totally neglected the clustering part

    • So we used the good old fashioned k-means and k-medoids.

    • And they have serious drawbacks

    • A fixed number of classes.

    • Work only with an offline corpus

  • The results were... Unpredictable.

  • Luck played a major role.

WOKM – Wave oriented K-Means

  • By using the simple k-means we could only cluster individual sets of emails

  • We now needed to cluster the whole incoming stream of spam

  • We also want to store a history of the clusters we extract

    • And use that information to detect spam on the user side.

    • And also to help us better classify in the future

      • Remember Churchill?

WOKM – How does it work ?

  • Takes snapshots of the incoming spam stream

  • Takes in only what is new

  • Train it on those messages

  • Store the clusters for future reference

The spam corpus

  • All the changes originate here

    • All messages have an associated distance

    • The distance from them to the closest stored cluster in the cluster history

  • New clusters must be closer than old ones

  • Constrained K-Means

    • Wagstaff&Cardie, 2001

    • “must fit” or “must not fit”

    • A history constraint

The training phase

  • While a solution has not been found:

    • Unassigned all the given examples

    • Assign all examples

      • Create a given number of clusters

      • Assign what you can

      • Create some more and repeat the process

    • Recompute centers

    • Merge adjacent(similar) clusters

      • Counters the cluster inflation brought by the assign phase

    • Test solution

What’s worth remembering

  • Accepts just about any kind of feature – Booleans, integers and strings.

  • K-means is limited because you have to know the number of classes a priori.

    • WOKM determines the optimum number of classes automatically

  • New messages will not be assigned to clusters that are not considered close enough

  • Has a fast novelty detection phase, so it can train itself only with new spam.

  • Can use the triangle inequality to speed things up.

  • (Future work) Allows us to keep track of the changes spammers make in the design of their products.

    • By watching clusters that are close to each other


  • Perhaps the most exciting results – the cross language spam clusters


  • Then in spanish

  • We were surprised to find that this is not an isolated case. YouTube, Microsoft, Facebook fraud attempts also were found in multiple languages


  • Then again in french (different though)

And finally the promised charts

And finally the promised charts (2)

Thank you !


  • Login