Claudiu musat ionut grigorescu carmen mitrica alexandru trifan
Download
1 / 31

Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN - PowerPoint PPT Presentation


  • 139 Views
  • Uploaded on

Spam Clustering using Wave Oriented K Means. Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN. You’ll be hearing quite a lot about…. Spam signatures Previous approaches Spam Features Clustering K-Means K-Medoids Stream clustering Constraints.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN' - bill


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Claudiu musat ionut grigorescu carmen mitrica alexandru trifan

Spam Clustering

using

Wave Oriented K Means

Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN


You ll be hearing quite a lot about
You’ll be hearing quite a lot about…

  • Spam signatures

    • Previous approaches

    • Spam Features

  • Clustering

    • K-Means

    • K-Medoids

    • Stream clustering

  • Constraints


You ll be hearing quite a lot about1
You’ll be hearing quite a lot about…

  • Spam signatures

    • Previous approaches

    • Spam Features

  • Clustering

    • K-Means

    • K-Medoids

    • Stream clustering

  • Constraints


You ll be hearing quite a lot about2
You’ll be hearing quite a lot about…

  • Spam signatures

    • Previous approaches

    • Spam Features

  • Clustering

    • K-Means

    • K-Medoids

    • Stream clustering

  • Constraints



But the essence is
But the essence is…

"A nation that forgets its past is doomed to repeat it."

Winston Churchill



Spam signatures

  • Strong relation with dentistry

  • Necessary Evil ?

  • Last resort


Spam signatures 2
Spam signatures (2)

  • Most annoying problem is that they are labor intensive

  • An extension of filtering email by hand

  • More automation is badly needed to make signatures work


Spam features
Spam features

  • The ki of the spam business

  • Its DNA

  • Everything and yet nothing

  • Anything that has a constant value in a given spam wave


Email layout
Email Layout

  • We noticed then that though spammers tend to change everything in an email to conceal the fact that it’s actually spam, they tend to preserve a certain layout.

  • We encoded the layout of a message in a string of tokens such as 141L2211.

  • This later evolved in a message summary such as BWWWLWWNWWE

  • To this day, message layout is the most effective feature

  • We also use variations of this feature for the MIME parts, for the paragraph contents and so on.


Other spam features headers
Other Spam Features - headers

  • Subject length, the number of separators, the maximum length of any word

  • The number of received fields(turned out we were drunk and high when we chose this one)

  • Whether it had a name in the from field

  • A quite nice example is the stripped date format

    • Take the date field

    • Strip it of all alpha-numeric characters

    • Store what’s left

    • “ ,    :: - ()” or “,    :: +” or “,    :: + ”

  • Any more suggestions?


Other spam features body
Other Spam Features – body

  • Its length; the number of lines; whether it has long paragraphs or not; the number of consecutive blank lines;

    • Basically any part of the email layout that we felt was more important than the average

  • The number of links/email addresses/phone numbers

  • Bayes poison

  • Attatchments

  • Etc.


Combining features 1
Combining features (1)

  • One stick is easy to break

  • The Roman fasces symbolized power and authority

  • The symbol of strength through unity from the Roman Empire to the U.S.

  • The most obvious problem – our sticks are different.

    • Strings, integers, bools

    • I’ll stress this later

fasces lictoriae (bundles of the lictors)


Combining features 2
Combining features (2)

  • If it’s an A and at the same time a B then it’s spam

  • The idea of combining features never died out

  • Started with its relaxed form – adding scores

    • if it has “Viagra” in it – increase its spam score by 10%.

  • Evolution came naturally

National Guard Bureau insignia


Why cluster spam
Why cluster spam?

  • A “well doh” kind of slide

  • To extract the patterns we want

    • How do we combine spam traits to get a reliable spam pattern ?

    • And which are the traits that matter most?

  • Agglomerative clustering is just one of many options

    • Neural Networks

    • ARTMap worked beautifully on separating ham from spam


So why agglomerative
So why agglomerative?

  • Because the problem stated before is wrong

  • We don’t just want spam patterns.

    • We want patterns for that spam wave alone

  • Most neural nets make a binary decision. We want a plurality of classes.

  • Still there are other options, like SVM’s.

    • They don’t handle well on clustering strings

    • We want something that accepts just about any feature as long as you can compute a distance


K means and k medoids
K-means and K-medoids

  • So we chose the simplest of methods – the widely popular K-Means

    • In a given feature space each item to be classified is a point.

    • The distance between the points indicates the resemblance of the original items.

    • From a given set of instances to be clustered, it creates k classes based on their similarity

  • For spaces where the mean of two points cannot be computed, there is a variety of k-means: k-medoids.

    • This actually solves the different stick problem

    • As usual by solving a problem we introduce a whole range of others.

  • Combining them


An example
An Example

  • Is it a line or a square?

  • What about string features?


Our old model
Our old model

  • Focus mainly on correctly defining some powerful spam features

  • We totally neglected the clustering part

    • So we used the good old fashioned k-means and k-medoids.

    • And they have serious drawbacks

    • A fixed number of classes.

    • Work only with an offline corpus

  • The results were... Unpredictable.

  • Luck played a major role.


Wokm wave oriented k means
WOKM – Wave oriented K-Means

  • By using the simple k-means we could only cluster individual sets of emails

  • We now needed to cluster the whole incoming stream of spam

  • We also want to store a history of the clusters we extract

    • And use that information to detect spam on the user side.

    • And also to help us better classify in the future

      • Remember Churchill?


Wokm how does it work
WOKM – How does it work ?

  • Takes snapshots of the incoming spam stream

  • Takes in only what is new

  • Train it on those messages

  • Store the clusters for future reference


The spam corpus
The spam corpus

  • All the changes originate here

    • All messages have an associated distance

    • The distance from them to the closest stored cluster in the cluster history

  • New clusters must be closer than old ones

  • Constrained K-Means

    • Wagstaff&Cardie, 2001

    • “must fit” or “must not fit”

    • A history constraint


The training phase
The training phase

  • While a solution has not been found:

    • Unassigned all the given examples

    • Assign all examples

      • Create a given number of clusters

      • Assign what you can

      • Create some more and repeat the process

    • Recompute centers

    • Merge adjacent(similar) clusters

      • Counters the cluster inflation brought by the assign phase

    • Test solution


What s worth remembering
What’s worth remembering

  • Accepts just about any kind of feature – Booleans, integers and strings.

  • K-means is limited because you have to know the number of classes a priori.

    • WOKM determines the optimum number of classes automatically

  • New messages will not be assigned to clusters that are not considered close enough

  • Has a fast novelty detection phase, so it can train itself only with new spam.

  • Can use the triangle inequality to speed things up.

  • (Future work) Allows us to keep track of the changes spammers make in the design of their products.

    • By watching clusters that are close to each other


Results
Results

  • Perhaps the most exciting results – the cross language spam clusters


Results 2
Results(2)

  • Then in spanish

  • We were surprised to find that this is not an isolated case. YouTube, Microsoft, Facebook fraud attempts also were found in multiple languages


Results 3
Results(3)

  • Then again in french (different though)





ad