claudiu musat ionut grigorescu carmen mitrica alexandru trifan
Download
Skip this Video
Download Presentation
Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN

Loading in 2 Seconds...

play fullscreen
1 / 31

Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN - PowerPoint PPT Presentation


  • 149 Views
  • Uploaded on

Spam Clustering using Wave Oriented K Means. Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN. You’ll be hearing quite a lot about…. Spam signatures Previous approaches Spam Features Clustering K-Means K-Medoids Stream clustering Constraints.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN' - bill


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
claudiu musat ionut grigorescu carmen mitrica alexandru trifan

Spam Clustering

using

Wave Oriented K Means

Claudiu MUSAT, Ionut GRIGORESCU, Carmen MITRICA, Alexandru TRIFAN

you ll be hearing quite a lot about
You’ll be hearing quite a lot about…
  • Spam signatures
    • Previous approaches
    • Spam Features
  • Clustering
    • K-Means
    • K-Medoids
    • Stream clustering
  • Constraints
you ll be hearing quite a lot about1
You’ll be hearing quite a lot about…
  • Spam signatures
    • Previous approaches
    • Spam Features
  • Clustering
    • K-Means
    • K-Medoids
    • Stream clustering
  • Constraints
you ll be hearing quite a lot about2
You’ll be hearing quite a lot about…
  • Spam signatures
    • Previous approaches
    • Spam Features
  • Clustering
    • K-Means
    • K-Medoids
    • Stream clustering
  • Constraints
but the essence is
But the essence is…

"A nation that forgets its past is doomed to repeat it."

Winston Churchill

slide8

Spam signatures

  • Strong relation with dentistry
  • Necessary Evil ?
  • Last resort
spam signatures 2
Spam signatures (2)
  • Most annoying problem is that they are labor intensive
  • An extension of filtering email by hand
  • More automation is badly needed to make signatures work
spam features
Spam features
  • The ki of the spam business
  • Its DNA
  • Everything and yet nothing
  • Anything that has a constant value in a given spam wave
email layout
Email Layout
  • We noticed then that though spammers tend to change everything in an email to conceal the fact that it’s actually spam, they tend to preserve a certain layout.
  • We encoded the layout of a message in a string of tokens such as 141L2211.
  • This later evolved in a message summary such as BWWWLWWNWWE
  • To this day, message layout is the most effective feature
  • We also use variations of this feature for the MIME parts, for the paragraph contents and so on.
other spam features headers
Other Spam Features - headers
  • Subject length, the number of separators, the maximum length of any word
  • The number of received fields(turned out we were drunk and high when we chose this one)
  • Whether it had a name in the from field
  • A quite nice example is the stripped date format
    • Take the date field
    • Strip it of all alpha-numeric characters
    • Store what’s left
    • “ ,    :: - ()” or “,    :: +” or “,    :: + ”
  • Any more suggestions?
other spam features body
Other Spam Features – body
  • Its length; the number of lines; whether it has long paragraphs or not; the number of consecutive blank lines;
    • Basically any part of the email layout that we felt was more important than the average
  • The number of links/email addresses/phone numbers
  • Bayes poison
  • Attatchments
  • Etc.
combining features 1
Combining features (1)
  • One stick is easy to break
  • The Roman fasces symbolized power and authority
  • The symbol of strength through unity from the Roman Empire to the U.S.
  • The most obvious problem – our sticks are different.
    • Strings, integers, bools
    • I’ll stress this later

fasces lictoriae (bundles of the lictors)

combining features 2
Combining features (2)
  • If it’s an A and at the same time a B then it’s spam
  • The idea of combining features never died out
  • Started with its relaxed form – adding scores
    • if it has “Viagra” in it – increase its spam score by 10%.
  • Evolution came naturally

National Guard Bureau insignia

why cluster spam
Why cluster spam?
  • A “well doh” kind of slide
  • To extract the patterns we want
    • How do we combine spam traits to get a reliable spam pattern ?
    • And which are the traits that matter most?
  • Agglomerative clustering is just one of many options
    • Neural Networks
    • ARTMap worked beautifully on separating ham from spam
so why agglomerative
So why agglomerative?
  • Because the problem stated before is wrong
  • We don’t just want spam patterns.
    • We want patterns for that spam wave alone
  • Most neural nets make a binary decision. We want a plurality of classes.
  • Still there are other options, like SVM’s.
    • They don’t handle well on clustering strings
    • We want something that accepts just about any feature as long as you can compute a distance
k means and k medoids
K-means and K-medoids
  • So we chose the simplest of methods – the widely popular K-Means
    • In a given feature space each item to be classified is a point.
    • The distance between the points indicates the resemblance of the original items.
    • From a given set of instances to be clustered, it creates k classes based on their similarity
  • For spaces where the mean of two points cannot be computed, there is a variety of k-means: k-medoids.
    • This actually solves the different stick problem
    • As usual by solving a problem we introduce a whole range of others.
  • Combining them
an example
An Example
  • Is it a line or a square?
  • What about string features?
our old model
Our old model
  • Focus mainly on correctly defining some powerful spam features
  • We totally neglected the clustering part
    • So we used the good old fashioned k-means and k-medoids.
    • And they have serious drawbacks
    • A fixed number of classes.
    • Work only with an offline corpus
  • The results were... Unpredictable.
  • Luck played a major role.
wokm wave oriented k means
WOKM – Wave oriented K-Means
  • By using the simple k-means we could only cluster individual sets of emails
  • We now needed to cluster the whole incoming stream of spam
  • We also want to store a history of the clusters we extract
    • And use that information to detect spam on the user side.
    • And also to help us better classify in the future
      • Remember Churchill?
wokm how does it work
WOKM – How does it work ?
  • Takes snapshots of the incoming spam stream
  • Takes in only what is new
  • Train it on those messages
  • Store the clusters for future reference
the spam corpus
The spam corpus
  • All the changes originate here
    • All messages have an associated distance
    • The distance from them to the closest stored cluster in the cluster history
  • New clusters must be closer than old ones
  • Constrained K-Means
    • Wagstaff&Cardie, 2001
    • “must fit” or “must not fit”
    • A history constraint
the training phase
The training phase
  • While a solution has not been found:
    • Unassigned all the given examples
    • Assign all examples
      • Create a given number of clusters
      • Assign what you can
      • Create some more and repeat the process
    • Recompute centers
    • Merge adjacent(similar) clusters
      • Counters the cluster inflation brought by the assign phase
    • Test solution
what s worth remembering
What’s worth remembering
  • Accepts just about any kind of feature – Booleans, integers and strings.
  • K-means is limited because you have to know the number of classes a priori.
    • WOKM determines the optimum number of classes automatically
  • New messages will not be assigned to clusters that are not considered close enough
  • Has a fast novelty detection phase, so it can train itself only with new spam.
  • Can use the triangle inequality to speed things up.
  • (Future work) Allows us to keep track of the changes spammers make in the design of their products.
    • By watching clusters that are close to each other
results
Results
  • Perhaps the most exciting results – the cross language spam clusters
results 2
Results(2)
  • Then in spanish
  • We were surprised to find that this is not an isolated case. YouTube, Microsoft, Facebook fraud attempts also were found in multiple languages
results 3
Results(3)
  • Then again in french (different though)
ad