sorting spam with k nearest neighbor and hyperspace classifiers l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers PowerPoint Presentation
Download Presentation
Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers

Loading in 2 Seconds...

play fullscreen
1 / 31

Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers - PowerPoint PPT Presentation


  • 253 Views
  • Uploaded on

William Yerazunis 1 Fidelis Assis 2 Christian Siefkes 3 Shalendra Chhabra 1,4 1: Mitsubishi Electric Research Labs- Cambridge MA 2: Empresa Brasileira de Telecomunicações ­ Embratel, Rio de Janeiro, RJ Brazil 3: Database and Information Systems Group, Freie Universität Berlin,

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers' - andren


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
sorting spam with k nearest neighbor and hyperspace classifiers

William Yerazunis1

Fidelis Assis2

Christian Siefkes3

Shalendra Chhabra1,4

1: Mitsubishi Electric Research Labs- Cambridge MA

2: Empresa Brasileira de Telecomunicações ­ Embratel, Rio de Janeiro, RJ Brazil

3: Database and Information Systems Group, Freie Universität Berlin,

Berlin-Brandenburg Graduate School in Distributed Information Systems

4: Computer Science and Engineering, University of California, Riverside CA

Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers

KNN and Hyperspace Spam Sorting

bayesian is great why worry
KNN and Hyperspace Spam SortingBayesian is Great.Why Worry?
  • Typical Spam Filters are linear classifiers
    • Consider the “checkerboard” problem
  • Markovian requires the nonlinear features to be textually “near” each other
    • can’t be sure that will work forever because spammers are clever.
  • Winnow is just a different weighting + different chain rule rule
bayesian is great why worry3
KNN and Hyperspace Spam SortingBayesian is Great.Why Worry?
  • Bayesian is only a linear classifier
    • Consider the “checkerboard” problem
  • Markovian requires the nonlinear features to be textually “near” each other
    • can’t be sure of that; spammers are clever
  • Winnow is just a different weighting
  • KNNs are a very different kind of classifier
nonlinear decision surfaces
KNN and Hyperspace Spam SortingNonlinear Decision Surfaces

Nonlinear decision surfaces require

tremendous amounts of data.

nonlinear decision and knn hyperspace
KNN and Hyperspace Spam SortingNonlinear Decision and KNN / Hyperspace

Nonlinear decision surfaces require

tremendous amounts of data.

knns have been around
KNN and Hyperspace Spam SortingKNNs have been around
  • Earliest found reference:
    • E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties
knns have been around10
KNN and Hyperspace Spam SortingKNNs have been around
  • Earliest found reference:
    • E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties
  • In 1951 !
knns have been around11
KNN and Hyperspace Spam SortingKNNs have been around
  • Earliest found reference:
    • E. Fix and J. Hodges, Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties
  • In 1951 !
  • Interesting Theorem: Cover and Hart (1967)
    • KNNs are within a factor of 2 in accuracy to the optimal Bayesian filter
knns in one slide
KNN and Hyperspace Spam SortingKNNs in one slide!
  • Start with bunch of known things and one unknown thing.
  • Find theK known things most similar to the unknown thing.
  • Count how many of the K known things are in each class.
  • The unknown thing is of the same class as the majority of the K known things.
issues with standard knns
KNN and Hyperspace Spam SortingIssues with Standard KNNs
  • How big is the neighborhood K ?
  • How do you weight your neighbors?
    • Equal-vote?
    • Some falloff in weight?
    • Nearby interaction – the Parzen window?
  • How do you train?
    • Everything? That gets big...
    • And SLOW.
issues with standard knns14
KNN and Hyperspace Spam SortingIssues with Standard KNNs
  • How big is the neighborhood?
  • We will test with 3, 7, 21 and |corpus|
  • How do we weight the neighbors?
  • We will try equal-weighting, similarity, Euclidean distance, and combinations thereof.
issues with standard knns15
KNN and Hyperspace Spam SortingIssues with Standard KNNs
  • How do we train?
    • To compare with a good Markov classifier we need to use TOE – Train Only Errors
    • This is good in that it really speeds up classification and keeps the database small.
    • This is bad in that it violates the Cover and Hart assumptions, so the quality limit theorem no longer applies
    • BUT – we will train multiple passes to see if an asymptote appears.
issues with standard knns16
KNN and Hyperspace Spam SortingIssues with Standard KNNs
  • We found the “bad” KNNs mimic Cover and Hart behavior- they insert basically everything into a bloated database, sometimes more than once!
  • The more accurate KNNs inserted fewer examples into their database.
how do we compare knns
KNN and Hyperspace Spam SortingHow do we compare KNNs?
  • Use the TREC 2005 SA dataset.
  • 10-fold validation – train on 90%, test on 10%, repeat for each successive 10% (but remember to clear memory!)
  • Run 5 passes (find the asymptote)
  • Compare it versus the OSB Markovian tested at TREC 2005.
what do we use as features
KNN and Hyperspace Spam SortingWhat do we use as features?
  • Use the OSB feature set. This combines nearby words to make short phrases; the phrases are what are matched.
  • Example “this is an example” yields:
      • “this is”
      • “this <skip> an”
      • “this <skip> <skip> example”
  • These features are the measurements we classify against
test 1 equal weight voting knn with k 3 7 and 21
KNN and Hyperspace Spam SortingTest 1: Equal Weight VotingKNN with K = 3, 7, and 21

Asymptotic accuracy: 93%, 93%, and 94%

(good acc: 98%, spam acc 80% for K = 2 and 7,

96% and 90% for K=21)

Time: ~50-75 milliseconds/message

test 2 weight by hamming 1 2 knn with k 7 and 21
KNN and Hyperspace Spam SortingTest 2: Weight by Hamming-1/2KNN with K = 7 and 21

Asymptotic accuracy: 94% and 92%

(good acc: 98%, spam acc 85% for K=7,

98% and 79% for K=21)

Time: ~ 60 milliseconds/message

test 3 weight by hamming 1 2 knn with k corpus
KNN and Hyperspace Spam SortingTest 3: Weight by Hamming-1/2KNN with K = |corpus|

Asymptotic accuracy: 97.8%

Good accuracy: 98.2%Spam accuracy: 96.9%

Time: 32 msec/message

test 4 weight by n dimensional radiation model a k a hyperspace
Test 4: Weight by N-dimensional radiation model(a.k.a. “Hyperspace”)

KNN and Hyperspace Spam Sorting

test 4 hyperspace weight k corpus d 1 2 3
KNN and Hyperspace Spam SortingTest 4: Hyperspace weight,K = |corpus|, d=1, 2, 3

Asymptotic accuracy: 99.3%

Good accuracy: 99.64% , 99.66% and 99.59%

Spam accuracy: 98.7, 98.4, 98.5%

Time: 32, 22, and 22 milliseconds/message

test 5 compare vs markov osb thin threshold
KNN and Hyperspace Spam SortingTest 5: Compare vs. Markov OSB(thin threshold)

Asymptotic accuracy: 99.1%

Good accuracy: 99.6%, Spam accuracy: 97.9%

Time: 31 msec/message

test 6 compare vs markov osb thick threshold 10 0 pr
KNN and Hyperspace Spam SortingTest 6: Compare vs. Markov OSB(thick threshold = 10.0 pR)
  • Thick Threshold means:
    • Test it first
    • If it is wrong, train it.
    • If it was right, but only by less than the threshold thickness, train it anyway!
  • 10.0 pR units is roughly the range between 10% to 90% certainty.
test 6 compare vs markov osb thick threshold 10 0 pr26
Test 6: Compare vs. Markov OSB(thick threshold = 10.0 pR)

Asymptotic accuracy: 99.5%

Good accuracy: 99.6%, Spam accuracy: 99.3%

Time: 19 msec/message

KNN and Hyperspace Spam Sorting

conclusions
KNN and Hyperspace Spam SortingConclusions:
  • Small-K KNNs are not very good for sorting spam.
conclusions28
KNN and Hyperspace Spam SortingConclusions:
  • Small-K KNNs are not very good for sorting spam.
  • K=|corpus| KNNs with distance weighting are reasonable.
conclusions29
KNN and Hyperspace Spam SortingConclusions:
  • Small-K KNNs are not very good for sorting spam
  • K=|corpus| KNNs with distance weighting are reasonable
  • K=|corpus| KNNs with hyperspace weighting are pretty good.
conclusions30
KNN and Hyperspace Spam SortingConclusions:
  • Small-K KNNs are not very good for sorting spam.
  • K=|corpus| KNNs with distance weighting are reasonable.
  • K=|corpus| KNNs with hyperspace weighting are pretty good.
  • But thick-threshold trained Markovs seem to be more accurate, especially in single-pass training.
thank you questions
KNN and Hyperspace Spam SortingThank you! Questions?

Full source is available at

http://crm114.sourceforge.net

(licensed under the GPL)