scavenger a junk mail classification program rohan malkhare l.
Download
Skip this Video
Download Presentation
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare

Loading in 2 Seconds...

play fullscreen
1 / 33

SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare - PowerPoint PPT Presentation


  • 147 Views
  • Uploaded on

SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare. Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner. Outline. Introduction Related Work Algorithm Measurements Implementation Future Work. Introduction. Anti-spam efforts Legislation Technology

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare' - idra


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
scavenger a junk mail classification program rohan malkhare

SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAMRohan Malkhare

Committee:

Dr. Eugene Fink

Dr. Dewey Rundus

Dr. Alan Hevner

outline
Outline
  • Introduction
  • Related Work
  • Algorithm
  • Measurements
  • Implementation
  • Future Work
introduction
Introduction

Anti-spam efforts

  • Legislation
  • Technology
    • White listing of Email addresses
    • Black Listing of Email addresses/domains
    • Challenge Response mechanisms
    • Content Filtering
      • Learning Techniques
introduction4
Introduction

Learning techniques for Spam classification

  • Feature Extraction
  • Assignment of weights to individual features representing the predictive strength of a feature
  • Combining weights of extracted features during classification to numerically determine whether mail is spam/legitimate
introduction5
Introduction

Current algorithms

  • Word or phrases as features
  • Probabilities of occurrence in spam/legitimate collections as weights
  • Bayes rule or one of it’s variants for combining weights
outline6
Outline
  • Introduction
  • Related Work
  • Algorithm
  • Measurements
  • Implementation
  • Future Work
related work
Related Work
  • Cohen (1996):
    • RIPPER, Rule Learning System
    • Rules in a human-comprehensible format
  • Pantel & Lin (1998):
    • Naïve-Bayes with words as features
  • Microsoft Research (1998):
    • Naïve-Bayes with the mutual information measure to select features with strongest resolving power
    • Words and domain-specific attributes of spam used as features
related work8
Related Work
  • Paul Graham (2002): A Plan for spam
    • Very popular algorithm credited with starting the craze for Bayesian Filters
    • Uses naïve-bayes with words as features
  • Bill Yerazunis (2002): CRM114 sparse binary polynomial hashing algorithm
    • Most accurate algorithm till date (over 99.7% accuracy)
    • Distinctive because of it’s powerful feature extraction technique
    • Uses Bayesian chain rule for combining weights
related work9
Related Work
  • CRM114 algorithm Feature Extraction
    • Slide a Window of 5 words over the incoming text
    • Generate order-preserving sub-phrases containing all combinations of windowed words
    • For one window, 24 = 16 features are generated
    • Very high computational complexity
    • E.g. “Click here to buy Viagra”
      • Features generated would be “Click”, “Click here”, “Click to”,“Click buy”, “Click Viagra”, “Click here to”, “Click here buy” etc.
outline10
Outline
  • Introduction
  • Related Work
  • Algorithm
  • Measurements
  • Implementation
  • Future Work
algorithm
Algorithm
  • Feature Extraction
    • Sentences in a message are identified by using the delimiting characters ‘.’, ‘?’, ‘!’, ‘;’, ‘<‘, ‘>’
    • All possible word-pairings are formed from the sentences
    • Commonly occuring words are skipped
    • These word-pairings serve as features to be used for classification
algorithm12
Algorithm
  • Feature Extraction (continued….)
    • If number of words become greater than a constant K, then series of K words is treated as a sentence
    • Value of K is set to 20
    • E.g. “There is a problem in the tables that have been copied to the database” “problem tables”, “tables problem”, “problem copied”, “copied problem”, “problem database”, “database problem” etc. are the features that would be formed out of the sentence
algorithm13
Algorithm
  • Feature Extraction (continued….)
    • Entire subject line is treated as one sentence
    • For HTML, all content within ‘<‘ and ‘>’ is treated as one sentence
    • For a sentence of n words, ‘scavenger’ creates (n-1)*(n-2) features as compared to 2n-1 created by CRM114
algorithm14
Algorithm
  • Weight Assignment
    • Weights represent predictive strength of features
    • Discretized values are assigned as weights to features depending on whether the feature is a ‘strong’ evidence or a ‘weak’ evidence
    • ‘Strong’ pieces of evidence should have high impact on the classification decision and ‘weak’ pieces of evidence should have low impact on the classification decision
algorithm15
Algorithm
  • Weight Assignment (Continued…)
    • Categorization of features into ‘strong’ and ‘weak’ pieces of evidence is done on the basis of frequency of occurrence of the feature in spam/legitimate collections, exclusivity of occurrence and on heuristic rules like distance between words in the word-pairing, whether the feature is from the subject or the body.
    • Only exclusively occuring features are assigned weights
    • Features occuring in both spam and legitimate collections are ignored.
algorithm16
Algorithm
  • Weight Assignment (Continued…)
    • What weights to select for the ‘strong’ evidences and the ‘weak’ evidences?
    • During classification, the class having more pieces of ‘strong’ evidence should ‘win’ regardless of the number of ‘weak’ evidences on either side.
    • In the absence of ‘strong’ evidences on either side, the class having more pieces of ‘weak’ evidence should ‘win’.
algorithm17
Algorithm
  • Weight Assignment (Continued…)
    • Intuitively, we would like to have as much ‘distance’ between the values we choose for the ‘strong’ and ‘weak’ evidences.
    • We select 0.9 as the weight for ‘strong’ evidences and 0.1 as the weight for ‘weak’ evidences.
algorithm18
Algorithm
  • Combining of weights
    • Total spam evidence = sum of spam weights of matching features
    • Total legitimate evidence = sum of legitimate weights of matching features
    • If Total spam evidence >= M* Total legitimate evidence, then message is spam
    • M is the thresold parameter which can be used as ‘tuning knob’
outline19
Outline
  • Introduction
  • Related Work
  • Algorithm
  • Measurements
  • Implementation
  • Future Work
measurements
Measurements
  • Precision and Recall used as parameters of measurement
    • Spam Precision=Messages correctly classified as spam / Total Messages classified as spam
    • Spam Recall = Messages correctly classified as spam / Total Spam Messages in Testing set
    • Precision gives accuracy with respect to false positives
    • Recall gives capacity of filter to catch spam
measurements21
Measurements
  • Testing data
    • Downloaded around 5600 spam messages from http://www.spamarchive.org
    • Used around 960 legitimate mails from Dr. Fink’s mailbox
  • Cross-Validation
    • K-fold cross-validation for two values of K, K=2 and K=5
    • K=2: Dividing data into 2 equal-sized sets
    • K=5: Dividing data into 5 equal-sized sets
measurements22
Measurements
  • Comparison with Paul Graham’s naïve-bayes algorithm
  • Implemented Graham’s algorithm for two methods of feature extraction
    • Words+phrases as features
    • Feature extraction similar to ‘scavenger’
measurements26
Measurements
  • Why ‘scavenger’ performs better than naïve-bayes?
    • Powerful feature extraction (as powerful as CRM114)
    • Calculates predictive strength on basis of frequency of occurrence as well as heuristic rules
outline27
Outline
  • Introduction
  • Related Work
  • Algorithm
  • Measurements
  • Implementation
  • Future Work
implementation
Implementation
  • Windows-PC based filter
  • Runs for Individual email accounts in IMAP mail servers
  • Three Modules
    • Configuration
    • Training
    • Classification
implementation29
Implementation
  • Classifier runs as a Windows Service
  • Connects to mail server every ten minutes
  • Downloads new messages, classifies them
  • Moves messages classified as spam to a pre-configured folder on the server
outline30
Outline
  • Introduction
  • Related Work
  • Algorithm
  • Measurements
  • Implementation
  • Future Work
future work
Future Work
  • Incorporating message headers during feature extraction step
  • Incorporating domain-specific attributes of spam during weight combination step
publications
Publications
  • Dr. William Yerazunis (inventor of CRM114) mentioned the ‘scavenger’ algorithm at the MIT spam conference on Jan 16, 2004
  • To be published in the ‘First Conference on email and anti-spam’ in Palo-Alto, California in July 2004
acknowledgements
Acknowledgements
  • Dr. Eugene Fink
  • Dr. Dewey Rundus, Dr. Alan Hevner
  • Dr. Paul Graham, MIT, Boston
  • Dr. William Yerazunis, MERL, Boston
ad