SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare

SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAMRohan Malkhare Committee: Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner

Outline • Introduction • Related Work • Algorithm • Measurements • Implementation • Future Work

Introduction Anti-spam efforts • Legislation • Technology • White listing of Email addresses • Black Listing of Email addresses/domains • Challenge Response mechanisms • Content Filtering • Learning Techniques

Introduction Learning techniques for Spam classification • Feature Extraction • Assignment of weights to individual features representing the predictive strength of a feature • Combining weights of extracted features during classification to numerically determine whether mail is spam/legitimate

Introduction Current algorithms • Word or phrases as features • Probabilities of occurrence in spam/legitimate collections as weights • Bayes rule or one of it’s variants for combining weights

Related Work • Cohen (1996): • RIPPER, Rule Learning System • Rules in a human-comprehensible format • Pantel & Lin (1998): • Naïve-Bayes with words as features • Microsoft Research (1998): • Naïve-Bayes with the mutual information measure to select features with strongest resolving power • Words and domain-specific attributes of spam used as features

Related Work • Paul Graham (2002): A Plan for spam • Very popular algorithm credited with starting the craze for Bayesian Filters • Uses naïve-bayes with words as features • Bill Yerazunis (2002): CRM114 sparse binary polynomial hashing algorithm • Most accurate algorithm till date (over 99.7% accuracy) • Distinctive because of it’s powerful feature extraction technique • Uses Bayesian chain rule for combining weights

Related Work • CRM114 algorithm Feature Extraction • Slide a Window of 5 words over the incoming text • Generate order-preserving sub-phrases containing all combinations of windowed words • For one window, 24 = 16 features are generated • Very high computational complexity • E.g. “Click here to buy Viagra” • Features generated would be “Click”, “Click here”, “Click to”,“Click buy”, “Click Viagra”, “Click here to”, “Click here buy” etc.

Algorithm • Feature Extraction • Sentences in a message are identified by using the delimiting characters ‘.’, ‘?’, ‘!’, ‘;’, ‘<‘, ‘>’ • All possible word-pairings are formed from the sentences • Commonly occuring words are skipped • These word-pairings serve as features to be used for classification

Algorithm • Feature Extraction (continued….) • If number of words become greater than a constant K, then series of K words is treated as a sentence • Value of K is set to 20 • E.g. “There is a problem in the tables that have been copied to the database” “problem tables”, “tables problem”, “problem copied”, “copied problem”, “problem database”, “database problem” etc. are the features that would be formed out of the sentence

Algorithm • Feature Extraction (continued….) • Entire subject line is treated as one sentence • For HTML, all content within ‘<‘ and ‘>’ is treated as one sentence • For a sentence of n words, ‘scavenger’ creates (n-1)*(n-2) features as compared to 2n-1 created by CRM114

Algorithm • Weight Assignment • Weights represent predictive strength of features • Discretized values are assigned as weights to features depending on whether the feature is a ‘strong’ evidence or a ‘weak’ evidence • ‘Strong’ pieces of evidence should have high impact on the classification decision and ‘weak’ pieces of evidence should have low impact on the classification decision

Algorithm • Weight Assignment (Continued…) • Categorization of features into ‘strong’ and ‘weak’ pieces of evidence is done on the basis of frequency of occurrence of the feature in spam/legitimate collections, exclusivity of occurrence and on heuristic rules like distance between words in the word-pairing, whether the feature is from the subject or the body. • Only exclusively occuring features are assigned weights • Features occuring in both spam and legitimate collections are ignored.

Algorithm • Weight Assignment (Continued…) • What weights to select for the ‘strong’ evidences and the ‘weak’ evidences? • During classification, the class having more pieces of ‘strong’ evidence should ‘win’ regardless of the number of ‘weak’ evidences on either side. • In the absence of ‘strong’ evidences on either side, the class having more pieces of ‘weak’ evidence should ‘win’.

Algorithm • Weight Assignment (Continued…) • Intuitively, we would like to have as much ‘distance’ between the values we choose for the ‘strong’ and ‘weak’ evidences. • We select 0.9 as the weight for ‘strong’ evidences and 0.1 as the weight for ‘weak’ evidences.

Algorithm • Combining of weights • Total spam evidence = sum of spam weights of matching features • Total legitimate evidence = sum of legitimate weights of matching features • If Total spam evidence >= M* Total legitimate evidence, then message is spam • M is the thresold parameter which can be used as ‘tuning knob’

Measurements • Precision and Recall used as parameters of measurement • Spam Precision=Messages correctly classified as spam / Total Messages classified as spam • Spam Recall = Messages correctly classified as spam / Total Spam Messages in Testing set • Precision gives accuracy with respect to false positives • Recall gives capacity of filter to catch spam

Measurements • Testing data • Downloaded around 5600 spam messages from http://www.spamarchive.org • Used around 960 legitimate mails from Dr. Fink’s mailbox • Cross-Validation • K-fold cross-validation for two values of K, K=2 and K=5 • K=2: Dividing data into 2 equal-sized sets • K=5: Dividing data into 5 equal-sized sets

Measurements • Comparison with Paul Graham’s naïve-bayes algorithm • Implemented Graham’s algorithm for two methods of feature extraction • Words+phrases as features • Feature extraction similar to ‘scavenger’

Measurements

Measurements • Why ‘scavenger’ performs better than naïve-bayes? • Powerful feature extraction (as powerful as CRM114) • Calculates predictive strength on basis of frequency of occurrence as well as heuristic rules

Implementation • Windows-PC based filter • Runs for Individual email accounts in IMAP mail servers • Three Modules • Configuration • Training • Classification

Implementation • Classifier runs as a Windows Service • Connects to mail server every ten minutes • Downloads new messages, classifies them • Moves messages classified as spam to a pre-configured folder on the server

Future Work • Incorporating message headers during feature extraction step • Incorporating domain-specific attributes of spam during weight combination step

Publications • Dr. William Yerazunis (inventor of CRM114) mentioned the ‘scavenger’ algorithm at the MIT spam conference on Jan 16, 2004 • To be published in the ‘First Conference on email and anti-spam’ in Palo-Alto, California in July 2004

Acknowledgements • Dr. Eugene Fink • Dr. Dewey Rundus, Dr. Alan Hevner • Dr. Paul Graham, MIT, Boston • Dr. William Yerazunis, MERL, Boston

SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare

SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare

Presentation Transcript

Animal Systematics

Plant Recognition: Classification Identification of Field Crops

Parvovirus Classification

Classification of Medically Important Viruses

Plant Classification

Lecture 3: Introduction to Classification

Lab. No.9

Classification of weeds

Classification of Soils

Classification

Feedforward Neural Networks. Classification and Approximation

Junk DNA

Financial classification models – Part I: Discriminant Analysis

Junk DNA

Implement the DiffServ QoS Model

Outline

Chapter 18: Classification

Data Mining amd Knowledge Acquisition — Chapter 5 —

CSA3180: Natural Language Processing

Introduction to Classification

Chapter 6. Classification and Prediction