Project Presentation

Project Presentation B92902041 王　立 B92902051 陳俊甫 B92902092 張又仁 B92902095 李佳穎

Outline • Spam filter technology • Personal issue • Statistic • Our approaching

Spam filtering technology • basic structured text filters • whitelist/verification filters • distributed blacklists • Pyzor • rule-based rankings • SpamAssasin • Bayesian word distribution filters • Bayesian trigram filters

Table 1. Quantitative accuracy of spam filtering techniques

Bayesian filtering • The first two using bayesian method • Pantel and Lin • Bayesian filtering 92% spam, 1.16% false positive at 1998 • Bayesian doesn’t use in the begin • Why?

Bayesian filtering (cont.) • Someone we find later • Jonathan Zdziarski • The main problem of previous work is making false positive too high • Bayesian filtering 99.5% spam, 0.03% false positive at 2002 • Why so different?

Possible Reasons • less of training data: 160 spam and 466 non spam mails. • ignore message headers • stemmed the token, reduce words in bad way • using all tokens is not good compared with using 15 most significant • no bias against false positives

Personal issue • Some good advantages about personalization • make filters more effective • let users decide their own spam filter • hard for spammer to tune the mail

Statistics • The fifteen most interesting words in this spam, with their probabilities, are: • madam 0.99 • promotion 0.99 • republic 0.99 • shortest 0.047225013 • mandatory 0.047225013 • standardization 0.07347802 • sorry 0.08221981 • supported 0.09019077 • people's 0.09019077 • enter 0.9075001 • quality 0.8921298 • organization 0.12454646 • investment 0.8568143 • very 0.14758544 • valuable 0.82347786

Our approaching Data Set Sparse format machine learning (training) machine learning (testing)

Data set • Source • http://iit.demokritos.gr/skel/i-config/downloads/ • Lingspam • PU1 • PU123 • Enron-spam

Ling-spam • Collected from a mailing list “Ling-spam” • With 481 spam messages and 2412 non-spam messages • Topics of legitimate mails are alike. • May be good for training, but not enough generalized. • 4 versions of the corpus • Using Lemmatiser or not • Using stop-list or not

Example Subject: want best economical hunt vacation life ? want best hunt camp vacation life , felton 's hunt camp wild wonderful west virginium . $ 50 . 0 per day pay room three home cook meal ( pack lunch want stay wood noon ) cozy accomodation . reserve space . follow season book 1998 : buck season - nov . 23 - dec . 5 doe season - announce ( please call ) muzzel loader ( deer ) - dec . 14 - dec . 19 archery ( deer ) - oct . 17 - dec . 31 turkey sesson - oct . 24 - nov . 14 e - mail us 110734 . 2622 @ compuserve . com

Features • ‘Words’ as features • Sequence of alpha, number and some symbols • Only consider subject and body field • Not supporting CJK for now • Collected from only spams • Unlimited feature set • Use only features that appear often enough

Example for Features • 104 please • 104 free • 103 our • 95 mail • 91 address • 86 send • 81 one • 80 information • 77 us • 77 list • 74 receive • 74 name • 73 money • … • Collected from the spams of lemm_stop section

Sparse Format • Some result from lemm_stop/part1 : 0, 2:1, 3:1, 4:1, 5:1, 6:1, 10:1, 12:1, 15:1, 16:1, 20:1, … 0, 0:1, 4:1, 5:1, 6:1, 7:1, 8:1, 12:1, 16:1, 20:1, 22:1, … 0, 0:1, 4:1, 5:1, 7:1, 8:1, 11:1, 13:1, 25:1, 41:1, 53:1, … 0, 0:1, 4:1, 5:1, 6:1, 8:1, 9:1, 11:1, 12:1, 13:1, 14:1, … 1, 0:1, 3:1, 6:1, 10:1, 17:1, 18:1, 23:1, 26:1, 28:1, … 1, 3:1, 4:1, 5:1, 6:1, 8:1, 9:1, 11:1, 13:1, 14:1, 15:1, … 1, 0:1, 1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, …

Training method • Naïve bayes • k-NN ,k=3 or less • CART tree

Training and testing • Ling-spam is splitted into 10 parts • Use 9 parts for training • Use 1 parts for testing

Reference data • spam filtering technology • http://www-128.ibm.com/developerworks/linux/library/l-spamf.html • Better bayesian filtering • http://www.paulgraham.com/better.html • a plan for spam • http://www.paulgraham.com/spam.html

Project Presentation