1 / 19

Project Presentation

Project Presentation. B92902041 王 立 B92902051 陳俊甫 B92902092 張又仁 B92902095 李佳穎. Outline. Spam filter technology Personal issue Statistic Our approaching. Spam filtering technology. basic structured text filters whitelist/verification filters distributed blacklists Pyzor

Download Presentation

Project Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Project Presentation B92902041 王 立 B92902051 陳俊甫 B92902092 張又仁 B92902095 李佳穎

  2. Outline • Spam filter technology • Personal issue • Statistic • Our approaching

  3. Spam filtering technology • basic structured text filters • whitelist/verification filters • distributed blacklists • Pyzor • rule-based rankings • SpamAssasin • Bayesian word distribution filters • Bayesian trigram filters

  4. Table 1. Quantitative accuracy of spam filtering techniques

  5. Bayesian filtering • The first two using bayesian method • Pantel and Lin • Bayesian filtering 92% spam, 1.16% false positive at 1998 • Bayesian doesn’t use in the begin • Why?

  6. Bayesian filtering (cont.) • Someone we find later • Jonathan Zdziarski • The main problem of previous work is making false positive too high • Bayesian filtering 99.5% spam, 0.03% false positive at 2002 • Why so different?

  7. Possible Reasons • less of training data: 160 spam and 466 non spam mails. • ignore message headers • stemmed the token, reduce words in bad way • using all tokens is not good compared with using 15 most significant • no bias against false positives

  8. Personal issue • Some good advantages about personalization • make filters more effective • let users decide their own spam filter • hard for spammer to tune the mail

  9. Statistics • The fifteen most interesting words in this spam, with their probabilities, are: • madam 0.99 • promotion 0.99 • republic 0.99 • shortest 0.047225013 • mandatory 0.047225013 • standardization 0.07347802 • sorry 0.08221981 • supported 0.09019077 • people's 0.09019077 • enter 0.9075001 • quality 0.8921298 • organization 0.12454646 • investment 0.8568143 • very 0.14758544 • valuable 0.82347786

  10. Our approaching Data Set Sparse format machine learning (training) machine learning (testing)

  11. Data set • Source • http://iit.demokritos.gr/skel/i-config/downloads/ • Lingspam • PU1 • PU123 • Enron-spam

  12. Ling-spam • Collected from a mailing list “Ling-spam” • With 481 spam messages and 2412 non-spam messages • Topics of legitimate mails are alike. • May be good for training, but not enough generalized. • 4 versions of the corpus • Using Lemmatiser or not • Using stop-list or not

  13. Example Subject: want best economical hunt vacation life ? want best hunt camp vacation life , felton 's hunt camp wild wonderful west virginium . $ 50 . 0 per day pay room three home cook meal ( pack lunch want stay wood noon ) cozy accomodation . reserve space . follow season book 1998 : buck season - nov . 23 - dec . 5 doe season - announce ( please call ) muzzel loader ( deer ) - dec . 14 - dec . 19 archery ( deer ) - oct . 17 - dec . 31 turkey sesson - oct . 24 - nov . 14 e - mail us 110734 . 2622 @ compuserve . com

  14. Features • ‘Words’ as features • Sequence of alpha, number and some symbols • Only consider subject and body field • Not supporting CJK for now • Collected from only spams • Unlimited feature set • Use only features that appear often enough

  15. Example for Features • 104 please • 104 free • 103 our • 95 mail • 91 address • 86 send • 81 one • 80 information • 77 us • 77 list • 74 receive • 74 name • 73 money • … • Collected from the spams of lemm_stop section

  16. Sparse Format • Some result from lemm_stop/part1 : 0, 2:1, 3:1, 4:1, 5:1, 6:1, 10:1, 12:1, 15:1, 16:1, 20:1, … 0, 0:1, 4:1, 5:1, 6:1, 7:1, 8:1, 12:1, 16:1, 20:1, 22:1, … 0, 0:1, 4:1, 5:1, 7:1, 8:1, 11:1, 13:1, 25:1, 41:1, 53:1, … 0, 0:1, 4:1, 5:1, 6:1, 8:1, 9:1, 11:1, 12:1, 13:1, 14:1, … 1, 0:1, 3:1, 6:1, 10:1, 17:1, 18:1, 23:1, 26:1, 28:1, … 1, 3:1, 4:1, 5:1, 6:1, 8:1, 9:1, 11:1, 13:1, 14:1, 15:1, … 1, 0:1, 1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, …

  17. Training method • Naïve bayes • k-NN ,k=3 or less • CART tree

  18. Training and testing • Ling-spam is splitted into 10 parts • Use 9 parts for training • Use 1 parts for testing

  19. Reference data • spam filtering technology • http://www-128.ibm.com/developerworks/linux/library/l-spamf.html • Better bayesian filtering • http://www.paulgraham.com/better.html • a plan for spam • http://www.paulgraham.com/spam.html

More Related