1 / 22

Data Mining & MacHine learning Final Project

Data Mining & MacHine learning Final Project. Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維. Outline. Experiment setting Feature extraction Model training Hybrid-Model Conclusion Reference. Experiment setting. Selected online corpus: enron

regis
Download Presentation

Data Mining & MacHine learning Final Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining & MacHinelearning Final Project Group 2 R95922027 李庭閣 R95922034 孔垂玖 R95922081 許守傑 R95942129 鄭力維

  2. Outline • Experiment setting • Feature extraction • Model training • Hybrid-Model • Conclusion • Reference

  3. Experiment setting • Selected online corpus: enron • Removing html tags • Factoring important headers • Six folders from enron1 to enron6. • Contain totally 13496 spam mails & 15045 ham mails

  4. Outline • Experiment setting • Feature extraction • Model training • Hybrid-Model • Conclusion • Reference

  5. Feature Extration • Transmitted Time of the Mail • Number of the Receiver • Existence of Attachment • Existence of images in mail • Existence of Cited URLs in mail • Symbols in Mail Title • Mail-body

  6. Transmitted Time of the Mail& Number of the Receiver Spam: Non-uniform Distribution Spam: Only Single Receiver

  7. Probability of being Spam for Transmitted Time & Receiver Size

  8. Attachment, Images, and URL

  9. Symbols in Mail Titles • Title Absentness • Spam senders add titles now. • Arabic Numeral : • Almost equal probability (Date, ID) • Non-alphanumeric Character & Punctuation Marks: Appear more often in Spam Appear more often in ham

  10. Mail-body • Build the internal structure of words • Use a good NLP tool called Treetaggerto help us do word stemming • Given the stemmed words appeared in each mail, we build a sparse format vector to represent the “semantic” of a mail

  11. Outline • Experiment setting • Feature extraction • Model training • Hybrid-Model • Conclusion • Reference

  12. Naïve Bayes Given a bag of words (x1, x2, x3,…,xn), Naïve Bayes is powerful for document classification.

  13. Vector Space Model Create a word-document (mail) matrix by SRILM. For every mail (column) pair, a similarity value can be calculated.

  14. KNN (Vector Space Model) As K = 1, the KNN classification model show the best accuracy.

  15. Maximum Entropy • Maximize the entropy and minimize the Kullback-Leiber distance between model and the real distribution. • The elements in word-document matrix are modified to the binary value {0, 1}.

  16. SVM • Binary : • Select binary value {0,1} to represent that this word appears or not • Normalized : • Count the occurrence of each word and divide them by their maximum occurrence counts.

  17. Outline • Experiment setting • Feature extraction • Model training • Hybrid-Model • Conclusion • Reference

  18. Single-layered-perceptronHybrid Model The accuracy of NN-based Hybrid Model is always the highest.

  19. Committee-based Hybrid-model • The voting model averages the classification result, promoting the ability of the filter slightly. However, sometimes voting might reduce the accuracy because of misjudgments of majority. • Knn + naïve Bayes + Maximum Entropy • naïve Bayes + Maximum Entropy + SVM

  20. Outline • Experiment setting • Feature extraction • Model training • Hybrid-Model • Conclusion • Reference

  21. Conclusion • 7 features are shown mail type discrimination. • Transmitted Time & Receiver Size • Attachment, Image, and URL • Non-alphanumeric Character & Punctuation Marks • 5 populous Machine Learning are proved suitable for spam filter • Naïve Bayes, KNN, SVM • 2 Model combination ways are tested. • Committee-based & Single Neural Network

  22. Reference • [1]. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A Bayesian Approach to Filtering Junk E-Mail," in Proc. AAAI 1998, Jul. 1998. • [2] A plan for spam: http://www.paulgraham.com/spam.html • [3]Enron Corpus: http://www.aueb.gr/users/ion/ • [4]Treetagger:http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html • [5]Maximum Entropy: http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html • [6]SRILM: http://www.speech.sri.com/projects/srilm/ • [7]SVM:http://svmlight.joachims.org/

More Related