1 / 15

Spam Detection Kingsley Okeke Nimrat Virk

Spam Detection Kingsley Okeke Nimrat Virk. Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients by email. They impede our ability to recognise normal emails. They can also be a threat to computer security. Everyone hates spams!!.

charliew
Download Presentation

Spam Detection Kingsley Okeke Nimrat Virk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spam DetectionKingsley Okeke Nimrat Virk

  2. Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients by email. • They impede our ability to recognise normal emails. • They can also be a threat to computer security Everyone hates spams!!

  3. But how do we filter out spams from normal emails?? ?? ??

  4. What is Text Mining?? • Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output .. wikipedia Text Mining!!

  5. Marketing applications • It is used to improve predictive analytic models for customers • E.gOpen ended questions in surveys • Online Media applications • Used by Large media companies to provide users with better search experience • Academic applications • Publishers with large databases use text mining for easy information retrieval Applications

  6. Using text mining we can analyse patterns common in spam emails in order to distinguish them from Ham emails.

  7. 1) Get some training data • A large collection of spam and normal emails • SpamAssassin public corpus (http://www.spamassassin.org/publiccorpus/) Steps

  8. 2) Data Pre-processing a) Stop words: e.g for, when, to, a , be • Domain specific stop words e.g email, send Steps

  9. b) Stemming: removal of stems/roots from words • E.g discussed – discussing - discuss • Porter stemming algorithm • One of the most widely used stemming algorithm • Developed by Martin Porter http://www.tartarus.org/~martin/PorterStemmer/ Steps

  10. c) Feature Selection What are Good and Bad Features? Good features: Must occur alongside with a particular category Do not co-occur with other categories Bad features: Uniform across all categories Very infrequent occurrence Steps

  11. Information Gain • A common feature selection technique used in machine learning applications. information gain of term t is defined as: Steps

  12. Feature Representation Steps

  13. TF: Term Frequency • Definition: TF = t (i,j) • frequency of term i in document j • Purpose: makes the frequent words for the document more important • TF-IDF (Term Frequency - Inverted Document Frequency) • value of a term i in document j • Definition: TF×IDF = t(i,j) × log(N/ni) • ni: number of documents containing term i • N : total number of documents Steps

  14. d) Text Classification • WEKA • Training data is used to build a classification model • This model is built from the pre-processed data Steps

  15. END

More Related