Ugr project haoyu li brittany edwards wei zhang under xiaoxiao xu and arye nehorai
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Machine Learning Basics with Applications to Email Spam Detection PowerPoint PPT Presentation


  • 126 Views
  • Uploaded on
  • Presentation posted in: General

UGR Project - Haoyu li, brittany edwards , wei zhang under xiaoxiao xu and arye nehorai. Machine Learning Basics with Applications to Email Spam Detection. General background information about the process of machine learning. The process of email detection. Motivation of this project

Download Presentation

Machine Learning Basics with Applications to Email Spam Detection

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ugr project haoyu li brittany edwards wei zhang under xiaoxiao xu and arye nehorai

UGR Project - Haoyu li, brittanyedwards, weizhang under xiaoxiaoxu and aryenehorai

Machine Learning Basics with Applications to Email Spam Detection


General background information about the process of machine learning

General background information about the process of machine learning


The process of email detection

The process of email detection

  • Motivation of this project

  • Pre-processing of data

  • Classifier Models

    • Evaluation of classifiers


Motivation of this project

Motivation of this project

  • Spam email has been annoyed every personal email account

    • 60% of January 2004 emails were spam

    • Fraud & Phishing

  • Spam vs. Ham email


Our goal

Our Goal


Machine learning basics with applications to email spam detection

Spam Email example


Machine learning basics with applications to email spam detection

Ham Email example


The process of email detection1

The process of email detection

  • Motivation of this project

  • Pre-processing of data

  • Classifier Models

    • Evaluation of classifiers


Pr e processing of data

Pre-processing of data

  • Convert capital letters to lowercase

  • Remove numbers, and extra white space

  • Remove punctuations 

  • Remove stop-words

  • Delete terms with length greater than 20. 


Pr e processing of data1

Pre-processing of data

  • Original Email


Pr e processing of data2

Pre-processing of data

  • After pre-processing


Pr e processing of data3

Pre-processing of data

  • Extract Terms


Pr e processing of data4

Pre-processing of data

  • Reduce Terms

    • Keep word length <20


The process of email detection2

The process of email detection

  • Motivation of this project

  • Pre-processing of data

  • Classifier Models

    • Evaluation of classifiers


Different classification methods

Different classification methods

  • K Nearest Neighbor (KNN)

  • Naive Bayes Classifier

  • Logistic Regression

  • Decision Tree Analysis


What is k nearest neighbor

What is K Nearest Neighbor

  • Use k "closet" samples (nearest neighbors) to perform classification


What is k nearest neighbor1

What is K Nearest Neighbor


Initial outcome and strategies for improvement

Initial outcome and strategies for improvement

  • KNN accuracy was ~64% - very low

  • KNN classifier does not fit our project 

  • Term-list is still too large 

  • Try different method to classify and see if evaluation results are better than KNN results

  • Continue to reduce size of term list by removing terms that are not meaningful


Steps for improvement

Steps for improvement

  • Remove sparsity

  • Reduced length threshold

  • Created hashtable

  • Used alternative classifier

    • Naive- Bayes Classifier


Machine learning basics with applications to email spam detection

Hashtable

  • Calculate Hash Key for each term in term-list.

  • Once collision occurs, use the separate chain


Naive bayes classifier

Naive- Bayes classifier


Secondary r esults

Secondary Results

  • Correctness increases from 62% to 82.36%


Suggestions for further improvement

Suggestions for further improvement

  • Revise pre-processing

  • Apply additional classifiers


Thank you

Thank you

  • Questions?


  • Login