ugr project haoyu li brittany edwards wei zhang under xiaoxiao xu and arye nehorai
Download
Skip this Video
Download Presentation
Machine Learning Basics with Applications to Email Spam Detection

Loading in 2 Seconds...

play fullscreen
1 / 24

Machine Learning Basics with Applications to Email Spam Detection - PowerPoint PPT Presentation


  • 188 Views
  • Uploaded on

UGR Project - Haoyu li, brittany edwards , wei zhang under xiaoxiao xu and arye nehorai. Machine Learning Basics with Applications to Email Spam Detection. General background information about the process of machine learning. The process of email detection. Motivation of this project

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Machine Learning Basics with Applications to Email Spam Detection' - eris


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the process of email detection
The process of email detection
  • Motivation of this project
  • Pre-processing of data
  • Classifier Models
    • Evaluation of classifiers
motivation of this project
Motivation of this project
  • Spam email has been annoyed every personal email account
    • 60% of January 2004 emails were spam
    • Fraud & Phishing
  • Spam vs. Ham email
the process of email detection1
The process of email detection
  • Motivation of this project
  • Pre-processing of data
  • Classifier Models
    • Evaluation of classifiers
pr e processing of data
Pre-processing of data
  • Convert capital letters to lowercase
  • Remove numbers, and extra white space
  • Remove punctuations 
  • Remove stop-words
  • Delete terms with length greater than 20. 
pr e processing of data1
Pre-processing of data
  • Original Email
pr e processing of data2
Pre-processing of data
  • After pre-processing
pr e processing of data4
Pre-processing of data
  • Reduce Terms
    • Keep word length <20
the process of email detection2
The process of email detection
  • Motivation of this project
  • Pre-processing of data
  • Classifier Models
    • Evaluation of classifiers
different classification methods
Different classification methods
  • K Nearest Neighbor (KNN)
  • Naive Bayes Classifier
  • Logistic Regression
  • Decision Tree Analysis
what is k nearest neighbor
What is K Nearest Neighbor
  • Use k "closet" samples (nearest neighbors) to perform classification
initial outcome and strategies for improvement
Initial outcome and strategies for improvement
  • KNN accuracy was ~64% - very low
  • KNN classifier does not fit our project 
  • Term-list is still too large 
  • Try different method to classify and see if evaluation results are better than KNN results
  • Continue to reduce size of term list by removing terms that are not meaningful
steps for improvement
Steps for improvement
  • Remove sparsity
  • Reduced length threshold
  • Created hashtable
  • Used alternative classifier
    • Naive- Bayes Classifier
slide20

Hashtable

  • Calculate Hash Key for each term in term-list.
  • Once collision occurs, use the separate chain
secondary r esults
Secondary Results
  • Correctness increases from 62% to 82.36%
suggestions for further improvement
Suggestions for further improvement
  • Revise pre-processing
  • Apply additional classifiers
thank you
Thank you
  • Questions?
ad