bayesian spam filter n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Bayesian Spam Filter PowerPoint Presentation
Download Presentation
Bayesian Spam Filter

Loading in 2 Seconds...

play fullscreen
1 / 15

Bayesian Spam Filter - PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on

Bayesian Spam Filter. By Joshua Spaulding. Statement of Problem. “Spam email now accounts for more than half of all messages sent and imposes huge productivity costs…By 2007, Spam-stopping should grow to a $2.4 Billion Business.” Technology Review 8/03. Objective.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Bayesian Spam Filter' - monifa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bayesian spam filter

Bayesian Spam Filter

By

Joshua Spaulding

statement of problem

Statement of Problem

“Spam email now accounts for more than half of all messages sent and imposes huge productivity costs…By 2007, Spam-stopping should grow to a $2.4 Billion Business.”

Technology Review 8/03

objective

Objective

Using Bayes’ rule I will attempt to classify an email message as spam or non-spam (ham). I will use a corpus of spam and ham to determine the probability that a new email is spam given the tokens in the message.

definition of spam

Definition of Spam

Unsolicited automated email

bayes rule

Bayes’ Rule

P(A|B) = P(B|A)P(A) / P(B)

P(A|B) is the conditional probability that event A occurs given that event B has occurred;

P(B|A) is the conditional probability of event B occurring given that event A has occurred;

P(A) is the probability of event A occurring;

P(B) is the probability of event B occurring.

p spam token p token spam p spam p token

Bayes’ Rule

P(spam|token) = P(token|spam)P(spam) / P(token)

P(spam|token) – probability that email is spam given a token

P(token|spam) – probability token exists given email is spam

P(spam) – probability of email being spam

P(token) – probability of token in email

project design orig
Project Design (orig)
  • Read in large text file containing 1000 spam.
  • Read in large text file containing 1000 ham.
  • Create a file for each corpus consisting of the token and it’s occurrence in the corpus.
  • I'll then create another file with the token and the probability that an email containing it is spam using Bayesian rule.
  • When an email arrives I will parse the email. I will look up the probability that the email is spam given the token. I’ll then combine all the probabilities to determine the probability that the email is spam.
project design
Project Design
  • Create Narl model from 100 spam and 100 ham contained in two separate CSV files. Used Narl’s built-in Excel Model function. (emailCorpus.narl)
  • Parse body slot from emailCorpus.narl, create word nodes and calculate the probability. (kb.narl)
  • Examine incoming text body, tokenize and create nodeNames. If nodeName is already in the kb then lookup the probability. Otherwise assign probability value of “0.5”.
issues
Issues
  • Text is unknown and often incomplete.
  • Java data structures
    • Vector, StringTokenizer, floating-point operations
  • Unfamiliar with Narl
enhancements
Enhancements
  • Read slots other than body.
  • Read data in from another format. Gain more knowledge about the email.
  • Better error handling.
  • Read email as they enter the mail server.
  • Regular expression matching of Stringtokenizer.
  • Performance tuning with more data.
  • Take advantage of Narl functionality??