1 / 22

Bayesian Spam Filters

Bayesian Spam Filters. Key Concepts Conditional Probability Independence Bayes Theorem. Spam or Ham?. FROM : Terry Delaney [removed] TO : (removed) Subject : FDA approved on-line pharmacies! click here (removed) Chose your product and site below:

reevess
Download Presentation

Bayesian Spam Filters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Spam Filters • Key Concepts • Conditional Probability • Independence • Bayes Theorem

  2. Spam or Ham? FROM: Terry Delaney [removed] TO: (removed) Subject: FDA approved on-line pharmacies! click here (removed) Chose your product and site below: Canadian pharmacy (removed) - Cialis Soft Tabs - $5.78, Viagra Professional - $4.07, Soma - $1.38, Human Growth Hormone - $43.37, Meridia - $3.32, Tramadol - $2.17, Levitra - $11.97.

  3. Quick Reminders • Conditional Probability: Events E, F with • Independence: E and F are independent if and only if

  4. Baye’s Theorem: A quick Proof

  5. Proof cont.

  6. Applying Baye’s Theorem • Let our sample space be the set of emails. • Let S be the event a message is spam; hence is the event a message is not spam • Let E be the event a message contains a word w.

  7. Estimations

  8. Estimation Continued

  9. Spam based on single words? • Probabilities based on single words: Bad Idea • False positives AND false negatives aplenty • Calculate based on n words, assuming each event Ei|S (Ei|SC) is independent; P(S) = P(SC).

  10. Final Approximation

  11. How do we use this? • User must train the filter based on messages in his/her inbox to estimate probabilities • The program or user must define a threshold probability r: • If , the message is considered spam.

  12. Example • Suppose the filter has the following data • Threshold Probability: .9 • “Viagra” occurs in 250 of 2000 spam messages • “Viagra” occurs in only 5 of 1000 non-spam messages • Let’s try to estimate the probability, using the process we just defined

  13. Example Cont. • Step 1: Find the probability that the message has the word “Viagra” in it and is spam. • p(Viagra) = 250 / 2000 = 0.125 • Step 2: Find the probability that the message has the word “Viagra” in it and is not spam. • q(Viagra) = 5 / 1000 = 0.005

  14. Example Cont. • Since we are assuming that it is equally likely that an incoming message is or is not spam, we can estimate the probability with this equation: • r(Viagra) = p(Viagra) p(Viagra) + q(Viagra)

  15. Example Cont. • 0.125 0.125 + 0.005 = 0.125 0.130 = 0.962 Since r(Viagra) is greater than the threshold of 0.9, we can reject this message as spam.

  16. Harder Stuff • Single-word detection can lead to a lot of false positives and false negatives. • To counter this, most spam filters look for the presence of multiple words.

  17. Another Example • 2000 Spam messages; 1000 real messages • “Viagra” appears in 400 spam messages • “Viagra” appears in 60 real messages • “Cialis” appears in 200 spam and 25 real messages • Threshold Probability: .9 • Let’s calculate the probability that it’s spam.

  18. Example Cont. • Step 1: Find the probability that the message has the word “Viagra” in it and is spam. • p(Viagra) = 400 / 2000 = 0.2 • Step 2: Find the probability that the message has the word “Viagra” and is not spam. • q(Viagra) = 60 / 1000 = 0.06

  19. Example Cont. • Step 3: Find the probability that the message contains the word “Cialis” and is spam. • p(Cialis) = 200 / 2000 = 0.1 • Step 4: Find the probability that the message contains the word “Cialis” and is not spam. • q(Cialis) = 25 / 1000 = 0.025

  20. Example Cont • Using our approximation, we have: • r(Viagra,Cialis) = p(Viagra) * p(Cialis) p(Viagra) * p(Cialis) + q(Viagra) * q(Cialis)

  21. Example Cont. • r(Viagra,Cialis) = (0.2)(0.1) (0.2)(0.1) + (0.6)(0.025) = 0.930 This message will be rejected however since we set the threshold probability at 0.9.

  22. Questions?

More Related