Bayesian Spam Filters

Bayesian Spam Filters • Key Concepts • Conditional Probability • Independence • Bayes Theorem

Spam or Ham? FROM: Terry Delaney [removed] TO: (removed) Subject: FDA approved on-line pharmacies! click here (removed) Chose your product and site below: Canadian pharmacy (removed) - Cialis Soft Tabs - $5.78, Viagra Professional - $4.07, Soma - $1.38, Human Growth Hormone - $43.37, Meridia - $3.32, Tramadol - $2.17, Levitra - $11.97.

Quick Reminders • Conditional Probability: Events E, F with • Independence: E and F are independent if and only if

Baye’s Theorem: A quick Proof

Proof cont.

Applying Baye’s Theorem • Let our sample space be the set of emails. • Let S be the event a message is spam; hence is the event a message is not spam • Let E be the event a message contains a word w.

Estimations

Estimation Continued

Spam based on single words? • Probabilities based on single words: Bad Idea • False positives AND false negatives aplenty • Calculate based on n words, assuming each event Ei|S (Ei|SC) is independent; P(S) = P(SC).

Final Approximation

How do we use this? • User must train the filter based on messages in his/her inbox to estimate probabilities • The program or user must define a threshold probability r: • If , the message is considered spam.

Example • Suppose the filter has the following data • Threshold Probability: .9 • “Viagra” occurs in 250 of 2000 spam messages • “Viagra” occurs in only 5 of 1000 non-spam messages • Let’s try to estimate the probability, using the process we just defined

Example Cont. • Step 1: Find the probability that the message has the word “Viagra” in it and is spam. • p(Viagra) = 250 / 2000 = 0.125 • Step 2: Find the probability that the message has the word “Viagra” in it and is not spam. • q(Viagra) = 5 / 1000 = 0.005

Example Cont. • Since we are assuming that it is equally likely that an incoming message is or is not spam, we can estimate the probability with this equation: • r(Viagra) = p(Viagra) p(Viagra) + q(Viagra)

Example Cont. • 0.125 0.125 + 0.005 = 0.125 0.130 = 0.962 Since r(Viagra) is greater than the threshold of 0.9, we can reject this message as spam.

Harder Stuff • Single-word detection can lead to a lot of false positives and false negatives. • To counter this, most spam filters look for the presence of multiple words.

Another Example • 2000 Spam messages; 1000 real messages • “Viagra” appears in 400 spam messages • “Viagra” appears in 60 real messages • “Cialis” appears in 200 spam and 25 real messages • Threshold Probability: .9 • Let’s calculate the probability that it’s spam.

Example Cont. • Step 1: Find the probability that the message has the word “Viagra” in it and is spam. • p(Viagra) = 400 / 2000 = 0.2 • Step 2: Find the probability that the message has the word “Viagra” and is not spam. • q(Viagra) = 60 / 1000 = 0.06

Example Cont. • Step 3: Find the probability that the message contains the word “Cialis” and is spam. • p(Cialis) = 200 / 2000 = 0.1 • Step 4: Find the probability that the message contains the word “Cialis” and is not spam. • q(Cialis) = 25 / 1000 = 0.025

Example Cont • Using our approximation, we have: • r(Viagra,Cialis) = p(Viagra) * p(Cialis) p(Viagra) * p(Cialis) + q(Viagra) * q(Cialis)

Example Cont. • r(Viagra,Cialis) = (0.2)(0.1) (0.2)(0.1) + (0.6)(0.025) = 0.930 This message will be rejected however since we set the threshold probability at 0.9.

Questions?

Bayesian Spam Filters

Bayesian Spam Filters

Presentation Transcript

Spam

Bayesian Spam Filters

Good Word Attacks on Statistical Spam Filters

Spam, Spam, Spam, Spam….

Kalman Filters and Dynamic Bayesian Networks

Spam filters

Bayesian Learning Application to Text Classification Example: spam filtering

Bayesian Spam Filter

SPAM

SPAM

Spam: An Analysis of Spam Filters

Spam Filtering Using Bayesian Approach

SPAM

How to Avoid Spam Filters

Semalt: Free Spam Filters For Windows

Spam

Spam, Spam, Spam, Spit and Spim

Reflections on Bayesian Spam Filtering