improving digest based collaborative spam detection
Download
Skip this Video
Download Presentation
Improving Digest-Based Collaborative Spam Detection

Loading in 2 Seconds...

play fullscreen
1 / 22

Improving Digest-Based Collaborative Spam Detection - PowerPoint PPT Presentation


  • 290 Views
  • Uploaded on

MICS. Improving Digest-Based Collaborative Spam Detection. Slavisa Sarafijanovic Sabrina Perez Jean-Yves Le Boudec EPFL, Switzerland. MIT\_Spam\_Conference, Mar 27-28, 2008, MIT, Cambridge. Talk content. Digest-based filtering – global picture overview

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Improving Digest-Based Collaborative Spam Detection' - jana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
improving digest based collaborative spam detection
MICSImproving Digest-BasedCollaborative Spam Detection

Slavisa Sarafijanovic

Sabrina Perez

Jean-Yves Le Boudec

EPFL, Switzerland

MIT_Spam_Conference, Mar 27-28, 2008, MIT, Cambridge.

talk content
Talk content
  • Digest-based filtering – global picture overview
  • Understanding “HOW Digests WORK” - “Open Digest” Paper [1]
  • (Very positive results/conclusions, cited and referred a lot!)
  • Understanding it better - Our re-evaluation of “Open Digest” Paper results
  • (Different conclusions!)
  • Our Alternative Digests - results IMPROVE a lot, understanding “WHY”
  • Understanding the “why” => further improvements possible
  • (Negative selection)
  • Conclusions

[1] "An Open Digest-based Technique for Spam Detection”, E. Damiani, S. De Capitani di

Vimercati, S. Paraboschi, P. Samarati, in Proc. of the 2004 International Workshop on Security

in Parallel and Distributed Systems, San Francisco, CA USA, September 15-17, 2004.

two main collaborative spam detection approaches
Two main collaborative spam detection approaches

1) White-listing using Social Networks

2) Bulky Content Detection using Digests

digests

relationships

User 1

User 1

User n

User 2

Recent

digests

User 3

User n

User 2

Example:

PGP graph of certificates

Examples:

DCC, Vipul’s Razor, Commtouch

Implementations (in both cases): centralized or decentralized, open or proprietary

This talk (paper): digests approach for bulky content detection

a real digest based system dcc distributed checksum clearinghouse
s

s

s

s

s

MC

MS

MS

MS

MC

MS

MC

MC

A Real Digest-Based System: DCC(Distributed Checksum Clearinghouse)

~ 250 DCC Servers

~ n * 10 000 Mail servers

Reply=counter

(n=3)

Query= digest

~ n * millions of Mail users

  • Strengths/drawbacks:
    • - fast response
    • not precise (FP problems)
    • limited obfuscation resistance

Spammer

(sends in bulk)

Reproducible evaluation of digests-efficiency: “Open Digest” Paper

producing digests nilsimsa similarity hashing as explained in od paper
Che

Cha

hea

b7 ... b0

b7 ... b0

b7 ... b0

Accumulator

After L-N+1 steps

0

15

255

Producing Digests: Nilsimsa similarity hashingas explained in OD-paper

Cheap

N=5 characters sliding window

E-mail,

L characters long

1:

2:

8:

trigrams

Cheapest vac...

Hash:30^3 -> 2^8

Hash()

Hash()

Hash()

00001111

+1

+1

+1

accumulator

...

Best Regards,

John

0

15

255

Digest =

0

1

1

0

1

0

15

255

  • Digest is a binary string of 256 bits
  • Definition: Nilsimsa Compare Value (NCV) between two digests is equal to the
  • number of bits at corresponding positions that are equal, minus 128.
  • Identical emails  NCV=128, unrelated emails  NCV close to 0.

More similar emails  more similar digests  higher NCV

open digest paper experiments and results
>

compare

“Open Digest” paper experiments and results
  • Evaluation <= experiment:
  • spam bulk detection <=detection of similarity between two emails from the same spam bulk
  • ham miss-detection <= miss-detection of similarity between unrelated emails

Bulk detection experiment:

OD-paper result for “adding random text” obfuscation:

(repeated many times, to get statistic)

Spam Corpus

Select at random

Obfuscate

(2 copies)

Compute digests

010110…10

011010…11

Evaluate similarity

  • OD-paper only evaluates (talks about) the average NCV

Threshold=54

OD-paper conclusion:

Average NCV > Threshold => bulk detection resistant to strong obfuscation by spammer

NCV

value

(integer)

Matching

indicator

(0/1)

open digest paper experiments and results cont
>

compare

“Open Digest” paper experiments and results (cont.)

Ham miss-detection experiment:

Ham and Spam

Corpus

  • OD-paper result:
  • n1~2500, n2~2500 emails
  • no matching (miss-detection) case is observed

For each pair of

unrelated emails

Compute digests

100110…10

011100…11

  • OD-paper conclusion:
  • Miss-detection of good emails must be very low
  • approximating miss-detection probability by use of Binomial distribution supports the observed result

Evaluate similarity

Threshold=54

NCV

values

(integer)

Matching

indicators

(0/1)

extending od paper experiments spam bulk detection
>

compare

Extending OD-paper experiments: spam bulk detection

Bulk detection experiment,

identical as in OD-paper:

But we test higher obfuscation ratios:

(repeated many times, to get statistic)

Spam Corpus

Select at random

Obfuscate

(2 copies)

Compute digests

010110…10

011010…11

Evaluate similarity

Threshold=54

  • OD-paper results is well recovered (blue dotted line)

NCV

value

(integer)

Matching

indicator

(0/1)

OD-paper conclusion does not hold!

Even only slightly higher obfuscation ratio brings the average NCV bellow the threshold

understanding better what happens
Ham Corpus 2/2

>

Understanding better what happens

“Compare X to Database” (generic experiment):

EITHER Ham Corpus1/2 (ham to filter)

OR Spam Corpus (Obfuscation 1)

X

n2

n1

Spam Corpus

(Obfuscation 2)

Select at random

Compute digest

010110…10

Database DB

of spam and ham digests

(represents

“previous digest queries”)

compare to each from DB

Threshold=54

NCV

values

(integer)

Matching

indicators

(0/1)

We look at more metrics

Probability of

email-to-email matching

Max(NCV) average

NCV histogram

spam db experiment results
SPAM – DB experiment results:

Mean Max(NCV) value not informative

Effect of obfuscation changes gracefully

Spammer may gain by additional obfuscation.

spam db ncv histograms effect of obfuscation
SPAM – DB, NCV histograms: effect of obfuscation

Small obfuscation: digests are still usefull for bulk detection

spam db ncv histograms effect of obfuscation1
SPAM – DB, NCV histograms: effect of obfuscation

Stronger obfuscation: most of the digest are rendered to not be useful !

ham db experiment results
HAM – DB experiment results:

Mean Max(NCV) value not informative

Miss-detection probability still too high for practical use

ham db ncv histograms effect of obfuscation
HAM – DB, NCV histograms: effect of obfuscation

Spam obfuscation does not impact miss-detection of good emails.

Shifted and wide histograms phenomena => high false positives explained

alternative digests
Alternative digests

Sampling strings: fixed length, random positions

011010…11

101110…11

001010…10

Email-to-email matching: max NCV between over pairs of digests

(find how similar are the most similar parts – e.g. spammy phrases)

spam db experiment results alt digests
SPAM – DB experiment results (alt. digests)

Spam bulk detection not any more vulnerable to obfuscation...

spam db alt digests effect of obfuscation
SPAM – DB (alt. digests): effect of obfuscation

… and we can see why it is like that.

ham db experiment results alt digests
HAM – DB experiment results (alt. digests):
  • miss-det. Prob still too high
ham db alt digests effect of obfuscation
HAM – DB (alt. digests) effect of obfuscation:

What can be done to decrease ham miss-detection?

alternative digests open new possibilities
Alternative digests open new possibilities

New email

digest(s)

database

of good

digests

Negative

selection

digest that

do not match

Compare to

collaborative

database of

digests (DB)

This part is the same

as without negative selection

conclusions
Conclusions
  • Use of proper metrics is crucial for proper conclusions from experiments.
  • Alternative digests provide much better results, and by use of
  • NCV histograms we understand why.
  • Use of proper metrics crucial for understanding what happens…
  • … and for understanding how to fix the problems.
ad