Database techniques for fighting spam
1 / 27

Database Techniques - PowerPoint PPT Presentation

  • Uploaded on

Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li Everybody knows about SPAM Spam is unsolicited bulk email sent for profit and general mayhem. BOTNETs = Distributed Network of hijacked IPs. IPs hard to track

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Database Techniques ' - Mia_John

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Database techniques for fighting spam l.jpg

Database Techniques for fighting SPAM

Telvis Calhoun

CSc 8710 – Advanced Databases

Dr. Yingshu Li

Everybody knows about spam l.jpg
Everybody knows about SPAM

  • Spam is unsolicited bulk email sent for profit and general mayhem.

  • BOTNETs = Distributed Network of hijacked IPs.

  • IPs hard to track

  • 70 billion emails sent per day. 70% spam

How anti spam uses dbs l.jpg
How Anti-SPAM uses DBs?

  • Spam databases collect network layer and application layer data.

  • IP Blacklisting

    • Detect a malicious host during SMTP dialog.

    • Difficult to detect IP address DHCP, botnet size or good IPs used to forward

  • Content Analysis

    • Detect malicious mail content.

    • Requires that MTA complete the SMTP connection.

    • Arms race between content filter designers and spammers.

Summary of db techniques l.jpg
Summary of DB Techniques

  • Grey Space Analysis

  • Trinity: Peer-to-Peer Database

  • Behavioral Blacklisting

  • Progressive Email Scanning

  • Content filtering using Bayesian Analysis

Grey space analysis l.jpg
Grey Space Analysis

  • Characterize IP Space: Active vs. Grey Space

  • IP Flow Database

  • Detect malicious IPs by extracting dominant scanning ports (DSPs)

  • Find DSPs using relative uncertainty algorithm

Mining technique relative uncertainty l.jpg
Mining Technique: Relative Uncertainty

  • Determines entropy of IP ports in flows database.

  • Formula := Entropy of dstPrt distribution ÷ maximum entropy.

  • p := number of flows with port[i] ÷ total flows

  • RU close to 1 shows ~even distribution, near 0 shows uneven distribution

Grey space algorithm l.jpg
Grey Space Algorithm

  • Isolate flows toward grey space

  • Find dominant scanning ports (DSPs)

  • Find outside hosts with DSPs flows toward grey and active hosts.

  • Find inside host footprint for outside hosts.

  • Classify adversary as hitter or scanner.

Focused hitters vs bad scanners l.jpg
Focused Hitters vs Bad Scanners

  • Focused hitters tend to send tens or hundreds of flows to each grey host.

  • Bad scanners send one or a few flows to each grey host

Trinity distribute ip reputation database l.jpg
Trinity: Distribute IP Reputation Database

  • Botnets send a large amount of data in a short amount of time.

  • Trinity uses distributed in-memory hash table containing IP reputation entries.

  • Each peer has 10 to 50 megabytes of data (833K – 4.17M entries)

Chord distributed hash table l.jpg
Chord Distributed Hash Table

  • Distribute data over a large P2P network

    • Quickly find any given item

  • Stores key/value pairs

    • The key value controls which node(s) stores the value

    • Each node is responsible for some section of the space

  • Basic operations

    • Store(key; val)

    • val = Retrieve(key)

Chord cont l.jpg
Chord (cont)

  • Each node chooses a n-bit ID

    • IDs are arranged in a ring

  • Each lookup key is also a n-bit ID

    • i.e., the hash of the real lookup key

    • Node IDs and keys occupy the same space!

  • Each node is responsible for storing keys “near" its ID

    • Replication usaully between current and previous node

    • Items can be replicated at multiple successors

    • No single host contains large fraction of a particular space to guard against DDoS.

Database updates l.jpg
Database Updates

  • Compute the number of interval quarters since last update. Shift and update counters accordingly

  • Determine site responsible for entry and send UDP. Once received by owner site, forward entry to k peers using TCP.

  • Updates communicative, order doesn’t matter. Consistency not required.

  • Even if host goes down, database can be rebuilt in an hour.

Security l.jpg

  • Secure communications for neighbors

  • Limit updates for nodes that have sent more than 100 emails in 10 minutes.

  • Falsified source IPs can cause false positives.

Clustering technique for behavioral blacklisting l.jpg
Clustering Technique for Behavioral Blacklisting

  • Identify spammers that attack many domains.

  • Domain distribution and frequency is the sending pattern

  • Form clusters of sending patterns

  • Use clusters to ID new attack

Spectral clustering l.jpg
Spectral Clustering

  • Divide Phase – produces a tree whose leaves are elements of the set.

  • Merge Phase – Start with each leaf in its own cluster and merge going up the tree.

Vector generation l.jpg
Vector Generation

  • Database contains: M(i,j,k)

    • Total times that IP ‘i' sent email to domain ‘j’ in time slot ‘k’.

  • Find total flows for IP/Domain across entire time axis (M’).

  • Generate feature vector from M’

    • IP := <#flows to domain 1, # flows to domain 2, … #flows to domain j>

Clustering l.jpg

  • Clusters contain IP addresses that send mail to similar sets of domains.

  • Define traffic pattern for each cluster

    • Averaging the rows (vector contents) for all IPs in the cluster.

IPxIP matrix of related spam senders

Classification l.jpg

  • Input IP vector ‘r’ :=1 x d vector

  • Use similarity algorithm to find closes cluster

  • Spam score is the maximum similarity of r with any cluster.

Progressive email scanner l.jpg
Progressive Email Scanner

  • Maintains Feature Instance (FI) database

  • FI is any feature that can discriminate HAM from SPAM.

  • Dynamic Features - Use any feature that IDs mail such as contents, network, etc.)

    • Paper only uses URL links as FIs

Pec architecture l.jpg
PEC Architecture

  • FI States

    • Grey (Ambiguous FI)

    • Black (Spam FI)

    • White (HAM FI)

  • Blacklist Module – Extracts and hashes FIs

  • Scoreboard Module – Tracks FI occurrences and timestamp (age)

Competitive aging and scoring system cass l.jpg
Competitive Aging and Scoring System (CASS)

  • Transition between states governed by

    • Score – number of occurrence of FI

    • Age – time since last score update.

  • Score (R) exceeds score threshold (S) causes Grey to Black transition.

  • Age (A) exceeds age threshold (M) triggers Grey to White transition.

    • Purge

Bayesian content filtering l.jpg
Bayesian Content Filtering

  • Determine the probability that a message is spam based on contents

  • Use Bayesian combination of spam probabilities

Bayesian training l.jpg
Bayesian Training

  • Requires training corpus of HAM/SPAM

  • Find interesting tokens.

  • Create HAM/SPAM token tables

Classification24 l.jpg


Just a reminder: don’t forget your allergy prescription when you visit New

York City today.


Spam Probability Table

Sample Message

  • Tokenize new message

  • Calculate spam probability for each message

  • Derive overall spam probablity using Bayes formula. Sample Message = 0.0

  • Non-spam tokens outweigh spam tokens to prevent false positives

Real world applications l.jpg
Real World Applications

Messaging Security Architecture

Summary l.jpg

  • A variety of database techniques are used in Anti-Spam Technology

    • IP Blacklisting

    • Content filtering

  • Databases can contain:

    • Network traffic: IP Addresses, Domain, Ports

    • Message Content: Words, URLs, HTML Text

  • Challenges:

    • Scalability – Must handle many connections or messages

    • Minimize False Positive Rates – Cannot classify a HAM message as SPAM.

    • Finding useful SPAM features. Using machine learning techniques.

References l.jpg

  • Brodsky, et al, A Distributed Content Independent Method for Spam Detection, HotBots 2007

  • Jin, et al, Identifying and Tracking Suspicious Activities through IP Gray Space Analysis, MineNet 2007

  • Liu, et al, High-Speed Detection of Unsolicited Bulk Emails, ANCS 2007

  • Ramachandran, A., Filtering Spam with Behavioral Blacklisting, CCS 2007

  • Cheng, et al., A Divide-and-Merge Methodology for Clustering, ACM Transactions on Database Systems, 2006

  • Graham P., A Plan for Spam,, 2002

  • Secure Computing Corporation,, 2008