database techniques for fighting spam l.
Skip this Video
Loading SlideShow in 5 Seconds..
Database Techniques for fighting SPAM PowerPoint Presentation
Download Presentation
Database Techniques for fighting SPAM

Loading in 2 Seconds...

play fullscreen
1 / 27

Database Techniques for fighting SPAM - PowerPoint PPT Presentation

  • Uploaded on

Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li Everybody knows about SPAM Spam is unsolicited bulk email sent for profit and general mayhem. BOTNETs = Distributed Network of hijacked IPs. IPs hard to track

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Database Techniques for fighting SPAM' - Mia_John

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
database techniques for fighting spam

Database Techniques for fighting SPAM

Telvis Calhoun

CSc 8710 – Advanced Databases

Dr. Yingshu Li

everybody knows about spam
Everybody knows about SPAM
  • Spam is unsolicited bulk email sent for profit and general mayhem.
  • BOTNETs = Distributed Network of hijacked IPs.
  • IPs hard to track
  • 70 billion emails sent per day. 70% spam
how anti spam uses dbs
How Anti-SPAM uses DBs?
  • Spam databases collect network layer and application layer data.
  • IP Blacklisting
    • Detect a malicious host during SMTP dialog.
    • Difficult to detect IP address DHCP, botnet size or good IPs used to forward
  • Content Analysis
    • Detect malicious mail content.
    • Requires that MTA complete the SMTP connection.
    • Arms race between content filter designers and spammers.
summary of db techniques
Summary of DB Techniques
  • Grey Space Analysis
  • Trinity: Peer-to-Peer Database
  • Behavioral Blacklisting
  • Progressive Email Scanning
  • Content filtering using Bayesian Analysis
grey space analysis
Grey Space Analysis
  • Characterize IP Space: Active vs. Grey Space
  • IP Flow Database
  • Detect malicious IPs by extracting dominant scanning ports (DSPs)
  • Find DSPs using relative uncertainty algorithm
mining technique relative uncertainty
Mining Technique: Relative Uncertainty
  • Determines entropy of IP ports in flows database.
  • Formula := Entropy of dstPrt distribution ÷ maximum entropy.
  • p := number of flows with port[i] ÷ total flows
  • RU close to 1 shows ~even distribution, near 0 shows uneven distribution
grey space algorithm
Grey Space Algorithm
  • Isolate flows toward grey space
  • Find dominant scanning ports (DSPs)
  • Find outside hosts with DSPs flows toward grey and active hosts.
  • Find inside host footprint for outside hosts.
  • Classify adversary as hitter or scanner.
focused hitters vs bad scanners
Focused Hitters vs Bad Scanners
  • Focused hitters tend to send tens or hundreds of flows to each grey host.
  • Bad scanners send one or a few flows to each grey host
trinity distribute ip reputation database
Trinity: Distribute IP Reputation Database
  • Botnets send a large amount of data in a short amount of time.
  • Trinity uses distributed in-memory hash table containing IP reputation entries.
  • Each peer has 10 to 50 megabytes of data (833K – 4.17M entries)
chord distributed hash table
Chord Distributed Hash Table
  • Distribute data over a large P2P network
    • Quickly find any given item
  • Stores key/value pairs
    • The key value controls which node(s) stores the value
    • Each node is responsible for some section of the space
  • Basic operations
    • Store(key; val)
    • val = Retrieve(key)
chord cont
Chord (cont)
  • Each node chooses a n-bit ID
    • IDs are arranged in a ring
  • Each lookup key is also a n-bit ID
    • i.e., the hash of the real lookup key
    • Node IDs and keys occupy the same space!
  • Each node is responsible for storing keys “near" its ID
    • Replication usaully between current and previous node
    • Items can be replicated at multiple successors
    • No single host contains large fraction of a particular space to guard against DDoS.
database updates
Database Updates
  • Compute the number of interval quarters since last update. Shift and update counters accordingly
  • Determine site responsible for entry and send UDP. Once received by owner site, forward entry to k peers using TCP.
  • Updates communicative, order doesn’t matter. Consistency not required.
  • Even if host goes down, database can be rebuilt in an hour.
  • Secure communications for neighbors
  • Limit updates for nodes that have sent more than 100 emails in 10 minutes.
  • Falsified source IPs can cause false positives.
clustering technique for behavioral blacklisting
Clustering Technique for Behavioral Blacklisting
  • Identify spammers that attack many domains.
  • Domain distribution and frequency is the sending pattern
  • Form clusters of sending patterns
  • Use clusters to ID new attack
spectral clustering
Spectral Clustering
  • Divide Phase – produces a tree whose leaves are elements of the set.
  • Merge Phase – Start with each leaf in its own cluster and merge going up the tree.
vector generation
Vector Generation
  • Database contains: M(i,j,k)
    • Total times that IP ‘i' sent email to domain ‘j’ in time slot ‘k’.
  • Find total flows for IP/Domain across entire time axis (M’).
  • Generate feature vector from M’
    • IP := <#flows to domain 1, # flows to domain 2, … #flows to domain j>
  • Clusters contain IP addresses that send mail to similar sets of domains.
  • Define traffic pattern for each cluster
    • Averaging the rows (vector contents) for all IPs in the cluster.

IPxIP matrix of related spam senders

  • Input IP vector ‘r’ :=1 x d vector
  • Use similarity algorithm to find closes cluster
  • Spam score is the maximum similarity of r with any cluster.
progressive email scanner
Progressive Email Scanner
  • Maintains Feature Instance (FI) database
  • FI is any feature that can discriminate HAM from SPAM.
  • Dynamic Features - Use any feature that IDs mail such as contents, network, etc.)
    • Paper only uses URL links as FIs
pec architecture
PEC Architecture
  • FI States
    • Grey (Ambiguous FI)
    • Black (Spam FI)
    • White (HAM FI)
  • Blacklist Module – Extracts and hashes FIs
  • Scoreboard Module – Tracks FI occurrences and timestamp (age)
competitive aging and scoring system cass
Competitive Aging and Scoring System (CASS)
  • Transition between states governed by
    • Score – number of occurrence of FI
    • Age – time since last score update.
  • Score (R) exceeds score threshold (S) causes Grey to Black transition.
  • Age (A) exceeds age threshold (M) triggers Grey to White transition.
    • Purge
bayesian content filtering
Bayesian Content Filtering
  • Determine the probability that a message is spam based on contents
  • Use Bayesian combination of spam probabilities
bayesian training
Bayesian Training
  • Requires training corpus of HAM/SPAM
  • Find interesting tokens.
  • Create HAM/SPAM token tables


Just a reminder: don’t forget your allergy prescription when you visit New

York City today.


Spam Probability Table

Sample Message

  • Tokenize new message
  • Calculate spam probability for each message
  • Derive overall spam probablity using Bayes formula. Sample Message = 0.0
  • Non-spam tokens outweigh spam tokens to prevent false positives
real world applications
Real World Applications

Messaging Security Architecture

  • A variety of database techniques are used in Anti-Spam Technology
    • IP Blacklisting
    • Content filtering
  • Databases can contain:
    • Network traffic: IP Addresses, Domain, Ports
    • Message Content: Words, URLs, HTML Text
  • Challenges:
    • Scalability – Must handle many connections or messages
    • Minimize False Positive Rates – Cannot classify a HAM message as SPAM.
    • Finding useful SPAM features. Using machine learning techniques.
  • Brodsky, et al, A Distributed Content Independent Method for Spam Detection, HotBots 2007
  • Jin, et al, Identifying and Tracking Suspicious Activities through IP Gray Space Analysis, MineNet 2007
  • Liu, et al, High-Speed Detection of Unsolicited Bulk Emails, ANCS 2007
  • Ramachandran, A., Filtering Spam with Behavioral Blacklisting, CCS 2007
  • Cheng, et al., A Divide-and-Merge Methodology for Clustering, ACM Transactions on Database Systems, 2006
  • Graham P., A Plan for Spam,, 2002
  • Secure Computing Corporation,, 2008