270 likes | 697 Views
Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li Everybody knows about SPAM Spam is unsolicited bulk email sent for profit and general mayhem. BOTNETs = Distributed Network of hijacked IPs. IPs hard to track
E N D
Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li
Everybody knows about SPAM • Spam is unsolicited bulk email sent for profit and general mayhem. • BOTNETs = Distributed Network of hijacked IPs. • IPs hard to track • 70 billion emails sent per day. 70% spam
How Anti-SPAM uses DBs? • Spam databases collect network layer and application layer data. • IP Blacklisting • Detect a malicious host during SMTP dialog. • Difficult to detect IP address DHCP, botnet size or good IPs used to forward • Content Analysis • Detect malicious mail content. • Requires that MTA complete the SMTP connection. • Arms race between content filter designers and spammers.
Summary of DB Techniques • Grey Space Analysis • Trinity: Peer-to-Peer Database • Behavioral Blacklisting • Progressive Email Scanning • Content filtering using Bayesian Analysis
Grey Space Analysis • Characterize IP Space: Active vs. Grey Space • IP Flow Database • Detect malicious IPs by extracting dominant scanning ports (DSPs) • Find DSPs using relative uncertainty algorithm
Mining Technique: Relative Uncertainty • Determines entropy of IP ports in flows database. • Formula := Entropy of dstPrt distribution ÷ maximum entropy. • p := number of flows with port[i] ÷ total flows • RU close to 1 shows ~even distribution, near 0 shows uneven distribution
Grey Space Algorithm • Isolate flows toward grey space • Find dominant scanning ports (DSPs) • Find outside hosts with DSPs flows toward grey and active hosts. • Find inside host footprint for outside hosts. • Classify adversary as hitter or scanner.
Focused Hitters vs Bad Scanners • Focused hitters tend to send tens or hundreds of flows to each grey host. • Bad scanners send one or a few flows to each grey host
Trinity: Distribute IP Reputation Database • Botnets send a large amount of data in a short amount of time. • Trinity uses distributed in-memory hash table containing IP reputation entries. • Each peer has 10 to 50 megabytes of data (833K – 4.17M entries)
Chord Distributed Hash Table • Distribute data over a large P2P network • Quickly find any given item • Stores key/value pairs • The key value controls which node(s) stores the value • Each node is responsible for some section of the space • Basic operations • Store(key; val) • val = Retrieve(key)
Chord (cont) • Each node chooses a n-bit ID • IDs are arranged in a ring • Each lookup key is also a n-bit ID • i.e., the hash of the real lookup key • Node IDs and keys occupy the same space! • Each node is responsible for storing keys “near" its ID • Replication usaully between current and previous node • Items can be replicated at multiple successors • No single host contains large fraction of a particular space to guard against DDoS.
Database Updates • Compute the number of interval quarters since last update. Shift and update counters accordingly • Determine site responsible for entry and send UDP. Once received by owner site, forward entry to k peers using TCP. • Updates communicative, order doesn’t matter. Consistency not required. • Even if host goes down, database can be rebuilt in an hour.
Security • Secure communications for neighbors • Limit updates for nodes that have sent more than 100 emails in 10 minutes. • Falsified source IPs can cause false positives.
Clustering Technique for Behavioral Blacklisting • Identify spammers that attack many domains. • Domain distribution and frequency is the sending pattern • Form clusters of sending patterns • Use clusters to ID new attack
Spectral Clustering • Divide Phase – produces a tree whose leaves are elements of the set. • Merge Phase – Start with each leaf in its own cluster and merge going up the tree.
Vector Generation • Database contains: M(i,j,k) • Total times that IP ‘i' sent email to domain ‘j’ in time slot ‘k’. • Find total flows for IP/Domain across entire time axis (M’). • Generate feature vector from M’ • IP := <#flows to domain 1, # flows to domain 2, … #flows to domain j>
Clustering • Clusters contain IP addresses that send mail to similar sets of domains. • Define traffic pattern for each cluster • Averaging the rows (vector contents) for all IPs in the cluster. IPxIP matrix of related spam senders
Classification • Input IP vector ‘r’ :=1 x d vector • Use similarity algorithm to find closes cluster • Spam score is the maximum similarity of r with any cluster.
Progressive Email Scanner • Maintains Feature Instance (FI) database • FI is any feature that can discriminate HAM from SPAM. • Dynamic Features - Use any feature that IDs mail such as contents, network, etc.) • Paper only uses URL links as FIs
PEC Architecture • FI States • Grey (Ambiguous FI) • Black (Spam FI) • White (HAM FI) • Blacklist Module – Extracts and hashes FIs • Scoreboard Module – Tracks FI occurrences and timestamp (age)
Competitive Aging and Scoring System (CASS) • Transition between states governed by • Score – number of occurrence of FI • Age – time since last score update. • Score (R) exceeds score threshold (S) causes Grey to Black transition. • Age (A) exceeds age threshold (M) triggers Grey to White transition. • Purge
Bayesian Content Filtering • Determine the probability that a message is spam based on contents • Use Bayesian combination of spam probabilities
Bayesian Training • Requires training corpus of HAM/SPAM • Find interesting tokens. • Create HAM/SPAM token tables
Classification Hi, Just a reminder: don’t forget your allergy prescription when you visit New York City today. Mom Spam Probability Table Sample Message • Tokenize new message • Calculate spam probability for each message • Derive overall spam probablity using Bayes formula. Sample Message = 0.0 • Non-spam tokens outweigh spam tokens to prevent false positives
Real World Applications Messaging Security Architecture TrustedSource.org
Summary • A variety of database techniques are used in Anti-Spam Technology • IP Blacklisting • Content filtering • Databases can contain: • Network traffic: IP Addresses, Domain, Ports • Message Content: Words, URLs, HTML Text • Challenges: • Scalability – Must handle many connections or messages • Minimize False Positive Rates – Cannot classify a HAM message as SPAM. • Finding useful SPAM features. Using machine learning techniques.
References • Brodsky, et al, A Distributed Content Independent Method for Spam Detection, HotBots 2007 • Jin, et al, Identifying and Tracking Suspicious Activities through IP Gray Space Analysis, MineNet 2007 • Liu, et al, High-Speed Detection of Unsolicited Bulk Emails, ANCS 2007 • Ramachandran, A., Filtering Spam with Behavioral Blacklisting, CCS 2007 • Cheng, et al., A Divide-and-Merge Methodology for Clustering, ACM Transactions on Database Systems, 2006 • Graham P., A Plan for Spam, www.paulgraham.com/spam.html, 2002 • Secure Computing Corporation, http://trustedsource.org, 2008