1 / 27

Network-Level Spam Detection

Network-Level Spam Detection. Nick Feamster Georgia Tech. Spam: More than Just a Nuisance. 95% of all email traffic Image and PDF Spam (PDF spam ~12%) As of August 2007, one in every 87 emails constituted a phishing attack Targeted attacks on the rise

teddy
Download Presentation

Network-Level Spam Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Network-Level Spam Detection Nick FeamsterGeorgia Tech

  2. Spam: More than Just a Nuisance • 95% of all email traffic • Image and PDF Spam (PDF spam ~12%) • As of August 2007, one in every 87 emailsconstituted a phishing attack • Targeted attacks on the rise • 20k-30k unique phishing attacks per month Source: CNET (January 2008), APWG

  3. Detection • Detect unwanted traffic from reaching a user’s inbox by distinguishing spam from ham • Question: What features best differentiate spam from legitimate mail? • Content-based filtering: What is in the mail? • IP address of sender: Who is the sender? • Behavioral features: How the mail is sent?

  4. Content-Based Detection: Problems • Low cost to evasion:Spammers can easily alter features of an email’s content can be easily adjusted and changed • Customized emails are easy to generate: Content-based filters need fuzzy hashes over content, etc. • High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated

  5. Another Approach: IP Addresses • Problem: IP addresses are ephemeral • Every day, 10% of senders are from previously unseen IP addresses • Possible causes • Dynamic addressing • New infections

  6. Idea: Network-Based Detection • Filter email based on how it is sent, in addition to simply what is sent. • Network-level properties are less malleable • Hosting or upstream ISP (AS number) • Membership in a botnet (spammer, hosting infrastructure) • Network location of sender and receiver • Set of target recipients

  7. Behavioral Blacklisting • Idea:Blacklist sending behavior (“Behavioral Blacklisting”) • Identify sending patterns commonly used by spammers • Intuition:Much more difficult for a spammer to change the technique by which mail is sent than it is to change the content

  8. Improving Classification • Lower overhead • Faster detection • Better robustness (i.e., to evasion, dynamism) • Use additional features and combine for more robust classification • Temporal: interarrival times, diurnal patterns • Spatial: sending patterns of groups of senders

  9. SNARE: Automated Sender Reputation • Goal: Sender reputation from a single packet?(or at least as little information as possible) • Lower overhead • Faster classification • Less malleable • Key challenge • What features satisfy these properties and can distinguish spammers from legitimate senders

  10. Sender-Receiver Geodesic Distance 90% of legitimate messages travel 2,200 miles or less

  11. Density of Senders in IP Space For spammers, k nearest senders are much closer in IP space

  12. Other Network-Level Features • Time-of-day at sender • Upstream AS of sender • Message size (and variance) • Number of recipients (and variance)

  13. Combining Features • Put features into the RuleFit classifier • 10-fold cross validation on one day of query logs from a large spam filtering appliance provider • Using only network-level features • Completely automated

  14. Cluster-Based Features • Construct a behavioral fingerprint for each sender • Cluster senders with similar fingerprints • Filter new senders that map to existing clusters

  15. DHCP Reassignment Infection Identifying Invariants IP Address: 24.99.146.xxx Unknown sender IP Address: 76.17.114.xxx Known Spammer spam spam spam spam spam spam domain3.com domain1.com domain2.com domain3.com domain1.com domain2.com Cluster on sending behavior Cluster on sending behavior Similar fingerprint! Behavioral fingerprint

  16. Building the Classifier: Clustering • Feature: Distribution of email sending volumes across recipient domains • Clustering Approach • Build initial seed list of bad IP addresses • For each IP address, compute feature vector: volume per domain per time interval • Collapse into a single IP x domain matrix: • Compute clusters

  17. Clustering: Fingerprint • For each cluster, compute fingerprint vector: • New IPs will be compared to this “fingerprint” IP x IP Matrix: Intensity indicates pairwise similarity

  18. Evaluation • Emulate the performance of a system that could observe sending patterns across many domains • Build clusters/train on given time interval • Evaluate classification • Relative to labeled logs • Relative to IP addresses that were eventually listed

  19. Early Detection Results • Compare SpamTracker scores on “accepted” mail to the SpamHaus database • About 15% of accepted mail was later determined to be spam • Can SpamTracker catch this? • Of 620 emails that were accepted, but sent from IPs that were blacklisted within one month • 65 emails had a score larger than 5 (85th percentile)

  20. Small Samples Work Well Relatively small samples can achieve low false positive rates

  21. Extensions to Phishing • Goal: Detect phishing attacks based on behavioral properties of hosting site(vs. static properties of URL) • Features • URL regular expressions • Registration time of domain • Uptime of hosting site • DNS TTL and redirections • Next time: Discussion of phishing detection/integration

  22. Integration with SMITE • Sensors • Extract network features from traffic • IP addresses • Combine with auxiliary data (routing, time, etc.) • Algorithms • Clustering algorithm to identify behavioral fingerprints • Learning algorithm to classify based on multiple features • Correlation • Clusters formed by aggregating sending behavior observed across multiple sensors • Various features also require input from data collected across collections of IP addresses

  23. Summary • Spam increasing, spammers becoming agile • Content filters are falling behind • IP-Based blacklists are evadable • Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month • Complementary approach: behavioral blacklisting based on network-level features • Blacklist based on how messages are sent • SNARE: Automated sender reputation • ~90% accuracy of existing with lightweight features • Cluster-based features to improve accuracy/reduce need for labelled data

  24. Improvements • Accuracy • Synthesizing multiple classifiers • Incorporating user feedback • Learning algorithms with bounded false positives • Performance • Caching/Sharing • Streaming • Security • Learning in adversarial environments

  25. Sampling: Training Time

  26. Dynamism: Accuracy over Time

More Related