Fighting spam phishing and online scams at the network level
This presentation is the property of its rightful owner.
Sponsored Links
1 / 53

Fighting Spam, Phishing and Online Scams at the Network Level PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

Fighting Spam, Phishing and Online Scams at the Network Level. Nick Feamster Georgia Tech. with Anirudh Ramachandran, Shuang Hao, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala. Spam: More than Just a Nuisance. 95% of all email traffic Image and PDF Spam (PDF spam ~12%)

Download Presentation

Fighting Spam, Phishing and Online Scams at the Network Level

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Fighting spam phishing and online scams at the network level

Fighting Spam, Phishing and Online Scams at the Network Level

Nick FeamsterGeorgia Tech

with Anirudh Ramachandran, Shuang Hao, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala


Spam more than just a nuisance

Spam: More than Just a Nuisance

  • 95% of all email traffic

    • Image and PDF Spam (PDF spam ~12%)

  • As of August 2007, one in every 87 emails constituted a phishing attack

  • Targeted attacks on the rise

    • 20k-30k unique phishing attacks per month

Source: CNET (January 2008), APWG


Filtering

Filtering

  • Prevent unwanted traffic from reaching a user’s inbox by distinguishing spam from ham

  • Question: What features best differentiate spam from legitimate mail?

    • Content-based filtering: What is in the mail?

    • IP address of sender: Who is the sender?

    • Behavioral features: How the mail is sent?


Conventional approach content filters

Conventional Approach: Content Filters

  • Trying to hit a moving target...

Images

PDFs

Excel sheets

...and even mp3s!


Problems with content filtering

Problems with Content Filtering

  • Low cost to evasion:Spammers can easily alter features of an email’s content can be easily adjusted and changed

  • Customized emails are easy to generate: Content-based filters need fuzzy hashes over content, etc.

  • High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated


Another approach ip addresses

Another Approach: IP Addresses

  • Problem: IP addresses are ephemeral

  • Every day, 10% of senders are from previously unseen IP addresses

  • Possible causes

    • Dynamic addressing

    • New infections


Idea network based filtering

Idea: Network-Based Filtering

  • Filter email based on how it is sent, in addition to simply what is sent.

  • Network-level properties are less malleable

    • Set of target recipients

    • Hosting or upstream ISP (AS number)

    • Membership in a botnet (spammer, hosting infrastructure)

    • Network location of sender and receiver


Challenges

Challenges

  • Understanding the network-level behavior

    • What behaviors do spammers have?

    • How well do existing techniques work?

  • Building classifiers using network-level features

    • Key challenge: Which features to use?

    • Algorithms: SpamTracker and SNARE

  • Building the system

    • Dynamism: Behavior itself can change

    • Scale: Lots of email messages (and spam!) out there


Data collection spam and bgp

Data Collection: Spam and BGP

  • Spam Traps: Domains that receive only spam

  • BGP Monitors: Watch network-level reachability

Domain 1

Domain 2

17-Month Study: August 2004 to December 2005


Data collection mailavenger

Data Collection: MailAvenger

  • Highly configurable SMTP server

  • Collects many useful statistics


Bgp spectrum agility

~ 10 minutes

BGP “Spectrum Agility”

  • Hijack IP address space using BGP

  • Send spam

  • Withdraw IP address

A small club of persistent players appears to be using this technique.

Common short-lived prefixes and ASes

61.0.0.0/8 4678

66.0.0.0/8 21562

82.0.0.0/8 8717

Somewhere between 1-10% of all spam (some clearly intentional, others might be flapping)


Why such big prefixes

Why Such Big Prefixes?

  • Visibility: Route typically won’t be filtered (nice and short)

  • Flexibility:Client IPs can be scattered throughout dark space within a large /8

    • Same sender usually returns with different IP addresses


Characteristics of agile senders

Characteristics of Agile Senders

  • IP addresses are widely distributed across the /8 space

  • IP addresses typically appear only once at our sinkhole

  • Depending on which /8, 60-80% of these IP addresses were not reachable by traceroute when we spot-checked

  • Some IP addresses were in allocated, albeit unannounced space

  • Some AS paths associated with the routes contained reserved AS numbers


Other findings

Other Findings

  • Top senders: Korea, China, Japan

    • Still about 40% of spam coming from U.S.

  • More than half of sender IP addresses appear less than twice

  • ~90% of spam sent to traps from Windows


What about ip based blacklists

What about IP-based blacklists?


Two metrics

Two Metrics

  • Completeness: The fraction of spamming IP addresses that are listed in the blacklist

  • Responsiveness: The time for the blacklist to list the IP address after the first occurrence of spam


Completeness and responsiveness

Completeness and Responsiveness

  • 10-35% of spam is unlisted at the time of receipt

  • 8.5-20% of these IP addresses remain unlisted even after one month

Data: Trap data from March 2007, Spamhaus from March and April 2007


Completeness of ip blacklists

Completeness of IP Blacklists

~95% of bots listed in one or more blacklists

Fraction of all spam received

~80% listed on average

Only about half of the IPs spamming from short-lived BGP are listed in any blacklist

Number of DNSBLs listing this spammer

Spam from IP-agile senders tend to be listed in fewer blacklists


What s wrong with ip blacklists

What’s Wrong with IP Blacklists?

  • Based on ephemeral identifier (IP address)

    • More than 10% of all spam comes from IP addresses not seen within the past two months

      • Dynamic renumbering of IP addresses

      • Stealing of IP addresses and IP address space

      • Compromised machines

  • IP addresses of senders have considerable churn

  • Often require a human to notice/validate the behavior

    • Spamming is compartmentalized by domain and not analyzed across domains


Ephemeral addresses keep changing

Ephemeral: Addresses Keep Changing

About 10% of IP addresses never seen before in trace

Fraction of IP Addresses


Low volume to each domain

Low Volume to Each Domain

Most spammers send very little spam, regardless of how long they have been spamming.

Amount of Spam

Lifetime (seconds)


Where do we go from here

Where do we go from here?

  • Option 1: Stronger sender identity

    • Stronger sender identity/authentication may make reputation systems more effective

    • May require changes to hosts, routers, etc.

  • Option 2: Filtering based on sender behavior

    • Can be done on today’s network

    • Identifying features may be tricky, and some may require network-wide monitoring capabilities


Outline

Outline

  • Understanding the network-level behavior

    • What behaviors do spammers have?

    • How well do existing techniques work?

  • Building classifiers using network-level features

    • Key challenge: Which features to use?

    • Algorithms: SpamTracker and SNARE

  • Building the system (SpamSpotter)

    • Dynamism: Behavior itself can change

    • Scale: Lots of email messages (and spam!) out there


Spamtracker

SpamTracker

  • Idea:Blacklist sending behavior (“Behavioral Blacklisting”)

    • Identify sending patterns commonly used by spammers

  • Intuition:Much more difficult for a spammer to change the technique by which mail is sent than it is to change the content


Spamtracker approach

SpamTracker Approach

  • Construct a behavioral fingerprint for each sender

  • Cluster senders with similar fingerprints

  • Filter new senders that map to existing clusters


Some patterns of sending are invariant

DHCP

Reassignment

Some Patterns of Sending are Invariant

IP Address: 76.17.114.xxx

IP Address: 24.99.146.xxx

  • Spammer's sending pattern has not changed

  • IP Blacklists cannot make this connection

spam

spam

spam

spam

spam

spam

domain3.com

domain3.com

domain1.com

domain2.com

domain1.com

domain2.com


Spamtracker identify invariant

DHCP

Reassignment

Infection

SpamTracker: Identify Invariant

IP Address: 24.99.146.xxx

Unknown sender

IP Address: 76.17.114.xxx

Known Spammer

spam

spam

spam

spam

spam

spam

domain3.com

domain1.com

domain2.com

domain3.com

domain1.com

domain2.com

Cluster on

sending behavior

Cluster on

sending behavior

Similar fingerprint!

Behavioral fingerprint


Building the classifier clustering

Building the Classifier: Clustering

  • Feature: Distribution of email sending volumes across recipient domains

  • Clustering Approach

    • Build initial seed list of bad IP addresses

    • For each IP address, compute feature vector: volume per domain per time interval

    • Collapse into a single IP x domain matrix:

    • Compute clusters


Clustering output and fingerprint

Clustering: Output and Fingerprint

  • For each cluster, compute fingerprint vector:

  • New IPs will be compared to this “fingerprint”

IP x IP Matrix: Intensity indicates pairwise similarity


Classifying ip addresses

Classifying IP Addresses

  • Given “new” IP address, build a feature vector based on its sending pattern across domains

  • Compute the similarity of this sending pattern to that of each known spam cluster

    • Normalized dot product of the two feature vectors

    • Spam score is maximum similarity to any cluster


Evaluation

Evaluation

  • Emulate the performance of a system that could observe sending patterns across many domains

    • Build clusters/train on given time interval

  • Evaluate classification

    • Relative to labeled logs

    • Relative to IP addresses that were eventually listed


Fighting spam phishing and online scams at the network level

Data

  • 30 days of Postfix logs from email hosting service

    • Time, remote IP, receiving domain, accept/reject

    • Allows us to observe sending behavior over a large number of domains

    • Problem: About 15% of accepted mail is also spam

      • Creates problems with validating SpamTracker

  • 30 days of SpamHaus database in the month following the Postfix logs

    • Allows us to determine whether SpamTracker detects some sending IPs earlier than SpamHaus


Classification results

Classification Results

Ham

Spam

Not always so accurate!

SpamTracker Score


Early detection results

Early Detection Results

  • Compare SpamTracker scores on “accepted” mail to the SpamHaus database

    • About 15% of accepted mail was later determined to be spam

    • Can SpamTracker catch this?

  • Of 620 emails that were accepted, but sent from IPs that were blacklisted within one month

    • 65 emails had a score larger than 5 (85th percentile)


Evasion

Evasion

  • Problem: Malicious senders could add noise

    • Solution: Use smaller number of trusted domains

  • Problem: Malicious senders could change sending behavior to emulate “normal” senders

    • Need a more robust set of features…


Improving classification

Improving Classification

  • Lower overhead

  • Faster detection

  • Better robustness (i.e., to evasion, dynamism)

  • Use additional features and combine for more robust classification

    • Temporal: interarrival times, diurnal patterns

    • Spatial: sending patterns of groups of senders


Outline1

Outline

  • Understanding the network-level behavior

    • What behaviors do spammers have?

    • How well do existing techniques work?

  • Building classifiers using network-level features

    • Key challenge: Which features to use?

    • Algorithms: SpamTracker and SNARE

  • Building the system (SpamSpotter)

    • Dynamism: Behavior itself can change

    • Scale: Lots of email messages (and spam!) out there


Snare automated sender reputation

SNARE: Automated Sender Reputation

  • Goal: Sender reputation from a single packet?(or at least as little information as possible)

    • Lower overhead

    • Faster classification

    • Less malleable

  • Key challenge

    • What features satisfy these properties and can distinguish spammers from legitimate senders


Sender receiver geodesic distance

Sender-Receiver Geodesic Distance

90% of legitimate messages travel 2,200 miles or less


Density of senders in ip space

Density of Senders in IP Space

For spammers, k nearest senders are much closer in IP space


Combining features

Combining Features

  • Put features into the RuleFit classifier

  • 10-fold cross validation on one day of query logs from a large spam filtering appliance provider

  • Using only network-level features

  • Completely automated


Outline2

Outline

  • Understanding the network-level behavior

    • What behaviors do spammers have?

    • How well do existing techniques work?

  • Building classifiers using network-level features

    • Key challenge: Which features to use?

    • Algorithms: SpamTracker and SNARE

  • Building the system (SpamSpotter)

    • Dynamism: Behavior itself can change

    • Scale: Lots of email messages (and spam!) out there


Real time blacklist deployment

Real-Time Blacklist Deployment

  • As mail arrives, lookups received at BL

  • Queries provide proxy for sending behavior

  • Train based on received data

  • Return score

Approach


Challenges1

Challenges

  • Scalability: How to collect and aggregate data, and form the signatures without imposing too much overhead?

  • Dynamism: When to retrain the classifier, given that sender behavior changes?

  • Reliability: How should the system be replicated to better defend against attack or failure?

  • Sensor placement: Where should monitors be placed to best observe behavior/construct features?


Design choice augment dnsbl

Design Choice: Augment DNSBL

  • Expressive queries

    • SpamHaus: $ dig 55.102.90.62.zen.spamhaus.org

      • Ans: 127.0.0.3 (=> listed in exploits block list)‏

    • SpamSpotter: $ dig \ receiver_ip.receiver_domain.sender_ip.rbl.gtnoise.net

      • e.g., dig 120.1.2.3.gmail.com.-.1.1.207.130.rbl.gtnoise.net

      • Ans: 127.1.3.97 (SpamSpotter score = -3.97)‏

  • Also a source of data

    • Unsupervised algorithms work with unlabeled data


Design choice sampling

Design Choice: Sampling

Relatively small samples can achieve low false positive rates


Sampling training time

Sampling: Training Time


Dynamism accuracy over time

Dynamism: Accuracy over Time


Improvements

Improvements

  • Accuracy

    • Synthesizing multiple classifiers

    • Incorporating user feedback

    • Learning algorithms with bounded false positives

  • Performance

    • Caching/Sharing

    • Streaming

  • Security

    • Learning in adversarial environments


Summary network based behavioral filtering

Summary: Network-Based Behavioral Filtering

  • Spam increasing, spammers becoming agile

    • Content filters are falling behind

    • IP-Based blacklists are evadable

      • Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month

  • Complementary approach: behavioral blacklisting based on network-level features

    • Blacklist based on how messages are sent

    • SpamTracker: Spectral clustering

      • catches significant amounts faster than existing blacklists

    • SNARE: Automated sender reputation

      • ~90% accuracy of existing with lightweight features

    • SpamSpotter: Putting it together in an RBL system


References

References

  • Anirudh Ramachandran and Nick Feamster, “Understanding the Network-Level Behavior of Spammers”, ACM SIGCOMM, 2006

  • Anirudh Ramachandran, Nick Feamster, and Santosh Vempala, “Filtering Spam with Behavioral Blacklisting”, ACM CCS, 2007

  • Nadeem Syed, Shuang Hao, Nick Feamster, Alex Gray and Sven Krasser, “SNARE: Spatio-temporal Network-level Automatic Reputation Engine”, GT-CSE-08-02

  • Anirudh Ramachandran, Shuang Hao, Hitesh Khandelwal, Nick Feamster, Santosh Vempala, “A Dynamic Reputation Service for Spotting Spammers”, GT-CS-08-09


Additional history message size variance

Additional History: Message Size Variance

Certain Spam

Senders of legitimate mail have a much higher variance in sizes of messages they send

Likely Spam

Likely Ham

Surprising: Including this feature (and others with more history) can actually decrease the accuracy of the classifier

Certain Ham

Message Size Range


  • Login