Show me the money… In search of a meaningful research agenda for addressing cybercrime

Show me the money…In search of a meaningful research agendafor addressing cybercrime Stefan Savage & Geoff Voelker Kirill Levchenko, Chris Kanich, Andreas Pitsillidis, Justin Ma, Brandon Enright, Qing Zhang, Christian Kreibich (ICSI), Mark Felegyhazi (ICSI), Nick Weaver (ICSI), Vern Paxson (ICSI)

Warning… • Yinglian asked Geoff or I to give a talk… • All the obvious topics failed because: • Already gave version of talk here (e.g., Spamalytics) • Work was joint and done at Microsoft (e.g., SORA) • Work was embargoed by co-authors (e.g. Cloud stuff) • So… instead this is a bit of a pastiche • Trying to explain our research agenda in e-crime (why) • Give a sketch of the projects we have in progress (what)(some only on blackboard) • Very unpolished and preliminary; questions/feedback welcome

Context • I co-direct the Collaborative Center for Internet Epidemiology and Defenses (CCIED) • UCSD/ICSI group created in response to worm threat • Funded in 2004, many strong partners • Research Agenda • Internet epidemiology: measuring/understanding attacks • Automated defenses: stopping outbreaks/attacks • Economic and legal issues: that other stuff

Many big successes… • 50+ papers, lots of tech transfer, big systems, etc • Network Telescope • Passive monitor for > 1%of routable Internet addr space • Potemkin & GQ Honeyfarms • Active VM honeypot servers on >250k IP addresses • Earlybird • On-line learning of new worm signatures in < 1ms

But… depressing truth We didn’t stop Internet worms, let alone malware, let alone cybercrime… nor did anyone else. At best, moved it around a bit. By any meaningful metric the bad guys are winning… Mistake: looking at this solely as a technical problem

Key threat transformations of the 21st century • Efficient large-scale compromises • Internet communications model • Software homogeneity • User naïveity/fatigue • Centralized control • Cheap scalability for criminal applications(e.g. spam, info theft, DDoS, etc) • Profit-driven applications • Commodity resources (IP, bandwidth, storage, CPU) • Unique resources(PII/credentials, CD-Keys, address book, etc)

Emergence of Economic Drivers • In last six years, emergence of profit-making malware • Anti-spam efforts force spammers to launder e-mail through compromised machines (starts with MyDoom.A, SoBig) • “Virtuous” economic cycle transforms nature of threat • Commoditization of compromised hosts • Fluid third-party exchange market (millions of hosts) • Raw bots (range from pennies to dollars) • Value added tier: SPAM proxying (more expensive) • Innovation in both host substrate and its uses • Sophisticated infection and command/control networks: platform • SPAM, piracy, phishing, identity theft, DDoS are all applications

DDoS for sale • Emergence of economic engine for Internet crime • SPAM, phishing, spyware, etc • Fluid third party markets for illicit digital goods/services • Bots ~$0.5/host, special orders, value added tiers • Cards, malware, exploits, DDoS, cashout, etc.

Botnet Spammer Rental Rates >20-30k always online SOCKs4, url is de-duped and updated > every 10 minutes. 900/weekly, Samples will be sent on > request. Monthly payments arranged at discount prices. • 3.6 cents per bot week • 6 cents per bot week • 2.5 cents per bot week September 2004 postings to SpecialHam.com, Spamforum.biz >$350.00/weekly - $1,000/monthly (USD) >Type of service: Exclusive (One slot only) >Always Online: 5,000 - 6,000 >Updated every: 10 minutes >$220.00/weekly - $800.00/monthly (USD) >Type of service: Shared (4 slots) >Always Online: 9,000 - 10,000 >Updated every: 5 minutes Bot Payloads

Spamalytics

Key structural asymmetries • Defenders reactive, attackers proactive • Defenses public, attacker develops/tests in private • Arms race where best case for defender is to “catch up” • New defenses expensive, new attacks cheap • Defenses sunk costs/business model, attacker agile and not tied to particular technology • Low risk to attacker, high reward to attacker • Minimal deterrence • Functional anonymity on the Internet; very hard to fix • Defenses hard to measure, attacks easy to measure • Few security metrics (no “evidence-based” security), attackers measure monetization which drives attack quality

Example: brief history of the spam arms race Spammer response Send via open relays/proxies Delivery via compromised botnets Content chaff, polymorphic spam generators, img spam Fast-flux redirect and transparent proxies CAPTCHA outsourcing, OCR-based breaking Anti-spam action • Real-time IP blacklisting • Clean up open relays/proxies • Content-based learning • Site takedown • CAPTCHAs

Revisiting the problem • We tend to think about this in terms of technical means for securing computer systems • Most of 50-100B IT budget on cyber security is spent on securing the end host • AV, firewalls, IDS, encryption, etc… • Single most expensive front to secure • Single hardest front to secure • But are individual end hosts valuable to bad guys? • Maybe $1.50? Even less in bulk… not a pain point • What instead? Economically informed strategies • Offence: Identify and attack economic bottlenecks in value chain • Defense: proactive techniques; minimize time-value of enablers

Elements of the Internet “underground economy” • Acquisition of illicit digital goods • Tier-1 goods (e.g. credit card data, paypal, etc) • Directly valued in “real world”; single step liquidity • Tier-2 goods (e.g. bots, malware, $ services, CAPTCHA solving) • Valued only in UE, rented for service, or used to produce value in scam • Trade/Sale in such goods • On-line markets and market enablers • Scams (capital investment to extract new value) • Combine digital goods with value creation strategy • SPAM, phishing, DDoS extortion, pump/dump, etc • Liquidation of goods (cash out) • Indirect: SPAM/Adware (potentially legal), Click fraud, pump/dump, gambling • Direct: cash out (WU, eGold, WebMoney), wire transfer, card “tracking”, mules/drops

Previous work:Underground markets i sell CVV2s at $0.90, hacked hosts at $8, paypals at 8, fullz at $10, and wells fargo logins. IM me at XXXX DO NOT ASK FOR TESTS OR FREE CARDS. Thank you :) westernunion confirmercan confirm males and females have drops in usaI AM VERIFIED MSG ME • We analyzed 13 million messages on “public” channel of popular trading market (think dark-QVC) • Not “english” per se; pidgin at best. Same for “russian” forumsCombination of regexps, machine learning and NLP to parse • Identified 10’s of M$ in stolen credentials

Previous work:Estimating spam profits • Key basic inequality: (Delivery Cost) < (Conversion Rate) x (Marginal Revenue) • We have some handle on two of these • Delivery cost to send spam • Outsourced cost: retail purchase price < $70/M addrs • In-house cost: development/management labor • Marginal revenue • Average pharma sale of $100, affiliate commissions ≈ 50% • Conversion rate is hard to measure directly • UCSD/ICSI study infiltrated Storm botnet and manipulated “command and control” so ~500M of the URLs it used pointed to sites under our control

Spam pipeline Response rates by country Effects of Blacklisting (CBL Feed) Spam filtering software • The fraction of spam delivered into user inboxes depends on the spam filtering software used • Combination of site filtering (e.g., blacklists) and content filtering(e.g., spamassassin) • Difficult to generalize, but we can use our test accounts for specific services Fraction of spam sent that was delivered to inboxes Unused Two orders of magnitude Other filtering No large aberrations based on email topic Effective Sent MTA Inbox Visits Conversions 347.5M 82.7M (24%) 10,522 (0.003%) 28 (0.000008%) 83.6 M 21.1M (25%) 3,827 (0.005%) 316 (0.00037%) --- 40.1 M 10.1M (25%) 2,721 (0.005%) 225 (0.00056%) Pharma: 12 M spam emails for one “purchase” E-card: 1 in 10 visitors execute the binary 20

What are we doing now? • Measurement studies into e-crime economics • Spam value chain analysis • Spammers, botnets, fast flux, affiliates, processing, fulfillment, • Analyzing market enablers (cost structure and characteristics) • E.g., mules, domain registration, traffic selling, de-CAPTCHA • Mining social network of underground providers • Mapping monetization via financial credential honeytokens • Value of anti-phishing mechanisms • More proactive defenses • Botnet-driven spam filtering • Proactive URL blocking via on-line learning • Proactive phishing defense via machine vision

Spam value chain market • 10,000 foot idea: • We’ve gone deep into one spam campaign • Like to understand the relationship between all the elements of the value chain involved across the spam industry • Value-chain characterization • Front end (visible via network) • Spamming groups • Botnets (& hosters) • Fast flux networks (& hosters/registrars) • Affiliate programs (& hosters) • Back end • Payment processing • Fulfillment

Anatomy of a modern pharma spam campaign Courtesy Stuart Brown modernlifisrubbish.co.uk

Spammer A Spammer B Spammer C Spammer D Botnet A Botnet A FastFlux A FastFlux B Affiliate C Affiliate A Fulfillment A Fulfillment C Fulfillment B Affiliate B Botnet A Goal

Automated data collection Blacklists WHOIS DNS Spam Botnets Bad URLs http://... URL Live URL Feeds Follow Referrers Render Page Repeat Big Database

Manual data collection • Purchase goods from sites (Visa gift cards) • E-mail confirmation records • Receipt and customer service contact • Payment records • Merchant id on CC statement • Delivery • Post-mark, wrapping, any receipts, tracking info • Contents • FT-NIR features for pills, movement matching for watches

Synthesizing into elements • Cluster spammers • Address distribution fingerprints • Cookie matching • Cluster fast flux networks • Sets of domains hosted by same DNS infra; changing A (flux), or NS records (double flux) • Cluster affiliate programs • Matching on site HTML (text features) • Matching on visual similarity (SIFT features) • Merchant id (oracle) from purchase • Cluster fulfillment • Matching features of delivered products

CAPTCHA solving analysis Webmail based spam Web bots hard to filter; launder reputation of Web mail provider But bots must solve CAPTCHA to create account; key enabler De-catpcha services (as little as $1/1k solved, 33% margin) Study: purchase solving from range of such services and join as solvers Key questions Quality of solving (overall and vs price) Capacity (latency from imparted load) Number of workers (priming) Labor market (language queries/primes) Relative hardness of CAPTCHAs Do CAPTCHAs make sense?

Crawling underground social networks Underground criminals have implicit social network Who offers which services, who partners with whom, etc... Use multiple pseudo-identities, but significant structure still can be reconstructed manually Goal: build social network via crawling/datamining Identifiers (ICQ, phone, etc) Web page content, linkage on forum sites (who referenced whom, etc), twitter

Traffic selling On-line underground market for click traffic (parallel to Google/Yahoo/MS) For direction to particular scams (e.g. pharma, counterfeits, etc) For use in click fraud/PTC scams Active purchasing of traffic streams Characterize traffic streams themselves Real people, country of origin, time on site, click through, etc Survey of subset of people (why are you here) Differential pricing for different click streams

Financial honeytokens Range of scams that steal financial credentials Question: do they share monetization infrastructure? Money mules, wire cashout, layering via purchase, carding, trading, etc Methodology: Purposely “lose” financial credentials Infostealing malware, phishing site, on open market See how accounts are monetized Fingerprinting test transactions Merchant for large transfers Exploring multiple kinds of financialcredentials (e.g.,Visa,paypal,banks) Working with large bank partner

Scam domain registration Web-based crime is built on cheap and easy domain registration, but little understood We now have full feed for .com, .net and .org (others) Look at pattern of use for scam domains (ala w/Storm) Time to use, length of use, registrar agility, etc Different between FF domains and hosting domains Mining registrant records Either identify template or tie into social network

Phishing defense value • We have three kinds of phishing defenses • Spam filtering: stops subset from getting known e-mails lures • Toolbars: stop subset from clicking on a known phishing site • Takedown: stop everyone from reaching known phishing site • But… how much do they each matter (i.e., to the phisher) and which is worth additional investment? • Dataset • Categorize phish e-mail and send through current filters • Track current toolbar blacklists • Track site lifetime (i.e. takedown) • Estimating click through (Taylor webalizer trick, DNS caching)

Proactive phishing defense • Virtually all anti-phishing defenses are reactive • Proactive defense via browser-based logo identification • Phishing campaigns all use logos or variations as trust cues • SIFT feature matching invariant (rotation, shearing, scale)

Proactive phishing defense Warning: you are attempting to enter data into a site that is not authorized to use the Bank of America trademark. It is likely that this isa scam Query brand provider (ala SPF for domains) on recognized logo – is IP address authorized to display Delay notification until user attempts to enter data

Proactive detection Of malicious web sites URL = Uniform Resource Locator http://www.cs.mcgill.ca/~icml2009/abstracts.html http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll http://fblight.com http://mail.ru • Safe URL? • Web exploit? • Spam-advertised site? • Phishing site? Predict what is safe without committing to risky actions Joint work w/Lawrence Saul

Problem in a Nutshell • URL features to identify malicious Web sites • Different classes of URLs • Benign, spam, phishing, exploits, scams... • For now, distinguish benign vs. malicious • Practical implementation issues • Scale to large problems • Update quickly as adversary changes • Online algorithms… facebook.com fblight.com

Live URL Classification System Label Example Hypothesis

Feature vector construction http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll WHOIS registration: 3/25/2009 Hosted from 208.78.240.0/22 IP hosted in San Mateo Connection speed: T1 Has DNS PTR record? Yes Registrant “Chad” ... [ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …] Host-based Lexical Real-valued 60+ features 1.8 million 1.1 million GROWING

~99% accuracy against oracle Perceptron LR w/ SGD Confidence-Weighted

Spin-off: predicting exploitation • Hundreds of vulnerability disclosures each month… but which are most important? • Lots of adhoc vulnerability severity indices (CVSS) • We treat as machine learning problem • Same model as URL classification • Features: who reported, when, type of vuln, for what system, bag of words model on free text, etc… eventual exploitation • Accurate predictions (90%+) whether vulnerability will be exploited in next month, etc. (20% than best static choice) • Vastly better than industry indexes

Bot-based spam filter generation • Observations • Modest number of bots send most spam • Virtually all bots use templates with simple rules to describe polymorphism • Templates+dictionaries ≈ regex describing spam to be generated • If we can extract or infer these from the botnets, we have a perfect filter for all the spam generated by the botnet • Very specific filters, extremely low FP risk random letters and numbers http://www.marshal.com/trace/spam_statistics.asp phrases from a dictionary

Full automated algorithm • Almost perfect in testing(~0 false positives, very few false negatives) • Currently in live testing • Open question • Botnet output provides implicit clustering • Can you modify algorithm to work absent these labels

Summary We think that the economic structures underlying e-crime may be weaker than their technical vulnerabilities Much of our research agenda focuses on measuring or inferring quantitative empirical data about this structure We think technical defenses make sense when they can significantly shrink the window of opportunity We’re always interested in collaborations in this space

Questions? Collaborative Center for Internet Epidemiology and Defenses http://ccied.org Yahoo!

The spammer’s bottom line • Recall that we tracked the contents of shopping carts • Using the prices on the actual site, we can estimate the value of the purchases • 28 purchases for $2,731 over 25 days, or $100/day ($140 active) • We only interposed on a fraction of the workers • Connected to approx 1.5% of workers • Back-of-the-envelope (be very careful) $7-10k/day for all, or ~$3M/year • With a 50% affiliate commission, $1.5M/year revenue • Not enough to be profitable unless spammer = botnet owner • For self-propagation • Roughly 3-9k new bots/day 46

Show me the money… In search of a meaningful research agenda for addressing cybercrime