design and evaluation of a real time url spam filtering service n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Design and Evaluation of a Real-Time URL Spam Filtering Service PowerPoint Presentation
Download Presentation
Design and Evaluation of a Real-Time URL Spam Filtering Service

Loading in 2 Seconds...

play fullscreen
1 / 23

Design and Evaluation of a Real-Time URL Spam Filtering Service - PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on

Design and Evaluation of a Real-Time URL Spam Filtering Service. Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security and Privacy 2011. OUTLINE. Introduction - Monarch Related Work System Design Implementation Evaluation Discussion and Conclusion.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Design and Evaluation of a Real-Time URL Spam Filtering Service' - macy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
design and evaluation of a real time url spam filtering service

Design and Evaluation of a Real-Time URL Spam Filtering Service

Kurt Thomas, Chris Grier, Justin Ma,

Vern Paxson, and Dawn Song

IEEE Symposium on Security and Privacy 2011

outline
OUTLINE
  • Introduction - Monarch
  • Related Work
  • System Design
  • Implementation
  • Evaluation
  • Discussion and Conclusion
spam url
Spam URL
  • Advertisement
  • Harmful content
    • Phishing, malware, and scams
  • Use of compromised and fraudulent accounts
    • Email, web services
monarch
Monarch
  • Spam URL Filtering as a Service
  • Tens of millions of features
related work
Related Work
  • “Detecting spammers on Twitter” (2010)
    • Post frequency, URLs, friends…
  • “Behind phishing: an examination of phisher modi operandi” (2008)
    • Lexical characteristics of phishing URLs
  • “Cantina: a content-based approach to detecting phishing web sites” (2007)
    • Parse HTML content
system design
System Design

Monarch’s cloud infrastructure

  • Url Aggregation
    • Email providers and Twitter’s streaming API
  • Feature Collection
    • Visits a URL with web browsers to collect page content
system design cont
System Design(cont.)

Monarch’s cloud infrastructure

  • Feature Extraction
    • Transform the raw data into a sparse feature vector
  • Classification
    • Training and testing by distributed logistic regression
collect raw features web browser
Collect Raw Features – Web Browser

“A taxonomy of JavaScript redirection spam”(2007)

  • Lightweight browser not enough
    • Poor HTML parsing, lack of JavaScript and plugins
  • Instrumented version of Firefox
    • JavaScript enabled
    • Flash and Java installed
    • Visited a URL and monitor a number of details
raw features
Raw Features
  • Web Browser
    • Initial URL and Landing URL, Redirects, Sources and Frames
    • HTML Content, Page Links
    • JavaScript Events, Pop-up Windows, Plugins
    • HTTP Headers
  • DNS Resolver
    • Initial, final, and redirect URLs
  • IP Address Analysis
    • City, country, ASN
  • Proxy and Whitelist (200 domains)
features vector
Features Vector
  • Raw Features => sparse feature vector
    • Canonicalize URLs
    • Remove obfuscation
  • Tokenize the text corpus
    • Splitting on non-alphanumeric characters

http://adl.tw/~dada/dada2.php?a=1&b=3

=> domain feature [adl,tw]

path feature [dada,dada2,php]

query parameters feature [a,1,b,3]

=> (…,adl:true,adm:false,…,dada:true,…,tw:true,……..)

total 49,960,691 feature(dimension)…

=> (1,3,a,adl,b,dada,dada2,php,tw)

distributed classifier design
Distributed Classifier Design
  • Linear classification
    • : feature vector
    • Determine a weight vector
  • A parallel online learner
    • With regularization to yield a sparse weight vector
  • Labeled data ,
  • Testing =>

-1 => non-spam site

1 => spam site

training the weight vector
Training the weight vector
  • Logistic Regression
    • With subgradient L1-Regularization
  • yi(xi.wi) larger => f(w) smaller

(Classification margin, hyperplane)

data set and assumption
Data Set and assumption
  • 1.25 million spam email URLs
  • 567,784 spam Twitter URLs
  • 9 million non-spam Twitter URLs
  • Checking all Twitter URLs against:
    • Google Safebrowsing, SURBL, URIBL, APWG, Phishtank
    • Any of its source URLs become blacklisted
data set and assumption cont
Data Set and assumption(cont.)
  • On Twitter:
    • 36% scams, 60% phishing, 4% malware
implementation
Implementation
  • Amazon Web Services(AWS) infrastructure
  • URL Aggregation
    • A queue, keeps 300,000 URLs
  • Feature Collection
    • 20x6 Firefox(4.0b4) on Ubuntu 10.04
      • With a custom extension
    • Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views
  • Classifier
    • Hadoop Distributed File System
    • On the 50-node cluster
evaluation overall accuracy
Evaluation – Overall Accuracy
  • 5-fold cross-validation
  • 500,000 spam and non-spam each
  • Training set size to 400,000 example
    • 1:1, 4:1, 10:1
  • Testing set size to 200,000 example
    • 1:1
evaluation accuracy over time
Evaluation – Accuracy Over Time

Training once only <-> Retraining every four days

evaluation the cost
Evaluation – The Cost
  • For Twitter, $22,751 per month
discussion and conclusion
Discussion and Conclusion
  • Evasion
    • Feature Evasion
    • Time-based Evasion
    • Crawler Evasion
  • Monarch
    • Real-time system
    • Spam URL Filtering as a Service
    • $22,751 a month