1 / 11

Q&A for “Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs”

Q&A for “Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs”. Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker KDD 2009 By Fu-Chi Ao. Questions. What’s the error rate? What are the relevant/dominant features out of the selected 30783 features?

inari
Download Presentation

Q&A for “Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Q&A for “Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs” Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker KDD 2009 By Fu-Chi Ao

  2. Questions • What’s the error rate? • What are the relevant/dominant features out of the selected 30783 features? • Indication of TTL values? • How to construct the feature vectors? • What are the 3959 features of WHOIS information features?

  3. What’s the error rate? (In binary classification) • Accuracy: The proportion of the true results in the population • Error rate = 1 – Accuracy

  4. What are the relevant/dominant features out of the selected 30783 features? non-zero features benign malicious • Breakdown of features for L1-regularized LR for an instance of the Yahoo-PhishTank data set • The training phase for L1-regularized LR yields a sparse parameter vector w • Focus on a smaller number of relevant features

  5. Certain “Red Flags" Indicate Malicious Intent • 1) Suspicious ownership of the site • Benign features: IP rangesbelonging to Google, Yahoo and AOL • Malicious features: having an NSrecord in one of the IP prefixes run by GoDaddy • 2) Where the site is hosted geographically • Top-6 benign features: ‘.gov’, ‘.edu’, ‘.com’, ‘.org’, ‘.ca’ and ‘.se’ • Top-6 malicious features: ‘.info’, ‘.kr’, ‘.it’, ‘.hu’, and ‘.es’ • 3) The registration date of the site • Malicious: a recent registration or update date/missing any of the three WHOIS dates (registration, update, expiration) • 4) What kind of connection the server is using • Top-2 benign features: have T1 speed for the DNS A and MX records • Malicious sites hosted on compromised machines in residential ISPs • 5) The presence of certain URL extensions • "bankofamerica.com" vs. "bankofamerica.com.cz.rnl"

  6. What are the relevant/dominant features out of the selected 30783 features? (cont’d) • Machine learning techniques can adapt to differing feature distributions by learning the appropriate decision rules automatically • The results of experiments show that different data sets provide different feature distributions for distinguishing malicious and benign URLs • Rather than manually discovering and adjusting the decision rules for different data sets

  7. What are the relevant/dominant features out of the selected 30783 features? (cont’d) • Automation of the classifier • Select malicious and benign features for which domain experts had prior intuition • Automatically selected new, non-obvious features that were highly predictive and yielded additional, substantial performance improvements

  8. Indication of TTL values? • “What is the time-to-live (TTL) value for the DNS records associated with the hostname?” • Set by an authoritative names server for a particular resource record • Low TTL value • Some well-known larger web sites depend on low TTL values to enable quick changes to their web sites • e.g. “www.cnn.com” • Some small web-sites require frequent DNS updates (when their IP address changes) • run on ADSL or cable connections with dynamic IP addresses

  9. How to construct the feature vectors? • Use the selected features to encode individual URLs as very high dimensional feature vectors • Most generated by the “bag-of-words" representation of the URL, registrar name, and registrant name • Binary features are also used to encode all possible ASes, prefixes and geographic locales of an IP address • The resulting URL descriptors typically have tens of thousands of binary features • Overfitting • Not know in advance which features are relevant • Though only a subset of the generated features may correlate with malicious Web site • When there are more features than labeled examples  prone to overfitting!

  10. Feature vector construction http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll WHOIS registration: 3/25/2009 Hosted from 208.78.240.0/22 IP hosted in San Mateo Connection speed: T1 Has DNS PTR record? Yes Registrant “Chad” ... [ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …] Host-based Lexical Real-valued No clear illustration for the construction methodology…

  11. What are the 3959 features of WHOIS information features? • A distributed database contains contact information • the owner and registrar of the domain (including home page URL) • date of registration, last update, expiration • primary and secondary DNS servers • and any additional status information of the domain • Mainly tokens in the names of the registrar and registrant of the domain name

More Related