1 / 17

Intelligent Detection of Malicious Script Code

Intelligent Detection of Malicious Script Code. CS194, 2007-08 Benson Luk Eyal Reuveni Kamron Farrokh Advisor: Adnan Darwiche Sponsored by Symantec. Outline for Project. Phase I : Setup Set up machine for testing environment Ensure that “whitelist” is clean Phase II : Crawling

avongara
Download Presentation

Intelligent Detection of Malicious Script Code

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intelligent Detection of Malicious Script Code CS194, 2007-08 Benson Luk Eyal Reuveni Kamron Farrokh Advisor: Adnan Darwiche Sponsored by Symantec

  2. Outline for Project Phase I : Setup Set up machine for testing environment Ensure that “whitelist” is clean Phase II : Crawling Modify crawler to output only necessary data. This means: Grab only necessary information from webcrawling results Listen into Internet Explorer’s Javascript interpreter and output relevant behavior Phase III: Database Research and develop an effective structure for storing data and link it to webcrawler Phase IV: Analysis Research and develop an effective algorithm for learning from massive amounts of data

  3. Completed Tasks – First Quarter Phase I Configured machine with Norton Antivirus and Heritrix web crawler Webcrawler will be used to grab additional URLs, and Norton Antivirus will be used to verify that a URL has not launched an attack Created a Python script to ensure that visited sites are clean Captures Norton’s web attack logs before and after loading a site in Internet Explorer, then compares the logs for new entries and signals whether or not a site’s data should be discarded Phase II Configured Heritrix to run specific crawls that target a set of domains, and output minimal information The purpose is to gather as many URLs with scripts as possible for a large sample base Created a parser for Heritrix logs to filter out irrelevant websites For example, we are omitting URLs that point to images since they will not contain scripts

  4. Completed Tasks – Second Quarter Phase I • Whitelist: integrated Symantec component to check whether visited site is malicious, so all of the data we gather is from clean sources • Hard drive: installed a 750 GB hard drive

  5. Completed Tasks – Second Quarter Phase II • Crawling: We ran a shallow crawl with 200 domains as seed, and that is the current base of our data. The result was 18,500 URLs that we run through with our Script Listening component

  6. Completed Tasks – Second Quarter Phase II • Script Listening: received a customizable tool from Symantec that listens to the Javascript interpreter in Internet Explorer • We modified it to output the information we need: GUID -> DISPID -> ArgType -> ArgVal

  7. Completed Tasks – Second Quarter Example of data:

  8. Completed Tasks – Second Quarter Phase III • The amount of data we have gotten is too large to use in a database. The pure text file is 4GB (~50 million function calls), and querying such a database is too slow on the computer we have. • Instead, we are storing the data as a text file, and doing operations on it with Python scripts.

  9. Results and Findings – Second Quarter Phase IV • We have analyzed data from our first two result sets • Crawl with 5 initial seeds • 3,476,348 function calls • 109 distinct GUIDs, 7364 GUID-DispID pairs • Crawl with 15 initial seeds • 3,706,454 function calls • 95 distinct GUIDS, 5575 GUID-DispID pairs • Looked at most common functions, most common int-argument functions, and distribution of the argument values for these functions

  10. Results and Findings – Second Quarter • Function 1: • GUID: 3050f55d-98b5-11cf-bb82-00aa00bdce0b • GUID object name: DispHTMLWindow2 • DispID: 1103 • Most popular int-argument function in both result sets • Mostly random distribution, but signs of regularity • Results from two sets show significant differences

  11. Results and Findings – Second Quarter

  12. Results and Findings – Second Quarter • Function 2: • GUID: 3050f55f-98b5-11cf-bb82-00aa00bdce0b • GUID object name: DispHTMLDocument • DispID: 1013 • Second most popular int-argument function in both result sets • Shows a regular distribution with distinct characteristics • Results from two sets show significant differences

  13. Results and Findings – Second Quarter

  14. Results and Findings – Second Quarter • Function 3: • GUID: 3050f51b-98b5-11cf-bb82-00aa00bdce0b • GUID object name: DispHTMLIFrame • Dispid: -2147418107 • Third most popular int-argument function 1st result set, 95th most popular in 2nd result set • Shows a random distribution with distinct characteristics • Results are dramatically different between data sets • All arguments in the 2nd result set are 0

  15. Results and Findings – Second Quarter

  16. Results and Findings – Second Quarter • Found significant differences between the data sets in both the frequencies of specific functions, and the arguments of specific functions • Suspect that differences result from biases due to small amount of original seeds (5 and 15) • Ran a much broader crawl (200 seeds) in hopes of getting more general, unbiased results • Just from partial results of this crawl (roughly 8000 websites), we have so far found: • A much larger average of calls to our listener per website • A large percentage of function calls that take 0 arguments • Will post complete results once crawl is finished

  17. Direction for Next Quarter • Further analyze the gathered data for patterns • Compare trends in “normal” data to what occurs in malicious scripts

More Related