1 / 20

Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo

Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science Columbus State University November 19 th , 2009. Malware. Classes: Viruses, Worms, Trojans

davis
Download Presentation

Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science Columbus State University November 19th, 2009

  2. Malware • Classes: Viruses, Worms, Trojans • Malware-generating Engines: Script Kiddies, Morphers, Metamorphic, and Virus Generating Toolkits • State of the Threat: • Anti-virus firms are having to analyze between 15,000 and 20,000 new malware instances a day. [AV Test Lab 2007, McAfee2009] • 1.6 million malware instances were detected in 2008. [F-Secure2008] • Professionals are being recruited to make stealthier malware. [ESET2008] • Automation is being used to generate malware. [ESET2008] • Generic Detection is not good. 630 out of 1000 instances of new malware went unnoticed. [Team-cymru2008] 2/20

  3. Malware is Hard to Detect as it is… • …Stop early in the program analysis pipeline. • Static Program Analysis: • Static program analysis may be imprecise and inefficient (e.g., Def-Use analysis). • Static program analysis may • be challenged by obfuscation. • Dynamic Program Analysis: • May be challenged by testing the • patience of the emulator.[Aycock2005] Dead code insertion Entry Suspect Program Extract Procedures Control Flow Graphs Disassembly Signature Verification Read x Malicious Benign If (x*x* (x+1) *(x+1) % 4==0) NO YES Call Proc 1 Call Proc 2 Call Proc 2 Call Proc 1 3/20 Exit

  4. Engines-Generated Malware Variant 0 • Engine generates new variants at a high rate. • Malware detectors typically store one signature per variant. •  Too many signatures challenge the detector. ENGINE IN OUT Variant n Variant 3 Variant 1 Variant 2 MALWARE DETECTOR 4/20

  5. Proposal: View Engine as Author Goal 1: Reduce the number of steps required in the program analysis pipeline. Goal 2: Eliminate the need for signature per variant. Goal 3: Must be satisfactorily accurate. • Proposed Model[Chouchane2006] Proposed approach was inspired by Keselj’s work on authorship analysis of natural text produced by humans. [V. Keselj2003] ENGINE OUT Variant n Variant 3 Variant 1 Variant 2 MALWARE DETECTOR Engine Signature 5/20 Source: Google Images

  6. Feature 1: Instruction Frequency Vector add,push,pop,add,and,jmp,pop,and,mov,jmp,mov,push,jmp,jmp,push, jmp,add,pop,mov,add,mov,push,jmp,mov,mov,jmp,push • Example Program: P IFV(P) Normalized IFV(P) 6/20

  7. IFV Classification STEPS Given a sample of malicious programs and a sample of benign ones select sets of trainers from each sample. Compute the IFVs of all trainers. Choose a threshold ε. Input : IFVsuspect , where ‘suspect’ is a program that is not among the trainers. Count the number of malicious training IFVs within ε of IFVsuspect . Count the number of benign training IFVs within ε of IFVsuspect . Output: The family that has the highest number of trainers within ε is declared to be that of the suspect program. If there is a tie, pick one at random. Malicious Suspect program threshold Benign Distance Measure 7/20

  8. Experimental Setup • Metamorphic Malware (from vx.netlux.org) • W32.Simile (100 samples) • Benign Programs (from download.com, sourceforge.net) • (100 samples) Thanks to Jessica Turner for extracting the original variant of W32.Simile. 8/20

  9. Classifying W32.Simile vs. Benigns • RI is the number of instructions considered in the IFV • For RI=4: 0.1 ≤ ε ≤ 0.7, 98 % ≤ accuracy ≤ 100% • For RI=5: 0.1 ≤ ε ≤ 0.7, 96 % ≤ accuracy ≤ 100% • Very small signatures (4 and 5 doubles per IFV) • But does not use single signature 9/20

  10. Feature 2: N-gram Frequency Vector • Example Program: P NFV(P) Normalized NFV(P) add,push,call,pop,call,add,push,call,pop,call,add,mov,add,add,mov,add,add, mov,add,push,call,push,call,call,pop,call,push,mov,add,mov,add,push,call,pop,pop,call, pop,call,pop,call,mov,add,mov,add 10/20

  11. N-Gram Authorship Attribution (Proposed) STEPS Choose a set of trainers from each of the families. For each family, compute the average of the NFVsof the family’s trainers to create a Family Signature(FS) for each family. Input : NFVsuspect , where ‘suspect’ is a program that is not among the trainers. Compute the distance between each of the FS’s and NFVsuspect. Output: The suspect program classified as a member of the family with the shortest distance. If there are ties, choose one at random. Malware Detector NFVsuspect DB FSB DS FSS DE DV FSE DN FSV FSN Distance Measure Prediction for = MIN (DB,DS,DE,DV,DN) 11/20

  12. k-nn Classification NGVCK STEPS Given a sample of malicious programs and a sample of benign ones select sets of trainers from each sample. Choose k > 0. Input: NFVsuspect of suspect program. Find the k closesttrainingNFVs (neighbors of NFVsuspect). Output: The suspect program classified as a member of the family with the most neighbors. If there are ties, choose one at random. Evol Benign VCL Simile Distance Measure 12/20

  13. Experimental Setup • Metamorphic Malware (from vx.netlux.org) • W32.Simile + W32.Evol (100 samples each) • Malware Generation Toolkits(from vx.netlux.org) VCL + NGVCK (100 samples each) • Benign Programs (from download.com, sourceforge.net). (100 samples) Thanks to Yasmine Kandissounon for collecting the NGVCK and VCL variants. 13/20

  14. Ten-fold Cross Validation • Divide each family into a training set of 90instances and a testing set of 10 instances. • Perform 10-fold cross validation using a new testing set and a new training set each time. • Cross accuracy equals the average accuracy across all the validation accuracies. 14/20

  15. Bigram Selection (Relevant Instructions) • RI is the number of most relevant instructions across the samples used to construct the features. • Best Accuracy 85% for RI =3, RI=4 and RI=9. 15/20

  16. Bigram Selection(Relevant Bigrams) • RB is the number of most relevant bigrams across the samples used to construct the features. • Best accuracy 95% for 17 doubles. • Accuracies of 94.8 % for 6, 8 and 14 doubles. 16/20

  17. Successful Evaluation… • A single, small family signature of 17 doubles for each family induced a 95%detection accuracy. • W32.Simile's Engine signature = (0.190, 0.030, 0.155, 0.048, 0.043, 0.057, 0.063, 0.020,0.076, 0.022, 0.0, 0.041, 0.109, 0.0, 0.122, 0.022, 0.0) • W32.Evol's Engine signature = (0.074, 0.026, 0.006, 0.326, 0.208, 0.014, 0.024, 0.073,0.043, 0.048, 0.0, 0.071, 0.042, 0.0, 0.026, 0.019, 0.0) • W32.VCL's Engine signature = (0.111, 0.238, 0.142, 0.027, 0.076, 0.063, 0.063, 0.033,0.009, 0.018, 0.018, 0.054, 0.042, 0.0, 0.040, 0.052, 0.013) • W32.NGVCK's Engine signature = ( 0.132, 0.113, 0.106, 0.048, 0.203, 0.018, 0.055,0.038, 0.022, 0.017, 0.070, 0.122, 0.007, 0.0, 0.007, 0.020, 0.017) • Benign's “Engine signature” = (0.165, 0.173, 0.091, 0.061, 0.052, 0.060, 0.052, 0.046, 0.060,0.028, 0.019, 0.043, 0.024, 0.029, 0.02, 0.031, 0.029) • A single, small family signature of 6 doubles for each family induced a 94.8%detection accuracy. • W32.Simile's Engine signature = (0.362, 0.058, 0.295 , 0.093 , 0.082, 0.110 ) • W32.Evol's Engine signature = (0.113, 0.039, 0.010, 0.497, 0.319, 0.021 ) • W32.VCL's Engine signature = (0.176, 0.358, 0.212 , 0.041, 0.115 , 0.100 ) • W32.NGVCK's Engine signature = (0.212, 0.182, 0.171, 0.078, 0.327, 0.029 ) • Benign's “Engine signature”= (0.265, 0.279, 0.147, 0.102 , 0.098, 0.098 ) 17/20

  18. Successful Evaluation cont’d… … Re-examining our goals • Goal 1: Simplified analysis. Analysis involves only the disassembly and signature verification stages of the program analysis pipeline. • Goal 2: One signature per family (Family Signature). • Goal 3: Accuracy of 95% using only 17 doubles as a signature. MALWARE GENERATING ENGINE OUT Variant n Variant 3 Variant 1 Variant 2 MALWARE DETECTOR ES 18/20

  19. Directions for Future Work • Experiment with other malware instances and families • Address scalability issue? • Experiment with other feature selection methods • Could we do “better” than 95% for a signature of 17 doubles? • Try other classifiers • Other distance measure? • Try byte NFV’s instead of opcode NFV’s • Take into account malware that comes as binary. • Import existing forensic linguistics methods to malware detection 19/20

  20. References • Paper documenting this work has already been submitted for possible publication at the Journal in Computer Virology. • E. Milgo. A Fast Approximate Detection of Win.32 Simile Malware. Columbus State University Colloquium Series, Feb.’09 and Best Paper Award, 2nd Place-Masters category: ACM MidSE, ’08 . • M. R. Chouchane, A. Lakhotia. Using Engine Signature to Detect Metamorphic Malware. WORM, ‘06. • M. R. Chouchane. Approximate Detection of Machine-morphed Malware. Ph.D. Dissertation, University of Louisiana at Lafayette, ’08. • P. Ször. The Art of Computer Virus Research and Defense. 2005. • J. D. Aycock. Computer Viruses and Malware. 2005. • V. Keselj, F. Peng, N. Cercone, and C. Thomas. N-gram-based Author Profiles for Authorship Attribution. PACL, ’03. • T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan. N-gram Based Detection of New Malicious Code. CMPSAC, ‘04. • http://www.av-test.org/, 2007 . • http://resources.mcafee.com/content/AvertReportQ109, 2009. • http://www.eset.com, 2008. • http://www.f-secure.com/en_US/, 2008 • http://www.team-cymru.org, 2008. 20/20

More Related