1 / 31

Configuration Fuzzing for Software Vulnerability Detection

Configuration Fuzzing for Software Vulnerability Detection. Huning Dai, Chris Murphy, Gail Kaiser Columbia University. Introduction. 1. The importance of Security Testing 2. Existing Problems. Related Work. Former solutions: 1. Fuzz Testing Drawbacks:

mauli
Download Presentation

Configuration Fuzzing for Software Vulnerability Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Configuration Fuzzing for Software Vulnerability Detection Huning Dai, Chris Murphy, Gail Kaiser Columbia University

  2. Introduction • 1. The importance of Security Testing • 2. Existing Problems

  3. Related Work • Former solutions: 1. Fuzz Testing • Drawbacks: A. Randomly generated inputs may fail to satisfy syntactic constraints. B. It is hard to evaluate how much of the input/configuration space is explored C. Limited information about the "failure"

  4. Related Work • Former solutions: 1. Fuzz Testing 2. White-box Fuzzing • Drawbacks:  A. Randomly generated inputs may fail to satisfy syntactic constraints. (Fixed) B. It is hard to evaluate how much of the input/configuration space is explored C. Limited information about the "failure“ D. Overhead…

  5. Our Solution • Configuration Fuzzing A. Instead of generating random inputs, Configuration Fuzzing mutates the application configuration using a covering array algorithm. B. To increase effectiveness, Configuration Fuzzing tests are carried out “In Vivo” after a software is released, with real-world inputs and runtime environment. C. Instead of only checking for failure, surveillance functions are run throughout the tests; these functions check for violations of “security invariants” and log detailed information.

  6. Overview • Background • Model • ConFu Framework • Evaluation and Observations • Limitations and Conclusions

  7. Supervised Machine Learning • Data sets consist of a number of examples, each of which has attributes and a label • In the first phase (“training”), a model is generated that attempts to generalize how attributes relate to the label • In the second phase, the model is applied to a previously-unseen data set with unknown labels to produce a classification (or, in our case, a ranking)

  8. Related Work – Machine Learning • There has been much research into applying Machine Learning techniques to software testing, but not the other way around • Reusable real-world data sets and Machine Learning frameworks are available for checking how well a Machine Learning algorithm predicts, but not for testing an implementation’s correctness

  9. Related Work – Random Testing • Parameterization generally refers to specifying data type or range of values • Our work differs from that of Thévenod-Fosse et al. [’91] on “structural statistical testing”, which focuses on path selection and coverage testing, not system testing

  10. More Related Work • Wichmann [’98] noted the role randomization could play within partition testing but focuses only on small reusable components, not system-level testing • Mayer et al. [’04] investigated the use of random testing for applications that have no test oracle, but focused on randomized software

  11. The Problem • How can we generate data sets that will reveal defects in the software given that we do not have a reliable test oracle? • We need to restrict random test data generation

  12. Analyzing the Problem Domain • Consider properties of data sets in general • Data set size: number of attributes and examples • Range of values: attributes and labels • Precision of floating-point numbers • Whether values can repeat • Consider properties of real-world data sets in the domain of interest • How alphanumeric (“categorical”) attributes are to be interpreted • Whether data values might be missing

  13. Equivalence Partitions • Data sizes of different orders of magnitude • Repeating vs. non-repeating attribute values • Missing vs. non-missing attribute values • Categorical vs. non-categorical data • 0/1 labels vs. non-negative integer labels • Predictable vs. non-predictable data sets • Used data set generator to parameterize test case selection criteria

  14. How Data Are Generated • M attributes and N examples • No-repeat mode: • Generate a list of integers from 1 to M*N and then randomly permute them • Repeat mode: • Each value in the data set is simply a random integer between 1 and M*N • Values can also randomly be unknown

  15. Generating Labels • Specify percentage of “positive examples” to include in the data set • positive examples have a label of 1 • negative examples have a label of 0 • Each label is randomly assigned • Framework guarantees that the number of positive examples comes out to the right distribution • Labels are never unknown/missing

  16. Categorical Data • For some alphanumeric attributes, expand K distinct values to K attributes • Same as data pre-processing in the real-world ranking applications of interest • Input parameter to data generation tool is of the format (a1, a2, ..., aK-1, aK, m) • a1 through aK represent the percentage distribution of those values for the categorical attribute • m is the percentage of unknown values

  17. Data Set Generator - Parameters • # of examples • # of attributes • % positive examples (label = 1) • % missing • any categorical data • repeat/no-repeat modes • Also, plug-replaceable modules for different data formats

  18. Sample Data Set • 10 examples, 7 attributes, 40% positive examples, 20% missing, (40,10,40,10) categorical attribute, repeats allowed 27,81,88,59, ?,16,88, 1, 0, 0,0 15,70,91,41, ?, 3, ?, ?, ?, ?,0 82, ?,51,47, ?, 4, 1, 0, 0, 1,0 22,72,11, ?,96,24,44, 1, 0, 0,1 ?,77, ?,86,89,77,61, 0, 0, 1,1 76,11, 4,51,43, ?,79, 0, 0, 1,0 6,33, ?, ?,52,63,94, 1, 0, 0,0 77,36,91, ?,47, 3,85, 0, 0, 1,1 ?,17,15, 2,90,70, ?, 0, 1, 0,0 8,58,42,41,74,87,68, 1, 0, 0,1 examples labels attributes categorical attribute

  19. Sample Data Sets • 10 examples, 7 attributes, 40% positive examples, 20% missing, (40,10,40,10) categorical attribute, repeats allowed 27,81,88,59, ?,16,88, 1, 0, 0,0 15,70,91,41, ?, 3, ?, ?, ?, ?,0 82, ?,51,47, ?, 4, 1, 0, 0, 1,0 22,72,11, ?,96,24,44, 1, 0, 0,1 ?,77, ?,86,89,77,61, 0, 0, 1,1 76,11, 4,51,43, ?,79, 0, 0, 1,0 6,33, ?, ?,52,63,94, 1, 0, 0,0 77,36,91, ?,47, 3,85, 0, 0, 1,1 ?,17,15, 2,90,70, ?, 0, 1, 0,0 8,58,42,41,74,87,68, 1, 0, 0,1 35, 3,20,41,91, ?,32, 1, 0, 0,1 19,50,11,57,36,94, ?, 0, 0, 1,1 24,36,36,79,78,33,34, 0, 0, 1,0 ?,15, ?,19,65,80,17, 1, 0, 0,0 40,31,89,50,83,55,25, ?, ?, ?,1 52, ?, ?, ?, ?,39,79, 0, 1, 0,0 86,45, ?, ?,74,68,13, 0, 0, 1,0 ?,53,91,23,11, ?,47, 1, 0, 0,0 77,11,34,44,92, ?,63, 1, 0, 0,1 21, 1,70, ?,16,40,63, 0, 0, 1,0

  20. The Testing Framework • Data set generator • Model comparison tool • Ranking comparison: includes metrics like normalized equivalence and AUCs • Tracing options: for generating and comparing outputs of debugging statements

  21. Findings • Testing approach and framework were developed for one ML application (MartiRank) then applied to another (SVM-Light) • Only the findings most related to parameterized random testing are presented here • More details and case studies about the testing of MartiRank can be found in our tech report

  22. MartiRank and SVM • MartiRank was specifically designed for the real-world device failure application • Sorts each column (attribute) to find the best result, then the data are segmented and the process is repeated • SVM is typically a classification algorithm • Seeks to find a hyperplane that separates examples from different classes • SVM-Light has a ranking mode based on the distance from the hyperplane

  23. Issue #1: Repeating Values • One version of MartiRank did not use “stable” sorting ... 91,41,19, 3,57,11,20,64,0 36,73,47, 3,85,71,35,45,1 ... ... ... ... stable ... 91,41,19, 3,57,11,20,64,0 ... ... ... 36,73,47, 3,85,71,35,45,1 ... ... 36,73,47, 3,85,71,35,45,1 91,41,19, 3,57,11,20,64,0 ... ... ... ... unstable

  24. Issue #2: Sparse Data Sets • Not addressed in MartiRank specification 41,91, ?,32,11,43, ?,1 19,65,80,17,78,46, ?,0 79,78,33,34, ?,31, ?,0 ?, ?,39,79,82,94, ?,0 50,83,55,25, ?, ?,45,1 57,36,94, ?,96, 7,23,1 41,91, ?,32,11,43, ?,1 57,36,94, ?,96, 7,23,1 79,78,33,34, ?,31, ?,0 19,65,80,17,78,46, ?,0 50,83,55,25, ?, ?,45,1 ?, ?,39,79,82,94, ?,0 sort “around” missing values randomly place missing values put missing values at end 41,91, ?,32,11,43, ?,1 19,65,80,17,78,46, ?,0 ?, ?,39,79,82,94, ?,0 57,36,94, ?,96, 7,23,1 79,78,33,34, ?,31, ?,0 50,83,55,25, ?, ?,45,1 41,91, ?,32,11,43, ?,1 50,83,55,25, ?, ?,45,1 19,65,80,17,78,46, ?,0 79,78,33,34, ?,31, ?,0 ?, ?,39,79,82,94, ?,0 57,36,94, ?,96, 7,23,1

  25. Issue #3: Categorical Data • Discovered that refactoring of MartiRank had introduced a bug into an important calculation • A global variable was being used incorrectly • This bug did not appear in any of the tests only with repeating values or only with missing values • However, categorical data necessarily has repeating values and may have missing

  26. Issue #4: Permuted Input Data • Randomly permuting the input data led to different models (and then different rankings) generated by SVM-Light • Caused by “chunking” data for use by an approximating variant of optimization algorithm

  27. Observations • Parameterized random testing allowed us to isolate the traits of the data sets • These traits may appear in real-world data but not necessarily in the desired combinations • Controlling the traits allowed us to pinpoint defects and discrepancies, despite the absence of a test oracle

  28. Limitations and Future Work • Test suite adequacy for coverage not addressed or measured • Could also consider non-deterministic Machine Learning algorithms • Can also include mutation testing for effectiveness of approach • Should investigate creating large data sets that correlate to real-world data

  29. Conclusion • Our contribution is an approach that combines parameterization and randomness to control the properties of very large data sets • Critical for limiting the scope of individual tests and for pinpointing specific issues related to the traits of the input data

  30. Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University

More Related