Acknowledgments

Automated Whitebox Fuzz TestingNetwork and Distributed System Security (NDSS) 2008 by Patrice Godefroid, ‏Michael Y. Levin, and ‏David MolnarPresent byDiego Velasquez

Acknowledgments • Figures are copy from the paper. • Some slides were taken from the original presentation presented by the authors

Outline • Summary • Goals • Motivations • Methods • Experiments • Results • Conclusions • Review • Strengths • Weakness • Extensions • Reference

Goals • Propose a novel methodology that performs efficiently fuzz testing. • Introduce a new search algorithm for systematic test generation. • Outcast their system SAGE (Scalable, Automated, Guided Execution)

Methods • Fuzz testing inserts random data to input of applications in order to find defects of a software system. Heavily used in Security testing. • Pros: Cost effective and can find most of known bugs • Cons: It has some limitations depending on some types of branches, for example on project 2 in order to find bug # 10 we need to execute the if statement below. if(address ==613 && value >= 128 && value<255)//Bug #7 printf("BUG 10 TRIGGERED); • Has (1 in 5000) * (128 in 2^32) in order to be executed if we know that is only 5000 addresses and value is a random 32-bit input

Methods Cont. • Whitebox Fuzz Testing • Combine fuzz testing with dynamic test generation [2] • Run the code with some initial input • Collect constraints on inputs with symbolic execution • Generate new constraints • Solve constraints with constraint solver • Synthesize new inputs

Methods Cont. • The Search Algorithm figure 1 from [1] • Black box will do poorly in this case • Dynamic test could do better

Methods Cont. • Dynamic Approach • Input ‘good’ as example • Collect constrain from trace • Create a new path constraint Figure 2 from [1]

Methods Cont. • Limitations of Dynamic Testing • Path Explosion • Path doesn’t scale to large in realistic programs. • Can be corrected by modifying the search algorithm. • Imperfect Symbolic Execution • Could be imprecise due to Complex program statements (arithmetic, pointer manipulation) • Calls to OS have to be expensive in order to be precise

Methods Cont. • New Generation Search Algorithm Figure 3 and figure 4 from [1] • A type of Bread First Search with heuristic to get more input test cases. • Scores return the number of new test cases covered.

Methods Cont. • Summary of Generation Search Algorithm • Push input to the list • Run&Check(input) check bugs in that input • Traverse the list by selecting from the list base in score • Expanded child paths and adding to the childlist • Traverse childlist Run&Check, assigned score and add to list • Expand Execution • Generates Path constrain • Attempt to expand path constraints and save them • Input.bound is bound is used to limit the backtracking of each sub-search above the branch.

Experiments • Can test any file-reading program running on Windows by treating bytes read from files as symbolic input. • Another key novelty of SAGE is that it performs symbolic execution of program traces at the x86 binary level FIGURE FROM [2]

Experiments Cont. • Sage advantages • Not source-based, SAGE is a machine-code-based, so it can run different languages. • Expensive to build at the beginning, but less expensive over time • Test after shipping, • Since is based in symbolic execution on binary code, SAGE can detects bugs after the production phase • Not source is needed like in another systems • SAGE doesn’t even need specific data types or structures not easy visible in machine code

Experiments Cont. • MS07-017: Vulnerabilities in Graphics Device Interface (GDI) Could Allow Remote Code Execution. • Test in different Apps such as image processors, media players, file decoders.[2] • Many bugs found rated as “security critical, severity 1, priority 1”[2] • Now used by several teams regularly as part of QA process.[2]

Experiments Cont. • More in MS07-017, figure below is from [2] left is input right is crashing test case RIFF...ACONB B...INFOINAM.... 3D Blue Alternat e v1.1..IART.... ................ 1996..anih$...$. ................ ................ ..rate.......... ..........seq .. ................ ..anih....framic on......... .. RIFF...ACONLIST B...INFOINAM.... 3D Blue Alternat e v1.1..IART.... ................ 1996..anih$...$. ................ ................ ..rate.......... ..........seq .. ................ ..LIST....framic on......... .. Only 1 in 232 chance at random!

Results • Statistics from 10hour searches on seven test applications, each seeded with a well formed input file.

Results • Focused on the Media 1 and Media 2 parsers. • Ran a SAGE search for the Media 1 parser with five “well-formed” media files, and five bogus files. Figure 7 from [1]

Results • Compared with Depth-First Search Method • DFS runs for 10 hours for Media 2 with wff-2 and wff-3, didn’t find anything GS found 15 crashes • Symbolic Execution is slow • Well formed input are better than Bogus files • Non-determinism in Coverage Results. • The heuristic method didn’t have too much impact • Divergences are common

Results • Most bugs found are “shallow” Figure from [2]

Conclusions • Blackbox vs. Whitebox Fuzzing • Cost/precision tradeoffs • Blackbox is lightweight, easy and fast, but poor coverage • Whitebox is smarter, but complex and slower • Recent “semi-whitebox” approaches • Less smart but more lightweight: Flayer (taint-flow analysis, may generate false alarms), Bunny-the-fuzzer (taint-flow, source-based, heuristics to fuzz based on input usage), autodafe, etc. • Which is more effective at finding bugs? It depends… • Many apps are so buggy, any form of fuzzing finds bugs! • Once low-hanging bugs are gone, fuzzing must become smarter: use whitebox and/or user-provided guidance (grammars, etc.)‏ • Bottom-line: in practice, use both! *Slide From [2]

Strengths • Novel approach to do fuzz testing • Introduced new search algorithm that use code-coverage maximizing heuristic • Applied as a black box • Not source code was needed • symbolic execution of program at the x86 binary level • Shows results comparing previous results • Test large applications previously tested found more bugs. • Introduced a full system and applied the novel ideas in this paper

Weakness • The results were non-determinism • Same input, program and idea different results. • Only focus in specific areas • X86 windows applications • File manipulation applications • Well formed input still some type of regular fuzzing testing • SAGE needs help from different tools • In my opinion the paper extends too much in the implementation of SAGE, and the system could of be too specific to Microsoft

Extensions • Make SAGE more general • Easy to implement to another architectures • Use for another types of applications • Linux based applications • Better way to create input files • May be used of grammar • Make the system deterministic • Having different results make me think that it could not be reliable.

Reference • [1] P Godefroid, MY Levin, D Molnar, Automated Whitebox Fuzz Testing, NDSS, 2008. • [2] Original presentation slides www.truststc.org/pubs/366/15%20-%20Molnar.ppt • [3] Wikipedia Fuzz testing http://en.wikipedia.org/wiki/Fuzz_testing.

Questions, Comments or Suggestions?

Acknowledgments