A Metric for Evaluating Static Analysis Tools

A Metric for Evaluating Static Analysis Tools Katrina Tsipenyuk, Fortify Software Brian Chess, Fortify Software

Four Perspectives on the Problem • General • How good are software security tools today? • Tools vendor • Is my static analysis product getting better over time? (What is better?) • How much has it improved since the last release? • What should I focus on to improve my tool in the future? • If I make my tool detect a new kind of security bug, will an auditor or a developer thank me? Or both? • Tools user: auditor • Is the tool finding all the important types of security bugs? • Tools user: developer • Is the tool producing a lot of noise? Auditors and developers have different criteria for security tools, so we need a way to answer posed questions on two scales – “Auditor” and “Developer”

Proposed Solution • Define metrics that model tool characteristics and conjecture a formula for calculating the score for each tool version • Counts of true positives (t), false positives (p), and false negatives (n) • 100 * t / (t + p + n) augmented by the weights and penalties (score out of 100) • Define weights & penalties for reported results • Results with different reported severities should be weighed differently • High (h), Medium (m), and Low (l) • False negatives penalties per bug category should differ depending on whether the tool claims to detect this kind of bug or not • Define weights & penalties for “Auditor” and “Developer” scale • Auditors tolerate false positives while developers tolerate false negatives – make false positive and false negative weights different to reflect this • Importance & value of a vulnerability category (vc) for auditors & developers should affect the weights of the results • Conduct an experiment and collect the necessary data to prove or disprove the conjecture

Experiment • Analyzed three different projects: wuftpd (C), webgoat (Java), and securibench (Java) • Ran four versions of Fortify tool • Did a full audit of reported results for all product / version combinations (time consuming) Table 1. Penalties with respect to category importance • Defined weights based on our experiences with auditors and developers • Table 1 presents chosen weights & penalties for true positives (t), false positives (p), and false negatives (n) based on high-value (high vc) and low-value (low vc) categories • Table 2 presents false negatives penalty per bug category based on whether the tool claims to detect the category or not • Table 3 presents High (h), Medium (m), and Low (l) severity weights • Table 4 presents false positives (p) and false negatives (n) penalties for “Auditor” and “Developer” scales Table 3. Severity weights Table 2. False negatives penalty based on whether the tool claims to detect the category or not Table 4. “Auditor” vs. “Developer” scales penalties

Experimental Results & Analysis • Collected data seems to indicate that we are headed in the right direction • Both scores for wuftpd get higher until version 3.1 • The number of false positives decreases, but in version 3.1 it increases • wuftpd “Developer” score is lower than “Auditor” score for all four versions • “Developer” false positives penalty is higher -- tool is tuned better for Java than for C • After all, Fortify is a security company • webgoat “Developer” score drops between versions 3.1 and 3.5 • With the addition of multiple auditor-oriented categories • Both scores are best for latest release examined (whew) webgoat securibench wuftpd (complete set of data for one experiment is available as a handout)

Conclusions & Future Work • Proposed approach is useful for our purposes – measuring improvements of Fortify static analyzer • It is unclear whether the same approach would be useful for comparing two different tools • Determining an “answer key” to grade the results of the tool with is still a hard problem • On our to-do list: • Do more audits of various projects to collect more data to adjust the weights and penalties • Include projects written for other languages the tool supports • Experiment with additional weights and penalties • Introduce penalty for incorrectly reporting severity of results • Define a good visual representation of the collected data • Make it intuitive to determine the area that needs improvement

A Metric for Evaluating Static Analysis Tools

A Metric for Evaluating Static Analysis Tools

Presentation Transcript

Evaluating and Tuning a Static Analysis to Find Null Pointer Bugs

Evaluating a Meta-Analysis

Evaluating Static Analysis Tools

A Metric for Evaluating Static Analysis Tools

Metric Tools for Java Programs

Specify tools Enforce banned functions Static analysis

Using Static Analysis Tools When Developing Drivers

Evaluating Web Server Log Analysis Tools

Static Analysis

Static Analysis for Security

Evaluating and Tuning a Static Analysis to Find Null Pointer Bugs

Static analysis for security

A Static Analysis Framework For Embedded Systems

A New Metric for Evaluating the Throughput Performance of HEW

Static Analysis

Scanstud Evaluating Static Analysis Tools

Static Analysis for Security

A New Metric for Evaluating the Throughput of HEW

Static Code Analysis Tools Appvigil

Evaluating Static Analysis Tools

A New Metric for Evaluating the Throughput Performance of HEW

The NIST SAMATE and Evaluating Static Analysis Tools