1 / 6

A Metric for Evaluating Static Analysis Tools

A Metric for Evaluating Static Analysis Tools. Katrina Tsipenyuk, Fortify Software Brian Chess, Fortify Software. Four Perspectives on the Problem. General How good are software security tools today? Tools vendor Is my static analysis product getting better over time? (What is better?)

Download Presentation

A Metric for Evaluating Static Analysis Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Metric for Evaluating Static Analysis Tools Katrina Tsipenyuk, Fortify Software Brian Chess, Fortify Software

  2. Four Perspectives on the Problem • General • How good are software security tools today? • Tools vendor • Is my static analysis product getting better over time? (What is better?) • How much has it improved since the last release? • What should I focus on to improve my tool in the future? • If I make my tool detect a new kind of security bug, will an auditor or a developer thank me? Or both? • Tools user: auditor • Is the tool finding all the important types of security bugs? • Tools user: developer • Is the tool producing a lot of noise? Auditors and developers have different criteria for security tools, so we need a way to answer posed questions on two scales – “Auditor” and “Developer”

  3. Proposed Solution • Define metrics that model tool characteristics and conjecture a formula for calculating the score for each tool version • Counts of true positives (t), false positives (p), and false negatives (n) • 100 * t / (t + p + n) augmented by the weights and penalties (score out of 100) • Define weights & penalties for reported results • Results with different reported severities should be weighed differently • High (h), Medium (m), and Low (l) • False negatives penalties per bug category should differ depending on whether the tool claims to detect this kind of bug or not • Define weights & penalties for “Auditor” and “Developer” scale • Auditors tolerate false positives while developers tolerate false negatives – make false positive and false negative weights different to reflect this • Importance & value of a vulnerability category (vc) for auditors & developers should affect the weights of the results • Conduct an experiment and collect the necessary data to prove or disprove the conjecture

  4. Experiment • Analyzed three different projects: wuftpd (C), webgoat (Java), and securibench (Java) • Ran four versions of Fortify tool • Did a full audit of reported results for all product / version combinations (time consuming) Table 1. Penalties with respect to category importance • Defined weights based on our experiences with auditors and developers • Table 1 presents chosen weights & penalties for true positives (t), false positives (p), and false negatives (n) based on high-value (high vc) and low-value (low vc) categories • Table 2 presents false negatives penalty per bug category based on whether the tool claims to detect the category or not • Table 3 presents High (h), Medium (m), and Low (l) severity weights • Table 4 presents false positives (p) and false negatives (n) penalties for “Auditor” and “Developer” scales Table 3. Severity weights Table 2. False negatives penalty based on whether the tool claims to detect the category or not Table 4. “Auditor” vs. “Developer” scales penalties

  5. Experimental Results & Analysis • Collected data seems to indicate that we are headed in the right direction • Both scores for wuftpd get higher until version 3.1 • The number of false positives decreases, but in version 3.1 it increases • wuftpd “Developer” score is lower than “Auditor” score for all four versions • “Developer” false positives penalty is higher -- tool is tuned better for Java than for C • After all, Fortify is a security company • webgoat “Developer” score drops between versions 3.1 and 3.5 • With the addition of multiple auditor-oriented categories • Both scores are best for latest release examined (whew) webgoat securibench wuftpd (complete set of data for one experiment is available as a handout)

  6. Conclusions & Future Work • Proposed approach is useful for our purposes – measuring improvements of Fortify static analyzer • It is unclear whether the same approach would be useful for comparing two different tools • Determining an “answer key” to grade the results of the tool with is still a hard problem • On our to-do list: • Do more audits of various projects to collect more data to adjust the weights and penalties • Include projects written for other languages the tool supports • Experiment with additional weights and penalties • Introduce penalty for incorrectly reporting severity of results • Define a good visual representation of the collected data • Make it intuitive to determine the area that needs improvement

More Related