MSc Software Testing and Maintenance MSc Prófun og viðhald hugbúnaðar

MSc Software Testing and MaintenanceMSc Prófun og viðhald hugbúnaðar Fyrirlestrar 41 & 42Comparing Bug Finding Tools Can you detect me? Dr Andy Brooks

Case StudyDæmisaga • Reference • Comparing Bug Finding Tools with Reviews and Tests, • Stefan Wagner, Jan Jürjens, Claudia Koller, and Peter Trischberger, Institut für Informatik, Technische Universität München, 2005 • http://www4.in.tum.de/publ/papers/SWJJCKPT05.pdf Dr Andy Brooks

1. Introduction • Software quality assurance accounts for around 50% of the development time. • Defect-detection techniques need to be improved and costs reduced. • There are a number of automated static analysis tools called bug finding tools. • Faults are the cause of failures in code. Dr Andy Brooks

1. Introduction Problem • Which kinds of defects are found by bug finding tools, reviews, and testing? • Are the same or different defects found? • How much overlap is there between the different techniques? • Do the static analysis tools produce too many false positives? • Bug reports that are not actually bugs... Dr Andy Brooks

1. Introduction Results • Bug finding tools detect only a subset of the kinds of defects that reviews find. • The tools are better regarding the bug patterns they are programmed for. • Testing finds completely different defects than bug finding tools. • Bug finding tools produce many more false positives than true positives. • Results of applying bug finding tools varies according to the project being studied. Dr Andy Brooks

1. Introduction Consequences • Testing or reviews cannot be substituted by bug finding tools. • Bug finding tools could be usefully run before conducting reviews. • The false positive ratio from bug finding tools needs to be lowered to realise reductions in defect-detection effort. • Tools should be more tolerant of programming style and design. Dr Andy Brooks

1. Introduction Experimental Setup • Five Java projects • 4 industrial (telecomms company O2) • web information systems • 1 university • Technische Universität München • projects in use or in final testing • projects have an interface to a relational database • Java bug finding tools and testing was applied to all 5 projects. Dr Andy Brooks

1. Introduction Experimental Setup • A review was applied to only one project. • Reports from the bug finding tools were classified as true and false positives by experienced developers. • Defects were classified by: • severity • type Dr Andy Brooks

2. Bug Finding Tools Techniques used by tools • Bug patterns are based on experience and known pitfalls in a programming language. • Readability is checked based on coding guidelines and standards. • Dataflow and controlflow analysis. • Code annotations to allow extended static checking/model checking. • code annotation tools ignored in this study Dr Andy Brooks

2. Bug Finding Tools The Java bug finding tools • FindBugs Version 0.8.1 • bug patterns & dataflow analysis • can detect unused variables • analyses bytecode • PMD Version 1.8 • coding standards • can detect empty try/catch blocks • can detect classes with high cyclomatic complexity • QJ Pro Version 2.1 • uses over 200 rules • can detect too long variable names • can detect imbalance between code and commentary lines Dr Andy Brooks

3. Projects • Project A • online shop • software in use for 6 months • 1066 Java classes, over 58 KLOC • Project B • pay for goods • not operational at time of study • 215 Java classes, over 24 KLOC • Project C • frontend for file converter • software in use for 3 months • over 3 KLOC and JSP code Dr Andy Brooks

3. Projects • Project D • data manager • J2EE application • 572 classes, over 34 KLOC • EstA • non-industrial, requirements editor • not extensively used • over 4 KLOC Dr Andy Brooks

4. Approach 4.1 General • Bug finding tools used on all 5 projects. • Black-box and white-box testing of all 5 projects. • One review (Project C). • Techniques used completely independently. • Warnings from the tools are called positives and experienced developers classified them as true positives or false positives. Dr Andy Brooks

4. Approach 4.1 General • Validity threats include: • one review is not representative of reviews • only 3 bug finding tools were used • there are many more and results might be different • testing of the mature projects did not reveal many faults • too little data to make accurate statistical inferences • only 5 projects were analysed • more experiments are necessary Dr Andy Brooks

4. Approach 4.2 Defect Categorisation • Defects that lead to a crash. • Defects that cause a logical failure. • Defects with insufficient error handling. • Defects that violate the principles of structured programming. • Defects that reduce code maintainability. Dr Andy Brooks

* Table 1, Section 5 Analysis Dr Andy Brooks over all projects

5.1 Bug Finding Tools Observations and Interpretations • Most of the true positives are Category 5 • code maintainability • Different tools find different positives. • only one defect type was found across all tools* • FindBugs is the only tool to find positives across all defect categories 1 thru´ 5. • FindBugs detects the most number of types, QJ Pro the least. Dr Andy Brooks

5.1 Bug Finding Tools Observations and Interpretations • True positive detection is diverse. • For the defect type in common to all, FindBugs finds only 4 true positives, PMD finds 29, and QJ Pro finds 30. • FindBugs and PMD have lower false positive ratios than QJ Pro. • Because all warnings have to be examined, QJ Pro is not efficient. Table 2. Average ratios of false positives for each tool and in total. Dr Andy Brooks

5.1 Bug Finding Tools Observations and Interpretations • Efficiency of tools varied across projects. • For the Category 1 defect (“Database connection not closed”), FindBugs issued true positives for projects B and D but 46 false positives for project A. • Detection rates of true positives decreases for projects A and D for the other two tools. • Ignoring Category 5 defects. • Recommending a single tool is difficult. • QJ Pro is the least efficient. • FindBugs and PMD should be used in combination. • FindBugs finds many different defect types. • PMD has accurate results for Category 5 defects. Dr Andy Brooks

5.2 Bug Finding Tools vs. Review • An informal review was performed on project C with three developers. • no preparation • code author was a reviewer • code inspected at the review meeting • 19 different types of defects were found This variable is initialised but not used. Dr Andy Brooks

* * Dr Andy Brooks Section 5.2

5.2 Bug Finding Tools vs Review Observations and Interpretations • All defect types found by the tools* were also found by the review of project C: • “Variable initialised but not used” • The tools found 7 defects. • The review found only one. • “Unnecessary if clause” • The review found 8 defects. • An if-clause with no further computation. • 7 defects required investigation of program logic. • The tools found only one. • The if-clause with no further computation. Dr Andy Brooks

5.2 Bug Finding Tools vs Review Observations and Interpretations • But 17 additional defect types were found in the review, some of which could have been found by tools but were not: • “Database connection is not closed” was not found by the tools. • FindBugs is generally able to detect “String concatenated inside loop with “+”” but did not. • to avoid creating unnecessary and unreferenced String objects • Defect types such as “Wrong result” cannot be found by static tools but can be found in a review by manually executing a test case through the code. Dr Andy Brooks

5.2 Bug Finding Tools vs Review Observations and Interpretations • By finding more defect types, the review of project C can be thought of as more successful than any tool. • Perhaps it is beneficial to use a bug finding tool first because automated static analysis is cheap. • But bug finding tools produce many false positives and the work involved in assessing a positive as false might outweigh the benefits of automatic static analysis. Dr Andy Brooks

5.3 Bug Finding Tools vs. Testing • Several hundred test cases were executed. • Black-box test cases were based on the textual specifications and the experience of the testers. • equivalence partitioning • boundary value analysis • White-box test cases involved path testing. • Path selection criteria are not specified. Dr Andy Brooks

5.3 Bug Finding Tools vs. Testing • A coverage tool checked test set quality. • Coverage was high apart from project C. • “In all the other projects, class coverage was nearly 100%, method coverage was also in that area and line coverage lay between 60% and 93%.” • No stress tests were executed. • This “might have changed the results significantly”. • Defects were found only for project C and project EstA. • Other projects were “probably too mature”. Dr Andy Brooks

Dr Andy Brooks

5.3 Bug Finding Tools vs. Testing Observations and Interpretations • Dynamic testing found defects in Categories 1,2, and 3, but not 4 or 5. • Category 5 defects are not detectable by dynamic testing. • Dynamic testing of project C and project EstA found completely different defects to those found by the bug finding tools. • Stress testing might have revealed the database connections that were not closed. • “Therefore, we again recommend using both techniques in a project.” Dr Andy Brooks

5.4 Defect Removal Efficiency • The total number of defects is unknown but can be estimated using all the defects found so far. • Without regard to severity of defect, efficiency is poor for tests and good for the bug finding tools. (Only 1 defect found in common: between Review and Tools.) Dr Andy Brooks

5.4 Defect Removal Efficiency • With regard to severity of defect, tests and reviews are “far more efficient in finding defects of the categories 1 and 2 than the bug finding tools”. Dr Andy Brooks

6. Discussion • The results are not too surprising: • Static tools, with no model checking capabilities, are limited and cannot verify program logic. • Reviews and tests can verify program logic. • Perhaps surprising is that there was not a single defect detected both by the tools and testing. • Few defects, however, were found during testing since most of the projects were mature and already in operation. This may explain the lack of overlap. Dr Andy Brooks

6. Discussion • “A rather disillusioning result is the high ratio of false positives that are issued by the tools.” • The benefits of automated detection are outweighed by the need to manually determine a positive is false. • No cost/benefit analysis performed in this study. Dr Andy Brooks

6. Discussion • Some bug finding tools make use of additional annotations that permit some checks of logic. • The number of false positives could be reduced. • Category 1 and 2 defect detection could be increased. • But savings could be outweighed by the need to add annotations to the source code. Dr Andy Brooks

8. Conclusions • Work is not a comprehensive empirical study and provides only “first indications” of the effectiveness of bug finding tools to other techniques. • Further experimental work is needed. • Cost/benefit models need to be built. Dr Andy Brooks

8. Conclusions • Bug finding tools find: • different defects than testing • a subset of the types a review finds • Bug finding tool effectiveness varied from project to project. • Probably because of different programming style and design in use. • Andy asks: how should we incorporate the idea of maintainability into static analysis tools? Dr Andy Brooks

8. Conclusions • If the number of false positives were much lower, it would be safe to recommend using bug finding tools, reviews and testing in a combined approach. • “It probably costs more time to resolve the false positives than is saved by the automation using the tools.” Looks like another false positive and another two minutes of my time wasted... Dr Andy Brooks

MSc Software Testing and Maintenance MSc Prófun og viðhald hugbúnaðar