Evaluating Software Defects with State Coverage: An Empirical Analysis

State coverage: an empirical analysis based on a user study Dries Vanoverberghe, Emma Eyckmans, and Frank Piessens

Software Validation Metrics • Software defects after product release are expensive • NIST2002: $60 billion annually • MS Security bulletins: around 40/year at 100k to 1M $ each • Validating software (Testing) • Reduce # defects before release • But not without a cost • Make tradeoff: • Estimate remaining # defects => Software validation metrics

Example: Code coverage • Fraction of statements/basic blocks that are executed by the test suite • Principle: • not executed => no defects discovered • Hypothesis: • not executed => more likely contains defect

Example: Code coverage • High statement coverage • No defects? • Different paths • Structural coverage metrics: • e.g. Path coverage, data flow coverage, … • Measure degree of exploration • Automatic tool assistance • Metrics evaluate tools rather than human effort

Problem statement • Exploration is not sufficient • Tests need to check requirements • Evaluate completeness of test oracle • Impossible to automate: • Guess requirements • Evaluation is critical! • No good metrics available

State coverage • Evaluate strength of assertions • Idea: • State updates must be checked by assertions • Hypothesis: • Unchecked state update => more likely defect

State coverage • Complements code coverage • No replacement • Metrics also assist developers • Code coverage => reachability of statements? • State coverage => invariant established by reachable statements?

State coverage • Metric: • State update • Assignment to fields of objects • Return values, local variables, … also possible • Computation: • Runtime monitor number of state updates read in assertions total number of state updates

Design of experiment • Existing evaluation: • Correlation with mutation adequacy (Koster et al.) • Case study by expert user • Goal: • Directly analyze correlation with ‘real’ defects • Average users

Hypotheses • Hypothesis 1: • When increasing state coverage (without increasing exploration), the number of discovered defects increases • Similar to existing case study • Hypothesis 2: • State coverage and the number of discovered defects are correlated • Much stronger

Structure of experiment • Base program: • Small calendar management system • Result of software design course • Existing test suite • Presence of software defects unknown

Structure of experiment • Phase 1: case study • Extend test suite to find defects • First increase code coverage • Then increase state coverage • Dry run of experiment • Simplified application • Injected additional defects

Structure of experiment • Phase 2: Controlled user study • Create new test suite • First increase code coverage • Then increase state coverage • Commit after each detected defect

Threats to validity • Internal validity • Two sessions: no differences observed • Learning effect: subjects were familiar with environment before experiment • External validity • Choice of application • Choice of faults • Subjects are students

Results • Phase 1: case study • No additional defects discovered • No confirmation for hypothesis 1 • Potential reasons • Mostly structural faults • Non-structural faults were obvious • Phase 2: Controlled user study • No confirmation for hypothesis 1

Potential causes • Frequency of logical faults • 3/20 incorrect state updates • only 1/14 discovered! • 5/14 are detected by assertions • Focusing on these 5 faults • Higher state coverage (42% wrt 34%) for classes that detect at least one of these 5 • How common are logical faults?

Potential causes • Logical faults too obvious • Subjects discovered them with code coverage • State coverage is not monotonic • Adding new tests may decrease state coverage • Always relative to exploration

Conclusions • Experiment fails to confirm hypothesis • How frequent are logical faults? • Combine state coverage with code coverage? • Or compare test suites with similar code coverage • But also: • Simple • Efficient

Questions?

Evaluating Software Defects with State Coverage: An Empirical Analysis