CS590Z Statistical Debugging

CS590Z Statistical Debugging Xiangyu Zhang (part of the slides are from Chao Liu)

A Very Important Principle • Traditional debugging techniques deal with single (or very few) executions. • With the acquisition of a large set of executions, including passing and failing executions, statistical debugging is often highly effective. • Failure reporting • In house testing

Tarantula (ASE 2005, ISSTA 2007)

Scalable Remote Bug Isolation (PLDI 2004, 2005) • Look at predicates • Branches • Function returns (<0, <=0, >0, >=0, ==0, !=0) • Scalar pairs • For each assignment x=…, find all variables y_i and constants c_j, each pair of x (=,<,<=…) y_i/c_j • Sample the predicate evaluations (Bernoulli sampling) • Investigate the relation of the probability of a predicate being true with the bug manifestion.

Bug Isolation

Bug Isolation How much does P being true increase the probability of failure over simply reaching the line P is sampled.

An Example void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } • Symptoms • 563 lines of C code • 130 out of 5542 test cases fail to give correct outputs • No crashes • The predicate are evaluated to both true and false in one execution not enough

void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } P_f (A) = tilde P (A | A & !B) P_t (A) = tilde P (A | !(A&!B))

Program Predicates • A predicate is a proposition about any program properties • e.g., idx < BUFSIZE, a + b == c, foo() > 0 … • Each can be evaluated multiple times during one execution • Every evaluation gives either true or false • Therefore, a predicate is simply a boolean random variable, which encodes program executions from a particular aspect.

Evaluation Bias of Predicate P • Evaluation bias • Def’n: the probability of being evaluated as true within one execution • Maximum likelihood estimation: Number of true evaluations over the total number of evaluations in one run • Each run gives one observation of evaluation bias for predicate P • Suppose we have n correct and m incorrect executions, for any predicate P, we end up with • An observation sequence for correct runs • S_p = (X’_1, X’_2, …, X’_n) • An observation sequence for incorrect runs • S_f = (X_1, X_2, …, X_m) • Can we infer whether P is suspicious based on S_p and S_f?

Prob Prob 0 1 Evaluation bias 0 1 Evaluation bias Underlying Populations • Imagine the underlying distribution of evaluation bias for correct and incorrect executions are and • S_p and S_f can be viewed as a random sample from the underlying populations respectively • One major heuristic is • The larger the divergence between and , the more relevant the predicate P is to the bug

Prob Prob 0 1 Evaluation bias 0 1 Evaluation bias Major Challenges • No knowledge of the closed forms of both distributions • Usually, we do not have sufficient incorrect executions to estimate reliably.

Our Approach

Algorithm Outputs • A ranked list of program predicates w.r.t. the bug relevance score s(P) • Higher-ranked predicates are regarded more relevant to the bug • What’s the use? • Top-ranked predicates suggest the possible buggy regions • Several predicates may point to the same region • … …

Outline • Program Predicates • Predicate Rankings • Experimental Results • Case Study: bc-1.06 • Future Work • Conclusions

Experiment Results • Localization quality metric • Software bug benchmark • Quantitative metric • Related works • Cause Transition (CT), [CZ05] • Statistical Debugging, [LN+05] • Performance comparisons

Bug Benchmark • Bug benchmark • Dreaming benchmark • Large number of known bugs on large-scale programs with adequate test suite • Siemens Program Suite • 130 variants of 7 subject programs, each of 100-600 LOC • 130 known bugs in total • mainly logic (or semantic) bugs • Advantages • Known bugs, thus judgments are objective • Large number of bugs, thus comparative study is statistically significant. • Disadvantages • Small-scaled subject programs • State-of-the-art performance, so far claimed in literature, • Cause-transition approach, [CZ05]

Localization Quality Metric [RR03]

1st Example 1 10 2 6 3 5 4 7 9 8 T-score = 70%

2nd Example 1 10 2 6 3 5 4 7 9 8 T-score = 20%

Related Works • Cause Transition (CT) approach [CZ05] • A variant of delta debugging [Z02] • Previous state-of-the-art performance holder on Siemens suite • Published in ICSE’05, May 15, 2005 • Cons: it relies on memory abnormality, hence its performance is restricted. • Statistical Debugging (Liblit05) [LN+05] • Predicate ranking based on discriminant analysis • Published in PLDI’05, June 12, 2005 • Cons: Ignores evaluation patterns of predicates within each execution

Localized bugs w.r.t. Examined Code

Cumulative Effects w.r.t. Code Examination

Top-k Selection • Regardless of specific selection of k, both Liblit05 and SOBER are better than CT, the current state-of-the-art holder • From k=2 to 10, SOBER is better than Liblit05 consistently

Outline • Evaluation Bias of Predicates • Predicate Rankings • Experimental Results • Case Study: bc-1.06 • Future Work • Conclusions

Case Study: bc 1.06 • bc 1.06 • 14288 LOC • An arbitrary-precision calculator shipped with most distributions of Unix/Linux • Two bugs were localized • One was reported by Liblit in [LN+05] • One was not reported previously • Some lights on scalability

Outline • Evaluation Bias of Predicates • Predicate Rankings • Experimental Results • Case Study: bc-1.06 • Future Work • Conclusions

Future Work • Further leverage the localization quality • Robustness to sampling • Torture on large-scale programs to confirm its scalability to code size • …

Conclusions • We devised a principled statistical method for bug localization. • No parameter setting hassles • It handles both crashing and noncrashing bugs. • Best quality so far.

Discussion • Features • Easy implementation • Difficult experimentation • More advanced statistical technique may not be necessary • Go wide, not go deep… • Predicates are treated as independent random variables. • Can execution indexing help? • Can statistical principles be combined with slicing or IWIH ?

CS590Z Statistical Debugging

CS590Z Statistical Debugging

Presentation Transcript

Debugging

Debugging

Debugging

Debugging

Debugging

Statistical Debugging

Debugging

Statistical Debugging

Debugging

CS590Z Delta Debugging

Debugging

Debugging

Debugging

Debugging

Debugging

Debugging !!! 

Debugging

Debugging

Debugging