300 likes | 424 Views
This presentation by Xiangyu Zhang explores the significant principles of statistical debugging, emphasizing its efficiency with large execution datasets. Traditional debugging focuses on few executions, whereas statistical techniques leverage multiple passing and failing executions for insights. Key topics covered include the evaluation of program predicates, failure reporting techniques like Tarantula, and methods for bug isolation. Experimental results demonstrate how predicate rankings can assist in identifying buggy code regions, ultimately enhancing the debugging process in software development.
E N D
CS590Z Statistical Debugging Xiangyu Zhang (part of the slides are from Chao Liu)
A Very Important Principle • Traditional debugging techniques deal with single (or very few) executions. • With the acquisition of a large set of executions, including passing and failing executions, statistical debugging is often highly effective. • Failure reporting • In house testing
Scalable Remote Bug Isolation (PLDI 2004, 2005) • Look at predicates • Branches • Function returns (<0, <=0, >0, >=0, ==0, !=0) • Scalar pairs • For each assignment x=…, find all variables y_i and constants c_j, each pair of x (=,<,<=…) y_i/c_j • Sample the predicate evaluations (Bernoulli sampling) • Investigate the relation of the probability of a predicate being true with the bug manifestion.
Bug Isolation How much does P being true increase the probability of failure over simply reaching the line P is sampled.
An Example void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } • Symptoms • 563 lines of C code • 130 out of 5542 test cases fail to give correct outputs • No crashes • The predicate are evaluated to both true and false in one execution not enough
void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } P_f (A) = tilde P (A | A & !B) P_t (A) = tilde P (A | !(A&!B))
Program Predicates • A predicate is a proposition about any program properties • e.g., idx < BUFSIZE, a + b == c, foo() > 0 … • Each can be evaluated multiple times during one execution • Every evaluation gives either true or false • Therefore, a predicate is simply a boolean random variable, which encodes program executions from a particular aspect.
Evaluation Bias of Predicate P • Evaluation bias • Def’n: the probability of being evaluated as true within one execution • Maximum likelihood estimation: Number of true evaluations over the total number of evaluations in one run • Each run gives one observation of evaluation bias for predicate P • Suppose we have n correct and m incorrect executions, for any predicate P, we end up with • An observation sequence for correct runs • S_p = (X’_1, X’_2, …, X’_n) • An observation sequence for incorrect runs • S_f = (X_1, X_2, …, X_m) • Can we infer whether P is suspicious based on S_p and S_f?
Prob Prob 0 1 Evaluation bias 0 1 Evaluation bias Underlying Populations • Imagine the underlying distribution of evaluation bias for correct and incorrect executions are and • S_p and S_f can be viewed as a random sample from the underlying populations respectively • One major heuristic is • The larger the divergence between and , the more relevant the predicate P is to the bug
Prob Prob 0 1 Evaluation bias 0 1 Evaluation bias Major Challenges • No knowledge of the closed forms of both distributions • Usually, we do not have sufficient incorrect executions to estimate reliably.
Algorithm Outputs • A ranked list of program predicates w.r.t. the bug relevance score s(P) • Higher-ranked predicates are regarded more relevant to the bug • What’s the use? • Top-ranked predicates suggest the possible buggy regions • Several predicates may point to the same region • … …
Outline • Program Predicates • Predicate Rankings • Experimental Results • Case Study: bc-1.06 • Future Work • Conclusions
Experiment Results • Localization quality metric • Software bug benchmark • Quantitative metric • Related works • Cause Transition (CT), [CZ05] • Statistical Debugging, [LN+05] • Performance comparisons
Bug Benchmark • Bug benchmark • Dreaming benchmark • Large number of known bugs on large-scale programs with adequate test suite • Siemens Program Suite • 130 variants of 7 subject programs, each of 100-600 LOC • 130 known bugs in total • mainly logic (or semantic) bugs • Advantages • Known bugs, thus judgments are objective • Large number of bugs, thus comparative study is statistically significant. • Disadvantages • Small-scaled subject programs • State-of-the-art performance, so far claimed in literature, • Cause-transition approach, [CZ05]
1st Example 1 10 2 6 3 5 4 7 9 8 T-score = 70%
2nd Example 1 10 2 6 3 5 4 7 9 8 T-score = 20%
Related Works • Cause Transition (CT) approach [CZ05] • A variant of delta debugging [Z02] • Previous state-of-the-art performance holder on Siemens suite • Published in ICSE’05, May 15, 2005 • Cons: it relies on memory abnormality, hence its performance is restricted. • Statistical Debugging (Liblit05) [LN+05] • Predicate ranking based on discriminant analysis • Published in PLDI’05, June 12, 2005 • Cons: Ignores evaluation patterns of predicates within each execution
Top-k Selection • Regardless of specific selection of k, both Liblit05 and SOBER are better than CT, the current state-of-the-art holder • From k=2 to 10, SOBER is better than Liblit05 consistently
Outline • Evaluation Bias of Predicates • Predicate Rankings • Experimental Results • Case Study: bc-1.06 • Future Work • Conclusions
Case Study: bc 1.06 • bc 1.06 • 14288 LOC • An arbitrary-precision calculator shipped with most distributions of Unix/Linux • Two bugs were localized • One was reported by Liblit in [LN+05] • One was not reported previously • Some lights on scalability
Outline • Evaluation Bias of Predicates • Predicate Rankings • Experimental Results • Case Study: bc-1.06 • Future Work • Conclusions
Future Work • Further leverage the localization quality • Robustness to sampling • Torture on large-scale programs to confirm its scalability to code size • …
Conclusions • We devised a principled statistical method for bug localization. • No parameter setting hassles • It handles both crashing and noncrashing bugs. • Best quality so far.
Discussion • Features • Easy implementation • Difficult experimentation • More advanced statistical technique may not be necessary • Go wide, not go deep… • Predicates are treated as independent random variables. • Can execution indexing help? • Can statistical principles be combined with slicing or IWIH ?