410 likes | 644 Views
Automated Support for Classifying Software Failure Reports. Andy Podgurski, David Leon, Patrick Francis, Wes Masri, Melinda Minch, Jiayang Sun, Bin Wang Case Western Reserve University. Presented by: Hamid Haidarian Shahri. Automated failure reporting.
E N D
Automated Support for Classifying Software Failure Reports Andy Podgurski, David Leon, Patrick Francis, Wes Masri, Melinda Minch, Jiayang Sun, Bin Wang Case Western Reserve University Presented by: Hamid Haidarian Shahri
Automated failure reporting • Recent software products automatically detect and report crashes/exceptions to developer • Netscape Navigator • Microsoft products • Report includes call stack, register values, other debug info
User-initiated reporting • Other products permit user to report a failure at any time • User describes problem • Application state info may also be included in report
Mixed blessing • Good news: • More failures reported • More precise diagnostic information • Bad news: • Dramatic increase in failure reports • Too many to review manually
Our approach • Help developers group reported failures with same cause – before cause is known • Provide “semi-automatic” support • For execution profiling • Supervised and unsupervised pattern classification • Multivariate visualization • Initial classification is checked, refined by developer
How classification helps (Benefits) Aids prioritization and debugging: • Suggests number of underlying defects • Reflects howoften each defect causes failures • Assembles evidence relevant to prioritizing, diagnosing each defect
Formal view of problem • Let F = { f1, f2, ..., fm } be set of reported failures • True failure classification: partition of F into subsets F1, F2, ..., Fk such that in each Fi all failures have same cause • Our approach produces approximate failure classificationG1, G2, ..., Gp
Classification strategy (***) • Software instrumented to collect and upload profiles or captured executions for developer • Profiles of reported failures combined with those of apparently successful executions (reducing bias) • Subset of relevant features selected • Failure profiles analyzed using cluster analysis and multivariate visualization • Initial classification of failures examined, refined
Execution profiling • Our approach not limited to classifying crashes and exceptions • User may report failure well after critical events leading to failure • Profiles should characterize entire execution • Profiles should characterize events potentially relevant to failure, e.g., • Control flow, data flow, variable values, event sequences, state transitions • Full execution capture/replay permits arbitrary profiling
Feature selection • Generate candidate feature sets • Use each one to train classifier to distinguish failures from successful executions • Select features of classifier, which performs best overall • Use those features to group (cluster) related failures
Probabilistic wrapper method • Used to select features in our experiments • Due to Liu and Setiono • Random feature sets generated • Each used with one part of profile data to train classifier • Misclassification rate of each classifier estimated using another part of data (testing) • Features of classifier with smallest estimated misclassification rate used for grouping failures
Logistic regression (skip) • Simple, widely-used classifier • Binary dependent variable Y • Expected value E(Y | x) of Y given predictor x = (x1, x2, ..., xp) is (x) = P(Y = 1 | x)
Logistic regression cont. (skip) • Log odds ratio (logit) g(x) defined by • Coefficients estimated from sample of x and Y values. • Estimate of Y given x is 1 iff estimate of g(x) is positive
Grouping related failures Alternatives: • 1) Automatic cluster analysis • Can be fully automated • 2) Multivariate visualization • User must identify groups in display • Weaknesses of each approach offset by combining them
1) Automatic cluster analysis • Identifies clusters among objects based on similarity of feature values • Employs dissimilarity metric • e.g., Euclidean, Manhattan distance • Must estimate number of clusters • Difficult problem • Several “reasonable” ways to cluster a population may exist
Estimating number of clusters • Widely-used metric of quality of clustering due to Calinski and Harabasz: • B is total between-cluster sum of squared distances • W is total within-cluster sum of squared distances from cluster centroids • n is number of objects in population • Local maxima represent alternative estimates
2) Multidimensional scaling (MDS) • Represents dissimilarities between objects by 2D scatter plot • Distances between points in display approximatedissimilarities • Small dissimilarities poorly represented with high-dimensional profiles • Our solution: hierarchical MDS (HMDS)
Confirming or refining the initial classification • Select 2+ failures from each group • Choose ones with maximally dissimilar profiles • Debug to determine if they are related • If not, split group • Examine neighboring groups to see if they should be combined
Limitations • Classification unlikely to be exact • Sampling error • Modeling error • Representation error • Spurious correlations • Form of profiling • Human judgment
Experimental validation • Implemented classification strategy with three large subject programs • GCC, Jikes, javac compilers • Failures clustered automatically (what failure?) • Resulting clusters examined manually • Most or all failures in each cluster examined
Subject programs • GCC 2.95.2 C compiler • Written in C • Used subset of regression test suite (self-validating execution tests) • 3333 tests run, 136 failures • Profiled with Gnu Gcov (2214 function call counts) • Jikes 1.15 java compiler • Written in C++ • Used Jacks test suite (self-validating) • 3149 tests run, 225 failures • Profiled with Gcov (3644 function call counts)
Subject programs cont. • javac 1.3.1_02-b02 java compiler • Written in Java • Used Jacks test suite • 3140 tests run, 233 failures • Profiled with function-call profiler written using JVMPI (1554 call counts)
Experimental methodology (skip) • 400-500 candidate Logistic Regression (LR) models generated per data set • 500 randomly selected features per model • Model with lowest estimated misclassification rate chosen • Data partitioned into three subsets: • Train (50%): used to train candidate models • TestA (25%): used to pick best model • TestB (25%): used for final estimate of misclassification rate
Experimental Methodology cont. (skip) • Measure used to pick best model: • Gives extra weight to misclassification of failures • Final LR models correctly classified 72% of failures and 91% of successes • Linearly dependent features omitted from fitted LR models
Experimental methodology cont. (skip) • Cluster analysis • S-Plus clustering algorithm clara • Based on k-medoids criterion • Calinski-Harabasz index plotted for 2 c 50, local maxima examined • Visualization • Hierarchical MDS (HMDS) algorithm used
Manual examination of failures (skip) • Several GCC tests often have same source file, different optimization levels • Such tests often fail or succeed together • Hence, GCC failures were grouped manually based on • Source file • Information about bug fixes • Date of first version to pass test
Manual examination cont. (skip) • Jikes, javac failures grouped in two stages • Automatically formed clustered checked • Overlapping clusters in HMDS display checked • Activities: • Debugging • Comparing versions • Examining error codes • Inspecting source files • Check correspondence between tests and JLS sections
Number of clusters % size of largest group of failures in cluster with same cause Total failures (136) 21 100 77 (57%) 1 83 6 (4%) 3 75,75, 71 23 (17%) 1 60 5 (4%) 1 24 25 (18%) GCC results
GCC results cont. HMDS display of GCC failure profiles after feature selection. Convex hulls indicate results of automatic clustering into 27 clusters. HMDS display of GCC failure profiles after feature selection. Convex hulls indicate failures involving same defect using HMDS (more accurate).
GCC results cont. HMDS display of GCC failure profiles before feature selection. Convex hulls indicate failures involving same defect. So feature selectionhelps in grouping.
Number of clusters % size of largest group of failures in cluster with same cause Total failures (232) 9 100 70 (30%) 5 88, 85, 85, 85, 83 64 (28%) 4 75, 67, 67, 57 49 (21%) 2 50, 50 20 (9%) 1 17 23 (10%) javac results
javac results cont. HMDS display of javac failures. Convex hulls indicate results of manual classification with HMDS.
Number of clusters % size of largest group of failures in cluster with same cause Total failures (225) 12 100 64 (29%) 5 85, 83, 80, 75, 75 41 (18%) 4 70, 67, 67, 56 25 (11%) 8 50, 50, 50, 43, 41, 33, 33, 25 76 (34%) Jikes results
Jikes results cont. HMDS display of Jikes failures. Convex hulls indicate results of manual classification with HMDS.
Summary of results • In most automatically-created clusters, majority of failures had same cause • A few large, non-homogenous clusters were created • Sub-clusters evident in HMDS displays • Automatic clustering sometimes splits groups of failures with same cause • HMDS displays didn’t have this problem • Overall, failures with same cause formed fairly cohesive clusters
Threats to validity • One type of program used in experiments • Hand-crafted test inputs used for profiling • Think of Microsoft..
Related work • cSlice [Agrawal, et al] • Path spectra [Reps, et al] • Tarantula [Jones, et al] • Delta debugging [Hildebrand & Zeller] • Cluster filtering [Dickinson, et al] • Clustering IDS alarms [Julisch & Dacier]
Conclusions • Demonstrated that our classification strategy is potentially useful with compilers • Further evaluation needed with different types of software, failure reports from field • Note: Input space is huge. More accurate reporting (severity, location) could facilitate a better grouping and overcome these problems • Note: Limited labeled data available and error causes/types constantly changing (errors are debugged), so effectiveness of learning is somewhat questionable (like following your shadow)
Future work • Further experimental evaluation • Use more powerful classification, clustering techniques • Use different profiling techniques • Extract additional diagnostic information • Use techniques for classifying intrusions reported by anomaly detection systems