SAND2010-4169C

SAND2010-4169C Quantifying Effectiveness of Failure Prediction and Responsein HPC Systems:Methodology and Example Jackson Mayo, James Brandt, Frank Chen,Vincent De Sapio, Ann Gentile, Philippe Pébay,Diana Roe, David Thompson, and Matthew Wong Sandia National Laboratories Livermore, CA 28 June 2010 Workshop on Fault-Tolerance for HPC at Extreme Scale

Acknowledgments • This work was supported by the U.S. Department of Energy, Office of Defense Programs • Sandia is a multiprogram laboratory operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy under contract DE-AC04-94AL85000

Overview • OVIS project goals and techniques • Considerations for evaluating HPC failure prediction • Example failure mode and predictor • Example quantification of predictor effectiveness

The OVIS project aims to discoverHPC failure predictors • Probabilistic failure predictioncan enable smarter resourcemanagement and checkpointing,and extend HPC scaling • Challenges have limited progress on failure prediction • Complex interactions among resources and environment • Scaling of data analysis to millions of observables • Relative sparsity of data on failures and causes • Need for actionable, cost-effective predictors • OVIS (http://ovis.ca.sandia.gov) is open-source software for exploration and monitoring of large-scale data streams, e.g., from HPC sensors

The OVIS project aims to discoverHPC failure predictors • Robust, scalable infrastructure for data collection/analysis

The OVIS project aims to discoverHPC failure predictors • Analysis engines that learn statistical models from data and monitor for outliers (potential failure predictors) Correlative Graph clustering Bayesian

The OVIS project aims to discoverHPC failure predictors • Flexible user interface for data exploration

Evaluation of HPC failure predictionconfronts several challenges • Lack of plausible failure predictors • Some previous studies focused on possible responses without reference to a predictor • Lack of response cost information • Diverse costs may need to be estimated (downtime, hardware, labor) • Complex temporal features of prediction and response • Cost of action or inaction depends on prediction timing • Response to an alarm (e.g., hardware replacement) can alter subsequent events • Historical data do not fully reveal what would have happened if alarms had been acted upon

Two general approaches offermetrics for prediction effectiveness • Correlation of predictor with failure • Consider predictor as a classifier that converts available observations into a statement about future behavior • Simplest case: for a specific component and time frame, classifier predicts failure or non-failure (binary classification) • Use established metrics for classifier performance • Cost-benefit of response driven by predictor • Use historical data to estimate costs of acting on predictions • More stringent test because even a better-than-chance predictor may not be worth acting on • Requires choice of response and understanding of its impact on the system; may be relatable to classifier metrics

Classifier metrics assessability to predict failure • Classifiers have been analyzed for signal detection, medical diagnostics, and machine learning • Basic construct is “receiver operating characteristic” (ROC) curve • Binary classifiers have an adjustable threshold separating the two possible predictions • Interpretation in OVIS: How extreme an outlier is alarmed? • Sweeping this threshold generates a tradeoff curve between false positives and false negatives • Statistical significance of predictor can be measured • Any definition of failure/non-failure can be used, but one motivated by costs is most relevant

Cost metrics assess ability tobenefit from failure prediction • Given a predictor and a response, evaluate the net cost of using them versus not (or versus others) • Historical data alone may not answer this counterfactual • Alternatives are real-world trials and dynamical models • Classifier thresholds are subject to cost optimization • ROC curves allow reading off simple cost functions: constant cost per false positive and per false negative • Realistic costs may not match such binary labels • Is the cost-benefit of an alarm really governed by whether a failure occurs in the next N minutes? • If costs are available for each historical event, they can be used to optimize thresholds directly

Real-world failure predictorillustrates evaluation issues • Out of memory (OOM) condition has been a cause of job failure on Sandia’s Glory cluster • Failure predictor: abnormally high memory usage during idle time (detectable > 2 hours before failure)

Real-world failure predictorillustrates evaluation issues What is failure? Jobs terminate abnormally before system failure event(s) Cost-benefit and ramifications? How to evaluate cost-benefit for a given action? What are the ramifications of a given action/inaction on a live system where playback is impossible? Attribution? Event is far from the indicator/cause

Definitions and assumptions allowexample quantification of OOM predictor • Classifier predicts whether a job will terminate as COMPLETED (non-failure) or otherwise (failure) • Failure is predicted if memory usage (MU) on any job node during preceding idle time exceeds threshold • Response is rebooting of any node with excess MU during idle time, thus clearing memory • Cost of rebooting is 90 CPU-seconds • Does not include cycling wear or effect on job scheduling • If a job failed, rebooting its highest-MU node during the preceding idle time would have saved it • Credit given for total CPU-hours of failed job • Unrealistic assumption because not all failures are OOM

Example ROC curvemeasures prediction accuracy Lowest threshold (always alarm) • Predictions of job failure/non-failure are evaluated for various MU thresholds • ROC curve shows better-than-chance accuracy • Area under curve is 0.562 vs. 0.5 for chance • Statistical significance(p ~ 0.001) via comparison to synthetic data with no MU-failure correlation • Validates ability to predict failure in this system False neg. False pos. Highest threshold (never alarm)

Example net-benefit curvemeasures response effectiveness 80% of benefit from 20% of responses Lowest threshold (always reboot) • With stated assumptions, net benefit (saved jobs minus rebooting time) is monotonic with threshold • Rebooting cost is negligible • Routine rebooting is optimal • More realistic treatment would reduce net benefit • Not all failed jobs were OOM or could be saved • Additional rebooting costs • Curve bent 80/20: smart reboot has potential value Highest threshold (never reboot)

Conclusion • HPC failure prediction is a valuable ingredient to improve resilience and thus scaling of applications • System complexity makes predictors difficult to discover • When a potential failure predictor is identified, quantification of effectiveness is challenging in itself • Classifier metrics evaluate correlation between predictor and failure, but do not account for feasibility/cost of response • Assessing practical value of predictor involves a response’s cost and impact on the system (often not fully understood) • At least one predictor is known (idle memory usage) • Evaluation methodology applied to this example confirms predictivity and suggests benefit from reboot response http://ovis.ca.sandia.gov ovis@sandia.gov

SAND2010-4169C

SAND2010-4169C

Presentation Transcript

Building and Linking against Trilinos