Stanford ROC Updates

Stanford ROC Updates Armando Fox

Progress • Graduations • Ben Ling (SSM, cheap-recovery session state manager) • Jamie Cutler (refactoring satellite groundstation software architecture to apply ROC techniques) • Andy Huang: DStore, a persistent cluster-based hash table (CHT) • Consistency model concretized • Cheap recovery exploited for fast recovery triggered by statistical monitoring • Cheap recovery exploited for online repartitioning

More progress • George Candea: Microreboots at the EJB level in J2EE apps • Shown to recover from variety of injected faults • J2EE app session state factored out into SSM, making the J2EE app crash-only • Demo during poster session • Emre Kiciman: Pinpoint: further exploration of anomaly-based failure detection [in a minute]

Fast Recovery meets Anomaly Detection • Use anomaly detection techniques to infer (possible) failures • Act on alarms using low-overhead “micro-recovery” mechanisms • Microreboots in EJB apps • Node- or process-level reboot in DStore or SSM • Occasional false positives OK since recovery is so cheap • These ideas will be developed at Panel tonight, and form topics for Breakouts tomorrow

Emre Kıcıman and Armando Fox {emrek, fox}@cs.stanford.edu Updates on PinPoint

What Is This Talk About? Overview of recent Pinpoint experiments Including observations on fault behaviors Comparison with other app-generic fault detectors Tests of Pinpoint limitations Status of deployment at real sites

Pinpoint: Overview Goal: App-generic & High-level failure detection For app-level faults, detection is significant % of MTTR (75%!) Existing monitors: hard to build/maintain or miss high-level faults Approach: Monitor, aggregate, and analyze low-level behaviors that correspond to high-level semantics Component interactions Structure of runtime paths Analysis of per-node statistics (req/sec, mem usage, ...), without a priori thresholds Assumption: Anomalies are likely to be faults Look for anomalies over time, or across peers in the cluster.

Recap: 3 Steps to Pinpoint Observe low-level behaviors that reflect app-level behavior Likely to change iff application-behavior changes App-transparent instrumentation! Model normal behavior and look for anomalies Assume: most of system working most of the time Look for anomalies over time and across peers No a priori app-specific info! Correlate anomalous behavior to likely causes Assume: observed connection between anomaly and cause Finally, notify admin or reboot component

An Internet Service... HTTP Frontends Databases Application Components Middleware

A Failure... HTTP Frontends Databases Application Components X Middleware • Failures behave differently than normal • Look for anomalies in patterns of internal behavior

Patterns: Path-shapes HTTP Frontends Databases Application Components Middleware

Patterns: Component Interactions HTTP Frontends Databases Application Components Middleware

Outline Overview of recent Pinpoint experiments Observations on fault behaviors Comparison with other app-generic fault detectors Tests of Pinpoint limitations Status of deployment at real sites

Compared to other anomaly-detection... • Labeled and Unlabeled training sets • If we know the end user saw a failure, Pinpoint can help with localization • But often we’re trying to catch failures that end-user-level detectors would miss • “Ground truth” for the latter is HTML page checksums + database table snapshots • Current analyses are done offline • Eventual goal is to move to online, with new models being trained and rotated in periodically • Alarms must be actionable • Microreboots (tomorrow) allows acting on alarms even when false positives

Fault and Error Injection Behavior • Injected 4 types of faults and errors • Declared and runtime exceptions • Method call omissions • Source code bug injections (details on next page) • Results ranged in severity (% of requests affected) • 60% of faults caused cascades, affecting secondary requests • We fared most poorly on the “minor” bugs

Experience w/Bug Injection Wrote a Java code modifier to inject bugs Injects 6 kinds of bugs into code in Petstore 1.3 Limited to bugs that would not be caught by compiler, and are easy to inject -> no major structural bugs Double-check fault existence by checksumming HTML output Not trivial to inject bugs that turn into failures! 1st try: inject 5-10 bugs into random spots in each component. Ran 100 experiments, only 4 caused any changes! 2nd try: exhaustive enumeration of potential “bug spots” Found total of 41 active spots out of 1000s. Rest is straight-line code w/no trivial bug spots, or dead code.

Source Code Bugs (Detail) Loop Errors: Inverts loop conditions, injected 15. while(b) {stmt;} -> while(!b) {stmt;} Misassignment: Replaces LHS of assignment, injected 1 i=f(a); -> j=f(a); Misinitialization: Clears a variable initialization, injected 2 int i=20; -> int i=0; Misreference: Replaces a var reference, injected 6 avail=onStock-Ordered; -> avail=onStock-onOrder; Off-by-one: Replaces comparison op, injected 17 if(a > b) {...}; -> if(a >= b) {...}; Synchronization: Removes synchronization code, injected 0 synchronized { stmt; } -> { stmt; }

Outline Overview of recent Pinpoint experiments Including observations on fault behaviors Comparison with other app-generic fault detectors Tests of Pinpoint limitations Status of deployment at real sites

Metrics: Recall and Precision Recall = C/T, how much of target was identified Precision = C/R, how much of results were correct Also, precision = 1 – false positive rate Correctly Identified (C) Results (R) Target (T)

Metrics: Applying Recall and Precision Detection Do failures in the system cause detectable anomalies? Recall = % of failures actually detected as anomalies Precision = 1 - (false positive rate); ~1.0 in our expts Identification (given a failure is detected): recall = how many actually-faulty requests are returned precision = what % of requests returned are faulty = 1-(false positive rate) using HTML page checksums as ground truth Workload: PetStore 1.1 and 1.3 (significantly different versions), plus RUBiS

Fault Detection: Recall (All fault types) • Minor faults were hardest to detect • especially for Component Interaction

FD Recall (Severe & Major Faults only) • Major faults are those that affect > 1% of requests • For these faults, Pinpoint has significantly higher recall than other low-level detectors

Detecting Source Code Bugs Source code bugs were hardest to detect PS-analysis, CI-analysis individually detected 7-12% of all faults, 37% of major faults HTTP detected 10% of all faults We did better than HTTP logs, but that’s no excuse Other faults: PP strictly better than HTTP and HTML det. Src code bugs: complementary: together all detected 15%

Faulty Request Identification HTTP monitoring has perfect precision since it’s a “ground truth indicator” of a server fault Path-shape analysis pulls more points out of the bottom left corner Failures detected, but high rate of mis-identification of faulty requests (false positive) Failures detected, faulty requests identified as such Failures injected but not detected Failures not detected, but low false positives (good requests marked faulty)

Faulty Request Identification HTTP monitoring has perfect precision since it’s a “ground truth indicator” of a server fault Path-shape analysis pulls more points out of the bottom left corner

Adjusting Precision =1: recall=68% precision=14% =4: recall=34% precision=93% Low recall for faulty request identification still detects 83% of fault experiments

Outline Overview of recent Pinpoint experiments Including observations on fault behaviors Comparison with other app-generic fault detectors Tests of Pinpoint limitations Status of deployment at real sites

Status of Real-World Deployment Deploying parts of Pinpoint at 2 large sites Site 1 Instrumenting middleware to collect request paths for path-shape and component interaction analysis Feasability completed, instrumentation in progress... Site 2 Applying peer-analysis techniques developed for SSM and D-Store Metrics (e.g., req/sec, memory usage, ...) already being collected. Beginning analysis and testing...

Summary Fault injection experiments showed range of behavior Cascading faults to other requests; range of severity. Pinpoint performed better than existing low-level monitors Detected ~90% of major component-level errors (exceptions, etc) Even in worst-case expts (src code bugs) PP provided a complementary improvement to existing low-level monitors Currently, validating Pinpoint in two real-world services

Detail Slides

Limitations: Independent Requests PP assumes: request-reply w/independent requests Monitored RMI-based J2EE system (ECPerf 1.1) .. is request-reply, but requests not independent, nor is unit of work (UoW) well defined. Assume: UoW = 1 RMI call. Most RMI calls resulted in short paths (1 comp) Injected faults do not change these short paths When anomalies occurred, rarely in faulty path... Solution? Redefine UoW as multiple RMI calls => paths capture more behavioral changes => redefined UoW is likely app-specific

Limitations: Well-defined Peers PP assumes: component peer groups well-defined But behavior can depend on context Ex. Naming server in a cluster Front-end servers mostly send lookup requests Back-end servers mostly respond to lookups. Result: No component matches “average” behavior Both front-end and back-end naming servers “anomalous”! Solution? Extend component-IDs to include logical location...

Bonus Slides

Ex. Application-Level Failure No itinerary is actually available on this page Ticket was bought in March for travel in April But, website (superficially) appears to be working. Heartbeat, pings, HTTP-GET tests are not likely to detect the problem

Application-level Failures Application-level failures are common >60% of sites have user-visible (incl. app-level) failures [BIG-SF] Detection is major portion of recovery time TellMe: detecting app-level failures is 75% of recovery time [CAK04] 65% of user-visible failures mitigable by earlier detection [OGP03] Existing monitoring techniques aren't good enough Low-level monitors: pings, heartbeats, http error monitoring + app-generic/low maintenance, - miss high-level failures High-level, app-specific tests - app-specific/hard to maintain, + can catch many app-level failures, - test coverage problem

Testbed and Faultload Instrumented JBoss/J2EE middleware J2EE: state mgt, naming, etc. -> Good layer of indirection JBoss: open-source; millions of downloads; real deployments Track EJBs, JSPs, HTTP, RMI, JDBC, JNDI w/synchronous reporting: 2-40ms latency hit; 17% throughput decrease. Testbed applications Petstore 1.3, Petstore 1.1, RUBiS, ECPerf Test strategy: inject faults, measure detection rate Declared and undeclared exceptions Omitted calls: app not likely to handle at all Source code bugs (e.g., off-by-one errors, etc)

PCFGs Model Normal Path Shapes Probabilistic Context Free Grammar (PCFG) Represents likely calls made by each component Learn probabilities of rules based on observed paths Anomalous path shapes Score a path by summing the deviations of P(observed calls) from average. Detected 90% of faults in our experiments Sample Paths A B C A B C Learned PCFG C B A S p=.5 p=1 $ B B A p=.5 p=.5 $ C BC A p=1 p=.5

Use PCFG to Score Paths Measure difference between observed path and avg Score(path) = ∑ 1/ni - P(ri) Higher scores are anomalous Detected 90% of faults in our experiments Sample Paths A B C A B C Learned PCFG C B A S p=.5 p=1 $ B B A p=.5 p=.5 $ C BC A p=1 p=.5

Separating Good from Bad Paths Use dynamic threshold to detect anomalies When unexpectedly many paths fall above Nth percentile Distribution with faults Normal distribution

Anomalies in Component Interaction Weighted links model component interaction w0=.4 w1=.3 w2=.2 w3=.1

Scoring CI Models Score w/test of goodness-of-fit: Probability that same process generated both Makes no assumptions about shape of distribution Anomaly! w0=.4 n0=30 w1=.3 n1=10 w2=.2 w3=.1 n2=40 n3=20 Normal Pattern

Two Kinds of False Positives Algorithmic false positives No anomaly exists But statistical technique made a mistake... Semantic false positives Correctly found an anomaly But anomaly is not a failure

Resilient Against Semantic FP Test against normal changes 1. Vary workload from “browse & purchase” to “only browse” 2. Minor upgrade from Petstore 1.3.1 to 1.3.2 Path-shape analysis found NO differences Component interaction changes below threshold For predictable, major changes: Consider lowering Pinpoint sensitivity until retraining complete -> Window-of-vulnerability, but better than false-positives. Q: Rate of normal changes? How quickly can we retrain? Minor changes every day, but only to parts of site. Training speed -> how quickly is service exercised?

Related Work Detection and Localization: Richardson: Performance failure detection Infospect: search for logical inconsistencies in observed configuration Event/alarm correlation systems: use dependency models to quiesce/collapse correlated alarms. Request Tracing Magpie: tracing for performance modeling/characterization Mogul: discovering majority behavior in black-box distrib. systems Compilers & PL DIDUCE: hypothesize invariants, report when they're broken Bug Isolation Proj.: correlate crashes w/state, across real runs Engler: Analyze static code for patterns and anomalies -> bugs

Conclusions Monitoring path shapes and component interactions.. ... easy to instrument, app-generic ... are likely to change when application fails Model normal pattern of behavior, look for anomalies Key assumption: most of system working most of time Anomaly detection detects high-level failures, and is deployable Resilient to (at least some) normal changes to system Current status: Deploying in real, large Internet service. Anomaly detection techniques for “structure-less” systems

More Information http://www.stanford.edu/~emrek/ Detecting Application-Level Failures in Component-Based Internet Services. Emre Kiciman, Armando Fox. In submission Session State: Beyond Soft State. Benjamin Ling, Emre Kiciman, Armando Fox. NSDI'04 Path-based Failure and Evolution Management Chen, Accardi, Kiciman, Lloyd, Patterson, Fox, Brewer. NSDI'04

Localize Failures with Decision Tree Search for features that occur with bad items, but not good Decision trees Classification function Each branch in tree tests a feature Leaves of tree give classification Learn decision tree to classify good/bad examples But we won't use it for classification Just look at learned classifier and extract questions as features

Illustrative Decision Tree

Results: Comparing Localization Rate

Stanford ROC Updates

Stanford ROC Updates

Presentation Transcript

ROC Curves

ROC curves

ROC

Stanford

ROC

ROC ALPS

HOT ROC

ROC Orlando

HOT ROC:

Instantaneous ROC

HOT ROC

ROC Analysis

STANFORD

HOT ROC

TRD ROC Status

ROC@Stanford Progress Report

Stanford

ROC Curves

ROC curves