Exploring Failure Transparency & Recovery Limits: A Detailed Study

Exploring Failure Transparency and the Limits of Generic Recovery Dave LowellCompaq Western Research Labxxx Subhachandra Chandra andPeter M. Chen, University of Michigan

Introduction • Failure transparency: abstraction of failure-free operation • OS recovers app after hardware, OS, and application failures • No programmer help • No slow down • Will explore theory, performance, and limitations

Consistent recovery • Visible output equivalent to failure-free run • equivalence: allows duplicates • avoids “exactly once” problem • Failure transparency  consistent recovery with generic techniques

Guaranteeing consistent recovery • Key players: non-deterministic events, visible events, commit events • Save-work invariant (simplified): • There’s a commit after each non-deterministic event that happens-before a visible event. • Full theorem handles liveness, distinguishes causality and ordering

Commit All CAND CAND-LOG Effort to identify/convert ND events

CPV-2PC CBNDV-2PC CBNDVS CBNDVS-LOG CPVS CAND CAND-LOG Effort to commit only visible events Effort to identify/convert ND events

Manetho Coord. Checkpointing Optimistic Logging Targon/32 Hypervisor SBL CPV-2PC CBNDV-2PC Effort to commit only visible events CBNDVS CBNDVS-LOG CPVS CAND CAND-LOG Effort to identify/convert ND events

increasing simplicity application failure recovery increasing recovery time increasing performance Effort to commit only visible events Effort to identify/convert ND events

Performance study • Discount Checking: fast checkpoints to reliable memory (Rio) • Logging and two-phase commit • Disk version • Mostly interactive applications • Localized and distributed

Nvi Text Editor Effort to commit only visible events CBNDVS1%42% CBNDVS-LOG0%12% CPVS1%44% CAND1%43% CAND-LOG0%13% Effort to identify/convert ND events

TreadMarks Barnes-Hut CPV-2PC12%319% CBNDV-2PC12% 252% Effort to commit only visible events CBNDVS101%5743% CBNDVS-LOG73%4973% CPVS129%7346% CAND199%11499% CAND-LOG126%7700% Effort to identify/convert ND events

Have only considered “stop” failures • Committing everything is okay • Save-work: when we must commit • Some failures affect application state • Can we commit too much?

Dangerous Paths

Lose-work invariant • To recover from propagation failure, never commit on a “dangerous path”. • Save-work and Lose-work conflict! • Visible event on dangerous path • Can’t guarantee consistent recovery from propagation failures • Do we see this conflict in practice?

Measuring Lose-work violations • Fault-injection study : OS crashes • injected faults into running kernel • induced 350 OS crashes • recovered nvi and postgres using Discount Checking • Results • nvi: 15% crashes violate Lose-work • postgres: 3% crashes violate Lose-work

Application crashes • Fault-injection study: ND bugs • nvi: 37% violate Lose-work • postgres: 33% violate Lose-work • Published bug distributions: 85-95% of application bugs are deterministic • intrinsically violate Lose-work • Perhaps > 90% app crashes violate Lose-work!

Conclusions • Save-work and Lose-work invariants • Save-work protocol space • Invariants fundamentally conflict • Failure transparency performance: • 0-12% overhead on reliable memory • 13-40% overhead on disk (interactive apps) • > 90% application failures violate Lose-work

Chart example

Exploring Failure Transparency & Recovery Limits: A Detailed Study

Exploring Failure Transparency & Recovery Limits: A Detailed Study

Presentation Transcript

Recovery Accountability and Transparency Board Cloud Migration

Failure Prevention and Recovery

CS 603 Failure Recovery

A Proposal of Application Failure Detection and Recovery in the Grid

Exploring the Limits of Digital Predistortion

Exploring the complexity limits of joint data detection and channel estimation

Failure Prevention and recovery

Automated Truck Driving Exploring the Benefits and Limits

Failure Recovery

Recovery Accountability and Transparency Board Cloud Migration

Exploring the Limits of Self-Sustainable Closed Ecological Systems

Failure Recovery

ALMA: Exploring the Outer Limits of Radio Astronomy

Exploring Limits on Liability

Unprecedented Accountability and Transparency of Recovery Act Funds—Wave of the Future?

Exploring the Limits of Energy Efficiency in Office Buildings

Exploring the meaning and limits of ‘country ownership’ with externally-funded projects

The American Recovery and Reinvestment Act— Exploring Use of the Funds

Generic Overlay OAM and Datapath Failure Detection

Failure Recovery of Overlay Tree-based Structures

Failure Prevention and Recovery

CS422 Principles of Database Systems Failure Recovery