210 likes | 288 Views
This exploration delves into failure transparency, consistent recovery, and recovery performance limitations in the context of generic techniques. It discusses the theory, performance, and constraints of failure transparency and recovery with a focus on guaranteeing consistent recovery. The study analyzes various methods such as Commit All, CAND, CAND-LOG, CPV-2PC, CBNDV-2PC, and more, to identify and convert non-deterministic events for efficient recovery processes. Performance studies examining discount checking, logging, two-phase commit, and application failures are also discussed, shedding light on the various challenges and solutions in application failure recovery.
E N D
Exploring Failure Transparency and the Limits of Generic Recovery Dave LowellCompaq Western Research Labxxx Subhachandra Chandra andPeter M. Chen, University of Michigan
Introduction • Failure transparency: abstraction of failure-free operation • OS recovers app after hardware, OS, and application failures • No programmer help • No slow down • Will explore theory, performance, and limitations
Consistent recovery • Visible output equivalent to failure-free run • equivalence: allows duplicates • avoids “exactly once” problem • Failure transparency consistent recovery with generic techniques
Guaranteeing consistent recovery • Key players: non-deterministic events, visible events, commit events • Save-work invariant (simplified): • There’s a commit after each non-deterministic event that happens-before a visible event. • Full theorem handles liveness, distinguishes causality and ordering
Commit All CAND CAND-LOG Effort to identify/convert ND events
CPV-2PC CBNDV-2PC CBNDVS CBNDVS-LOG CPVS CAND CAND-LOG Effort to commit only visible events Effort to identify/convert ND events
Manetho Coord. Checkpointing Optimistic Logging Targon/32 Hypervisor SBL CPV-2PC CBNDV-2PC Effort to commit only visible events CBNDVS CBNDVS-LOG CPVS CAND CAND-LOG Effort to identify/convert ND events
increasing simplicity application failure recovery increasing recovery time increasing performance Effort to commit only visible events Effort to identify/convert ND events
Performance study • Discount Checking: fast checkpoints to reliable memory (Rio) • Logging and two-phase commit • Disk version • Mostly interactive applications • Localized and distributed
Nvi Text Editor Effort to commit only visible events CBNDVS1%42% CBNDVS-LOG0%12% CPVS1%44% CAND1%43% CAND-LOG0%13% Effort to identify/convert ND events
TreadMarks Barnes-Hut CPV-2PC12%319% CBNDV-2PC12% 252% Effort to commit only visible events CBNDVS101%5743% CBNDVS-LOG73%4973% CPVS129%7346% CAND199%11499% CAND-LOG126%7700% Effort to identify/convert ND events
Have only considered “stop” failures • Committing everything is okay • Save-work: when we must commit • Some failures affect application state • Can we commit too much?
Lose-work invariant • To recover from propagation failure, never commit on a “dangerous path”. • Save-work and Lose-work conflict! • Visible event on dangerous path • Can’t guarantee consistent recovery from propagation failures • Do we see this conflict in practice?
Measuring Lose-work violations • Fault-injection study : OS crashes • injected faults into running kernel • induced 350 OS crashes • recovered nvi and postgres using Discount Checking • Results • nvi: 15% crashes violate Lose-work • postgres: 3% crashes violate Lose-work
Application crashes • Fault-injection study: ND bugs • nvi: 37% violate Lose-work • postgres: 33% violate Lose-work • Published bug distributions: 85-95% of application bugs are deterministic • intrinsically violate Lose-work • Perhaps > 90% app crashes violate Lose-work!
Conclusions • Save-work and Lose-work invariants • Save-work protocol space • Invariants fundamentally conflict • Failure transparency performance: • 0-12% overhead on reliable memory • 13-40% overhead on disk (interactive apps) • > 90% application failures violate Lose-work