1 / 21

Exploring Failure Transparency and the Limits of Generic Recovery

Exploring Failure Transparency and the Limits of Generic Recovery. Dave Lowell Compaq Western Research Lab xxx Subhachandra Chandra and Peter M. Chen, University of Michigan. Introduction. Failure transparency: abstraction of failure-free operation

akando
Download Presentation

Exploring Failure Transparency and the Limits of Generic Recovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploring Failure Transparency and the Limits of Generic Recovery Dave LowellCompaq Western Research Labxxx Subhachandra Chandra andPeter M. Chen, University of Michigan

  2. Introduction • Failure transparency: abstraction of failure-free operation • OS recovers app after hardware, OS, and application failures • No programmer help • No slow down • Will explore theory, performance, and limitations

  3. Consistent recovery • Visible output equivalent to failure-free run • equivalence: allows duplicates • avoids “exactly once” problem • Failure transparency  consistent recovery with generic techniques

  4. Guaranteeing consistent recovery • Key players: non-deterministic events, visible events, commit events • Save-work invariant (simplified): • There’s a commit after each non-deterministic event that happens-before a visible event. • Full theorem handles liveness, distinguishes causality and ordering

  5. Commit All CAND CAND-LOG Effort to identify/convert ND events

  6. CPV-2PC CBNDV-2PC CBNDVS CBNDVS-LOG CPVS CAND CAND-LOG Effort to commit only visible events Effort to identify/convert ND events

  7. Manetho Coord. Checkpointing Optimistic Logging Targon/32 Hypervisor SBL CPV-2PC CBNDV-2PC Effort to commit only visible events CBNDVS CBNDVS-LOG CPVS CAND CAND-LOG Effort to identify/convert ND events

  8. increasing simplicity application failure recovery increasing recovery time increasing performance Effort to commit only visible events Effort to identify/convert ND events

  9. Performance study • Discount Checking: fast checkpoints to reliable memory (Rio) • Logging and two-phase commit • Disk version • Mostly interactive applications • Localized and distributed

  10. Nvi Text Editor Effort to commit only visible events CBNDVS1%42% CBNDVS-LOG0%12% CPVS1%44% CAND1%43% CAND-LOG0%13% Effort to identify/convert ND events

  11. TreadMarks Barnes-Hut CPV-2PC12%319% CBNDV-2PC12% 252% Effort to commit only visible events CBNDVS101%5743% CBNDVS-LOG73%4973% CPVS129%7346% CAND199%11499% CAND-LOG126%7700% Effort to identify/convert ND events

  12. Have only considered “stop” failures • Committing everything is okay • Save-work: when we must commit • Some failures affect application state • Can we commit too much?

  13. Dangerous Paths

  14. Dangerous Paths

  15. Lose-work invariant • To recover from propagation failure, never commit on a “dangerous path”. • Save-work and Lose-work conflict! • Visible event on dangerous path • Can’t guarantee consistent recovery from propagation failures • Do we see this conflict in practice?

  16. Measuring Lose-work violations • Fault-injection study : OS crashes • injected faults into running kernel • induced 350 OS crashes • recovered nvi and postgres using Discount Checking • Results • nvi: 15% crashes violate Lose-work • postgres: 3% crashes violate Lose-work

  17. Application crashes • Fault-injection study: ND bugs • nvi: 37% violate Lose-work • postgres: 33% violate Lose-work • Published bug distributions: 85-95% of application bugs are deterministic • intrinsically violate Lose-work • Perhaps > 90% app crashes violate Lose-work!

  18. Conclusions • Save-work and Lose-work invariants • Save-work protocol space • Invariants fundamentally conflict • Failure transparency performance: • 0-12% overhead on reliable memory • 13-40% overhead on disk (interactive apps) • > 90% application failures violate Lose-work

  19. Chart example

More Related