1 / 17

Why Do Computers Stop and What Can Be Done About It?

Why Do Computers Stop and What Can Be Done About It?. Jim Gray Presentation – Joe Tucek. Introduction. Many applications require high availability Patient monitoring Transaction processing (banks) Yet, computers fail 99.6% uptime is an hour a week of downtime This paper –

avani
Download Presentation

Why Do Computers Stop and What Can Be Done About It?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Why Do Computers Stop and What Can Be Done About It? Jim Gray Presentation – Joe Tucek

  2. Introduction • Many applications require high availability • Patient monitoring • Transaction processing (banks) • Yet, computers fail • 99.6% uptime is an hour a week of downtime • This paper – • What Tandem Non-Stop does • How well it works • What more can we do

  3. Overview • Introduction • Principles of Reliable Design • Failure Results • Software Reliability Techniques • “The” Solution • Conclusions

  4. Principles of Reliable Design • Availability is MTTF/(MTTF+MTTR) • Really small MTTR is good • “Spare modules are configured to give the appearance of instantaneous repair” • Modules must fail independently • Modules must fail stop • Mirrored hard disks • 10,000 hour MTTF, 24 hour MTTR • 1000 year MTTR for the system

  5. Overview • Introduction • Principles of Reliable Design • Failure Results • Software Reliability Techniques • “The” Solution • Conclusions

  6. Failure Results

  7. Failure Results

  8. Failure Results

  9. Failure Results • Software and administration are • 65% of all failures • 80% of non-environmental failures • Of software and administration failiures • Software is 37% • Administration is 63% • Simplifying and minimizing administrator involvement is the top priority!

  10. Overview • Introduction • Principles of Reliable Design • Failure Results • Software Reliability Techniques • Modularity • Heisenbugs • Process Pairs • “The” Solution • Conclusions

  11. Software Reliability-Modularity • Hardware uses fail-fast modules • Copy hardware’s successful technique • Modular processes • Cannot corrupt each other (fault isolation) • Can easily be made fail fast • Defensive programming • Can simply be killed (fail stop) • So, we can use redundant modules • But computers are deterministic automata…

  12. Software Reliability-Heisenbugs • What bugs are the hardest to find? • Race conditions • Limit conditions • “Unusual” system states • These are “heisenbugs” • They go away after you add debugging symbols • Or run in debugger • Or add that printf • Or do anything at all to find the stupid little thing • 131 of 132 faults is a heisenbug

  13. Software Reliability-Process Pairs • Process Pairs are redundant modules • Lockstep execution for hardware faults • Manual checkpoint rollback • Still used in scientific computing • Automatic Checkpointing • Easier to do • “Delta” Checkpoints • Programmatically harder • Persistent Processes • Fast fail-over with “amnesia”

  14. Overview • Introduction • Principles of Reliable Design • Failure Results • Software Reliability Techniques • “The” Solution • Conclusions

  15. Transactions • The reason Jim Gray is famous… • Either things happen or they don’t • You never see things half-happened • What happens cannot be corrupting • What happens has happened forever • ACID • Atomicity, Consistency, Integrity, Durability

  16. Transactions • Consider “persistent” process pairs • They’re easy, but they leave things half done • Why not combine them with transactions? • Of course, the OS and DB must be reliable  • Still state of the art • Microreboot -- A Technique for Cheap Recovery. Candea, Kawamoto, Fujiki, Friedman, Fox. OSDI 2004.

  17. Conclusions • Computers mostly fail due to software and administration • Modular redundancy -> hardware reliable • Same can work for software • Most “hard” bugs are heisenbugs • So redundant processes solve them • Transactions are the way to store state • And yet, 20 years later, computers crash…

More Related