1 / 9

Recovery Oriented Computing Embracing Failure

Recovery Oriented Computing Embracing Failure. A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-oriented computing (ROC), HPTS, 2001 A little of … A. B. Brown and D. A. Patterson, Undo for operators: Building an undoable e-mail store, USENIX ATC 2003 (Best paper).

fynn
Download Presentation

Recovery Oriented Computing Embracing Failure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recovery Oriented ComputingEmbracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-oriented computing (ROC), HPTS, 2001 A little of … A. B. Brown and D. A. Patterson, Undo for operators: Building an undoable e-mail store, USENIX ATC 2003 (Best paper) Fabián E. Bustamante, Winter 2006

  2. Availability and today’s apps • Availability is the most important metric for modern computer systems • Availability used to be a solved problem • Expensive fault-tolerance server • Vendor-supplied high-availability database system • All behind a box well firewalled • Today’s apps are quire different • Distributed, heterogeneous environment • Conglomeration of interconnected systems: databases, application servers, middleware, web servers • So – 65% of surveyed sties suffered a customer-visible outage at least once in 6-month; 25% 3+ in same period CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  3. Problem with assumptions • Basic model • Hardware and software can be built w/ negligible failure rates • Failure modes of systems can be predicted and tolerated • Maintenance and repair are error-free procedures • More realistically • Hardware and software failures are inevitable • Human failures are inevitable • Unanticipated failures are inevitable • Your only option – get used to it – embraced failure – Recovery Oriented Computing (ROC) CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  4. HW & SW failures are inevitable • Software: Functionality is king – a constant race to offer new functionality → sloppy people & buggy code • Hardware: razor-thin margins means no $ for high-quality, fault-tolerant hardware → commodity, failure-prone, hardware Scale only multiplies the problem! CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  5. Human failures are inevitable • Large systems rely on human beings for • Maintenance and repair • Software configuration and upgrading • Performance tuning • Diagnosing and fixing failures • Human beings make mistakes • At a rate of 10-100% under stress • 70% of failures in electronic systems, 20-53% in missile systems, 60-70% in aircraft failures, 50% in VAX systems, 42% in Tandem systems, …. • But modern systems do not into account the possibility of human failure CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  6. Unanticipated failures are inevitable • Could you solve this w/ good engineering? • Not really • Perrow’s work on high-risk technology • Large servers - complex, reasonably-tightly-coupled systems, performing complex tasks under human guidance … prone to “normal accidents” • Accidents that arise from the multiple and unexpected hidden interactions of smaller failures and recovery systems designed to handle them CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  7. Recovery Oriented Computing • Focus on repair instead of avoiding failures • Recovery needs to be a first-class part of the system • It must • Ensure problems are detected fast (for containment) • Provide assistance in diagnosing root-cause of them • Repair mechanisms should be trustworthy • Should tolerate errors during recovery • It’s really complementary to fault-tolerance (redundancy is thus necessary) • Should automatically track the health of all components – so it should include fault-injection mechanisms • … CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  8. Undoable e-mail store • You have undos for Office, but not for admins?! • Undo operator incorporates three steps • Rewind – physically rolled back to before the damage • Repair – not constraint admins on what repair they can do • Replay – logically (to incorporate the repair) bring it back • Two challenges in the 3Rs model • Timeline management – record system timeline so that you can edit it during repair and re-execute during replay • Keep the system consistent from an external observer’s point of view (even ‘after’ repair) CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  9. User Control UI In part to make the undo manager generic Service specific Verbs Undo Proxy Undo Manager Control Service App To be able to roll-back the system Timeline log Time travel storage Undo system architecture CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

More Related