Recovery Oriented Computing Embracing Failure

Recovery Oriented ComputingEmbracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-oriented computing (ROC), HPTS, 2001 A little of … A. B. Brown and D. A. Patterson, Undo for operators: Building an undoable e-mail store, USENIX ATC 2003 (Best paper) Fabián E. Bustamante, Winter 2006

Availability and today’s apps • Availability is the most important metric for modern computer systems • Availability used to be a solved problem • Expensive fault-tolerance server • Vendor-supplied high-availability database system • All behind a box well firewalled • Today’s apps are quire different • Distributed, heterogeneous environment • Conglomeration of interconnected systems: databases, application servers, middleware, web servers • So – 65% of surveyed sties suffered a customer-visible outage at least once in 6-month; 25% 3+ in same period CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Problem with assumptions • Basic model • Hardware and software can be built w/ negligible failure rates • Failure modes of systems can be predicted and tolerated • Maintenance and repair are error-free procedures • More realistically • Hardware and software failures are inevitable • Human failures are inevitable • Unanticipated failures are inevitable • Your only option – get used to it – embraced failure – Recovery Oriented Computing (ROC) CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

HW & SW failures are inevitable • Software: Functionality is king – a constant race to offer new functionality → sloppy people & buggy code • Hardware: razor-thin margins means no $ for high-quality, fault-tolerant hardware → commodity, failure-prone, hardware Scale only multiplies the problem! CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Human failures are inevitable • Large systems rely on human beings for • Maintenance and repair • Software configuration and upgrading • Performance tuning • Diagnosing and fixing failures • Human beings make mistakes • At a rate of 10-100% under stress • 70% of failures in electronic systems, 20-53% in missile systems, 60-70% in aircraft failures, 50% in VAX systems, 42% in Tandem systems, …. • But modern systems do not into account the possibility of human failure CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Unanticipated failures are inevitable • Could you solve this w/ good engineering? • Not really • Perrow’s work on high-risk technology • Large servers - complex, reasonably-tightly-coupled systems, performing complex tasks under human guidance … prone to “normal accidents” • Accidents that arise from the multiple and unexpected hidden interactions of smaller failures and recovery systems designed to handle them CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Recovery Oriented Computing • Focus on repair instead of avoiding failures • Recovery needs to be a first-class part of the system • It must • Ensure problems are detected fast (for containment) • Provide assistance in diagnosing root-cause of them • Repair mechanisms should be trustworthy • Should tolerate errors during recovery • It’s really complementary to fault-tolerance (redundancy is thus necessary) • Should automatically track the health of all components – so it should include fault-injection mechanisms • … CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Undoable e-mail store • You have undos for Office, but not for admins?! • Undo operator incorporates three steps • Rewind – physically rolled back to before the damage • Repair – not constraint admins on what repair they can do • Replay – logically (to incorporate the repair) bring it back • Two challenges in the 3Rs model • Timeline management – record system timeline so that you can edit it during repair and re-execute during replay • Keep the system consistent from an external observer’s point of view (even ‘after’ repair) CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

User Control UI In part to make the undo manager generic Service specific Verbs Undo Proxy Undo Manager Control Service App To be able to roll-back the system Timeline log Time travel storage Undo system architecture CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

Recovery Oriented Computing Embracing Failure

Recovery Oriented Computing Embracing Failure

Presentation Transcript

Recovery Oriented Software

Recovery Oriented Practice

Toward Recovery-Oriented Computing

Recovery-Oriented Computing

Failure Recovery

Recovery-Oriented Computing User Study

CPS216: Data-intensive Computing Systems Failure Recovery

CPS216: Data-intensive Computing Systems Failure Recovery

Failure Recovery

Embracing Change: Promoting Recovery

Recovery Oriented Computing (ROC)

Recovery Oriented Computing (ROC)

Recovery Oriented Computing

Recovery Oriented Prescribing

CPS216: Data-intensive Computing Systems Failure Recovery

Recovery-Oriented Computing

Recovery Oriented Computing (ROC)

ROC Solid: A Recovery Oriented Computing Perspective

Recovery Oriented Computing (ROC)

Recovery-Oriented Computing