90 likes | 247 Views
Recovery Oriented Computing Embracing Failure. A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-oriented computing (ROC), HPTS, 2001 A little of … A. B. Brown and D. A. Patterson, Undo for operators: Building an undoable e-mail store, USENIX ATC 2003 (Best paper).
E N D
Recovery Oriented ComputingEmbracing Failure A. B. Brown and D. A. Patterson, Embracing failure: a case for recovery-oriented computing (ROC), HPTS, 2001 A little of … A. B. Brown and D. A. Patterson, Undo for operators: Building an undoable e-mail store, USENIX ATC 2003 (Best paper) Fabián E. Bustamante, Winter 2006
Availability and today’s apps • Availability is the most important metric for modern computer systems • Availability used to be a solved problem • Expensive fault-tolerance server • Vendor-supplied high-availability database system • All behind a box well firewalled • Today’s apps are quire different • Distributed, heterogeneous environment • Conglomeration of interconnected systems: databases, application servers, middleware, web servers • So – 65% of surveyed sties suffered a customer-visible outage at least once in 6-month; 25% 3+ in same period CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Problem with assumptions • Basic model • Hardware and software can be built w/ negligible failure rates • Failure modes of systems can be predicted and tolerated • Maintenance and repair are error-free procedures • More realistically • Hardware and software failures are inevitable • Human failures are inevitable • Unanticipated failures are inevitable • Your only option – get used to it – embraced failure – Recovery Oriented Computing (ROC) CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
HW & SW failures are inevitable • Software: Functionality is king – a constant race to offer new functionality → sloppy people & buggy code • Hardware: razor-thin margins means no $ for high-quality, fault-tolerant hardware → commodity, failure-prone, hardware Scale only multiplies the problem! CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Human failures are inevitable • Large systems rely on human beings for • Maintenance and repair • Software configuration and upgrading • Performance tuning • Diagnosing and fixing failures • Human beings make mistakes • At a rate of 10-100% under stress • 70% of failures in electronic systems, 20-53% in missile systems, 60-70% in aircraft failures, 50% in VAX systems, 42% in Tandem systems, …. • But modern systems do not into account the possibility of human failure CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Unanticipated failures are inevitable • Could you solve this w/ good engineering? • Not really • Perrow’s work on high-risk technology • Large servers - complex, reasonably-tightly-coupled systems, performing complex tasks under human guidance … prone to “normal accidents” • Accidents that arise from the multiple and unexpected hidden interactions of smaller failures and recovery systems designed to handle them CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Recovery Oriented Computing • Focus on repair instead of avoiding failures • Recovery needs to be a first-class part of the system • It must • Ensure problems are detected fast (for containment) • Provide assistance in diagnosing root-cause of them • Repair mechanisms should be trustworthy • Should tolerate errors during recovery • It’s really complementary to fault-tolerance (redundancy is thus necessary) • Should automatically track the health of all components – so it should include fault-injection mechanisms • … CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
Undoable e-mail store • You have undos for Office, but not for admins?! • Undo operator incorporates three steps • Rewind – physically rolled back to before the damage • Repair – not constraint admins on what repair they can do • Replay – logically (to incorporate the repair) bring it back • Two challenges in the 3Rs model • Timeline management – record system timeline so that you can edit it during repair and re-execute during replay • Keep the system consistent from an external observer’s point of view (even ‘after’ repair) CS 395/495 Autonomic Computing SystemsEECS,Northwestern University
User Control UI In part to make the undo manager generic Service specific Verbs Undo Proxy Undo Manager Control Service App To be able to roll-back the system Timeline log Time travel storage Undo system architecture CS 395/495 Autonomic Computing SystemsEECS,Northwestern University