1 / 26

Toward Recovery-Oriented Computing

Toward Recovery-Oriented Computing. Armando Fox, Stanford University David Patterson, UC Berkeley and a cast of tens. Outline. Whither recovery-oriented computing? research/industry agenda of last 15 years

bert
Download Presentation

Toward Recovery-Oriented Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Toward Recovery-Oriented Computing Armando Fox, Stanford UniversityDavid Patterson, UC Berkeleyand a cast of tens

  2. Outline • Whither recovery-oriented computing? • research/industry agenda of last 15 years • today’s pressing problem: availability (we knew that) - but what is new/different compared to previous F/T work, databases, etc? • Recovery-Oriented Computing as an approach to availability • Motivation and philosophy • sampling of research avenues • what ROC is not

  3. Reevaluating goals & assumptions • Goals of last 15 years • Goal #1: Improve performance • Goal #2: Improve performance • Goal #3: Improve cost-performance • Assumptions • Humans are perfect (they don’t make mistakes during installation, wiring, upgrade, maintenance or repair) • Software will eventually be bug free (good programmers will write bug-free code, debugging works) • Hardware MTBF is already very large (~100 years between failures), and will continue to increase

  4. Results of this successful agenda • Good news: faster computers, denser disks, cheaper $ • computation faster by >3 orders of magnitude • disk capacity greater by >3 orders of magnitude • Result: TCO dominated by administration, not hardware cost • Bad news: complex, brittle systems that fail frequently • 65% of IT managers report that their websites were unavailable to customers over a 6-month period (25%: 3 or more outages) [Internet Week, 4/3/2000] • outage costs: negative press, “click overs” to competitor, stock price, market cap… • Yet availability is key metric for online services!

  5. Direct Downtime Costs (per Hour) Brokerage operations $6,450,000 Credit card authorization $2,600,000 Ebay (22 hour outage) $225,000 Amazon.com $180,000 Package shipping services $150,000 Home shopping channel $113,000 Catalog sales center $90,000 Airline reservation center $89,000 Cellular service activation $41,000 On-line network fees $25,000 ATM service fees $14,000 Sources: InternetWeek 4/3/2000 + Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research."

  6. So, what are today’s challenges? • We all seem to agree on goals • Dave Patterson, IPTS 2002: ACME “availability, change, maintenance, evolution” • Jim Gray, HPTS 2001: FAASM “functionality, availability, agility, scalability, manageability” • Butler Lampson, SOSP 1999: “Always available, evolving while they run, growing without practical limit” • John Hennessy, FCRC 1999: “Availability, maintainability and ease of upgrades, scalability” • Fox & Brewer, HotOS 1997: BASE “best-effort service, availability, soft state, eventual consistency” • We’re all singing the same tune, but what is new?…

  7. What’s New and Different • Evolution and change are integral • not true of many “traditional” five nines systems: long design cycle, changes incur high overhead for design/spec/testing • Last version of space shuttle software: 1 bug in 420 KLOC, cost $35M/yr to maintain (good quality commercial SW: 1 bug/KLOC) • But, recent upgrade for GPS support required generating 2,500 pages of specs before changing anything in 6.3 KLOC (1.5%) • Performance still important, but focus changed • Interactive performance and availability to end users is key • Users appear willing to occasionally tolerate temporary degradation (“service quality”) in exchange for improved availability • How to capture this tradeoff: soft/stale state, partial performance degradation, imprecise answers…

  8. ROC Philosophy • ROC philosophy (“Peres’s Law”): “If a problem has no solution, it may not be a problem, but a fact; not to be solved, but to be coped with over time”Shimon Peres • Failures (hardware, software, operator-induced) are a fact; recovery is how we cope with them over time • Availability = MTTF/MTBF= MTTF / (MTTF + MTTR) Rather than just making MTTF very large, make MTTR << MTTF • Why? • Human errors will still cause outages => minimize recovery time • Recovery time is directly measurable, and directly captures impact on users of a specific outage incident (MTTF doesn’t) • Rapid evolution makes exhaustive testing/validation impossible => unexpected/transient failures will still occur

  9. 1. Human Error Is Inevitable • Human error major factor in downtime… • PSTN: Half of all outage incidents and outage-minutes from 1992-1994 were due to human error (including errors by phone company maintenance workers) • Oracle: up to half of DB failures due to human error (1999) • Microsoft blamed human error for ~24-hour outage in Jan 2001 • Approach: • Learn from psychology of human error and disaster case studies • Build in system support for recovery from human errors • Use tools such as error injection, virtual machine technology to provide “flight simulator” training for operators

  10. The 3R undo model • Undo == time travel for system operators • Three R’s for recovery • Rewind: roll system state backwards in time • Repair: change system to prevent failure • e.g., edit history, fix latent error, retry unsuccessful operation, install preventative patch • Replay: roll system state forward, replaying end-user interactions lost during rewind • All three R’s are critical • rewind enables undo • repair lets user/administrator fix problems • replay preserves updates, propagates fixes forward

  11. Example e-mail scenario • Before undo: • virus-laden message arrives • user copies it into a folder without looking at it • Operator invokes undo (rewind) to install virus filter (repair) • During replay: • message is redelivered but now discarded by virus filter • copy operation is now unsafe (source message doesn’t exist) • compensating action: insert placeholder for message • now copy command can be executed, making history replay-acceptable

  12. First implementation attempt • Undo wrapper for open source IMAP email store 3R Layer StateTracker Email Server Includes: - user state - mailboxes - application - operating system SMTP SMTP 3RProxy IMAP IMAP Non-overwritingStorage UndoLog control

  13. 3. Handling Transient Failures via Restart • Many failures are either (a) transient and fixable through reboot, or (b) non-transient, but reboot is the lowest-MTTR fix • Recursive Restarts: To minimize MTTR, restarts the minimal set of subsystems that could cure a failure; if that doesn’t help, restart the next-higher containing set, etc. • Partial restarts/reboots • Return system (mostly) to well-tested, well-understood start state • High confidence way to reclaim stale/leaked resources • Unlike true checkpointing, reboot more likely to avoid repeated failure due to corrupted state • We focus on proactive restarts; can also be reactive (SW rejuvenation) • “Easier to run a system 365 times for 1 day than 365 days” • Goals: • What is the software structure that can best accommodate such failure management while still preserving all other requirements (functionality, performance, consistency, etc.) • Develop methodology for building and managing RR systems (concrete engineering methods) • Develop the tools for building, testing, deploying, and managing RR systems • Design for fast restartability in online-service building blocks

  14. A Hierarchy of Restartable Units • Siblings highly fault-isolated • low level: by high-confidence, low-level, HW-assisted machinery, (eg MMU, physical isolation) • higher level: by VM-level abstractions based on the above machinery (eg JVM, HW VM, process) • R-map (=hierarchy of restartable component groups) captures restart dependencies • Groups of restart units can be restarted by common parent • Restarting a node restarts everything in its subtree • A failure is minimally curable at a specific node • Restarts farther up tree are more expensive, but higher confidence for curing transients

  15. RR-ifying a satellite ground station • Biggest improvement: MTTF/MTTR-based boundary redrawing • Ability to isolate unstable components without penalizing whole system • Achieve a balanced MTTF/MTTR ratio across components at the same level • Lower MTTR may be strictly better than higher MTTF • unplanned downtime is more expensive than planned downtime, and downtime under a heavy/critical workload (e.g., satellite pass) is more expensive than downtime under a light/non-critical workload. • high MTTF doesn’t guarantee failure-free operation interval, but sufficiently low MTTR may mitigate impact of failure • Current work is applying RR to a ubiquitous computing environment, a J2EE application server, and an OSGI-based platform for cars  new lessons will emerge (e.g., r-tree needs to be a r-DAG) • Most of these lessons are not surprising, but RR provides a uniform framework within which to discuss them

  16. MTTR Captures Outage Costs • Recent software-related outages at Ebay: 4.5 hours in Apr02, 22 hours Jun99, 7 hours May99, 9 hours Dec98 • Assume two 4-hour (“newsworthy”) outages/year • A=(182*24 hours)/(182*24 + 4 hours) = 99.9% • Dollar cost: Ebay policy for >2 hour outage, fees credited to all affected users (US$3-5M for Jun99) • Customer loyalty: after Jun99 outage, Yahoo Auctions reported statistically significant increase in users • Ebay’s market cap dropped US$4B after Jun99 outage, stock price dropped 25% • Newsworthy due to number of users affected, given length of outage

  17. Outage costs, cont. • What about a 10-minute outage once per week? • A=(7*24 hours)/(7*24 + 1/6 hours) = 99.9% - the same • Can we quantify “savings” over the previous scenario? • Shorter outages affect fewer users at a time • Typical AOL email “outage” affects 1-2% of users • Many short outages may affect different subsets of users • Shorter outages typically not news-worthy

  18. When Low MTTR Trumps High MTTF • MTTR is directly measurable; MTTF usually not • Component MTTF’s -> tens of years • Software MTTF ceiling -> ~30 yrs (Gray, HDCC 01) • Result: “measuring” MTTF requires 100’s of system-years • But, MTTR’s are minutes to hours, even for complex SW components • MTTR more directly captures impact of a specific outage • Very low MTTR (~10 seconds) achievable with redundancy and failover • Keeps response time below user threshold of distraction [Miller 1968, Bhatti et al 2001, Zona Research 1999]

  19. Degraded Service vs. Outage • How about longer MTTR’s (minutes or hours)? • Can service be designed so that “short” outages appear to users as temporary degradation instead? • How much degradation will users tolerate? • For how long (until they abandon the site because it feels like a true outage - abandonment can be measured) • How frequently? • Even if above thresholds can be deduced, how to design service so that transient failures can be mapped onto degraded quality?

  20. Examples of degraded service • Goal: derive a set of service “primitives” that directly reflect parameterizable degradation due to transient failure (“theory” is too strong…)

  21. Two Frequently Asked Questions • Is ROC the same as autonomic computing™? • Are you saying we should build lousy hardware and software and mask all those failures with ROC mechanisms?

  22. 1. Does ROC==autonomic computing? • Self-administering? • For now, focus on empowering administrators, not eliminating them • Humans are good at detecting and learning from own mistakes, so why not? (avoiding automation irony) • We’re not sure we understand sysadmins’ current techniques well enough to think about automation • Self-healing, self-reprovisioning, self-load-balancing…? • Sure - Web services and datacenters already do this for many situations; many techniques and tools are “well known” • But - do we know how (“theory”) to design the app software to make these techniques possible • Digital immune system - it’s in WinXP

  23. 2. What ROC is not • We do not advocate for… • producing buggy software • building lousy hardware • slacking on design, testing, or careful administration • discarding existing useful techniques or tools • We do advocate for… • an increased focus on lowering MTTR specifically • increased examination of when some guarantees can be traded for lower MTTR • systematic exploration of “design for fast recovery” in the context of a variety of applications • stealing great ideas from systems, Internet protocols, psychology, safety-critical systems design

  24. Summary: ROC and Online Services • Current software realities lead to new foci • Rapid evolution => traditional FT methodologies difficult to apply • Human error inevitable, but humans are good at identifying own errors => provide facilities to allow recovery from these • HW and SW failure inevitable => use redundancy and designed-in ability to substitute temporary degradation for outages (“design for recovery”) • Trying to stay relevant via direct contact with designers/operators of large systems • Need real data on how large systems fail • Need real data on how different kinds of failures are perceived by users

  25. Interested in ROCing? • Are you willing to anonymously share failure data? • Already great relationships (and in some cases data-sharing agreements) with BEA, IBM, HP, Keynote, Microsoft, Oracle, Tellme, Yahoo!, others • See http://roc.stanford.edu orhttp://roc.cs.berkeley.edu for publications, talks, research areas, etc. • Contact Armando Fox (fox@cs.stanford.edu) or Dave Patterson (patterson@cs.berkeley.edu)

  26. Discussion Question • [For discussion] So what if you pick the low hanging fruit? The challenge is in reaching the highest leaves.

More Related