1 / 18

Recovery-Oriented Computing

Recovery-Oriented Computing. Dave Patterson and Aaron Brown University of California at Berkeley {patterson,abrown}@cs.berkeley.edu In cooperation with Armando Fox, Stanford University fox@cs.stanford.edu http://roc.CS.Berkeley.EDU/ December 2001.

vbourgeois
Download Presentation

Recovery-Oriented Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recovery-Oriented Computing Dave Patterson and Aaron Brown University of California at Berkeley {patterson,abrown}@cs.berkeley.edu In cooperation with Armando Fox, Stanford Universityfox@cs.stanford.edu http://roc.CS.Berkeley.EDU/ December 2001

  2. The past: goals and assumptions of last 15 years • Goal #1: Improve performance • Goal #2: Improve performance • Goal #3: Improve cost-performance • Assumptions • Humans are perfect (they don’t make mistakes during installation, wiring, upgrade, maintenance or repair) • Software will eventually be bug free (good programmers write bug-free code, debugging works) • Hardware MTBF is already very large (~100 years between failures), and will continue to increase

  3. Today, after 15 years ofimproving performance • Availability is now the vital metric for servers • near-100% availability is becoming mandatory • for e-commerce, enterprise apps, online services, ISPs • but, service outages are frequent • 65% of IT managers report that their websites were unavailable to customers over a 6-month period • 25%: 3 or more outages • outage costs are high • social effects: negative press, loss of customers who “click over” to competitor • $500,000 to $5,000,000 per hour in lost revenues Source: InternetWeek 4/3/2000

  4. New goals: ACME • Availability • 24x7 delivery of service to users • Change • support rapid deployment of new software, apps, UI • Maintainability • reduce burden on system administrators (cost of ownership ~5X cost of purchase) • provide helpful, forgiving sysadmin environments • Evolutionary Growth • allow easy system expansion over time without sacrificing availability or maintainability

  5. Where does ACME stand today? • Availability: failures are common • Traditional fault-tolerance doesn’t solve the problems • Change • In back-end system tiers, software upgrades difficult, failure-prone, or ignored • For application service over WWW, daily change • Maintainability • human operator error is single largest failure source? • system maintenance environments are unforgiving • Evolutionary growth • 1U-PC cluster front-ends scale, evolve well • back-end scalability still limited

  6. Improving recovery/repair improves availability • UnAvailability = MTTR • MTTF • 1/10th MTTR just as valuable as 10X MTBF (assuming MTTR much less than MTTF) Recovery-Oriented Computing Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres • Failures are a fact, and recovery/repair is how we cope with them • Since major Sys Admin job is recovery after failure, ROC also helps with maintenance • If necessary, start with clean slate, sacrifice disk space and performance for ACME

  7. ROC Approach • Change system administration environment to make it forgiving • Develop ACME benchmarks to test old systems and new ideas to measure improvement • Fastest to Recover from Failures v. Fastest on SPEC • Work with companies to get real data on failure causes and patterns, feedback on approach • Cluster technology that enables partition systems, insert faults, test outputs

  8. One idea: Undo for Sysadmin • Major goal of ROC is to provide an Undo for system administration • to create an environment that forgives operator error • to let sysadmins fix latent errors even after they’re manifested • this is no ordinary word processor undo! • The Three R’s: undo meets time travel • Rewind: roll system state backwards in time • Repair: fix latent or active error • automatically or via human intervention • Redo: roll system state forward, replaying user interactions lost during rewind

  9. Discussion Topics • Focus on recovery: appropriate? • How much of a problem is human error? • Undo: would it be useful? • How hard is it to diagnose problems? • Benchmarks to evaluate progress in this area? • Future events on this topic?

  10. Backup Slides for Questions

  11. Discussion Topics (long version) • Focus on recovery: appropriate? • is it an important part of what you do? • what other parts of the sysadmin experience (for Internet services) should we look into? • How much of a problem is human error? • are there more user errors or sysadmin errors? • Undo: would it be useful? • how far back in time would it need to go? • should it be exported to users, e.g. undo for email? • How hard is it to diagnose problems? • would you use or trust automated diagnosis tools?

  12. Discussion Topics (long version), p2 • Benchmarks to evaluate progress in this area • measuring dependability of human-operated systems • how to measure sysadmin satisfaction with system? • Future events on this topic • would you come to a workshop at the next LISA? • what do you think about a hands-on workshop where we spend the morning using various systems then spend the afternoon analyzing and discussing results?

  13. ROC Enabler: ACME benchmarks • Traditional benchmarks focus on performance • assume perfect hardware, software, human operators • New benchmarks needed to drive progress toward ACME, evaluate ROC success • How else convince skeptics to adopt new technology? • Need workload of typical failures normal behavior(99% conf.) QoS degradation failure Repair Time

  14. Tools for Recovery: Detection • System enables input insertion, output check of all modules (including fault insertion) • To check module sanity to find failures faster • To test correctness of recovery mechanisms • insert (random) faults and known-incorrect inputs • also enables availability benchmarks • To expose & remove latent errors from system • To train/expand experience of operator • Periodic reports to management on skills • To discover if warning systems are broken

  15. Repairing the Past (2) • 3 cases needing Undo • reverse the effects of a mistyped command (rm –rf *) • roll back a software upgrade without losing user data • “go back in time” to retroactively install virus filter on email server; effects of virus are squashed on redo • The 3 R’s vs. checkpointing, reboot, logging • checkpointing gives Rewind only • reboot may give Repair, but only for “Heisenbugs” • logging can give all 3 R’s • but need more than RDBMS logging, since system state changes are interdependent and non-transactional • 3R-logging requires careful dependency tracking, and attention to state granularity and externalized events

  16. Organizations we’re working with • Traditional Computer Companies: HP, IBM, Microsoft, ... • Internet companies:Amazon, Google, Yahoo, ... • Startup companies (wouldn’t recognize names) • Stanford University (Armando Fox) • E.g., CS 294-4 class • Visit them + twice a year, 3-day retreats at Lake Tahoe to review progress, get new insights, build team spirit, have fun, ...

  17. Systems using for ROC • ROC-I: Cluster of 64 PCs modified with ability for HW isolation, fault insertion, monitoring • Cluster of 40 IBM PCs: each with 2 GB DRAM, 1 gigabit Ethernet, gigabit switch,HW monitor, each running Vmware virtual machine monitor (software layer)

  18. Why ROC? • Working on relevant, important systems topic with no jeopardy from near-term products • New, exciting, research field, so lots of opportunities • Interacting with many interesting companies, research labs, Stanford University • Contact Dave Patterson (patterson@cs)or Aaron Brown (abrown@cs) or David Oppenheimer (davidopp@cs) • See http://ROC.cs.berkeley.edu

More Related