Designing For Failure

Designing For Failure Stanford University CS 444A, Autumn 99Software Development for Critical ApplicationsArmando Fox & David Dill{fox,dill}@cs.stanford.edu

Outline • User expectations and failure semantics • Orthogonal mechanisms (again) • Case studies • TACC • Microsoft Tiger video server • Cirrus banking network • TWA flight reservations system • Mars Pathfinder • Lessons

Designing For Failure: Philosophy • Start with some “givens” • Hardware does fail • Latent bugs do occur • Nondeterministic/hard-to-reproduce bugs will happen • Requirements: • Maintain availability (possibly with degraded performance) when these things happen • Isolate faults so they don’t bring whole system down • Question: • What is “availability” from end user’s point of view? • Specifically…what constitutes correct/acceptable behavior?

User Expectations & Failure Semantics • Determine expected/common failure modes • Find “cheap” mechanisms to address them • Analysis: Invariants are your friends • Performance: keep the common case fast • Robustness: minimize dependencies on other components to avoid “domino effect” • Determine effects on performance & semantics • Unexpected/unsupported failure modes?

Example Cheap Mechanism: BASE • Best-effort Availability (like the Internet!) • Soft State (if you can tolerate temporary state loss) • Eventual Consistency (often fine in practice) • When is BASE compatible with user-perceived application semantics? • When is availability more important? (Your client-side Netscape cache? Source code library?) • When is consistency more important? (Your bank account? User profiles?) • Example: “Reload semantics”

Example: NFS (using BASE badly) • Original philosophy: “Stateless is good” • Implementation revealed: performance was bad • Caching retrofitted as a hack • Attribute cache for stat() • Stateless  soft state! • Network file locking added later • Inconsistent lock state if server or client crashes User’s view of data semantics wasn’t considered from the get-go. Result: brain-dead semantics.

Example: Multicast/SRM • Expected failure: periodic router death/partition • Cheap solution: • Each router stores only local mcast routing info • Soft state with periodic refreshes • Absence of “reachability beacons” used to infer failure of upstream nodes; can look for another • Receiving duplicate refreshes is idempotent • Unsupported failure mode: frequent or prolonged router failure

Example: TACC Platform • Expected failure mode: Load balancer death (or partition) • Cheap solution: LB state is all soft • LB periodically beacons its existence & location • Workers connect to it and send periodic load reports • If LB crashes, state is restored within a few seconds • Effect on data semantics • Cached, stale state in FE’s allows temporary operation during LB restart • Empirically and analytically, this is not so bad

Example: TACC Workers • Expected failure: Worker death (or partition) • Cheap solution: • Invariant: workers are restartable • Use retry count to handle pathological inputs • Effect on system behavior/semantics • Temporary extra latency for retry • Unexpected failure mode: livelock! • Timers retrofitted; what should be the behavior on timeout? (Kill request, or kill worker?) • User-perceived effects?

Example: MS Tiger Video Fileserver 0 1 2 3 4 • ATM-connected disks “walk down” static schedule • “Coherent hallucination”: global schedule, but each disk only knows its local piece • Pieces are passed around ring, bucket-brigade style WAN ATM interconnect Schedule Time

Example: Tiger, cont’d. 0 1 2 3 4 • Once failure is detected, bypass failed cub • Why not just detect first, then bypass schedule info to successor? • If cub fails, its successor is already prepared to take over that schedule slot WAN ATM interconnect

Example: Tiger, Cont’d. 0 1 2 3 4 • Failed cub permanently removed from ring WAN ATM interconnect

Example: Tiger, cont’d. • Expected failure mode: death of a cub (node) • Cheap solution: • All files are mirrored using decluster factor • Cub passes chunk of the schedule to both its successor and second-successor • Effect on performance/semantics • No user-visible latency to recover from failure! • What’s the cost of this? • What would be an analogous TACC mechanism?

Example: Tiger, cont’d. • Unsupported failure modes • Interconnect failure • Failure of both primary and secondary copies of a block • User-visible effect: frame loss or complete hosage • Notable differences from TACC • Schedule is global but not centralized (coherent hallucination) • State is not soft but not really “durable” either (probabilistically hard? Asymptotically hard?)

Example: CIRRUS Banking Network • CIRRUS network is a “smart switch” that runs on Tandem NonStop nodes (deployed early 80’s) • 2-phase commit for cash withdrawal: 1. Withdrawal request from ATM 2. “xact in progress” logged at Bank 3. [Commit point] cash dispensed, confirmation sent to Bank 4. Ack from bank • Non-obvious failure mode #1 after commit point: • CIRRUS switch resends #3 until reply received • If reply indicates error, manual cleanup needed

ATM Example, continued • Non-obvious failure mode #2: Cash is dispensed, but CIRRUS switch never sees message #3 from ATM • Notify bank: cash was not dispensed • “Reboot” ATM-to-CIRRUS net • Query net log to see if net thinks #3 was sent; if so, and #3 arrived before end of business day, create Adjustment Record at both sides • Otherwise, resolve out-of-band using physical evidence of xact (tape logs, video cam, etc) • Role of end-user semantics for picking design point for auto vs. manual recovery (end of business day)

ATM and TWA Reservations Example • Non-obvious failure mode #3: malicious ATM fakes message #3. • Not caught or handled! "In practice, this won't go unnoticed in the banking industry and/or by our customers.” • TWA Reservations System (~1985) • “Secondary” DB of reservations sold; points into “primary” DB of complete seating inventory • Dangling pointers and double-bookings resolved offline • Explicitly trades availability for throughput: “...a 90% utilization rate would make it impossible for us to log all our transactions”

“What Really Happened On Mars” • Sources: various posts to Risks Digest, including one from CTO of WindRiver Systems, which makes the VxWorks operating system that controls the Pathfinder. • Background concepts • Threads and mutual exclusion • Mutexes and priority inheritance • Priority inversion • How the failure was diagnosed and fixed • Lessons

Background: threads and mutexes • Threads: independent flows of control, typically within a single program, that usually share some data or resources • Can be used for performance, program structure, or both • Threads can have different priorities for sharing resources (CPU, memory, etc) • Thread scheduler (dozens of algorithms) enforces the priorities • Mutex: a lock (data structure) that enforces one-at-a-time access to a shared resource • In this case, a systemwide data bus • Usage: to exclusively access a shared resource, you acquire the mutex, do your thing, then release the mutex

Background: priority inversion • Priority inheritance: the mutex “inherits” the priority of whichever thread currently holds it • So, if a high-priority thread has it (or is waiting for it), mutex operations are given high priority • Priority inversion can happen when this isn’t done: • Low-priority, infrequent thread A grabs mutex M • High-priority thread B needs to access data protected by M • Result: even though B has higher priority than A, it has to wait a long time since A holds M • This example is obvious; in practice, priority inheritance is usually more subtle

What Really Happened • Dramatis personae • Low-priority thread A: infrequent, short-running meteorological data collection, using bus mutex • High-priority thread B: bus manager, using bus mutex • Medium-priority thread C: long-running communications task (that doesn’t need the mutex) • Priority inversion scenario • A is scheduled, and grabs bus mutex • B is scheduled, and blocks waiting for A to release mutex • C is scheduled while B is waiting for mutex • C has higher priority than A, so it prevents A from running (and therefore B as well) • Watchdog timer notices B hasn’t run, concludes something is wrong, reboots

Lessons They Learned • Extensive logging facilities (trace of system events) allowed diagnosis • Debugging facility (ability to execute C commands that tweak the running system…kind of like gdb) allowed problem to be fixed • Debugging facility was intended to be turned off prior to production use • Similar facility in certain commercial CPU designs!

Lessons We Should Learn • Complexity has pitfalls • Concurrency, race conditions, and mutual exclusion are hard to debug • Ousterhout. Why Threads Are a Bad Idea (For Most Purposes); Savage et al., Eraser • C.A.R. Hoare: “The unavoidable price of reliability is simplicity.” • It’s more important to get it extensible than to get it right • Even in the hardware world! • Knowledge of end-to-end semantics can be critical • Whither “transparent” OS-level mechanisms?

Designing For Failure

Designing For Failure

Presentation Transcript

Designing for Visibility

DESIGNING FOR

Designing Exchange 2010 Mailbox High Availability for Failure Domains

Drugs for Heart Failure

Idaho RISE System Reliability and Designing to Reduce Failure

Designing for accessibility

Designing for readability

Designing for Metacognition

Designing For Testability

ICDs for Heart Failure

SixTrack for failure studies

Designing for Coexistence

DESIGNING FOR ANIMALS VERSUS DESIGNING FOR PEOPLE

Designing for Understanding

Designing For Load

Policy for market failure

DESIGNING FOR ADOPTION

Designing for DVI

diet for renal failure

Designing for Values

Genius Sacrificed for Failure

Designing for Humans