230 likes | 322 Views
Designing For Failure. Stanford University CS 444A, Autumn 99 Software Development for Critical Applications Armando Fox & David Dill {fox,dill}@cs.stanford.edu. Outline. User expectations and failure semantics Orthogonal mechanisms (again) Case studies TACC Microsoft Tiger video server
E N D
Designing For Failure Stanford University CS 444A, Autumn 99Software Development for Critical ApplicationsArmando Fox & David Dill{fox,dill}@cs.stanford.edu
Outline • User expectations and failure semantics • Orthogonal mechanisms (again) • Case studies • TACC • Microsoft Tiger video server • Cirrus banking network • TWA flight reservations system • Mars Pathfinder • Lessons
Designing For Failure: Philosophy • Start with some “givens” • Hardware does fail • Latent bugs do occur • Nondeterministic/hard-to-reproduce bugs will happen • Requirements: • Maintain availability (possibly with degraded performance) when these things happen • Isolate faults so they don’t bring whole system down • Question: • What is “availability” from end user’s point of view? • Specifically…what constitutes correct/acceptable behavior?
User Expectations & Failure Semantics • Determine expected/common failure modes • Find “cheap” mechanisms to address them • Analysis: Invariants are your friends • Performance: keep the common case fast • Robustness: minimize dependencies on other components to avoid “domino effect” • Determine effects on performance & semantics • Unexpected/unsupported failure modes?
Example Cheap Mechanism: BASE • Best-effort Availability (like the Internet!) • Soft State (if you can tolerate temporary state loss) • Eventual Consistency (often fine in practice) • When is BASE compatible with user-perceived application semantics? • When is availability more important? (Your client-side Netscape cache? Source code library?) • When is consistency more important? (Your bank account? User profiles?) • Example: “Reload semantics”
Example: NFS (using BASE badly) • Original philosophy: “Stateless is good” • Implementation revealed: performance was bad • Caching retrofitted as a hack • Attribute cache for stat() • Stateless soft state! • Network file locking added later • Inconsistent lock state if server or client crashes User’s view of data semantics wasn’t considered from the get-go. Result: brain-dead semantics.
Example: Multicast/SRM • Expected failure: periodic router death/partition • Cheap solution: • Each router stores only local mcast routing info • Soft state with periodic refreshes • Absence of “reachability beacons” used to infer failure of upstream nodes; can look for another • Receiving duplicate refreshes is idempotent • Unsupported failure mode: frequent or prolonged router failure
Example: TACC Platform • Expected failure mode: Load balancer death (or partition) • Cheap solution: LB state is all soft • LB periodically beacons its existence & location • Workers connect to it and send periodic load reports • If LB crashes, state is restored within a few seconds • Effect on data semantics • Cached, stale state in FE’s allows temporary operation during LB restart • Empirically and analytically, this is not so bad
Example: TACC Workers • Expected failure: Worker death (or partition) • Cheap solution: • Invariant: workers are restartable • Use retry count to handle pathological inputs • Effect on system behavior/semantics • Temporary extra latency for retry • Unexpected failure mode: livelock! • Timers retrofitted; what should be the behavior on timeout? (Kill request, or kill worker?) • User-perceived effects?
Example: MS Tiger Video Fileserver 0 1 2 3 4 • ATM-connected disks “walk down” static schedule • “Coherent hallucination”: global schedule, but each disk only knows its local piece • Pieces are passed around ring, bucket-brigade style WAN ATM interconnect Schedule Time
Example: Tiger, cont’d. 0 1 2 3 4 • Once failure is detected, bypass failed cub • Why not just detect first, then bypass schedule info to successor? • If cub fails, its successor is already prepared to take over that schedule slot WAN ATM interconnect
Example: Tiger, Cont’d. 0 1 2 3 4 • Failed cub permanently removed from ring WAN ATM interconnect
Example: Tiger, cont’d. • Expected failure mode: death of a cub (node) • Cheap solution: • All files are mirrored using decluster factor • Cub passes chunk of the schedule to both its successor and second-successor • Effect on performance/semantics • No user-visible latency to recover from failure! • What’s the cost of this? • What would be an analogous TACC mechanism?
Example: Tiger, cont’d. • Unsupported failure modes • Interconnect failure • Failure of both primary and secondary copies of a block • User-visible effect: frame loss or complete hosage • Notable differences from TACC • Schedule is global but not centralized (coherent hallucination) • State is not soft but not really “durable” either (probabilistically hard? Asymptotically hard?)
Example: CIRRUS Banking Network • CIRRUS network is a “smart switch” that runs on Tandem NonStop nodes (deployed early 80’s) • 2-phase commit for cash withdrawal: 1. Withdrawal request from ATM 2. “xact in progress” logged at Bank 3. [Commit point] cash dispensed, confirmation sent to Bank 4. Ack from bank • Non-obvious failure mode #1 after commit point: • CIRRUS switch resends #3 until reply received • If reply indicates error, manual cleanup needed
ATM Example, continued • Non-obvious failure mode #2: Cash is dispensed, but CIRRUS switch never sees message #3 from ATM • Notify bank: cash was not dispensed • “Reboot” ATM-to-CIRRUS net • Query net log to see if net thinks #3 was sent; if so, and #3 arrived before end of business day, create Adjustment Record at both sides • Otherwise, resolve out-of-band using physical evidence of xact (tape logs, video cam, etc) • Role of end-user semantics for picking design point for auto vs. manual recovery (end of business day)
ATM and TWA Reservations Example • Non-obvious failure mode #3: malicious ATM fakes message #3. • Not caught or handled! "In practice, this won't go unnoticed in the banking industry and/or by our customers.” • TWA Reservations System (~1985) • “Secondary” DB of reservations sold; points into “primary” DB of complete seating inventory • Dangling pointers and double-bookings resolved offline • Explicitly trades availability for throughput: “...a 90% utilization rate would make it impossible for us to log all our transactions”
“What Really Happened On Mars” • Sources: various posts to Risks Digest, including one from CTO of WindRiver Systems, which makes the VxWorks operating system that controls the Pathfinder. • Background concepts • Threads and mutual exclusion • Mutexes and priority inheritance • Priority inversion • How the failure was diagnosed and fixed • Lessons
Background: threads and mutexes • Threads: independent flows of control, typically within a single program, that usually share some data or resources • Can be used for performance, program structure, or both • Threads can have different priorities for sharing resources (CPU, memory, etc) • Thread scheduler (dozens of algorithms) enforces the priorities • Mutex: a lock (data structure) that enforces one-at-a-time access to a shared resource • In this case, a systemwide data bus • Usage: to exclusively access a shared resource, you acquire the mutex, do your thing, then release the mutex
Background: priority inversion • Priority inheritance: the mutex “inherits” the priority of whichever thread currently holds it • So, if a high-priority thread has it (or is waiting for it), mutex operations are given high priority • Priority inversion can happen when this isn’t done: • Low-priority, infrequent thread A grabs mutex M • High-priority thread B needs to access data protected by M • Result: even though B has higher priority than A, it has to wait a long time since A holds M • This example is obvious; in practice, priority inheritance is usually more subtle
What Really Happened • Dramatis personae • Low-priority thread A: infrequent, short-running meteorological data collection, using bus mutex • High-priority thread B: bus manager, using bus mutex • Medium-priority thread C: long-running communications task (that doesn’t need the mutex) • Priority inversion scenario • A is scheduled, and grabs bus mutex • B is scheduled, and blocks waiting for A to release mutex • C is scheduled while B is waiting for mutex • C has higher priority than A, so it prevents A from running (and therefore B as well) • Watchdog timer notices B hasn’t run, concludes something is wrong, reboots
Lessons They Learned • Extensive logging facilities (trace of system events) allowed diagnosis • Debugging facility (ability to execute C commands that tweak the running system…kind of like gdb) allowed problem to be fixed • Debugging facility was intended to be turned off prior to production use • Similar facility in certain commercial CPU designs!
Lessons We Should Learn • Complexity has pitfalls • Concurrency, race conditions, and mutual exclusion are hard to debug • Ousterhout. Why Threads Are a Bad Idea (For Most Purposes); Savage et al., Eraser • C.A.R. Hoare: “The unavoidable price of reliability is simplicity.” • It’s more important to get it extensible than to get it right • Even in the hardware world! • Knowledge of end-to-end semantics can be critical • Whither “transparent” OS-level mechanisms?