1 / 14

CS 603 Failure Models

CS 603 Failure Models. April 12, 2002. Fault Tolerance in Distributed Systems. Perfect world: No Failures We don’t live in a perfect world Non-distributed system Crash, you’re dead Distributed system: Redundancy Should result in less down time But does it?

Download Presentation

CS 603 Failure Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. CS 603Failure Models April 12, 2002

  2. Fault Tolerance inDistributed Systems • Perfect world: No Failures • We don’t live in a perfect world • Non-distributed system Crash, you’re dead • Distributed system: Redundancy • Should result in less down time • But does it? • Distributed systems according to Butler Lampson A distributed system is a system in which I can’t get my work done because a computer that I’ve never heard of has failed.

  3. Analogy:Single vs. Multi-Engine Airplanes • If the engine quits on a single-engine airplane, you will land soon • In a multi-engine airplane, you can still fly • Fatal accidents per 100,000 hours flown (1997, private aircraft): • Single-engine: 1.47 • Multi-engine: 1.92 (Airlines: .038) • Why is multi-engine more dangerous?

  4. Problems with Distributed Systems • Failure more frequent • Many places to fail • More complex • More pieces • New challenges: Asynchrony, communication • Potential problems of failure greater • Single system – everything stops • Distributed – some parts continue Consistency!

  5. First Step: Goals • Availability • Can I use it now? • Reliability • Will it be up as long as I need it? • Safety • If it fails, what are the consequences? • Maintainability • How easy is it to fix if it breaks?

  6. Next Step: Failure Models • Failure: System doesn’t give desired behavior • Component-level failure (can compensate) • System-level failure (incorrect result) • Fault: Cause of failure (component-level) • Transient: Not repeatable • Intermittent: Repeats, but (apparently) independent of system operations • Permanent: Exists until component repaired • Failure Model: How the system behaves when it doesn’t behave properly

  7. Failure Model(Flaviu Cristian, 1991) • Dependency • Proper operation of Database depends on proper operation of processor, disk • Failure Classification • Type of response to failure • Failure semantics • State of system after given class of failure • Failure masking • High-level operation succeeds even if they depend on failed services

  8. Failure Classification • Correct • In response to inputs, behaves in a manner consistent with the service specification • Omission Failure • Doesn’t respond to input • Crash: After first omission failure, subsequent requests result in omission failure • Timing failure (early, late) • Correct response, but outside required time window • Response failure • Value: Wrong output for inputs • State Transition: Server ends in wrong state

  9. Crash Failure types(based on recovery behavior) • Amnesia • Server recovers to predefined state independent of operations before crash • Partial amnesia • Some part of state is as before crash, rest to predefined state • Pause • Recovers to state before omission failure • Halting • Never restarts

  10. Failure Semantics l u r f(sr) sr sr’ f(sr’) • Max delay on link: d; Max service time: p • Should get response in 2d+p • Assume omission failure only • If no response in 2d+p, resend request • What if performance failure possible? • Must distinguish between response to sr and sr’

  11. Failure Semantics • Specification for service must include • Failure-free (normal) semantics • Failure semantics (likely failure behaviors) • Multiple semantics • Combine to give (weaker) semantics • Arbitrary failure semantics: Weakest possible • Choice of failure semantics • Is class of failure likely? • Probability of type of failure • What is the cost of failure • Catastrophic?

  12. Hierarchical Failure Masking • Hierarchical failure masking • Dependency: Higher level gets (at best) failure semantics of lower level • Can compensate for lower level failure to improve this • Example: TCP fixes communication errors, so some failure semantics not propagated to higher level

  13. Group Failure Masking • Redundant servers • Failed server masked by others in group • Allows failure semantics of group to be higher than individuals • k-fault tolerant • Group can mask kconcurrent group member failures from client • May “upgrade” failure semantics • Example: Group detects non-responsive server, other member picks up the slack Omission failure becomes performance failure

  14. Additional Issues • Tradeoff of failure probabilities / semantics • Which is better? • P[Catastophic] = 0.01 & P[Performance] = 0.5 • P[Catastrophic] = 0.1 & P[Performance] = 0.01 • Techniques for managing failure • Recovery mechanisms

More Related