1 / 42

Fault Tolerance

Fault Tolerance. Motivation : Systems need to be much more reliable than their components Use Redundancy : Extra items that can be used to make up for failures Types of Redundancy : Hardware Software Time Information. Fault-Tolerant Scheduling.

Download Presentation

Fault Tolerance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault Tolerance • Motivation: Systems need to be much more reliable than their components • Use Redundancy: Extra items that can be used to make up for failures • Types of Redundancy: • Hardware • Software • Time • Information

  2. Fault-Tolerant Scheduling • Fault Tolerance: The ability of a system to suffer component failures and still function adequately • Fault-Tolerant Scheduling: Save enough time in a schedule that the system can still function despite a certain number of processor failures

  3. FT-Scheduling: Model • System Model • Multiprocessor system • Each processor has its own memory • Tasks are preloaded into assigned processors • Task Model • Tasks are independent of one another • Schedules are created ahead of time

  4. Basic Idea • Preassign backup copies, called ghosts. • Assign ghosts to the processors along with the primary copies • A ghost and a primary copy of the same task can’t be assigned to the same processor • For each processor, all the primaries and a particular subset of the ghost copies assigned to it should be feasibly schedulable on that processor

  5. Requirements • Two main variations: • Current and future iterations of the task have to be saved if a processor fails • Only future iterations need to be saved; the current iteration can be discarded

  6. Forward and Backward Masking • Forward Masking: Mask the output of failed units without significant loss of time • Backward Masking: After detecting an error, try to fix it by recomputing or some other means

  7. Failure Types • Permanent: The fault is incurable • Transient: The unit is faulty for some time, following which it starts functioning correctly again • Intermittent: Frequently cycles between a faulty and a non-faulty state

  8. Faults and Errors • A fault is some physical defect or malfunction • An error is a manifestation of a fault • Latency: • Fault Latency: Time between occurrence of a fault and its manifestation as an error • Error Latency: Time between the generation of an error and its being caught by the system

  9. Hardware Failure Recovery • If transient, it may be enough to wait for the fault to go away and then reinvoke the computation • If permanent, reassign the tasks to other, functional, processors

  10. Faults: Output Characteristics • Stuck-at: A line is stuck at 0 or 1. • Dead: No output (e.g., high-impedance state) • Arbitrary: The output changes with time

  11. Factors Affecting HW F-Rate • Temperature • Radiation • Power surges • Mechanical shocks • HW failure rate often follows the “bathtub” curve

  12. Some Terminology • Fail-safe Systems: Systems which end up in a “safe” state upon failure • Example: All traffic lights turning red in an intersection • Fail-stop Systems: Systems that stop producing output when they fail

  13. Example of HW Redundancy • Triple-Modular Redundancy (TMR): • Three units run the same algorithm in parallel • Their outputs are voted on and the majority is picked as the output of the TMR cluster • Can forward-mask up to one processor failure

  14. Mathematical Background • Basic laws of probability • Density and distribution functions • Notion of stochastic independence • Expectation, variance, etc. • Memoryless distribution • Markov chains • Steady-state & transient solutions • Bayes’s Law

  15. Hardware FT • N-Modular Redundancy (NMR) • Basic structure • Variations • Reliability evaluation • Independent failures • Correlated failures • Voter: • Bit-by-bit comparison • Median • Formalized majority • Generalized k-plurality

  16. Exploiting Appln Semantics • Acceptance Test: Specify a range outside which the output is tagged as faulty (or at least suspicious) • No acceptance test is perfect: • Sensitivity: Probability of catching an incorrect output • Specificity: Probabililty that an output which is flagged as wrong is really wrong • Specificity = 1 - False Positive Probability

  17. Checkpointing • Store partial results in a safe place • When failure occurs, roll back to the latest checkpoint and restart • Issues: • Checkpoint positioning • Implementation • Kernel level • Application level • Correctness: Can be a problem in distributed systems

  18. Terminology • Checkpointing Overhead: The part of the checkpointing activity that is not hidden from the application • Checkpointing Latency: Time between when a checkpoint starts being taken to when it is stored in non-volatile storage.

  19. Reducing Chkptg Overhead • Buffer checkpoint writes • Don’t checkpoint “dead” variables: • Never used again by the program, or • Next operation with respect to the variable is a write • Problem is how to identify dead variables • Don’t checkpoint read-only stuff, like code

  20. Reducing Chkptg Latency • Consider compressing the checkpoint. Usefulness of this approach depends on: • Extent of the compression possible • Work required to execute the compression algorithm

  21. Optimization of Chkptg • Objective in general-purpose systems is usually to minimize the expected execution time • Objective in real-time systems is to maximize the probability of meeting task deadlines • Need a mathematical model to determine this • Generally, we place checkpoints approximately equidistant from each other and just determine the optimal number of them

  22. Distributed Checkpointing • Ordering of Events: • Easy to do if there’s just one thread • If there are multiple threads: • Events in the same thread are trivial to order • Event A in thread X is said to precede Event B in thread Y if there is some communication from the X after event A that arrives at Y before event B • Given two events A and B in separate threads, • A could precede B • B could precede A • They could be concurrent

  23. Distributed Checkpointing • Domino Effect: An uncontrolled cascade of rollbacks can roll the entire system back to the starting state • To avoid the domino effect, we can coordinate the checkpointing • Tightly synchronize the checkpoints in all processors • Koo-Toueg algorithm

  24. Checkptg with Clock Sync • Assume the clock skew is bounded at d and minimum message delivery time is f • Each processor: • Takes a local checkpoint at some specified time, t • Following its checkpoint, it does not sent out any messages until it is sure that this message will be received only after the recipient has itself checkpointed; i.e., until t+f+d

  25. Koo-Toueg Algorithm • A processor that wants to checkpoint, • Does so, locally • Tells all processors which have communicated with it the last message (timestamp or message number) received from them • If these processors don’t have a checkpoint recording the transmission of this message, they take a checkpoint • This can result in a surge of checkpointing activity visible at the non-volatile storage

  26. Software Fault Tolerance • It is practically impossible to produce a large piece of software that is bug-free • E.g., Even the space shuttle flew with several potentially disastrous bugs despite extensive testing • Single-version Fault Tolerance • Multi-version Fault Tolerance

  27. Fault Models • Reasonably trustworthy hardware fault models exist • Many software fault models exist in the literature, but not one can be fully trusted to represent reality

  28. Single-Version FT • Wrappers: Code “wrapped around” the software that checks for consistency and correctness • Software Rejuvenation: Reboot the machine reasonably frequently • Use data diversity: Sometimes an algorithm may fail on some data but not if these data are subjected to minor perturbations

  29. Multi-version FT • Very, very expensive • Two basic approaches • N-version programming • Recovery Blocks

  30. N-Version Programming (NVP) • Theoretically appealing, but hard to make it effective • Basic Idea: • Have N independent teams of programmers develop applications independently • Run them in parallel and vote on them • If they are truly independent, they will be highly reliable

  31. Failure Diversity • Effectiveness hinges on whether faults in the versions are statistically independent of one another • Forces against truly independent failures: • Common programming “culture” • Common specifications • Common algorithms • Common software/hardware platforms

  32. Failure Diversity • Incidental Diversity • Prohibit interaction between teams of programmers working on different versions and hope they produce independently failing versions • Forced Diversity • Diverse specifications • Diverse programming languages • Diverse development tools and compilers • Cognitively diverse teams: Probably not realistic

  33. Experimental Results • Experiments suggest that correlated failures do occur at a much higher rate than would be the case if failures in the versions were stochastically independent • Example: Study conducted by Brilliant, Knight, and Leveson at UVa and UCI • 27 students writing code for anti-missile application • 93 correlated failures observed: if true independence had existed, we’d have expected about 5

  34. Recovery Blocks • Also uses multiple versions • Only one version is active at any time • If the output of this version fails an acceptance test, another version is activated

  35. Byzantine Failures • The worst failure mode known • Original Motivating Problem (~1978): • A sensor needs to disseminate its output to a set of processors. How can we ensure that, • If the sensor is functioning correctly: All functional processors obtain the correct sensor reading • If the sensor is malfunctioning: All functional processors agree on the sensor reading

  36. Byzantine Generals Problem • Some divisions of the Byzantine Army are besieging a city. They must all coordinate their attacks (or coordinate their retreat) to avoid disaster • The overall commander communicates to his divisional commanders by means of a confidential messenger. This messenger is trustworthy and doesn’t alter the message; it can only be read by its intended recipient

  37. Byz Generals Problem (contd.) • If the C-in-C is loyal • He sends consistent orders to the subordinate generals • All loyal subordinates must obey his order • If the C-in-C is a traitor • All loyal subordinate generals must agree on some default action (e.g., running away)

  38. Impossibility with 3 Generals • Suppose there are 2 divisions, A and B. • Commander-in-chief is a traitor and sends message to Com(A) saying “Attack!” and to Com(B) saying “Retreat!” • Com(A) sends a messenger to Com(B), saying “The boss told me to attack!” • Com(B) receives: • Direct order from the C-in-C saying “Retreat” • Message from Com(A) saying “I was ordered to attack”

  39. Byz. Generals Problem (contd.) • Com(B)’s dilemma: • Either the C-in-C or Com(A) is a traitor: it is impossible to know which • Further communication with Com(A) won’t add any useful information • Not possible to ensure that if Com(A) and Com(B) are both loyal, they both agree on the same action • The problem cannot be solved if there are 3 generals who may include at least one traitor

  40. Byz. Generals Problem (contd.) • Central Result: To reach agreement with a total of N participants with up to m traitors, we must have N > 3m

  41. Byzantine Generals Algorithm • Byz(0) // no-failure algorithm • C-in-C sends his order to every subordinate • The subordinate uses the order he receives, or the default if he receives no order

  42. Byz(m) // For up to m traitors (failures) • (1) C-in-C sends order to every subordinate, G_i: let this be received as v_i • (2) G_i acts as the C-in-C in a Byz(m-1) algorithm to circulate this order to his colleagues • (3) For each (i,j) such that i!=j, let w_(i,j) be the order that G_i got from G_j in step 2 or the default if no message was received. G_i calculates the majority of the orders {v_i, w_(i,j)} and uses it as the correct order to follow

More Related