fault tolerance n.
Skip this Video
Download Presentation
Fault Tolerance

Loading in 2 Seconds...

play fullscreen
1 / 42

Fault Tolerance - PowerPoint PPT Presentation

  • Uploaded on

Fault Tolerance. Motivation : Systems need to be much more reliable than their components Use Redundancy : Extra items that can be used to make up for failures Types of Redundancy : Hardware Software Time Information. Fault-Tolerant Scheduling.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Fault Tolerance' - abdul-burton

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
fault tolerance
Fault Tolerance
  • Motivation: Systems need to be much more reliable than their components
  • Use Redundancy: Extra items that can be used to make up for failures
  • Types of Redundancy:
    • Hardware
    • Software
    • Time
    • Information
fault tolerant scheduling
Fault-Tolerant Scheduling
  • Fault Tolerance: The ability of a system to suffer component failures and still function adequately
  • Fault-Tolerant Scheduling: Save enough time in a schedule that the system can still function despite a certain number of processor failures
ft scheduling model
FT-Scheduling: Model
  • System Model
    • Multiprocessor system
    • Each processor has its own memory
    • Tasks are preloaded into assigned processors
  • Task Model
    • Tasks are independent of one another
    • Schedules are created ahead of time
basic idea
Basic Idea
  • Preassign backup copies, called ghosts.
  • Assign ghosts to the processors along with the primary copies
    • A ghost and a primary copy of the same task can’t be assigned to the same processor
    • For each processor, all the primaries and a particular subset of the ghost copies assigned to it should be feasibly schedulable on that processor
  • Two main variations:
    • Current and future iterations of the task have to be saved if a processor fails
    • Only future iterations need to be saved; the current iteration can be discarded
forward and backward masking
Forward and Backward Masking
  • Forward Masking: Mask the output of failed units without significant loss of time
  • Backward Masking: After detecting an error, try to fix it by recomputing or some other means
failure types
Failure Types
  • Permanent: The fault is incurable
  • Transient: The unit is faulty for some time, following which it starts functioning correctly again
  • Intermittent: Frequently cycles between a faulty and a non-faulty state
faults and errors
Faults and Errors
  • A fault is some physical defect or malfunction
  • An error is a manifestation of a fault
  • Latency:
    • Fault Latency: Time between occurrence of a fault and its manifestation as an error
    • Error Latency: Time between the generation of an error and its being caught by the system
hardware failure recovery
Hardware Failure Recovery
  • If transient, it may be enough to wait for the fault to go away and then reinvoke the computation
  • If permanent, reassign the tasks to other, functional, processors
faults output characteristics
Faults: Output Characteristics
  • Stuck-at: A line is stuck at 0 or 1.
  • Dead: No output (e.g., high-impedance state)
  • Arbitrary: The output changes with time
factors affecting hw f rate
Factors Affecting HW F-Rate
  • Temperature
  • Radiation
  • Power surges
  • Mechanical shocks
  • HW failure rate often follows the “bathtub” curve
some terminology
Some Terminology
  • Fail-safe Systems: Systems which end up in a “safe” state upon failure
    • Example: All traffic lights turning red in an intersection
  • Fail-stop Systems: Systems that stop producing output when they fail
example of hw redundancy
Example of HW Redundancy
  • Triple-Modular Redundancy (TMR):
    • Three units run the same algorithm in parallel
    • Their outputs are voted on and the majority is picked as the output of the TMR cluster
    • Can forward-mask up to one processor failure
mathematical background
Mathematical Background
  • Basic laws of probability
    • Density and distribution functions
    • Notion of stochastic independence
    • Expectation, variance, etc.
  • Memoryless distribution
    • Markov chains
      • Steady-state & transient solutions
  • Bayes’s Law
hardware ft
Hardware FT
  • N-Modular Redundancy (NMR)
    • Basic structure
      • Variations
    • Reliability evaluation
      • Independent failures
      • Correlated failures
    • Voter:
      • Bit-by-bit comparison
      • Median
      • Formalized majority
      • Generalized k-plurality
exploiting appln semantics
Exploiting Appln Semantics
  • Acceptance Test: Specify a range outside which the output is tagged as faulty (or at least suspicious)
  • No acceptance test is perfect:
    • Sensitivity: Probability of catching an incorrect output
    • Specificity: Probabililty that an output which is flagged as wrong is really wrong
      • Specificity = 1 - False Positive Probability
  • Store partial results in a safe place
  • When failure occurs, roll back to the latest checkpoint and restart
  • Issues:
    • Checkpoint positioning
    • Implementation
      • Kernel level
      • Application level
    • Correctness: Can be a problem in distributed systems
  • Checkpointing Overhead: The part of the checkpointing activity that is not hidden from the application
  • Checkpointing Latency: Time between when a checkpoint starts being taken to when it is stored in non-volatile storage.
reducing chkptg overhead
Reducing Chkptg Overhead
  • Buffer checkpoint writes
  • Don’t checkpoint “dead” variables:
    • Never used again by the program, or
    • Next operation with respect to the variable is a write
    • Problem is how to identify dead variables
  • Don’t checkpoint read-only stuff, like code
reducing chkptg latency
Reducing Chkptg Latency
  • Consider compressing the checkpoint. Usefulness of this approach depends on:
    • Extent of the compression possible
    • Work required to execute the compression algorithm
optimization of chkptg
Optimization of Chkptg
  • Objective in general-purpose systems is usually to minimize the expected execution time
  • Objective in real-time systems is to maximize the probability of meeting task deadlines
    • Need a mathematical model to determine this
    • Generally, we place checkpoints approximately equidistant from each other and just determine the optimal number of them
distributed checkpointing
Distributed Checkpointing
  • Ordering of Events:
    • Easy to do if there’s just one thread
    • If there are multiple threads:
      • Events in the same thread are trivial to order
      • Event A in thread X is said to precede Event B in thread Y if there is some communication from the X after event A that arrives at Y before event B
      • Given two events A and B in separate threads,
        • A could precede B
        • B could precede A
        • They could be concurrent
distributed checkpointing1
Distributed Checkpointing
  • Domino Effect: An uncontrolled cascade of rollbacks can roll the entire system back to the starting state
  • To avoid the domino effect, we can coordinate the checkpointing
    • Tightly synchronize the checkpoints in all processors
    • Koo-Toueg algorithm
checkptg with clock sync
Checkptg with Clock Sync
  • Assume the clock skew is bounded at d and minimum message delivery time is f
  • Each processor:
    • Takes a local checkpoint at some specified time, t
    • Following its checkpoint, it does not sent out any messages until it is sure that this message will be received only after the recipient has itself checkpointed; i.e., until t+f+d
koo toueg algorithm
Koo-Toueg Algorithm
  • A processor that wants to checkpoint,
    • Does so, locally
    • Tells all processors which have communicated with it the last message (timestamp or message number) received from them
      • If these processors don’t have a checkpoint recording the transmission of this message, they take a checkpoint
  • This can result in a surge of checkpointing activity visible at the non-volatile storage
software fault tolerance
Software Fault Tolerance
  • It is practically impossible to produce a large piece of software that is bug-free
    • E.g., Even the space shuttle flew with several potentially disastrous bugs despite extensive testing
  • Single-version Fault Tolerance
  • Multi-version Fault Tolerance
fault models
Fault Models
  • Reasonably trustworthy hardware fault models exist
  • Many software fault models exist in the literature, but not one can be fully trusted to represent reality
single version ft
Single-Version FT
  • Wrappers: Code “wrapped around” the software that checks for consistency and correctness
  • Software Rejuvenation: Reboot the machine reasonably frequently
  • Use data diversity: Sometimes an algorithm may fail on some data but not if these data are subjected to minor perturbations
multi version ft
Multi-version FT
  • Very, very expensive
  • Two basic approaches
    • N-version programming
    • Recovery Blocks
n version programming nvp
N-Version Programming (NVP)
  • Theoretically appealing, but hard to make it effective
  • Basic Idea:
    • Have N independent teams of programmers develop applications independently
    • Run them in parallel and vote on them
    • If they are truly independent, they will be highly reliable
failure diversity
Failure Diversity
  • Effectiveness hinges on whether faults in the versions are statistically independent of one another
  • Forces against truly independent failures:
    • Common programming “culture”
    • Common specifications
    • Common algorithms
    • Common software/hardware platforms
failure diversity1
Failure Diversity
  • Incidental Diversity
    • Prohibit interaction between teams of programmers working on different versions and hope they produce independently failing versions
  • Forced Diversity
    • Diverse specifications
    • Diverse programming languages
    • Diverse development tools and compilers
    • Cognitively diverse teams: Probably not realistic
experimental results
Experimental Results
  • Experiments suggest that correlated failures do occur at a much higher rate than would be the case if failures in the versions were stochastically independent
  • Example: Study conducted by Brilliant, Knight, and Leveson at UVa and UCI
    • 27 students writing code for anti-missile application
    • 93 correlated failures observed: if true independence had existed, we’d have expected about 5
recovery blocks
Recovery Blocks
  • Also uses multiple versions
  • Only one version is active at any time
  • If the output of this version fails an acceptance test, another version is activated
byzantine failures
Byzantine Failures
  • The worst failure mode known
  • Original Motivating Problem (~1978):
    • A sensor needs to disseminate its output to a set of processors. How can we ensure that,
      • If the sensor is functioning correctly: All functional processors obtain the correct sensor reading
      • If the sensor is malfunctioning: All functional processors agree on the sensor reading
byzantine generals problem
Byzantine Generals Problem
  • Some divisions of the Byzantine Army are besieging a city. They must all coordinate their attacks (or coordinate their retreat) to avoid disaster
  • The overall commander communicates to his divisional commanders by means of a confidential messenger. This messenger is trustworthy and doesn’t alter the message; it can only be read by its intended recipient
byz generals problem contd
Byz Generals Problem (contd.)
  • If the C-in-C is loyal
    • He sends consistent orders to the subordinate generals
    • All loyal subordinates must obey his order
  • If the C-in-C is a traitor
    • All loyal subordinate generals must agree on some default action (e.g., running away)
impossibility with 3 generals
Impossibility with 3 Generals
  • Suppose there are 2 divisions, A and B.
  • Commander-in-chief is a traitor and sends message to Com(A) saying “Attack!” and to Com(B) saying “Retreat!”
  • Com(A) sends a messenger to Com(B), saying “The boss told me to attack!”
  • Com(B) receives:
    • Direct order from the C-in-C saying “Retreat”
    • Message from Com(A) saying “I was ordered to attack”
byz generals problem contd1
Byz. Generals Problem (contd.)
  • Com(B)’s dilemma:
    • Either the C-in-C or Com(A) is a traitor: it is impossible to know which
    • Further communication with Com(A) won’t add any useful information
    • Not possible to ensure that if Com(A) and Com(B) are both loyal, they both agree on the same action
  • The problem cannot be solved if there are 3 generals who may include at least one traitor
byz generals problem contd2
Byz. Generals Problem (contd.)
  • Central Result: To reach agreement with a total of N participants with up to m traitors, we must have N > 3m
byzantine generals algorithm
Byzantine Generals Algorithm
  • Byz(0) // no-failure algorithm
    • C-in-C sends his order to every subordinate
    • The subordinate uses the order he receives, or the default if he receives no order
Byz(m) // For up to m traitors (failures)
    • (1) C-in-C sends order to every subordinate, G_i: let this be received as v_i
    • (2) G_i acts as the C-in-C in a Byz(m-1) algorithm to circulate this order to his colleagues
    • (3) For each (i,j) such that i!=j, let w_(i,j) be the order that G_i got from G_j in step 2 or the default if no message was received. G_i calculates the majority of the orders {v_i, w_(i,j)} and uses it as the correct order to follow