Fault tolerance
1 / 42

Fault Tolerance - PowerPoint PPT Presentation

  • Uploaded on

Fault Tolerance. Motivation : Systems need to be much more reliable than their components Use Redundancy : Extra items that can be used to make up for failures Types of Redundancy : Hardware Software Time Information. Fault-Tolerant Scheduling.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Fault Tolerance' - abdul-burton

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Fault tolerance
Fault Tolerance

  • Motivation: Systems need to be much more reliable than their components

  • Use Redundancy: Extra items that can be used to make up for failures

  • Types of Redundancy:

    • Hardware

    • Software

    • Time

    • Information

Fault tolerant scheduling
Fault-Tolerant Scheduling

  • Fault Tolerance: The ability of a system to suffer component failures and still function adequately

  • Fault-Tolerant Scheduling: Save enough time in a schedule that the system can still function despite a certain number of processor failures

Ft scheduling model
FT-Scheduling: Model

  • System Model

    • Multiprocessor system

    • Each processor has its own memory

    • Tasks are preloaded into assigned processors

  • Task Model

    • Tasks are independent of one another

    • Schedules are created ahead of time

Basic idea
Basic Idea

  • Preassign backup copies, called ghosts.

  • Assign ghosts to the processors along with the primary copies

    • A ghost and a primary copy of the same task can’t be assigned to the same processor

    • For each processor, all the primaries and a particular subset of the ghost copies assigned to it should be feasibly schedulable on that processor


  • Two main variations:

    • Current and future iterations of the task have to be saved if a processor fails

    • Only future iterations need to be saved; the current iteration can be discarded

Forward and backward masking
Forward and Backward Masking

  • Forward Masking: Mask the output of failed units without significant loss of time

  • Backward Masking: After detecting an error, try to fix it by recomputing or some other means

Failure types
Failure Types

  • Permanent: The fault is incurable

  • Transient: The unit is faulty for some time, following which it starts functioning correctly again

  • Intermittent: Frequently cycles between a faulty and a non-faulty state

Faults and errors
Faults and Errors

  • A fault is some physical defect or malfunction

  • An error is a manifestation of a fault

  • Latency:

    • Fault Latency: Time between occurrence of a fault and its manifestation as an error

    • Error Latency: Time between the generation of an error and its being caught by the system

Hardware failure recovery
Hardware Failure Recovery

  • If transient, it may be enough to wait for the fault to go away and then reinvoke the computation

  • If permanent, reassign the tasks to other, functional, processors

Faults output characteristics
Faults: Output Characteristics

  • Stuck-at: A line is stuck at 0 or 1.

  • Dead: No output (e.g., high-impedance state)

  • Arbitrary: The output changes with time

Factors affecting hw f rate
Factors Affecting HW F-Rate

  • Temperature

  • Radiation

  • Power surges

  • Mechanical shocks

  • HW failure rate often follows the “bathtub” curve

Some terminology
Some Terminology

  • Fail-safe Systems: Systems which end up in a “safe” state upon failure

    • Example: All traffic lights turning red in an intersection

  • Fail-stop Systems: Systems that stop producing output when they fail

Example of hw redundancy
Example of HW Redundancy

  • Triple-Modular Redundancy (TMR):

    • Three units run the same algorithm in parallel

    • Their outputs are voted on and the majority is picked as the output of the TMR cluster

    • Can forward-mask up to one processor failure

Mathematical background
Mathematical Background

  • Basic laws of probability

    • Density and distribution functions

    • Notion of stochastic independence

    • Expectation, variance, etc.

  • Memoryless distribution

    • Markov chains

      • Steady-state & transient solutions

  • Bayes’s Law

Hardware ft
Hardware FT

  • N-Modular Redundancy (NMR)

    • Basic structure

      • Variations

    • Reliability evaluation

      • Independent failures

      • Correlated failures

    • Voter:

      • Bit-by-bit comparison

      • Median

      • Formalized majority

      • Generalized k-plurality

Exploiting appln semantics
Exploiting Appln Semantics

  • Acceptance Test: Specify a range outside which the output is tagged as faulty (or at least suspicious)

  • No acceptance test is perfect:

    • Sensitivity: Probability of catching an incorrect output

    • Specificity: Probabililty that an output which is flagged as wrong is really wrong

      • Specificity = 1 - False Positive Probability


  • Store partial results in a safe place

  • When failure occurs, roll back to the latest checkpoint and restart

  • Issues:

    • Checkpoint positioning

    • Implementation

      • Kernel level

      • Application level

    • Correctness: Can be a problem in distributed systems


  • Checkpointing Overhead: The part of the checkpointing activity that is not hidden from the application

  • Checkpointing Latency: Time between when a checkpoint starts being taken to when it is stored in non-volatile storage.

Reducing chkptg overhead
Reducing Chkptg Overhead

  • Buffer checkpoint writes

  • Don’t checkpoint “dead” variables:

    • Never used again by the program, or

    • Next operation with respect to the variable is a write

    • Problem is how to identify dead variables

  • Don’t checkpoint read-only stuff, like code

Reducing chkptg latency
Reducing Chkptg Latency

  • Consider compressing the checkpoint. Usefulness of this approach depends on:

    • Extent of the compression possible

    • Work required to execute the compression algorithm

Optimization of chkptg
Optimization of Chkptg

  • Objective in general-purpose systems is usually to minimize the expected execution time

  • Objective in real-time systems is to maximize the probability of meeting task deadlines

    • Need a mathematical model to determine this

    • Generally, we place checkpoints approximately equidistant from each other and just determine the optimal number of them

Distributed checkpointing
Distributed Checkpointing

  • Ordering of Events:

    • Easy to do if there’s just one thread

    • If there are multiple threads:

      • Events in the same thread are trivial to order

      • Event A in thread X is said to precede Event B in thread Y if there is some communication from the X after event A that arrives at Y before event B

      • Given two events A and B in separate threads,

        • A could precede B

        • B could precede A

        • They could be concurrent

Distributed checkpointing1
Distributed Checkpointing

  • Domino Effect: An uncontrolled cascade of rollbacks can roll the entire system back to the starting state

  • To avoid the domino effect, we can coordinate the checkpointing

    • Tightly synchronize the checkpoints in all processors

    • Koo-Toueg algorithm

Checkptg with clock sync
Checkptg with Clock Sync

  • Assume the clock skew is bounded at d and minimum message delivery time is f

  • Each processor:

    • Takes a local checkpoint at some specified time, t

    • Following its checkpoint, it does not sent out any messages until it is sure that this message will be received only after the recipient has itself checkpointed; i.e., until t+f+d

Koo toueg algorithm
Koo-Toueg Algorithm

  • A processor that wants to checkpoint,

    • Does so, locally

    • Tells all processors which have communicated with it the last message (timestamp or message number) received from them

      • If these processors don’t have a checkpoint recording the transmission of this message, they take a checkpoint

  • This can result in a surge of checkpointing activity visible at the non-volatile storage

Software fault tolerance
Software Fault Tolerance

  • It is practically impossible to produce a large piece of software that is bug-free

    • E.g., Even the space shuttle flew with several potentially disastrous bugs despite extensive testing

  • Single-version Fault Tolerance

  • Multi-version Fault Tolerance

Fault models
Fault Models

  • Reasonably trustworthy hardware fault models exist

  • Many software fault models exist in the literature, but not one can be fully trusted to represent reality

Single version ft
Single-Version FT

  • Wrappers: Code “wrapped around” the software that checks for consistency and correctness

  • Software Rejuvenation: Reboot the machine reasonably frequently

  • Use data diversity: Sometimes an algorithm may fail on some data but not if these data are subjected to minor perturbations

Multi version ft
Multi-version FT

  • Very, very expensive

  • Two basic approaches

    • N-version programming

    • Recovery Blocks

N version programming nvp
N-Version Programming (NVP)

  • Theoretically appealing, but hard to make it effective

  • Basic Idea:

    • Have N independent teams of programmers develop applications independently

    • Run them in parallel and vote on them

    • If they are truly independent, they will be highly reliable

Failure diversity
Failure Diversity

  • Effectiveness hinges on whether faults in the versions are statistically independent of one another

  • Forces against truly independent failures:

    • Common programming “culture”

    • Common specifications

    • Common algorithms

    • Common software/hardware platforms

Failure diversity1
Failure Diversity

  • Incidental Diversity

    • Prohibit interaction between teams of programmers working on different versions and hope they produce independently failing versions

  • Forced Diversity

    • Diverse specifications

    • Diverse programming languages

    • Diverse development tools and compilers

    • Cognitively diverse teams: Probably not realistic

Experimental results
Experimental Results

  • Experiments suggest that correlated failures do occur at a much higher rate than would be the case if failures in the versions were stochastically independent

  • Example: Study conducted by Brilliant, Knight, and Leveson at UVa and UCI

    • 27 students writing code for anti-missile application

    • 93 correlated failures observed: if true independence had existed, we’d have expected about 5

Recovery blocks
Recovery Blocks

  • Also uses multiple versions

  • Only one version is active at any time

  • If the output of this version fails an acceptance test, another version is activated

Byzantine failures
Byzantine Failures

  • The worst failure mode known

  • Original Motivating Problem (~1978):

    • A sensor needs to disseminate its output to a set of processors. How can we ensure that,

      • If the sensor is functioning correctly: All functional processors obtain the correct sensor reading

      • If the sensor is malfunctioning: All functional processors agree on the sensor reading

Byzantine generals problem
Byzantine Generals Problem

  • Some divisions of the Byzantine Army are besieging a city. They must all coordinate their attacks (or coordinate their retreat) to avoid disaster

  • The overall commander communicates to his divisional commanders by means of a confidential messenger. This messenger is trustworthy and doesn’t alter the message; it can only be read by its intended recipient

Byz generals problem contd
Byz Generals Problem (contd.)

  • If the C-in-C is loyal

    • He sends consistent orders to the subordinate generals

    • All loyal subordinates must obey his order

  • If the C-in-C is a traitor

    • All loyal subordinate generals must agree on some default action (e.g., running away)

Impossibility with 3 generals
Impossibility with 3 Generals

  • Suppose there are 2 divisions, A and B.

  • Commander-in-chief is a traitor and sends message to Com(A) saying “Attack!” and to Com(B) saying “Retreat!”

  • Com(A) sends a messenger to Com(B), saying “The boss told me to attack!”

  • Com(B) receives:

    • Direct order from the C-in-C saying “Retreat”

    • Message from Com(A) saying “I was ordered to attack”

Byz generals problem contd1
Byz. Generals Problem (contd.)

  • Com(B)’s dilemma:

    • Either the C-in-C or Com(A) is a traitor: it is impossible to know which

    • Further communication with Com(A) won’t add any useful information

    • Not possible to ensure that if Com(A) and Com(B) are both loyal, they both agree on the same action

  • The problem cannot be solved if there are 3 generals who may include at least one traitor

Byz generals problem contd2
Byz. Generals Problem (contd.)

  • Central Result: To reach agreement with a total of N participants with up to m traitors, we must have N > 3m

Byzantine generals algorithm
Byzantine Generals Algorithm

  • Byz(0) // no-failure algorithm

    • C-in-C sends his order to every subordinate

    • The subordinate uses the order he receives, or the default if he receives no order

  • Byz(m) // For up to m traitors (failures)

    • (1) C-in-C sends order to every subordinate, G_i: let this be received as v_i

    • (2) G_i acts as the C-in-C in a Byz(m-1) algorithm to circulate this order to his colleagues

    • (3) For each (i,j) such that i!=j, let w_(i,j) be the order that G_i got from G_j in step 2 or the default if no message was received. G_i calculates the majority of the orders {v_i, w_(i,j)} and uses it as the correct order to follow