Fault tolerance
Download
1 / 41

Fault Tolerance - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

Fault Tolerance. Fault tolerance terminology. “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability” - continuity of service metric: mean time between failures (MBTF) “availability” - readiness for usage

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Fault Tolerance' - jacoba


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Fault tolerance terminology
Fault tolerance terminology

  • “dependability” - extent to which reliance can justifiably be placed on service.

    • General concept

  • “reliability” - continuity of service

    • metric: mean time between failures (MBTF)

  • “availability” - readiness for usage

  • “safety” - avoidance of catastrophic effects on environment

  • “security” - resistance to unauthorized access.


Faults errors failures
Faults, errors, failures

  • “fault” - component malfunction

  • “error” - system state is wrong

  • “failure” - system departs from specification

error

fault

failure


System
System

System

components

fault

failure

Environment


Coping with faults
Coping with faults

  • Reduce/eliminate faults in components.

  • Fault tolerance

    • Prevent faults from becoming failures

    • usually through redundancy.


Types of faults fault models
Types of faults (fault models)

Fault tolerance algorithms dependent on fault models.

  • “Crash fault” or “stop fault” - faulty component stops responding. No incorrect state changes in component.

  • “Timing fault” - response is too early or late.

  • “Byzantine fault” - arbitrary behavior. Can be considered adversarial (imagine worst case).


The agreement problem
The agreement problem

  • Processors may fail

  • … so, use multiple processors

  • … but then, processors may disagree, causing failures.

  • Need a principled approach to distributed agreement


Example afti 16 from j rushby
Example: AFTI 16 (from J. Rushby)

  • “Advanced Fighter Technology Integration F16

  • Triple-redundant digital flight-control system (DFCS) with analog backup

  • DFCS design was “asynchronous”

    • processors ran independently

      • sample sensor, evaluate control law, send command to actuator

      • actuator averages or selects from commands

    • General Dynamics felt synchronization would introduce a single point of failure.


Afti 16 problems
AFTI 16 problems

  • Processors can get widely varying sensor readings because of timing differences

  • Reconfiguration can cause sudden changes in control (“thumps”).

    • Need to allow wide range of “plausible values” before declaring a processor “bad”

    • Bad sensor reading drags average down

    • Sensor finally crosses threshhold and is called “bad”

    • average suddenly snaps back when sensor is excluded.


Afti 16 problems cont
AFTI 16 problems (cont)

  • Processor states can diverge rapidly

    • especially when different processors go into different control modes.

  • Design complexity

    • 70% of application code was for redundancy management

    • Control laws had to be modified to ramp changes in and out smoothly


Afti 16 flight test flight 36
AFTI 16 flight test, Flight 36

  • “Departure” from control laws for 3 seconds

  • acceleration exceeded -4g, then +7g

  • Angle of attack went to -10 degrees, then +20 degrees

  • Aircraft rolled 360 degreees

  • Cause: side air probe cut out at high angle of attack

  • Analysis showed this would cause complete failure of DFCS for several areas of flight envelope


Afti 16 flight 44
AFTI 16 flight 44

  • Each channel declared the others failed

    • asynchronous operation, timing skew, sensor noise

  • analog backup not selected

    • simultaneous failure of two channels not anticipated

  • Aircraft flown home on a single digital channel (not designed for this)

  • There were no hardware failures.


Afti 16 analysis nasa
AFTI 16 Analysis (NASA)

  • Nearly all failure indications were design oversights related to asynchronous operation

  • Failures due to lack of understanding of interactions among

    • Air data system

    • redundancy management software

    • flight control laws (decision points, thumps, ramp-in/out)

  • Moral of the story: Reliability through redundancy is a lot harder than it looks.


Distributed consensus
Distributed consensus

  • Goal: multiple processors agree on something in the presence of various kinds of faults and errors

  • Intellectually difficult

    • Algorithms are tricky

    • Proofs are subtle

    • Sensitive to assumptions

      • Synchronous vs. asynchronous

      • Communication mechanism

      • Fault models

  • Many papers written


Synchronous vs asynchronous
Synchronous vs. asynchronous

  • Synchronous: Processors run in lock-step

    • Hard to implement - model may be unrealistic

      • Requires clock synchronization.

    • Consensus is easier

  • Asynchronous: Processors run at arbitrary speed

    • Easier to implement - model is conservative

    • In most models, consensus problem is provably unsolvable.


Synchronous vs asynchronous1
Synchronous vs. asynchronous

  • Semi-synchronous

    • Bounds on how far out-of-sync processors can get

    • Model is fairly realistic

    • Consensus is almost as easy as synchronous


Fault models
Fault models

  • Goal: Make claims such as: “the system will continue to function if any single processor stops.”

  • More conservative fault models:

    • Fault tolerance is harder

    • But, if successful, stronger claims can be made

    • Fewer assumptions = simpler FMEA, easier “certification”

  • A lot of models have been proposed.


Process fault models
Process fault models

  • “Stopping fault” - process stops sending messages

    • does not restart

    • does not send wrong messages

    • liberal (easy) model

  • “Byzantine fault” - process behaves arbitrarily

    • Name comes from cute “Byzantine generals” metaphor

    • May send arbitrary messages, enter arbitrary states

    • Equivalent to “evil” behavior, for our purposes


Synchronous agreement with stopping faults
Synchronous agreement with stopping faults

  • Multiple processes want to “agree” on a value

  • Applications

    • sensor readings among redundant processors

    • decide what time it is

    • decide which of a group of processors are broken and should be removed from system.


Synchronous agreement properties
Synchronous agreement - properties

  • Each process starts with an initial value, processes end with a decision value.

  • Agreement: all good processes decide on same values.

  • Validity: if all processors start with same value, that value is the final decision value.

  • Termination: All good processes eventually decide.


Flood set algorithm
Flood set algorithm

  • Assumption: There is a dedicated link between each pair of processes

  • No more than f processes can stop

  • Each process has an initial value v

  • Each process accumulates a set W of all the values it has ever seen.

    • On each round, every process sends its W set to every other process

    • Every process sets W to the union of the old value and all the new values coming in from others.


Flood set
Flood set

  • After f rounds, every process looks at W.

    • If W has only one value, choose that value.

    • Else, choose 0 (a predetermined default).


Flood set correctness
Flood set correctness

  • In f+1 rounds, there must be at least one round in which no processes stop

    • At most f processes can stop, and processes cannot stop more than once.

  • If no process stops in round r, W will be the same in all good processes in subsequent rounds.

    • All good processes successfully send all values in W to all other good processes, so all processes will have same W after the round.

    • After this, nothing can get added to any W sets, so it doesn’t matter whether more processes stop.


Flood set correctness1
Flood set correctness

  • So, after f+1 rounds, all non-stopped processes have same W sets

    • If W has only one value, all processes pick this value.

    • Else all processes pick 1.


Flood set example

Dies after

sending W to

but not

something

something

something

A

A

B

W sets for

,

are same

{A}

{A}

{B}

-

{A,B}

{A}

Www

s

-

{A,B}

{A,B}

0

0

-

Blank here

blank here

blank here

Choose default

because |W|>1

Flood set example

  • 3 processes, 1 fault, default value = 0

W in round 0

W in round 1

W in round 2

final


Flood set efficiency
Flood set efficiency

O((f + 1) n2) messages

f+1 rounds

n processes send n messages per round

O((f+1)n3) values are sent (each message

may have a set of up to n values)


Optimized flood set
Optimized flood set

  • Note: If W has more than one element, process doesn’t need to know what is in it.

  • Idea: Every process sends only first two distinct values.

    • Every process sends its initial value on first round

    • If process receives a different value, it sends it out on next round

  • Correctness proof: run Flood and OptFlood in parallel

    • same initial values, stopping pattern

    • W sets have more than one value iff OptFlood process gets two values.


Optflood efficiency
OptFlood efficiency

2 n2messages

n processes send at most two messages to n other processes.

O(n2) values are sent


Byzantine agreement
Byzantine agreement

  • Goal: non-faulty processes should agree on a value.

    • E.g., message received

    • e.g., sensor value

  • Faults may cause arbitrary behavior

    • arbitrary values communicated

    • different values communicated to different receivers

  • Advantage: reduces fault analysis

  • Disadvantage: hard or impossible to do.


Byzantine agreement properties
Byzantine agreement properties

Agreement: All good processes agree on a value

Validity: If source of value was non-faulty, agreed upon value is the same.


Asynchronous agreement
Asynchronous agreement

  • Asynchronous model:

    • Message transmission takes arbitrary time.

    • Processes run at arbitrary speeds.

  • Theorem: There is no algorithm that reaches agreement in an asynchronous model with even one Byzantine failure

    • Fine print: Details of conditions, communication

  • This is one of the most important results about distributed systems.


Synchronous agreement
Synchronous agreement

  • Synchronous model: Processes can communicate in a sequence of rounds. All processes complete a round before next round begins.

  • The agreement problem is solvable in this model.

  • Theorem: Tolerating k Byzantine faults requires > 3k processes.

  • So “Triple modular redundancy” can’t handle Byzantine faults.

  • Practical case: 1 Byzantine fault, 4 processes.

  • Assumes full connectivity (connections between each pair of processors).


Synchronous agreement with one fault
Synchronous agreement with one fault

  • Single transmitter communicates value to all processes.

  • Round 0: Transmitter sends value to n-1 receivers.

    • Values are sent correctly if transmitter is not faulty.

  • Round 1: Each receiver sends value to n-2 other receivers.

    • Receivers record all values separately.

    • Intuition: receivers compare notes on what transmitter told them.

  • Each receiver choose majority value of all values it received.

    • If no majority, use pre-arranged default value.


Example 1 faulty transmitter

Round 0: faulty xmtr sends

varying results to rcvrs.

1

1

2

Xmtr

P3

P3

P2

P2

P1

P1

consensus

P1

1

1

1

2

Rcvr

P2

1

1

1

2

P3

1

1

1

2

Finally, receivers

take majority of all

answers

These are the

round 0 values

Example 1- faulty transmitter

Round 1: rcvrs

exchange

values (reliably)


Example 2 faulty transmitter

Round 0: faulty xmtr sends

varying results to rcvrs.

1

2

3

Xmtr

P3

P3

P2

P2

P1

P1

consensus

P1

1

0

2

3

Rcvr

P2

1

0

2

3

P3

1

0

2

3

There is no majority,

so rcvrs use default

These are the

round 0 values

Example 2- faulty transmitter

Round 1: rcvrs

exchange

values (reliably)


Example 3 faulty receiver

Round 0: faulty xmtr sends

varying results to rcvrs.

1

1

1

Xmtr

P3

P3

P2

P2

P1

P1

consensus

P1

1

5

1

1

Rcvr

P2

2

1

1

1

P3

3

1

1

1

Majority computes

correct values for

processes 2,3

These are the

round 0 values

Example 3- faulty receiver

Process 1 is

broken, so result

is not required to be

correct

Process 1

sends bogus values


General case
General case

  • Previous algorithm can be generalized to handle more Byzantine faults.

  • General results: k faults require k+1 (k?) rounds, 3k+1 processors

  • Number of messages grows exponentially with number of rounds

  • Intuition: “Pn said that Pn-1 said that ... p1 said that p0 said that the value was x”

    • There are exponentially many chains pn ... p0.


Hybrid byzantine agreement
Hybrid Byzantine agreement

  • Idea: Free bonus reliability with the purchase of Byzantine agreement.

  • Handles Byzantine faults, plus some more simpler faults

  • Symmetric fault: process sends same wrong value to everyone.

  • Nonmalicious fault: process sends a recognizable error value.

  • Advantages:

    • If processors have these faults, we can tolerate more faulty processors

    • These faults are more probable than true Byzantine faults - so this increases reliability


Hybrid byzantine agreement1
Hybrid Byzantine agreement

  • Modify previous algorithm by adding special error value “E”.

    • Nonmalicious faults send E value (other faults may send E, also).

    • Majority algorithm first removes E values.

  • Theorem: Algorithm reaches agreement if

  • n > 2a + 2s + b + r

    • a = Byzantine, s = symmetric, b = nonmalicious, r = number of rounds (excluding first transmission).

    • Previous case: a=1, s=0, b=0, r=1, so n > 3

    • With 6 processors, can deal with 1 Byzantine + 2 nonmalicious faults.

    • or 1 Byzantine and 1 symmetric

    • ... but just 1 Byzantine in previous algorithm


Variations
Variations

  • Synchronous communication is difficult

    • Compromise between synchronous and asynchronous: real-time constraints.

  • “Authentication” - agreement can be made less costly by using digital signatures

    • transmitter digitally signs messages

    • processes can’t lie about who said what.

    • can handle any number of faults (in synchronous model).

  • May assume different network connectivity

    • Some links in network missing


Summary
Summary

  • Fault tolerance is tricky. Redundancy does not necessarily buy reliability.

  • Byzantine models can account for unforeseen fault types.

  • Byzantine agreement is impossible in some models.

  • There exist practical algorithms for Byzantine agreement if synchronous communication is available.

  • There are deep theoretical results in this area.