Ece 753 fault tolerant computing
Download
1 / 29

ECE 753: FAULT-TOLERANT COMPUTING - PowerPoint PPT Presentation


  • 162 Views
  • Updated On :

ECE 753: FAULT-TOLERANT COMPUTING. Kewal K.Saluja Department of Electrical and Computer Engineering Reliability Modeling and Analysis. Project. The deadline to decide project and project partner(s) – initial proposal - is March 6 BUT

Related searches for ECE 753: FAULT-TOLERANT COMPUTING

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'ECE 753: FAULT-TOLERANT COMPUTING' - abram


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Ece 753 fault tolerant computing l.jpg

ECE 753: FAULT-TOLERANT COMPUTING

Kewal K.Saluja

Department of Electrical and Computer Engineering

Reliability Modeling and Analysis


Project l.jpg
Project

  • The deadline to decide project and project partner(s) – initial proposal - is March 6

    BUT

  • I would like to see Projects decided as soon as possible

    • Discuss with me

    • You can have a dib on it

    • Prepare a short summary

ECE 753 Fault Tolerant Computing


Overview l.jpg
Overview

  • Introduction

  • Reliability Modeling

    • reliability block diagram

    • combinatorial model

    • Markov model

  • Other Parameters and analysis

  • General remarks and Summary

ECE 753 Fault Tolerant Computing


Introduction l.jpg
Introduction

  • References

    • Text

    • [prad:96], [swew:99], [shooman:02]

    • [triv:82] and [triv:01]

    • Text covers all the material of this part and the books in the second line (three books) contain sufficient material to cover this part of the course

  • Recap of definitions

  • Importance of analysis and analytical model

  • Mathematical formulation for quantitative analysis

ECE 753 Fault Tolerant Computing


Introduction contd l.jpg
Introduction (contd.)

  • Recap of definitions

    • Reliability R(t)

    • Availability A(t)

    • Performability and Dependability

  • Importance of analysis and analytical model

    • to evaluate a design

    • a metric to compare different designs

    • to provide feedback to the designer during early design stages

    • use a model for performance analysis

    • used for quantitative and qualitative analysis

ECE 753 Fault Tolerant Computing


Introduction contd6 l.jpg
Introduction (contd.)

  • Mathematical formulation for quantitative analysis

    • consider a large experiment with N systems

    • observation at time t

      • N0(t) - number of correctly operating systems

      • Nf(t) - number of failed systems

    • Hence

      • Reliability R(t) = N0(t)/N(t) = 1 - Nf(t)/N

      • Unreliability Q(t) = 1 - R(t)

      • Derivative of reliability: dR/dt = -(1/N)(dNf(t)/dt)

      • dNf(t)/dt is called instantaneous failure rate of the component

ECE 753 Fault Tolerant Computing


Introduction contd7 l.jpg
Introduction (contd.)

  • Mathematical formulation (contd.)

    • Also

      • failure rate at time t

        • (instantaneous failure rate at time t) / N0(t)

        • (1/N0(t))(dNf(t)/dt) - called z(t)

        • this and the previous expressions together reduce to

          • z(t) = -(1/R(t))(dR(t)/dt)

          • Z(t) is called failure rate, hazard function or hazard rate

        • We can solve the above for R(t) provided we know instantaneous failure rate

        • Bath tub curve for failure rate

          • implies constant failure rate during useful life

          • infant mortality and wear out periods have variable failure rates

ECE 753 Fault Tolerant Computing


Introduction contd8 l.jpg
Introduction (contd.)

  • Mathematical formulation (contd.)

    • Reliability computation - constant failure rate

      • solve the equations - exponential function for reliability and for unreliability, R(t) = 1- Q(t) = exp(-λt)

    • Reliability computation - time varying failure rate

      • Waibull distribution z(t) = αλ(λt)**(α-1)

      • solve the equations - exponential function for reliability and for unreliability

    • Failure rate computation - military standard

      • function of - learning factor, quality factor, temperature factor, environmental factor, and # of pins on IC (see the slide set by Koren and Krishna)

ECE 753 Fault Tolerant Computing


Introduction contd9 l.jpg
Introduction (contd.)

  • Mathematical formulation (contd.)

    • Reliability computation - mean time to failure (MTTF)

      • Definition: expected time that a system will operate before the first failure occurs

      • Probability measure: S-sample space, E-event space

        • for A in E P(A) >= 0

        • P(S) = 1

        • P(AB) = P(A) + P(B), when A and B are non-intersecting

      • Random Variable (RV) - X maps events of S to real-numbers

      • Probability distribution function of a RV

      • Probability density function (pdf) - derivative of the distribution function

ECE 753 Fault Tolerant Computing


Introduction contd10 l.jpg
Introduction (contd.)

  • Mathematical formulation (contd.)

    • Reliability computation - mean time to failure

      • Probability density function - properties

        • always >= 0

        • integrates to 1 (between limits)

      • Expectation

        • Integrate xf(x)

        • Σ xi p(xi) in discrete case

      • Application in our case

        • unreliability Q(t) is a probability distribution function of failure - in fact it is cumulative probability that system fails in time [0,t]

ECE 753 Fault Tolerant Computing


Introduction contd11 l.jpg
Introduction (contd.)

  • Mathematical formulation (contd.)

    • Reliability computation - MTTF and MTTR

      • Application in our case (contd.)

        • derivative of Q(t) , written as f(t), is pdf of failure - or failure density function

        • Expected value can be computed using integration and is Mean Time To Failure (MTTF)

        • constant failure rate

          • MTTF = 1/λ

      • Mean time to repair - MTTR

        • assume constant repair rate (μ) and arguments similar to those used for failure analysis and conclude MTTR = 1/ μ

ECE 753 Fault Tolerant Computing


Introduction contd12 l.jpg
Introduction (contd.)

  • Mathematical formulation (contd.)

    • Reliability computation - mean time between failure (MTBF)

      • Mean time between failure - MTBF

        • use heuristic arguments to conclude

          • MTBF = (total time T)/(average number of failures)

        • can also argue MTBF = MTTF + MTTR

      • Note: often λ << μ and hence MTTF >> MTTR , therefore the words MTTF and MTBF are used interchangeably by some practioners

ECE 753 Fault Tolerant Computing


Reliability modeling l.jpg
Reliability Modeling

  • Application of the previous analysis to system models

    • Assumptions

      • system consists of modules

      • each module assigned a probability of working R(t), a function of time

      • once a module fails it is assumed to yield incorrect results

      • module failures are independent

ECE 753 Fault Tolerant Computing


Reliability modeling14 l.jpg
Reliability Modeling

  • Application of the previous analysis to system models

    • Reliability block diagrams

      • consider a system - microP, controller, mem, bus, …

      • the system will fail if any of the components fails

      • Rsys = P(all subsystems work correctly)

      • = P(bus correct).P(mem correct)…. Etc.

      • (follows from the assumption that component

      • failures are independent)

      • Rsys = Rbus.Rmem.Rmicro.Rcont

ECE 753 Fault Tolerant Computing


Reliability modeling15 l.jpg
Reliability Modeling

  • Reliability block diagrams - Series Systems

    • Assume system has n components

    • All components should survive for system to operate

    • Reliability of system

      • R sys = Pi Ri (t)

    • For exponential distributions of each component

      • R sys = Pi e - l i t = e - (l1 + l2 + . . . + ln)t =exp(- Slit)

      • Effect is that the system failure rate is the summation of failure rates of components

    • Note these are nonredundant systems

R1

R2

Rn

ECE 753 Fault Tolerant Computing


Reliability modeling16 l.jpg
Reliability Modeling

  • Reliability block diagrams - Parallel Systems

    • Assume system with spares

    • faulty component is replaced by a spare as fault occurs

    • only one component needs to survive for the system to operate

    • Model is to represent all components connected in parallel

    • P(sys fail) = P(M1 fails).P(M2 fails). .. .P(Mn fails)

    • Rsys = 1 - P(sys fail) = 1- (1-R1)(1-R2) …(1-Rn)

ECE 753 Fault Tolerant Computing


Reliability modeling17 l.jpg
Reliability Modeling

  • Reliability block diagrams - Series-Parallel Systems

    • straight forward

  • Reliability block diagrams - MTTF of system

    • 1/(system failure rate)

    • Series systems - 1/(sum of individual falure rates)

    • Parallel systems and series parallel systems – work out by integration from the reliability or unreliability equations

ECE 753 Fault Tolerant Computing


Reliability modeling18 l.jpg
Reliability Modeling

  • Reliability block diagrams -Non series parallel systems

    • Bayes rule: consider a sample space S. Partitions this into space B andB (complement of B). Now consider an event that falls partly in B and partly inB. We can write:

      A = (AB)(AB)

      P(A) = P[(AB)(AB)]

      = P[(AB)] + P[(AB)]

      = P(A/B)P(B) + P(A/B)P(B)

    • In general the set S can be partitioned into (B1, B2, … ,Bn)

      P(A) = Σ P(A/Bi)P(Bi)

      This can be viewed graphically also (draw a tree)

ECE 753 Fault Tolerant Computing


Reliability modeling19 l.jpg

C1

C4

C2

C3

C5

Reliability Modeling

  • Reliability block diagrams -Non series parallel systems

    • Example - consider the following non series parallel system

    • list all paths for system to survive, namely c1c4, c2c4, c2c5, c3c5

    • These paths are not disjoint, sum of reliabilities of all path gives an upper bound on the system reliability

    • Exact computation is possible using Bayes rule – complete in class

ECE 753 Fault Tolerant Computing


Reliability modeling20 l.jpg
Reliability Modeling

  • Non series parallel systems

    Upper and lower bounds

    See the slides provided by Koren and Krishna (authors of the text)

ECE 753 Fault Tolerant Computing


Reliability modeling21 l.jpg
Reliability Modeling

  • Combinatorial model

    • Consider an NMR system

    • Assume voter reliability to be 1

    • Divide all events for success to disjointed events

    • Compute probability of each event and add them

    • Example – TMR system

    • Can be used to compute MTTF

    • Can also analyze other systems such as an m-of-n system

ECE 753 Fault Tolerant Computing


Reliability modeling22 l.jpg
Reliability Modeling

  • Markov model

    • Difficulty with the previous models

      • incorporating repairs in the model and analysis

      • Incorporation of coverage factor – such as in duplicates system we may be less than 100% certain that only faulty unit will be eliminated when system is re-configured

    • Markov modeling - basic

      • Define the concept of state using TMR system example (8 states)

      • Transitions between states occur with certain probabilities

    • Markov model – assumption

      • Probability of transition from a state si to sj is independent of the method of arrival into state si

    • Example – develop a Markov model for a TMR in class

ECE 753 Fault Tolerant Computing


Reliability modeling23 l.jpg
Reliability Modeling

  • Markov model

    • Markov model for a TMR – all details not shown

011

001

λΔt

1-3λΔt

000

111

101

010

λΔt

λΔt

100

110

ECE 753 Fault Tolerant Computing


Reliability modeling24 l.jpg
Reliability Modeling

  • Markov model- Reduced

    • Reduced Markov model for a TMR system

    • Previous eight state model can be reduced to a three state model by merging states and re-computing the transition probabilities

  • Markov model- accounting for repairs

    • We can include links between states knowing the repair rates of components

ECE 753 Fault Tolerant Computing


Reliability modeling25 l.jpg
Reliability Modeling

  • Markov model- analyzing systems

    • Consider a duplicate compare system – no repairs

    • Develop Markov model with 3 states

    • Develop a difference equation for computing probabilities for being in different states of the system

    • Develop a differential equation model

    • Solution methods

      • Numerical approach

      • Solving differential equation

        • direct approach

        • Using Laplace transforms

ECE 753 Fault Tolerant Computing


Reliability modeling26 l.jpg
Reliability Modeling

  • Markov model- analyzing systems

    • Consider a duplicate compare system – with repairs

    • Develop Markov model with 3 states

    • Develop a differential equation model

    • Solve using Laplace transforms

  • Yet one more example

    • duplicate compare system – with imperfect coverage

    • Develop Markov model with 5 states

    • Reduce model for different scenarios

ECE 753 Fault Tolerant Computing


Other parameters and analysis l.jpg
Other Parameters and analysis

  • Markov model- Can use other parameters

    • Safety –

    • Availability

      • Consider a simplex system

      • Develop Markov model with 2 states

      • Solve the system for probability of system being in available state

      • Define and compute steady state availability

      • Provide a intuitive explanation of the computed value of steady state availability and its relation of MTTF and MTTR

    • Maintainability

ECE 753 Fault Tolerant Computing


General remarks l.jpg
General remarks

  • Voter reliability issue

  • Performance and states with degraded performance

  • Mission time improvement

  • Redundancy Ratio

  • Law of diminishing return

ECE 753 Fault Tolerant Computing


Summary l.jpg
Summary

  • Introduction of mathematical models

  • Solving models to carry out analysis

    • Example systems

      • Duplicate

      • Duplicate with repair

      • Simplex with repair for avialability

ECE 753 Fault Tolerant Computing


ad