Loading in 5 sec....

ECE 753: FAULT-TOLERANT COMPUTINGPowerPoint Presentation

ECE 753: FAULT-TOLERANT COMPUTING

- By
**abram** - Follow User

- 162 Views
- Updated On :

ECE 753: FAULT-TOLERANT COMPUTING. Kewal K.Saluja Department of Electrical and Computer Engineering Reliability Modeling and Analysis. Project. The deadline to decide project and project partner(s) – initial proposal - is March 6 BUT

Related searches for ECE 753: FAULT-TOLERANT COMPUTING

Download Presentation
## PowerPoint Slideshow about 'ECE 753: FAULT-TOLERANT COMPUTING' - abram

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### ECE 753: FAULT-TOLERANT COMPUTING

Kewal K.Saluja

Department of Electrical and Computer Engineering

Reliability Modeling and Analysis

Project

- The deadline to decide project and project partner(s) – initial proposal - is March 6
BUT

- I would like to see Projects decided as soon as possible
- Discuss with me
- You can have a dib on it
- Prepare a short summary

ECE 753 Fault Tolerant Computing

Overview

- Introduction
- Reliability Modeling
- reliability block diagram
- combinatorial model
- Markov model

- Other Parameters and analysis
- General remarks and Summary

ECE 753 Fault Tolerant Computing

Introduction

- References
- Text
- [prad:96], [swew:99], [shooman:02]
- [triv:82] and [triv:01]
- Text covers all the material of this part and the books in the second line (three books) contain sufficient material to cover this part of the course

- Recap of definitions
- Importance of analysis and analytical model
- Mathematical formulation for quantitative analysis

ECE 753 Fault Tolerant Computing

Introduction (contd.)

- Recap of definitions
- Reliability R(t)
- Availability A(t)
- Performability and Dependability

- Importance of analysis and analytical model
- to evaluate a design
- a metric to compare different designs
- to provide feedback to the designer during early design stages
- use a model for performance analysis
- used for quantitative and qualitative analysis

ECE 753 Fault Tolerant Computing

Introduction (contd.)

- Mathematical formulation for quantitative analysis
- consider a large experiment with N systems
- observation at time t
- N0(t) - number of correctly operating systems
- Nf(t) - number of failed systems

- Hence
- Reliability R(t) = N0(t)/N(t) = 1 - Nf(t)/N
- Unreliability Q(t) = 1 - R(t)
- Derivative of reliability: dR/dt = -(1/N)(dNf(t)/dt)
- dNf(t)/dt is called instantaneous failure rate of the component

ECE 753 Fault Tolerant Computing

Introduction (contd.)

- Mathematical formulation (contd.)
- Also
- failure rate at time t
- (instantaneous failure rate at time t) / N0(t)
- (1/N0(t))(dNf(t)/dt) - called z(t)
- this and the previous expressions together reduce to
- z(t) = -(1/R(t))(dR(t)/dt)
- Z(t) is called failure rate, hazard function or hazard rate

- We can solve the above for R(t) provided we know instantaneous failure rate
- Bath tub curve for failure rate
- implies constant failure rate during useful life
- infant mortality and wear out periods have variable failure rates

- failure rate at time t

- Also

ECE 753 Fault Tolerant Computing

Introduction (contd.)

- Mathematical formulation (contd.)
- Reliability computation - constant failure rate
- solve the equations - exponential function for reliability and for unreliability, R(t) = 1- Q(t) = exp(-λt)

- Reliability computation - time varying failure rate
- Waibull distribution z(t) = αλ(λt)**(α-1)
- solve the equations - exponential function for reliability and for unreliability

- Failure rate computation - military standard
- function of - learning factor, quality factor, temperature factor, environmental factor, and # of pins on IC (see the slide set by Koren and Krishna)

- Reliability computation - constant failure rate

ECE 753 Fault Tolerant Computing

Introduction (contd.)

- Mathematical formulation (contd.)
- Reliability computation - mean time to failure (MTTF)
- Definition: expected time that a system will operate before the first failure occurs
- Probability measure: S-sample space, E-event space
- for A in E P(A) >= 0
- P(S) = 1
- P(AB) = P(A) + P(B), when A and B are non-intersecting

- Random Variable (RV) - X maps events of S to real-numbers
- Probability distribution function of a RV
- Probability density function (pdf) - derivative of the distribution function

- Reliability computation - mean time to failure (MTTF)

ECE 753 Fault Tolerant Computing

Introduction (contd.)

- Mathematical formulation (contd.)
- Reliability computation - mean time to failure
- Probability density function - properties
- always >= 0
- integrates to 1 (between limits)

- Expectation
- Integrate xf(x)
- Σ xi p(xi) in discrete case

- Application in our case
- unreliability Q(t) is a probability distribution function of failure - in fact it is cumulative probability that system fails in time [0,t]

- Probability density function - properties

- Reliability computation - mean time to failure

ECE 753 Fault Tolerant Computing

Introduction (contd.)

- Mathematical formulation (contd.)
- Reliability computation - MTTF and MTTR
- Application in our case (contd.)
- derivative of Q(t) , written as f(t), is pdf of failure - or failure density function
- Expected value can be computed using integration and is Mean Time To Failure (MTTF)
- constant failure rate
- MTTF = 1/λ

- Mean time to repair - MTTR
- assume constant repair rate (μ) and arguments similar to those used for failure analysis and conclude MTTR = 1/ μ

- Application in our case (contd.)

- Reliability computation - MTTF and MTTR

ECE 753 Fault Tolerant Computing

Introduction (contd.)

- Mathematical formulation (contd.)
- Reliability computation - mean time between failure (MTBF)
- Mean time between failure - MTBF
- use heuristic arguments to conclude
- MTBF = (total time T)/(average number of failures)

- can also argue MTBF = MTTF + MTTR

- use heuristic arguments to conclude
- Note: often λ << μ and hence MTTF >> MTTR , therefore the words MTTF and MTBF are used interchangeably by some practioners

- Mean time between failure - MTBF

- Reliability computation - mean time between failure (MTBF)

ECE 753 Fault Tolerant Computing

Reliability Modeling

- Application of the previous analysis to system models
- Assumptions
- system consists of modules
- each module assigned a probability of working R(t), a function of time
- once a module fails it is assumed to yield incorrect results
- module failures are independent

- Assumptions

ECE 753 Fault Tolerant Computing

Reliability Modeling

- Application of the previous analysis to system models
- Reliability block diagrams
- consider a system - microP, controller, mem, bus, …
- the system will fail if any of the components fails
- Rsys = P(all subsystems work correctly)
- = P(bus correct).P(mem correct)…. Etc.
- (follows from the assumption that component
- failures are independent)
- Rsys = Rbus.Rmem.Rmicro.Rcont

- Reliability block diagrams

ECE 753 Fault Tolerant Computing

Reliability Modeling

- Reliability block diagrams - Series Systems
- Assume system has n components
- All components should survive for system to operate
- Reliability of system
- R sys = Pi Ri (t)

- For exponential distributions of each component
- R sys = Pi e - l i t = e - (l1 + l2 + . . . + ln)t =exp(- Slit)
- Effect is that the system failure rate is the summation of failure rates of components

- Note these are nonredundant systems

R1

R2

Rn

ECE 753 Fault Tolerant Computing

Reliability Modeling

- Reliability block diagrams - Parallel Systems
- Assume system with spares
- faulty component is replaced by a spare as fault occurs
- only one component needs to survive for the system to operate
- Model is to represent all components connected in parallel
- P(sys fail) = P(M1 fails).P(M2 fails). .. .P(Mn fails)
- Rsys = 1 - P(sys fail) = 1- (1-R1)(1-R2) …(1-Rn)

ECE 753 Fault Tolerant Computing

Reliability Modeling

- Reliability block diagrams - Series-Parallel Systems
- straight forward

- Reliability block diagrams - MTTF of system
- 1/(system failure rate)
- Series systems - 1/(sum of individual falure rates)
- Parallel systems and series parallel systems – work out by integration from the reliability or unreliability equations

ECE 753 Fault Tolerant Computing

Reliability Modeling

- Reliability block diagrams -Non series parallel systems
- Bayes rule: consider a sample space S. Partitions this into space B andB (complement of B). Now consider an event that falls partly in B and partly inB. We can write:
A = (AB)(AB)

P(A) = P[(AB)(AB)]

= P[(AB)] + P[(AB)]

= P(A/B)P(B) + P(A/B)P(B)

- In general the set S can be partitioned into (B1, B2, … ,Bn)
P(A) = Σ P(A/Bi)P(Bi)

This can be viewed graphically also (draw a tree)

- Bayes rule: consider a sample space S. Partitions this into space B andB (complement of B). Now consider an event that falls partly in B and partly inB. We can write:

ECE 753 Fault Tolerant Computing

C4

C2

C3

C5

Reliability Modeling- Reliability block diagrams -Non series parallel systems
- Example - consider the following non series parallel system
- list all paths for system to survive, namely c1c4, c2c4, c2c5, c3c5
- These paths are not disjoint, sum of reliabilities of all path gives an upper bound on the system reliability
- Exact computation is possible using Bayes rule – complete in class

ECE 753 Fault Tolerant Computing

Reliability Modeling

- Non series parallel systems
Upper and lower bounds

See the slides provided by Koren and Krishna (authors of the text)

ECE 753 Fault Tolerant Computing

Reliability Modeling

- Combinatorial model
- Consider an NMR system
- Assume voter reliability to be 1
- Divide all events for success to disjointed events
- Compute probability of each event and add them
- Example – TMR system
- Can be used to compute MTTF
- Can also analyze other systems such as an m-of-n system

ECE 753 Fault Tolerant Computing

Reliability Modeling

- Markov model
- Difficulty with the previous models
- incorporating repairs in the model and analysis
- Incorporation of coverage factor – such as in duplicates system we may be less than 100% certain that only faulty unit will be eliminated when system is re-configured

- Markov modeling - basic
- Define the concept of state using TMR system example (8 states)
- Transitions between states occur with certain probabilities

- Markov model – assumption
- Probability of transition from a state si to sj is independent of the method of arrival into state si

- Example – develop a Markov model for a TMR in class

- Difficulty with the previous models

ECE 753 Fault Tolerant Computing

Reliability Modeling

- Markov model
- Markov model for a TMR – all details not shown

011

001

λΔt

1-3λΔt

000

111

101

010

λΔt

λΔt

100

110

ECE 753 Fault Tolerant Computing

Reliability Modeling

- Markov model- Reduced
- Reduced Markov model for a TMR system
- Previous eight state model can be reduced to a three state model by merging states and re-computing the transition probabilities

- Markov model- accounting for repairs
- We can include links between states knowing the repair rates of components

ECE 753 Fault Tolerant Computing

Reliability Modeling

- Markov model- analyzing systems
- Consider a duplicate compare system – no repairs
- Develop Markov model with 3 states
- Develop a difference equation for computing probabilities for being in different states of the system
- Develop a differential equation model
- Solution methods
- Numerical approach
- Solving differential equation
- direct approach
- Using Laplace transforms

ECE 753 Fault Tolerant Computing

Reliability Modeling

- Markov model- analyzing systems
- Consider a duplicate compare system – with repairs
- Develop Markov model with 3 states
- Develop a differential equation model
- Solve using Laplace transforms

- Yet one more example
- duplicate compare system – with imperfect coverage
- Develop Markov model with 5 states
- Reduce model for different scenarios

ECE 753 Fault Tolerant Computing

Other Parameters and analysis

- Markov model- Can use other parameters
- Safety –
- Availability
- Consider a simplex system
- Develop Markov model with 2 states
- Solve the system for probability of system being in available state
- Define and compute steady state availability
- Provide a intuitive explanation of the computed value of steady state availability and its relation of MTTF and MTTR

- Maintainability

ECE 753 Fault Tolerant Computing

General remarks

- Voter reliability issue
- Performance and states with degraded performance
- Mission time improvement
- Redundancy Ratio
- Law of diminishing return

ECE 753 Fault Tolerant Computing

Summary

- Introduction of mathematical models
- Solving models to carry out analysis
- Example systems
- Duplicate
- Duplicate with repair
- Simplex with repair for avialability

- Example systems

ECE 753 Fault Tolerant Computing

Download Presentation

Connecting to Server..