Dependability evaluation

Dependability evaluation

Dependability evaluation Reliability - continuity of correct service - As a function of time, R(t), is the conditional probability that the system performs correctly throughout the interval of time [t0, t], given that the system was performing correctly at time t0 Availability- the readiness for usage - As a function of time, A(t), is the probability that the system is operating correctly and is available to perform its functions at the instant of time t. Safety - avoidance of catastrophic consequences - As a function of time, S(t), is the probability that the system either behaves correctly or will discontinue its functions in a manner that causes no harm throughout the interval of time [t0, t], given that the system was performing correctly at time t0

Quantitative evaluation of Dependability • Does the arrival time of faults fit a probability distribution? If so, what are the parameters of that distribution? • Consider the time to failure of a system or component. It is not exactly predictable - random variable. Approaches to quantitative evaluation of failure rate, Mean Time To Failure (MTTF), Mean Time To Repair (MTTR), reliability analysis, availability analysis and safety probability theory

Basic concepts • Almost all specifications for systems mandate that certain values for dependability be achieved and, in some way, demonstrated • How do we compute the dependability of a system? - MODEL-BASED evaluation of dependability(a model is an abstraction of the system that highlightsthe important features for the objective of the study) - dependability of a system is calculated in terms of the dependability of individual components“divide And conquer approach”: the solution of the entiremodel is constructed on the basis of the solutions of individual sub-models

Model-based evaluation of dependability Dependability modelling methodologies A distinction can be made between Methodologies that employ combinatorial models Reliability Block Diagrams, Fault-tree, …. State space representation methodologiesMarkov chains, Petri nets, SANs, …

Model-based evaluation of dependability • Combinatorial methods • Do not require enumeration of the system states • Offer simple and intuitive methods of the construction and solutions of models • Inadequate to deal with systems that exhibits complex dependencies among components and repairable systems. • State based approaches • The collection of variables that define the state of the system must be defined • Require the enumeration of system states • They can be • - deterministic their behaviour of the system is exactly determined • - stocastic (probabilistic) they have probabilistic nature (appropriate in dependability evaluation)

Packages for dependability evaluation SHARPE http://people.ee.duke.edu/~kst/ SHARPE, (Symbolic Hierarchical Automated Reliability and Performance Evaluator) is a tool for specifying and analyzing performance, reliability and performability models. SURF-2 http://www.laas.fr/surf/surf-uk.html SURF-2 tool for hardware and software systems, based on numerical resolution of Markov models. System behaviour is modelized by either a Markov Chain or a Generalized Stochastic Petri Net (GSPN). UltraSAN http://www.crhc.uiuc.edu/UltraSAN/UltraSAN.html UltraSAN is a software package for hierarchical model-based evaluation of systems represented as stochastic activity networks (SANs). MOBIUS http://www.mobius.uiuc.edu/ stochastic extensions to Petri nets, Markov chains and extensions, and stochastic process algebras ……. ….

Hardware Reliability • Let X a random variable representing the time to failure of a component (time to the next failure of the component) • Assume X is a discrete random variable (possible values t1, t2, t3, …) • fX(t) probability density function (P(X=t)) • FX (t) cumulative distribution function (P(X<=t)) • Reliability of the component at time t is given by RX(t) = P[X > t] = 1 – P[X <= t] = 1 –FX (t) reliability function R(t) is the probability of not observing any failure before time t Unreliability of the component at time t is given by QX(t) = P[X <= t] = FX (t)

fx(t) hazard function zX(t) = 1 – Fx(t) Hardware Reliability The hazard function (also known as the failure rate, hazard rate, or force of mortality) is the ratio of the probability density function to the survival function, given by P[X = t] = fX(t) Survival function: P[X>t] = 1- P[X<=t] = 1 – Fx(t) Two functions that are of particular interest in reliability theory: - R(t) reliability function • z(t) failure rate (hazard rate) function

Hardware Reliability z(t) is the rate at which failures occurr in time z(t) =lconstant > 0 in the useful life period Failure rate l (failures for million hours)assume one failure every 2000 hours: l = 1/2000 In reality, z(t) is a function of time as depicted in the bathtub-shaped curve. During early life, there is a higher failure rate, calleld infant mortality, due to the failures of weaker components. Often these infant mortalities result from defetct or stress introduced in the manufacturing process. Once the infant mortalities are eliminated, the system settles into operational life, in which the failure rate is approximately constant. The system then approaches wear-out in which time and use cause the failure rate to increase.

Hardware Reliability Fault distribution model: exponential distribution f(t) = e–t F(t) = 1- e–t R(t) = e–t z(t) = the exponential relation between reliability and time is known as exponential failure law (where e is the base of the natural logarithm e= 2.718) Mean time to failure (MTTF) is the expected time that a system will operate before the first failure occurs (e.g., 2000 hours) MTTF= 1/λ • l = 1/2000 0.0005 per hour • MTTF = 2000 time to the firt failure 2000 hours

Failure Rate - Handbooks of failure rate data for various components are available from government and commercial sources. - Reliability Data Sheet of product • Commercially available databases- Military Handbook MIL-HDBK-217F- Telcordia, - PRISM User’s Manual, - International Eletrotechnical Commission (IEC) Standard 61508 - … Databases used to obtain reliability parameters in ‘’Traditional Probabilistic Risk Assessment Methods for Digital Systems’’, U.S. Nuclear Regulatory Commission, NUREG/CR-6962, October 2008

Distribution model for permanent faults MIL-HBDK-217 (Reliability Prediction of Electronic Equipment -Department of Defence) is a model for chip failure. Statistics on electronic components failures are studied since 1965 (periodically updated). Typical component failure rates in the range 0.01-1.0 per million hours. Failure rate for a single chip : l= τLτQ(C1τT τV+ C2τE) τL= learning factor, based on the maturing of the fabrication process τQ= quality factor, based on incoming screening of components τT= temperature factor, based on the ambient operating temperature and the type of semiconductor process τE= environmental factor, based on the operating environment τV= voltage stress derating factor for CMOS devices C1, C2 = complexity factors, based on the number of gates, or bits for memories in the component and number of pins. (From Reliable Computer Systems.D. P. Siewiorek R.S. Swarz, Prentice Hall, 1992)

Combinatorial models

Combinatorial models • If the system does not contain any redundancy, that is any component must function properly for the system to work, and if component failures are independent, then • the system reliability is the product of the component reliability, and it is exponential • - the failure rate of the system is the sum of the failure rates of the individual components

Combinatorial models

Combinatorial models • If the system contain redundancy, that is a subset of components must function properly for the system to work, and if component failures are independent, then • the system reliability is the reliability of a series/parallel combinatorial model

Series C2 C1 C3 source sink C2 C2 C1 C1 C3 C3 Parallel sink sink source source M of N 2 of 3 Reliability Blocks diagrams • Blocks are componentsconnected among them to represent the temporal order with which the system uses components, or the management of redundancy schemes or the success critera of the system • System failure occurs if there is no path from source to sink

Example Multiprocessor with 2 processors and three shared memories -> analysis under different conditions Series/Parallel SHARPE toolAssume Qp(t) = 0.0138 with t = 10 days Qm(t) = 0.00692 with t = 10 days

1 block arch1(k,n,pfail,mfail) 2 comp proc prob(pfail) 3 comp mem prob(mfail) 4 parallel procs proc proc 5 kofn mems k,n,mem 6 series top procs mems 7 end 8 loop k,1,3,1 9 expr 1 - sysprob(arch1;k,2,.0138,.00692) 10 end 11 end k operational memories, n operational processors, pfail value of failure probablity of processor, mfail value of failure probability of memoriesbottom-up description of the system (top: serie of parallel modules) sysprob(…) computes system failure probability 1-sysprob(…) reliability

Output: K=1 1 - sysprob(arch1;k,2,.0138,.00692): 0. 99981 K=2 1 - sysprob(arch1;k,2,.0138,.00692): 0. 99967 K=3 1 - sysprob(arch1;k,2,.0138,.00692): 0. 97920 We note that: increment of reliability is significant from three to two operational memories requirement (after ten days, one memory: 99.8%; two memories: 99.7%; three memories: 97.8%) A failure probability function can be assigned to components by specifying the failure rate: the exponential failure law is assumed. 1 block arch1(k,n,pfrate,mfrate) 2 comp proc exp(pfrate) 3 comp mem exp(mfrate) 4 parallel procs proc proc 5 kofn mems k,n,mem 6 series top procs mems 7 end lp = pfrate = 0.00139 lm = mfrate = 0.00764

Fault Trees • FT considers the combination of events that may lead to an unsdesirable situation of the system (the delivery of improper service for a Reliability study, catastrophic failures for a Safety study) • Describe the scenarios of occurrence of events at abstract level • Hierarchy of levels of events linked by logical operators • The analysis of the fault tree evaluates the probability of occurrence of the root event, in terms of the status of the leaves (faulty/non faulty) • Applicable both at design phase and operational phase

OR OR OR Fault Trees TOP EVENT G0 G2 G3 GATE SYMBOL AND E3 G4 EVENT SYMBOL E2 E1 E4 E5 Describes the Top Event (status of the system) in terms of the status (faulty/non faulty) of the Basic events (system’s components)

C2 C2 C2 AND C3 C1 C1 C1 C3 C3 OR 2 of 3 Fault Trees • Components are leaves in the tree • Component faulty corresponds to logical value true, otherwise false • Nodes in the tree are boolen AND, OR and k of N gates • The system fails if the root is true AND gate • True if all the components are true (faulty) OR gate • True if at least one of the components is true (faulty) K of N gate • True if at least k of the components are true (two or three components) (faulty)

2of3

AND AND OR P1 P2 M1 M3 M2 Example: Multiprocessor with 2 processors and three shared memories -> the computer fail if all the memories fail or all the processors fail Top event System failure A cut is defined as a set of elementary events that, according to the logic expressed by the FT, leads to the occurrence of the root event. To estimate the probability of the root event, compute the probability of occurrence for each of the cuts and combine these probabilities

Conditioning Fault Trees • If the same component appears more than once in a fault tree, it violates the independent failure assumption (conditioned fault tree) • Example Multiprocessor with 2 processors and three memories: M1 private memory of P1M2 private memory of P2, M3 shared memory. Top event system • Assume every process has its own private memory plus a shared memory. • Operational condition: at least one processor is active and can access to its private or shared memory. • repeat instruction:given a component C whether or not the component is input to more than one gate, the component is uniqueM3 is a shared memory AND OR OR AND AND

Conditioning Fault Trees • If a component C appears multiple times in the FT • Qs(t) = QS|C Fails(t) QC(t) + QS|C not Fails(t) (1-QC(t)) • where • S|C Fails is the system given that C fails • and • S|C not Fails is the system given that C has not failed

TOP 1 5 2 G1 AND 4 3 OR Minimal cut sets A cut is defined as a set of elementary events that, according to the logic expressed by the FT, leads to the occurrence of the root event. Cut Sets Top = {1}, {2} , {G1} , {5} = {1}, {2} , {3, 4} , {5} Minimal Cut Sets Top = {1}, {2} , {3, 4} , {5}

TOP 1 5 2 G1 AND 4 3 OR Minimal Cut Sets Top = {1}, {2} , {3, 4} , {5} independent faults of the components Qi (t) = probability that all components in the minimal cut set i are faulty Qi (t) = q1(t) q2(t) … qni(t) where ni is the number of components of the minimal cut i The numerical solution of the FT is performed by computing the probability of occurrence for each of the cuts, and by combining those probabilities to estimate the probability of the root event

TOP 1 5 2 G1 AND 4 3 OR Minimal Cut Sets Top = {1}, {2} , {3, 4} , {5} QTop (t) = Q1 (t) + … + QN (t) N number of mininal cut sets (MCS)

Fault Trees • Definition of the Top event • Analysis of failure models of components • Minimal cut set minimal set of events that leads to the top event -> critical path of the system (#MCS =1 or #MCS = n) Analysis: - Failure probability of Basic events - Failure probability of minimal cut sets - Failure probability of Top event - Single point of failure of the system: minimal cuts with one event

source source 3 1 1 2 2 C3 C2 C1 Series sink C1 C2 Parallel sink C3 Reliability Graphs (An alternative method to Fault Trees) • Graph with nodes and arcs • The arcs represent components and have failure distributions • There are two special nodes source and sink • A failure of the system occurs if there is no path from source to sink The system is operational if there exists a path from source to sink • Reliability graphs can implement complex interactions using conditioning

Example Multiprocessor with 2 processors and three memories: M1 private memory of P1M2 private memory of P2, M3 shared memory. • I1 and I2 and the special node (namedshare) are used to model memory M3 • Associate to arc I1 and I2 and infinite failure distribution function • There is a path between source and sink iff:(P1 is operational) and (M1 or M3 are operational)(P2 is operational) and (M2 or M3 are operational)

FMEA Failure Mode Effect Analysis (FMEA):is a step-by-step approach for identifying all possible failures in a design, a manufacturing or assembly process, or a product or service. • FMEA vulnerability to single failures is analysed(FMEA does not consider multiple failures)

FMEA FMEA is used during design to prevent failures. Later it’s used for control, before and during ongoing operation of the process. Ideally, FMEA begins during the earliest conceptual stages of design and continues throughout the life of the product or service. Begun in the 1940s by the U.S. military, FMEA was further developed by the aerospace and automotive industries. Several industries maintain formal FMEA standards FMEA: current knowledge and actions about the risks of failures

Example FMEA performed by a Bank on ATM system

FMEA tables • Identify the functionality of the system • Identify all the ways a failure could happen. These are potential failure modes. • FMEA is applied to the system and to any component. • Define a Table with the following information : • 1) potential effects of failurefor each failure mode, identify all the consequences on the component and on the system.determine how serious each effect is. This is the severity rating, or S. Severity is usually rated on a scale from 1 to 10, where 1 is insignificant and 10 is catastrophic.

FMEA tables • 2)List all possible causes for each failure mode. For each cause, determine the occurrence rating, or O. This rating estimates the probability of failure occurring for that reason during the lifetime of your scope. Occurrence is usually rated on a scale from 1 to 10, where 1 is extremely unlikely and 10 is inevitable. • 3)For each cause, identify current process controls. • These are tests, procedures or mechanisms that you have in place to keep failures from reaching the customer. These controls might prevent the cause from happening, reduce the likelihood that it will happen or detect failure after the cause has already happened but before the customer is affected.

FMEA tables • For each control, determine the detection rating, or D. This rating estimates how well the controls can detect either the cause or its failure mode after they have happened but before the customer is affected. Detection is usually rated on a scale from 1 to 10, where 1 means the control is absolutely certain to detect the problem and 10 means the control is certain not to detect the problem (or no control exists). • Calculate the risk priority number, or RPN, which equals S × O × D. Also calculate Criticality by multiplying severity by occurrence, S × O.

FMEA tables • FMEA table numbers provide guidance for ranking potential failures in the order they should be addressed. • Identify recommended actions. These actions may be additional controls to improve detection. • Note that:FMEA allows to associate a cause, i.e., the failure mode of a simple component, to the system failure event.

FT/FMEA Fault-trees often used in conjunction with FMEA • FMEA vulnerability to single failures is analysed(FMEA does not consider multiple failures) • FTallows to describe the case in which the occurrence of an event depends on multiple failures

Dependability evaluation

Dependability evaluation

Presentation Transcript

Dependability

Fundamental Concepts of Dependability

Ability and Dependability

CRT Dependability

Arcade: A formal, extensible, model-based dependability evaluation framework

UML and Dependability Analysis

Dependability

Dependability and real-time

Dependability

Dependability and Security : Positioning*

Dependability Evaluation through Markovian model

Modeling Stream Processing Applications for Dependability Evaluation

Adaptation and Dependability

Dependability Policy Development

Dependability and Security

Dependability evaluation

Basic Concepts of Dependability

Chapter 13 – Dependability engineering

Dependability Evaluation through Markovian model

Dependability QoS

Dependability: