Dependability evaluation

Dependability evaluation

Dependability evaluation Reliability - continuity of service - As a function of time, R(t), is the conditional probability that the system performs correctly throughout the interval of time [t0, t], given that the system was performing correctlyat time t0 Availability- the readiness for usage - As a function of time, A(t), is the probability that the system is operating correctly and is available to perform its functions at the instant of time t. Safety - avoidance of catastrophic consequences - As a function of time, S(t), is the probability that the system either behaves correctly or will discontinue its functions in a manner that causes no harm

Basic concepts • Almost all specifications for systems mandate that certain values for dependability be achieved and, in some way, demonstrated • How do we compute the dependability of a system? - MODEL-BASED evaluation of dependability - dependability of a system is calculated in terms of the dependability of individual components

Model-based evaluation of dependability Support analysis in all phases of the system lifecycle - Design phase evaluates various alternatives - Assessing a built system a means for providing insights into specific aspetcts and for suggesting solution for future releases analyse the effects of mantainence options and changes of the system configuration sensitivity analyses of the models for bottleneck analysis and optimizations. Models of complex systems usually requires many parameters and require to determine the value to assign to them which may be very difficult. Sensitivity analyses allows to determinate the critical parameters out of the many that are employed, that is those parameters to which the system is highly sensible.

Model-based evaluation of dependability A model is an abstraction of the system that highlights the importatnt features of the system organization and provides ways of quantifying its properties (neglecting all those details that are relevant for the implementation but that are marginal for the objective of the sudy) Main problems: - complexity - model parameters assume values whose order of magnitude differ significantly from each other Basic approach: divide and conquer. The solution of the entire model is constructed on the basis of the solutions of its individual sub-modules

Model-based evaluation of dependability Modelling methodologies Methodologies that employ combinatorial models Reliability Block Diagrams, Fault-tree, …. State space representation methodologiesMarkov chains, Petri nets, SANs, …

Model-based evaluation of dependability • Combinatorial methods • Do not require enumeration of the system states • Offer simple and intuitive methods of the construction and solutions of models • Inadequate to deal with systems that exibits complex dependencies among components and repairable systems. • State based approaches • Demand the collection of variables, the values of which define the state of the system at a given point • They can be • - deterministic their behaviour is exactly determined • - stocastic (probabilistic) they have probabilistic nature

Packages for dependability evaluation SHARPE http://people.ee.duke.edu/~kst/ SHARPE, (Symbolic Hierarchical Automated Reliability and Performance Evaluator) is a tool for specifying and analyzing performance, reliability and performability models. SURF-2 http://www.laas.fr/surf/surf-uk.html SURF-2 tool for hardware and software systems, based on numerical resolution of Markov models. System behaviour is modelized by either a Markov Chain or a Generalized Stochastic Petri Net (GSPN). UltraSAN http://www.crhc.uiuc.edu/UltraSAN/UltraSAN.html UltraSAN is a software package for hierarchical model-based evaluation of systems represented as stochastic activity networks (SANs). MOBIUS http://www.mobius.uiuc.edu/ stochastic extensions to Petri nets, Markov chains and extensions, and stochastic process algebras ……. ….

Hardware Reliability • Let X a random variable representing the time to failure of a component • fX(t) probability density function (P(X=t)) • FX (t) cumulative distribution function (P(X<=t)) • Reliability of the component at time t is given by RX(t) = P[X > t] = 1 –P[X <= t] = 1 –FX (t) Unreliability of the component at time t is given by QX(t) = P[X <= t] = FX (t) Failure rate at time tprobability that a component that has not yet failed, fails in the interval (t, t+t) as t->0 P[X ∈ (t, t + ∆t)] = fX(t)

zX(t) = P[X(t, t+ D t) | X>t] hazard rate function zX(t) = fx(t) / 1 – Fx(t) if z(t) = then R(t) = e–t the exponential relation between reliability and time is known as exponential failure law Unreliability of a component at time t Q(t) = 1 – R(t) Q(t) = 1 - e–lt Failure rate l (failures for million hours)assume one failure every 2000 hours: l = 1/2000

Combinatorial models

Combinatorial models • If the system does not contain any redundancy, that is any component must function properly for the system to work, and if component failures are independent, then • the system reliability is the product of the component reliability, and it is exponential • - the failure rate of the system is the sum of the failure rates of the individual components

Combinatorial models

Combinatorial models • If the system contain redundancy, that is a subset of components must function properly for the system to work, and if component failures are independent, then • the system reliability is the reliability of a series/parallel combinatorial model

Series C2 C1 C3 source sink C2 C2 C1 C1 C3 C3 Parallel sink sink source source M of N 2 of 3 Reliability Blocks diagrams • Blocks are componentsconnected among them to represent the temporal order with which the system uses components, or the management of redundancy schemes or the success critera of the system • System failure occurs if there is no path from source to sink

Example Multiprocessor with 2 processors and three shared memories -> assume the system fails if 2 or more of the memories fail and both processors fail (operational if at least one processor and one memory are operational) Series/Parallel SHARPE toolAssume Qp(t) = 0.0138 with t = 10 days Qm(t) = 0.00692 with t = 10 days

1 block arch1(k,n,pfail,mfail) 2 comp proc prob(pfail) 3 comp mem prob(mfail) 4 parallel procs proc proc 5 kofn mems k,n,mem 6 series top procs mems 7 end 8 loop k,1,3,1 9 expr 1 - sysprob(arch1;k,2,.0138,.00692) 10 end 11 end k operational memories, n operational processors, pfail value of failure probablity of processor, mfail value of failure probability of memoriesbottom-up description of the system (top: serie of parallel modules) sysprob(…) computes system failure probability 1-sysprob(…) reliability

Output: K=1 1 - sysprob(arch1;k,2,.0138,.00692): 0. 99981 K=2 1 - sysprob(arch1;k,2,.0138,.00692): 0. 99967 K=3 1 - sysprob(arch1;k,2,.0138,.00692): 0. 97920 We note that: increment of reliability is significant from three to two operational memories requirement (after ten days, one memory: 99.8%; two memories: 99.7%; three memories: 97.8%) A failure probability function can be assigned to components by specifying the failure rate: the exponential failure law is assumed. 1 block arch1(k,n,pfrate,mfrate) 2 comp proc exp(pfrate) 3 comp mem exp(mfrate) 4 parallel procs proc proc 5 kofn mems k,n,mem 6 series top procs mems 7 end lp = pfrate = 0.00139 lm = mfrate = 0.00764

Fault Trees • FT considers the combination of events that may lead to an unsdesirable situation of the system (the delivery of improper service for a Reliability study, catastrophic failures for a Safety study) • Describe the scenarios of occurrence of events at abstract level • Hierarchy of levels of events linked by logical operators • Applicable both at design phase and operational phase

OR OR OR Fault Trees TOP EVENT G0 G2 G3 GATE SYMBOL AND E3 G4 EVENT SYMBOL E2 E1 E4 E2 Describes the Top Event (status of the system) in terms of the status (faulty/non faulty) of the Basic events (system’s components)

C2 C2 C2 AND C3 C1 C1 C1 C3 C3 OR 2 of 3 Fault Trees • Components are leaves in the tree • Component faulty corresponds to logical value true, otherwise false • Nodes in the tree are boolen AND, OR and k of N gates • The system fails if the root is true AND gate • True if all the components are true (faulty) OR gate • True if at least one of the components is true (faulty) K of N gate • True if at least k of the components are true (two or three components) (faulty)

2of3

AND AND OR P1 P2 M1 M3 M2 Example: Multiprocessor with 2 processors and three shared memories -> the computer fail if all the memories fail and all the processors fail (operational if at least one processor and one memory are operational) Top event System failure A cut is defined as a set of elementary events that, according to the logic expressed by the FT, leads to the occurrence of the root event. To estimate the probability of the root event, compute the probability of occurrence for each of the cuts and combine these probabilities

Conditioning Fault Trees • If the same component appears more than once in a fault tree, it violates the independent failure assumption (conditioned fault tree) • Example Multiprocessor with 2 processors and three memories: M1 private memory of P1M2 private memory of P2, M3 shared memory. Top event system • Assume every process has its own private memory plus a shared memory. • Operational condition: at least one processor is active and can access to its private or shared memory. • repeat instruction:given a component C whether or not the component is input to more than one gate, the component is uniqueM3 is a shared memory AND OR OR AND AND

Conditioning Fault Trees • If a component C appears multiple times in the FT • Qs(t) = QS|C Fails(t) QC(t) + QS|C not Fails(t) (1-QC(t)) • where • S|C Fails is the system given that C fails • and • S|C not Fails is the system given that C has not failed

TOP 1 5 2 G1 AND 4 3 OR Minimal cut sets Cut Sets Top = {1}, {2} , {G1} , {5} = {1}, {2} , {3, 4} , {5} Minimal Cut Sets Top = {1}, {2} , {3, 4} , {5}

TOP 1 5 2 G1 AND 4 3 OR Minimal Cut Sets Top = {1}, {2} , {3, 4} , {5} Qi (t) = probability that all components in the minimal cut set i are faulty Qi (t) = q1(t) q2(t) … qni(t) -ni number of components of the minimal cut i -independent faults of the components

TOP 1 5 2 G1 AND 4 3 OR Minimal Cut Sets Top = {1}, {2} , {3, 4} , {5} W(t) = probability that the system fails for a minimal cut set at time t W(t) = q2(t) q3(t) … qni(t) w1(t) + q1(t) q3(t) … qni(t) w2(t) + ………….. q1(t) q2(t) … qni-1(t) wni(t) wi(t) is the failure rate of component i

TOP 1 5 2 G1 AND 4 3 OR Minimal Cut Sets Top = {1}, {2} , {3, 4} , {5} QTop (t) = Q1 (t) + … + QN (t) N number of mininal cut sets (MCS)

Fault Trees • Definition of the Top event • Analysis of failure models of components • Minimal cut set minimal set of events that leads to the top event -> critical path of the system (#MCS =1 or #MCS = n) Analysis: - Failure probability of Basic events - Failure probability of minimal cut sets - Failure probability of Top event - Single point of failure of the system: minimal cuts with one event

source source 3 1 1 2 2 C3 C2 C1 Series sink C1 C2 Parallel sink C3 Reliability Graphs (An alternative method to Fault Trees) • Graph with nodes and arcs • The arcs represent components and have failure distributions • There are two special nodes source and sink • A failure of the system occurs if there is no path from source to sink The system is operational if there exists a path from source to sink • Reliability graphs can implement complex interactions using conditioning

Example Multiprocessor with 2 processors and three memories: M1 private memory of P1M2 private memory of P2, M3 shared memory. • I1 and I2 and the special node (namedshare) are used to model memory M3 • Associate to arc I1 and I2 and infinite failure distribution function • There is a path between source and sink iff:(P1 is operational) and (M1 or M3 are operational)(P2 is operational) and (M2 or M3 are operational)

FT/FMEA Failure Mode Effect Analysis (FMEA):is a step-by-step approach for identifying all possible failures in a design, a manufacturing or assembly process, or a product or service. • FMEA vulnerability to single failures is analysed(FMEA does not consider multiple failures) • FTallows to describe the case in which the occurrence of an event depends on multiple failures Fault-trees often used in conjunction with FMEA

FMEA FMEA is used during design to prevent failures. Later it’s used for control, before and during ongoing operation of the process. Ideally, FMEA begins during the earliest conceptual stages of design and continues throughout the life of the product or service. Begun in the 1940s by the U.S. military, FMEA was further developed by the aerospace and automotive industries. Several industries maintain formal FMEA standards.

Example FMEA performed by a Bank on ATM system

FMEA tables • Identify the functionality of the system • Identify all the ways a failure could happen. These are potential failure modes. • FMEA is applied to the system and to any component. • Define a Table with the following information : • 1) potential effects of failurefor each failure mode, identify all the consequences on the component and on the system.determine how serious each effect is. This is the severity rating, or S. Severity is usually rated on a scale from 1 to 10, where 1 is insignificant and 10 is catastrophic.

FMEA tables • 2)List all possible causes for each failure mode. For each cause, determine the occurrence rating, or O. This rating estimates the probability of failure occurring for that reason during the lifetime of your scope. Occurrence is usually rated on a scale from 1 to 10, where 1 is extremely unlikely and 10 is inevitable. • 3)For each cause, identify current process controls. • These are tests, procedures or mechanisms that you have in place to keep failures from reaching the customer. These controls might prevent the cause from happening, reduce the likelihood that it will happen or detect failure after the cause has already happened but before the customer is affected.

FMEA tables • For each control, determine the detection rating, or D. This rating estimates how well the controls can detect either the cause or its failure mode after they have happened but before the customer is affected. Detection is usually rated on a scale from 1 to 10, where 1 means the control is absolutely certain to detect the problem and 10 means the control is certain not to detect the problem (or no control exists). • Calculate the risk priority number, or RPN, which equals S × O × D. Also calculate Criticality by multiplying severity by occurrence, S × O.

FMEA tables • FMEA table numbers provide guidance for ranking potential failures in the order they should be addressed. • Identify recommended actions. These actions may be additional controls to improve detection. • Note that:FMEA allows to associate a cause, i.e., the failure mode of a simple component, to the system failure event.

State based methods

State based methods • State of the system a state represents all that must be known to describe the system at any given instant of timeFor reliability models, each state represents a distinct combination of failed and working modules. • State transitions govern the changes of state that occur within a system-For reliability models, as time passes the system goes from state to state as modules fail and repair. The state transitions are characterized by probabilities, such as the probability of failure and the probability of repair Random processes

Random process A random process is a collection of random variables indexed by time Examples of random process {Xt}, with time T = {1, 2, 3. …} 1) Let X1 be the result of tossing a dieLet X2 be the result of tossing a die, and so on. {Xt} represents the sequence of results of tossing a die P[X1 = 4] = 1/6 P[X4 = 4 | X2 = 2] = P[X4 = 4 ] =1/6 2) Let X1 be the result of tossing a die. Let X2 be the result of tossing a die plus X1, and so on (Xi = the result of tossing a die plus Xi-1) P[X2 = 12] = 1/36 P[X3 = 14 | X1 = 2] = 1/36

Random process Discrete-time random processif the number of time points defined for the random process is finite or countable (e.g., integers) Continuous-time random process if the number of time points defined for the random process is uncountable (e.g., real numbers) Example:Let Xt be the number of fault arrivals in a system up to time t.If t is a real number , Xt is a continuous-time random process

Random process state space State space S of a random process X:is the set of all possible values the process can take S = {y: Xt = y, for some t} If X is a random process that models a system, then the state space S of X can represent the set of all possible configuration that the system could be in

Random process state space Discrete-state random process Xif the state space of random process X is finite or countable (e.g., S = {1,2,3,…})If the number of states of the process is finite, N denotes this number. Continuous-state random process Xif the state space of random process X is infinite and uncountable (e.g., S = the set of real numbers) Let X be a random process that represents the number of bad packets received over a network. X is a discrete-state random process Let X be a random process that represents the voltage on a telephone line. X is a continuous-state random process

Markov process • Let {Xt, t>=0} be a random process. A special type of random process is called the Markov process. Basic assumption underlying Markov process: the probability of a given state transition depends only on the current state the future behaviour is independent of past values (memoryless property)

Markov process: steady-state transition probabilities • Let {Xt, t>=0} be a Markov process. The Markov process X has steady-state transition probabilities if for any pair of states i, j: The probability of transition from state i to state j does not depend by the time. This probability is called pij

Markov chain • A Markov chain is a Markov process with discrete-state space S. • A Markov chain is omogeneous if X has steady-state transition probabilities; non-omogeneous otherwise • A Markov chain is a finite-state Markov chain if the number of states is finite (N). We consider discrete-time omogeneous Markov chains (DTMC)

Transition probability matrix • If a Markov chain is finite-state, we can define the transition probability matrix P (NxN) pij = probability of moving from state i to state j in one step row i of matrix P: probability of make a transition starting from state i column j of matrix P: probability of making a transition from any state to state j

Parr Pidle Pbusy 1 2 Pcom Pr Pfi 3 Pfb Pff An example A computer is idle, working or failed. When the computer is idle jobs arrives with a given probability. When the computer is idle or busy itmay fail with probability Pfi or Pfb, respectively S={1,2,3} 1 computer idle 2 computer working 3 computer failed

Dependability evaluation