Probability Distribution of Some Time Characteristics of Fault-Tolerant Systems

Probability Distribution of Some Time Characteristics of Fault-Tolerant Systems Sergey Frenkel Institute of Informatics Problems RAS, Moscow, Russia slf-ipiran@mtu-net.ru

Fault DetectionLatency A fault may persist for awhile before causing an error. This interval of time is called the fault latency. During this time the fault is a latent fault. This time interval between fault occurrence and fault detection, can be a critical issue, especially for on-line test. Thelater a fault is detected after its occurrence, the higher thechances that an erroneous result may slip undetected. ( Unlike off-line test where faults are detected before thecircuit performs its intended operation) It is likely, that the shorter the latency, the less damage the fault will cause, and, the more likely the system can recover. . fault output error occurrence occurrence |------------------------| -----------------------------| Fault-free operation Fault latency time

What kind of situations should be considered (N.-J. Park et al , Reliability Modeling and Analysis of Self-Healing Massively Parallel Computing Systems.2004) In a steady state, the reliability at time t: R(t) = Prob(In_Op) + Prob(Healable_Failure)= 1 - Prob(Failure) PNON-HEAL (Failure) >> PHeal(Failure)

FSM in Digital System Design Informal Specification A Behavioral Model FSM

FSM Fault Latency M(X,Y,S, , ), is a fault-free automaton, Mf(X,Y,S, , ) is a faulty one. Fault Latency is the length of time between the occurrence of a fault and the appearance of an error due to that fault. . fault output error occurrence occurrence |------------------------| -----------------------------| Fault-free operation Fault latency time t τ k Fault f detection latency is k= min(m):  t, m , t, m =0,1…, τ=mint{t: Sf(τ)S(τ), Y(τ)=Yf(τ)}, k=minm{Y(τ +m) Yf(τ +m)}

Fault: x4  1

Possible faultybehavior modes F - Fault free mode. The FSM remains in the fault free mode until a fault occurs. L - Latent mode. It is a mode where the presence of a fault cannot be be detected since a test vector detecting the fault has not yet appeared at Its input. The FSM moves to the latent mode from the fault free mode when a fault occurs, and leaves the latent mode when a test vector is applied to the circuit. S - Silent mode. It is a mode where a fault does not manifest itself at an output. E - Erroneous mode. It is a mode in which the FSM terminates its proper functioning,i.e. when a non-code output vector has been produced. The FSM is able to move to this mode either from the silent mode or from the latent mode. Under permanent faults Under transient faults

FSM-based models of faulty systems timing behavior The term “coupling" reflects the fact that the two objects are related in this way. FSM-based system Model Markov Chain Self-healing under Transient faults Permanent Fault latency Coupling of Markov Chains FSMs product Single automaton Network of component FSMs

Product of Fault-free and Faulty FSMs P-FSM The state space of the product FSM is the set of all pairs {(si,sjf )}, i,j=1,..n si is a state of the fault-free FSM, sjf is a state of f –faulty FSM.

Fault latency as time to absorption in a Markov Chain [Shedletsky J., McCluskey E., 1976 ,"The Error Latency of Fault in a Sequential Digital Circuit", IEEE Transaction on Computers, vol. 25, No 6, 1976] Input X={x1,.. xn}, xi are independent binary random values, both between the xi , i=1,..n, and from one time step to another. Prob(xi=1)=p1, Prob(xi=0)=p0 , Assumptions: Initial states of the FSMs are equal, X is applied simultaneously to the both faulty and fault–free FSMs. In this case the transitions of p-FSM is a Markov Chain with state space X={(s, sf) sA}, where sA is an absorbing state of the Markov chain, which corresponds to the state when the outputs of both FSMs become unequal, with size of (n2 +1)(n2+1), where n is the number states of the FSM considered. Absorption state of p-FSM: {s,sf}: Y(s) Y(sf)}

The absorbtion probability matrix for each n steps for each initial steps: P(n)=Qn-1R Intial distribution of states (0) is n2+1-bitvector of initialstates probabilitiesdistributions, thereby, only components {i,i}, i=1,…2n+1 can be non-zero, because of assumption about equality of initial states of faulty and fault-free FSMs.

Network ofInteracting Component FSMs M=(X, Y, S, , ) M1=(X1, Y1, S1, 1, 1) M2=(X2, Y2, S2, 2, 2) M3=(X3, Y3, S3, 3, 3) X1=(X1,P), X2=(X2,P),X3=(X3,P), X1, X2,X3 X S1, S2,S3 S P is a set of additional input variables defining some conditions of the component functioning

(Ilia Levin et al, Reduction of Fault Latency in Sequential Circuits by using Decomposition, 22nd IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems,2007)

Where a(l) is a predefined initial state of the components where the component begins the testing mode.

Component FSM S3 Supervisor

Possible regimes • S, Sf are set of states fault-free and faulty FSMs, SA={(S,Sf)} sAbs S,Sf={(s1,s2,s3,V)}, s1,s2,s3 V is (m1,m2,m3), m1,m2,m3 = W or T, are a status of corresponding components. W means “working mode”, T means testing one. • Possible transitions: • (i1,i2) →(j1,j2), i1, j1S, i2,j2 Sf, V are not changed in both FSMs. • 2. The same but with transition to an absorption state. • 3. (i1,i2) →(j1,j2), i1, j1S, i2,j2 Sf, VV’ • 4. The same but with transition to an absorption state. • A fault can appear in any FSMs in both modes.

Time to absorption(S.Frenkel, A.Pechinkin, 2007) Discrete random process S(t) ={(i,j)} is non-Markovian. Let n is the time (a number of steps) elapsed from the moment of switching a component from one of mode to another. v=(w,T,T) v=(T,w,T) t n Then z(t) ={(j,j,n}} is a Markov chain. (i,j,n): Y(I,n)≠Y(j,n) is subset of absorbing states of this MC. The probability to get to the absorption state at the m-th step after a fault occurence : Pr(m) = 1- ΣijΣn=0,m Pm (i,j,n), m≥0 Pm (i,j,n) computation is recurrent one. It starts from P0(i,j,0)=0, P(i,j,n)=0, n>0, then P1(i,j,0) and P1(I,j,1), P2(i,j,1) and P2(i,j,2) etc.

Ttransition probability matrix P(i1,j1,n1,i2,j2,n2): computation. Let i1,i2 is the states of a component Ak, k=1,2,3. In this case n2=n1+1. P(i1,j2,n1,i2,j2,n2) = Pr(i1,i2)Pr(j1,j2)Pr(Y(i1,i2)=Yf(j1,j2))qi(n1) . qi(n) = ai(n)/ai(n-1)is a conditional probability that the fault was not detected (in the test mode) at the step n when it had not detected at n-1, ai(n)is a probability a fault is not detected to the n-th step. n2 = 0 If i1,i2 are the states of different components.

Another model

Bounds in terms of component latencies Upper bound for the minimum: Low bound. As: Then: Also:

Kronecker-based approach(S.Frenkel, A.,Pechinkin, I.Levin, 2007) • where Pij, i,j=1,3, is a transition probability matrix of a Markov chains corresponding to the state spaces of combinations of various pairs of fault-free/faulty components, namely:

Aii, i=1,2,3, is the transition probability matrix of the product of fault-free and faulty component i in working mode, which expresses the probability, that under fault f all transitions in the fault-free component i in the working mode are resulted in the same output that the faulty component, that is probability of transition {(si, sjf), (sk, smf)}. That means that the transition of fault-free component is si sk while if the component is faulty, the transition will be skf smf in faulty components under the same input vector. (The states (sk, smf ) may belong to other components, as in working mode a transition from a predefined state a(l) may lead to a state of any of the three components).

Bii , j=1,2,3, is similar matrix of transition probabilities of Product fault-free and faulty component j in testing mode inside this component. • Pij are expressed in terms of some transition probability matrixes Qij which represent the probabilities, that following transitions are resulted with some equal outputs of fault-free and faulty components: • -the transitions of fault-free component i being in the working mode in an inner state j1 caused by transition in a predefined state of type a(l), • -the transition in the component j being in that time in a testing mode in a state j2 to a state k2, in this component, • - the transition of the corresponding faulty components, taking into account, that predefined state a(l) is equal for both fault-free and faulty component i due to the principal assumption mentioned above that initial states of both fault-free and faulty FSMs are equal.

For example: Q12 is the probabilities of the following pairs of transitions of {i1,j1, i2,j2}{k1,l1,k2,l2} • i1 is the state of fault-free component 1 in the working mode, j1 is the state of faulty component 1 in the working mode, • i2, j2 are correspondingly the fault-free and faulty component 2, • k1=l1 is the predefined state in the component 1 (the state type a(l) ), • -k2 is the state of fault-free component 2 where moved component 1 • (correspondingly, 2 became in the working mode), • - l2 is the state of the faulty component 2 to which the faulty working component has moved (becoming in this time in a testing mode), that is l2 is the state to which the j1 moved).

Kronecker product

Design Cost CD = CD0– CST -CSV - CVT – CSVT CD0 = CS0+CT0 +CV0 where CST, CSV, CVT, CSVT are costs of the works, which are the same for corresponding design task (Synthesis (S), Testing (T), Verification (V)). We assume that these costs are independent of the order of their fulfillment.

Automata Dependencies – Stochastic Automata Network Model A thing that causes the change of the SAN global state is an event The firing rate describes the rate at which the event occurs; The probability of occurrence quantifies a choice among all transitions corresponding to a synchronizing eventthat can be fired from the same local state.

Types of automata interactions Local Events in SAN, that change the SAN global state by changing the local state of one single automaton; Synchronizing Events Markov chain The firing of the transition from 0(2) to1(2) state occurs with rate λ1 (if automaton A(1) is in 0(1) state) or λ2 (if automaton A(1) is in 2(1) s

SAN Markov Chains Infinitesimal matrix of SAN (local events only)

Let.. E is the set of synchronizing events. e+correspond to actual synchronizing event and its rates, e-corresponds to an updating of the diagonalelements in the infinitesimal generator to reflect these transitions. Suppose that each time A(1) generates a transition from 2 to 1 (at rate lambda1), it forces the A92) into state 1. Q= SAN containing N stochastic automata with Esynchronizing events (and no functional transition rates) may be written as

PEPA Language PEPA allows to combine severalcomponents via a cooperation combinator such that the resulting composite component isagain a continuous time Markov chain. The process (α, r).C executes the activity (α, r) – which possesses the action type α and an exponentially distributed duration with rate r,– and afterwards behaveslike C. It is also possible to leave the rate r unspecified Choice: In a process C + D all currently enabled activities in C and D are involved in a racecondition. The activity to win this race is executed. Due to the memoryless property of the exponential distribution all other activities are reset. If for example in the process (α, r).C + (β, v).D the activity (α, r) wins, then afterwards the process behaves like C + (β, v).D. Cooperation: C ><LD denotes the situation where the components C and D must synchroniseover activities which are of an action type contained in the synchronisation set L (shared activities)C and D evolve independently of each other (i.e.in parallel) until the first of the two components, say C, reaches a shared activity. Fromthis time instant on this shared activity becomes blocked in C until also D reaches ashared activity of the same action type. If this happens the shared activity is executedsimultaneously by C and D. The rate of the shared activity is determined by the smallestrate of all activities involved in the synchronisation. If one or more activities involved inthe synchronisation possess an unspecified rate, then these activities can be regarded aspassive – they are not taken into account when determining the rate of the shared activity.

DTMC SAN • The transition matrix of the system DCMT With a synchronizing event

Time to Absorption in Joint MCs [F. Brenner Cumulative Measures of Absorbing Joint Markov Chains and an Application to Markovian Process Algebras, 2007] For a Markov chain U= (U1, . . . , Um), with the marginal absorbing Markov chains Ui, i = 1. . .m,compute the mean time to absorption. n-th moment of the time to absorption: - vector of initial transient states probabilities (M.Neuts,Matrix-Geometrix Solutions in Stochastic Models, 1981)

Mean Time to absorbtion: • Ps(T1 > t) is the complementary cumulative probability distribution of the state holding time in the synchronising state s. This state holding time is exponentially distributed with rate λ(s). • SAN component C corresponding to the state s – and all events which determine synchronising activities, • Y (t [T1]) is a process which behaves like Y (t) until t = T1. • For t ≥ T1 the process stays in the state that it occupied at the time instant T1, i.e. once Y (t [T1]) has reached its next embedded state, • The time complexity is O(Nmd2 +mN2), where m is the number of marginal CTMCs, d is the maximal size of the marginal state spaces and N is the truncation index of the infinite series which depends on convergence properties (eigenvalues) of the joint CTMC and the desired accuracy.

An example of self-healing properties is an FSM with partially monotonic Transition functions. A system of logical functions  is partially monotonic in x’ variables if for any pair of Boolean k-tuples A, B such that AB, the condition (A) (B) is satisfied.

Self-healing: formal view/informal views AnFSM may have a self-healing property for a given fault and a given input sequence even if it does not have equivalent states. Partially monotonic FSM in cubic form The symbols “-“ is “Don’t care” and free places in 3-bits input vector X as well. Variables as, am are state ones (as is previous state, am is a next one), Y is 7-bits output vector. Self-healing of the FSM that had started from the state 1000, when the FSM should transit under input 111 was changed to 1100 due to a transient (within the clock) fault.

Self-Healing Time and Coupling Given a Markov chain A, defined on a state space Sand a set L of its subsets (“configurations”). The expected hitting time of L (or more simply the hitting time) is: TL= maxxSE[TxL] where E[.] denotes expectation and TxL = min{ t : XtL | X0 = x }. We model a transient fault effect as a change of initial state of corresponding Markov chain: (X0= x)  (Y0 = y) A coupling for a Markov Chain A is a Markov chain , defined on the product on a space SS defining a stochastic process (Xt,, Yt) t =1,∞with the properties: 1. Each of the processes (Xt ) and (Yt) is a faithful copy of A (given initial configurations X0= x and Y0 = y). 2. If Xt,= Yt,then Xt+1= Yt+1

Self-Healing and Self-Stabilization A converges towards L if , starting from an arbitrary configuration, it reaches L within a finite number of states [4] A. Dhama, O.Theel, and T. Warns, Reliability and Availability Analysis of Self-Stabilizing Systems, LNCS vol. 2280, Stabilization, Safety and security of Distributed systems, pp. 244-261,2006. [6] L. Fribourg, S. Messika,· C. Picaronny, Coupling and self-stabilization, Distrib. Comput., DOI 10.1007/s00446-005-0142-7, Special issue: DISC 04, 2005.

Self-healing time : bounds Theorem Given a Markov chain A and an ergodic set L, if there exists a coupling of finite expected time T, then: 1. The hitting time TL satisfies: TL≤ T. Let the initial distributions of Markov chain for a fault-free FSM is Pxo, PYo corresponds to the faulty FSM, and P is transition matrix of the Markov chain, describing the FSM. Then the coupling inequality [ Rosethal97] gives that the probability, that the Markov chain started in the initial state Y0 will hit after k steps in the state, in what it would hit if it has started in the initial state X0, is: supAS׀PxoPk(A) -PyoPk(A) ׀ ≤ Pr(T>k) Correspondingly the probability of self-healing property fulfillment is : Pr (T≤k ) ≤ 1-supAS׀PxoPk(A) -PyoPk(A) ׀

Probability Distribution of Some Time Characteristics of Fault-Tolerant Systems

Probability Distribution of Some Time Characteristics of Fault-Tolerant Systems

Presentation Transcript

Chapter Fault Tolerant Design of Digital Systems

CprE 545: FAULT-TOLERANT SYSTEMS

Fault Tolerant Distributed Systems

CprE 545: FAULT-TOLERANT SYSTEMS

Scheduling and Optimization of Fault-Tolerant Embedded Systems

CprE 545: Fault Tolerant Systems

CprE 545: Fault Tolerant Systems

Characteristics of Real-Time Systems

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: FAULT-TOLERANT SYSTEMS

Fault Tolerant Design of Distributed Automotive Systems

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: Fault Tolerant Systems

Some probability distribution The Normal Distribution

Design Optimization of Time- and Cost-Constrained Fault-Tolerant Distributed Embedded Systems

Analysis and design of Fault Tolerant Real-time systems

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

Characteristics of Real-Time Systems

Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance