1 / 36

Failure Mode Assumptions and Assumption Coverage

Failure Mode Assumptions and Assumption Coverage. David Powell. Fault-Tolerance. Key questions How components may fail?  Prevention strategies At what rate they may fail?  The Amount of redundancy needed What are the important type of faults? Types of redundancy needed

trisha
Download Presentation

Failure Mode Assumptions and Assumption Coverage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Failure Mode Assumptions and Assumption Coverage David Powell

  2. Fault-Tolerance • Key questions • How components may fail?  Prevention strategies • At what rate they may fail?  The Amount of redundancy needed • What are the important type of faults? • Types of redundancy needed • The relation between dependability, redundancy and faults? • General FT design guidelines

  3. An F-T Paradox/Dilemma • More faulty  More redundancy More possibility of faults • ???

  4. Solution- Some Key Steps Classify, quantify and verify the assumptions

  5. Type of Failures

  6. Overview • Single-user service • Service Model • Potential Errors • Multiple-user service • Service Model • Potential Errors

  7. Single-user Service Model • Service items: si, i=1,2,… • Values of si: vsi • Observation time of si: tsi • Service Model: Si= <vsi, tsi> • An omniscient observer

  8. Correctness Model • Service item si is correct iff (vsi SVi)  (tsi STi) • SVi and STi are respectively the specified sets of values and times for service item si

  9. Potential Errors • Arbitrary value error: si : vsi SVi • Noncode error: si : vsi CV(CV defines a code) • Arbitrary timing error: si : tsi STi • Early timing error: si : tsi < min(STi) • Late timing error: si : tsi > max(STi) • Omission error: si : tsi =  • Impromptu error: si: (vsi = )  (tsi = )

  10. Multi-user Service Model • Service item si={si(1), si(2),…, si(n),} • Service model: <vsi(u), tsi(u)>, all i,u • New issues: “consistency”

  11. Correctness Model • vsi(u)– the value of service item i on process u • vsi-- the value of service item i • SVi– the set of specified service item i • tsi(u)– the observation time of service item i on process u • STi(u) – the range of specified observation time of service item i on process u • uv -- the time bound of related occurrences

  12. Examples of Potential Errors • Consistent value error • Consistent timing error • Semi-consistent value error

  13. Failure Mode Assumptions Attempt to formalize the concept of an assumed failure mode By assertions on the sequences of service items delivered by a component

  14. Examples of Value Error Assertions • No value errors occur (Vnone) i , vsi  SVi • The only value errors that occur are noncode value errors (Vn) i , (vsi  SVi)  (vsi CV) • Arbitrary value error can occur (Varb) i , (vsi  SVi)  (vsi SVi )

  15. Examples of Timing Error Assertions • No timing error occurs (Tnone) • The only timing errors are omission errors (TO) • The only timing errors are late timing errors (TL) • The only timing errors are early timing errors (TE) • Arbitrary timing error can occur (Tarb) • Permanent omission/crash (Tp) • Bounded omission degree (TBk)

  16. Timing Error Implications

  17. Failure Mode Assertions(FMA) • A complete FMA entails an assertion on errors occurring on both value and time domains • By taking the Cartesian production of the two domains, we get a family of FMA

  18. FMA Implication Graph

  19. So what? • The FMA classification and implication graph can serve as a guideline to design families of FT algorithms that can process errors in increasing severity!

  20. Assumption Coverage Establishing a link between assumed component failure mode and system dependability (The design a FT system relies on the assumption they make) (The dependability of a FT system is related to the failure mode they assume)

  21. Motivation • Components may fail • They may fail in a bad way  leads to a violation of assumptions of the system • The system, in turn, can fail • Question: to what degree can a component FMA prove to be true in the real system?

  22. The Coverage of the Assumption • Definition P(X) = Pr{ X= true | component failed} • P(Varb Tarb) = 1 • P(Vnone Tnone) = 0

  23. Coverage of an FT system PS(X) = Pr{ correct error processing |X= true} *Pr{ X= true | component failed}

  24. Influence of Assumption Coverage on System Dependability A Case Study

  25. The System • A system of n processors • Connected via unidirectional message-passing bus • Each processor carries out the same computation steps • The result of each processing step is communicated to all other processors • Each process has a decision function (DF) • The DF is applied to the results received from other processors • … • Each processor and its associated bus is viewed as a single component

  26. Fail-Silent Processor-bus • A fail-silent processor • Only has semi-consistent value errors • Always produces message on time • Or ceases to produce messages forever • If a message is delivered to a processor, it is to be delivered to all processors with consistent fixed delay

  27. Fail-Consistent Processor Bus • Only semi-consistent value errors may occur • Faulty processors may send erroneous values • Consistent timing error may occur

  28. Fail-uncontrolled Processor Bus • Arbitrary timing error • Arbitrary value error

  29. Implications of Assumption Coverage • Failure mode relations • Coverage relations

  30. Dependability Expressions From Markov Models • r = e –λt • λ = failure rate

  31. A Life-critical Application • System reliability objective: R > 1-10-9 over 10 hours • Single processor reliability: • r = e-λt • 1/λ = 5 years

  32. A Money-Critical Application • It is about availability of the system rather than reliability of the system • Please look at the paper for more details

  33. Unavailability v.s. Coverage

  34. Conclusion • A formalism for describing component failure modes • Multiplicity of value and timing errors • The notion of assumption coverage • The relation between dependability, availability and assumption coverage

  35. Thank you

More Related