1 / 21

Chapter 1. Introduction

Chapter 1. Introduction. Fault-Tolerance. Reliability : Continuity of Service Availability : Readiness for Usage Safety : Avoidance of Catastrophic Consequences on the Environment Security : Prevention of Unauthorized Access and/or Handling of Information. Fault-Tolerance (2).

stamos
Download Presentation

Chapter 1. Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 1. Introduction

  2. Fault-Tolerance • Reliability : Continuity of Service • Availability : Readiness for Usage • Safety : Avoidance of Catastrophic Consequences on the Environment • Security : Prevention of Unauthorized Access and/or Handling of Information

  3. Fault-Tolerance (2) • Fault-Tolerance : To provide service despite the presence of faults in the system • Fault-Prevention : To prevent faults from occurring or getting introduced into the system

  4. Fault-Tolerance (3) • Manual Maintenance in case of System Failures? • Unacceptability of the read-tike delays caused by manual repairs • Inaccessibility of systems for manual repairs • Excessive high costs of lost time and maintenance

  5. Basic Concepts and Definitions • System : An Identifiable Mechanism that Maintains a Pattern of Behavior at an Interface between the System and its Environment • Internal State and External State (Behavior) • Specification : The expected or correct behavior of a system (Completeness, Consistent, Correct)

  6. Basic Concepts and Definitions (2) • Failure : When the behavior of the system first deviates from that required by its specification. • Error : The part of the system state which is liable to lead to subsequent failure. • Fault : The cause of an error.

  7. Basic Concepts and Definitions (3) • Faults : • Transient Faults vs. Permanent Faults • Design Faults vs. Operational Faults • Fault Tolerance : The behavior of the system, despite the failure of some of its component, is consistent with its specification.

  8. Phases in Fault Tolerance • Error Detection • Damage Confinement • Error Recovery • Fault Treatment and Continued System Service

  9. Error Detection • Error Detection ? • Why Not Fault Detection or Failure Detection? • Use “Check” to Detect Errors • Replication Check • Timing Check • Structural and Coding Check • Reasonableness Check • Diagnostics Check

  10. Error Detection (2) • Replication Check : Replicating some components of the system, and the results are compared or voted. • Timing Checks : Time-Out if the specification of a component include timing constraints.

  11. Error Detection (3) • Structural and Coding Checks : To check the structure of the data is as it should be; Coding : Extra bits are added to the data bits. • Reasonableness Checks : To determine if the state of some object in the system is reasonable. (ex. Range check)

  12. Error Detection (4) • Diagnostics Checks : Use special input values w/ known output values.

  13. Damage Confinement and Assessment • Damage Assessment : The flow of information btw. different components of the system is examined. • Damage Confinement : Fire Walls - No information flow takes place across the walls.

  14. Error Recovery • Error Recovery : Remove the errorneous state • Backward Recovery : Checkpointing & Rollback • Forward Recovery : Make the state error-free by taking the necessary corrective actions.

  15. Fault Treatment and Continued Service • Transient Error, By Error Recovery ! Permanent Error, By ? • Fault Location : Identify the faulty component. System Repair : Bypass the faulty component. Dynamic System Reconfiguration. (Using Redundancy)

  16. Overview of Hardware Fault Tolerance • Triple Modular Redundancy (TMR) • What if two units fail, or voting element fails? • Synchronization problem ? • No error detection or recovery ? M Input M V Output M

  17. Overview ofHardware Fault Tolerance (2) • Dynamic Redundancy • Several units but with only one operating at a time • If a fault is detected, the faulty unit is switched out. • Cold-standby system vs. Hot-standby system • DR vs. TMR : Failure detection, Faulty unit is removed.

  18. Overview of Hardware Fault Tolerance (3) • Dynamic Redundancy : P1 : P2 ==> P3 P1, P2 ==> P4 P1,P2,P3

  19. Overview ofHardware Fault Tolerance (4) • Coding : Detectability/Correctability of a Code. • Hamming Distance : The minimum number of bit positions in which any two words in the code differ. d = C + D +1 C : # of bit errors the code can correct D : # of bit errors the code can detect

  20. Overview ofHardware Fault Tolerance (5) • Hamming Code : C1 C2 D1 C3 D2 D3 D4 C1 = D1 + D2 + D4 C2 = D1 + D3 + D4 C3 = D2 + D3 + D4 • Hamming Distance = ? • 1 bit error can be detected and corrected.

  21. Overview of Hardware Fault Tolerance (6) • Cyclic Redundancy Codes (CRC) : Data/A ------------------> (Data+R)/A A에 의해 나누어 떨어지는 error는? • Berger Code : Count the # of 0s and the count is appended. 10011010 ----------------> 10011010100

More Related