1 / 15

CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12)

CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12). Alexander A. Shvartsman Computer Science and Engineering University of Connecticut. Fault-Tolerance -- An Overview. A fundamental property of distributed systems: potential for fault tolerance

Download Presentation

CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 6510 (461)Fall 2010 Selected Noteson Fault-Tolerance (12) Alexander A. ShvartsmanComputer Science and EngineeringUniversity of Connecticut

  2. Fault-Tolerance -- An Overview • A fundamental property of distributed systems: • potential for fault tolerance • The main tool in achieving fault tolerance is • redundancy • Distributed systems consist of multiple components: • When more than one resource is capable of performing a certain function, some fault tolerance is achievable • Goal • Take advantage of the multiplicity of resources in constructing systems that tolerate failures

  3. Fault Tolerance and Dependability • A system specification may call for fault-tolerance • By stating that the system must perform correctly • Even if certain internal or external components fail to perform according to their specifications • Additionally, the degradation in in performance due to failures must be “graceful” • Dependability: is a closely-related notion • Trustworthiness of a computer system, i.e., • Reliance can justifiably be placed on system’s service • Dependability is achieved in part through fault-tolerance

  4. Faults, Errors and Failures • We distinguish among faults, errors and failures: • Fault: (or defect) a component or a subsystem fail to perform according to their specification • Error: a computation enters an incorrect state as the result of a fault • Failure: a systems fails to meet its specification as the result of an error • Faults may or may not lead to an error • Errors may or may not lead to a failure

  5. Fault-Tolerance -- Basic Approaches • Fault prevention: • eliminating faults • before the system put into use or • during periodic preventive maintenance • Fault tolerance: • a system detects errors caused by faults, • corrects its state and • does not fail for as long as the faults and errors are within its design parameters • Fault masking: • a fault-tolerant system is capable of dealing with faults and errors • in a way that is transparent to the users of the system’s services

  6. Crash Omission Timing Byzantine Fault Classification • Crash fault • Fail-stop processor (detectable crash) • Failure after a send/receive • Omission fault • Communication, send orreceive omission • Operation • Timing fault • Processor delays • Link time-out • Byzantine fault • Arbitrary fault • Malicious behavior Increased Severity

  7. Undetectable restarts Detectable restarts Synchronous restarts No restarts Initial faults Models of Processor Failures and Restarts • Fail-stop processors • Model assumptions, e.g., • Shared memory • Robust interconnect • Resilient memory • Timing guarantees

  8. Fault Tolerance, Redundancy and Efficiency • Fault tolerance is achieved through redundancy • Redundancy in components/resources -- space redundancy: • additional components (hardware or software) are provided or made available to deal with errors • distributed systems have inherently redundancy • Redundancy in computation or time redundancy: • additional computation is performed to detect errors or to test components • here the cost is performance

  9. Combining Fault-Tolerance and Efficiency • The fundamental conflict exists between efficiency and faulttolerance: • Efficiency implies low redundancy • Fault tolerance implies high redundancy • Robustness • Property of a system that combines • Efficiency and • Fault-tolerance, e.g., correctness under failures • Achieving robustness is very challenging in many cases • Efficiency often must be traded-off for fault tolerance

  10. Strategies for Fault Tolerance • Layered architecture : • a structuring technique in achieving fault tolerance • A failure of a lower level component may/will manifest itself as a fault to a higher layer • Error at a lower layer may be contained or masked • When this is not possible, the layer attempts • to reduce the severity of the error and • to manifest itself through a more benign failure

  11. Layer Architecture for Fault-Tolerance failure Layer N+1 error Layer N fault fault failure failure error error Layer N-1 fault fault

  12. Phases in Fault Tolerance • Fault prevention and fault tolerance are complementary: • both are needed for dependability • Fault tolerance and its “phases” • Error detection • Tests, checks and diagnostics • Damage confinement • Dynamic assessment of damage boundaries • Static firewalls • Progress evaluation and error recovery • Backward recovery, checkpointing, roll back • Forward recovery and self-stabilization • Processor scheduling and load balancing • Fault treatment and continued system service • Fault location • System repair • Dynamic reconfiguration • Standby spare components

  13. Faults: Causes and Temporal Effects • Faulty system -- a system with defects • Faulty requirements • Design faults • Hardware faults • Software . . . bugs (I don’t know who put it there) • Operational faults • Faults -- temporal taxonomy • Transient fault -- limited duration • Intermittent fault -- occur repeatedly • Permanent fault -- manifests itself until fixed • Faults and fault masking • Is fault masking “good”? • If a system is capable of tolerating k faults, is masking 1 fault good? Masking k-1 faults? • Are faults “bad”? • Is a system containing faults necessarily defective?

  14. Models of Failure: Overall Considerations • Models need to capture/abstract/approximate reality • Type of failures -- • severity: fail-stop, malicious failures, memory contamination • Kind of failure-causing adversary -- • omniscient or oblivious; on-line adaptive or off-line. • Duration: • no-restart <-> restartable • Frequency of failures -- • rate of processor attrition (one time, arbitrary, probabilistic) • Fine/coarse granularity of failures -- • components: processors / gates, processor / thread failures • Magnitude of failures -- • total number of failures (and recoveries) during computation

  15. Designing for F/T: Evaluation Criteria • What is the cost of failure? Is it bearable? • How much is one willing to pay for fault tolerance? • Is slower response preferable to a failure? • Is higher HW cost acceptable? • Is lower HW cost acceptable as long as failures are masked? • What is the goal of building-in some fault tolerance? • Elimination of (some failure)? • Reduction in the severity of failures? • Error detection? • When the failures are corrected, • Is a slower response time acceptable as long as the computation is correct? • Is a slight error acceptable as long as the computation completes within the required time?

More Related