Enhancing Fault Tolerance in Distributed Systems: Understanding Failures and Recovery

Fault Tolerance CSCI 4780/6780

Failures in Distributed Systems • Partial failures – characteristic of distributed systems • Goals: • Construct systems which can automatically recover from partial failures • System should operate in an acceptable way even during failures

Basic of Dependable Systems • Availability – Property that the system is operating correctly at a given moment • Reliability – Property that a system can continuously run without failures • Safety – Failures should not lead to catastrophes • Maintainability – How easy is it to repair a failed system

Failures, Errors and Faults • Failure – A system not meeting its promises • Error – Part of system’s state that may lead to failure • Eg: Damaged packets • Fault – Cause of error • Bad transmission medium, bad disk, etc. • Types of faults • Transient – Occur once and disappear • Intermittent – Appear, vanish and reappear • Permanent – Continues until repair

Failure Models • Different types of failures.

Arbitrary Failures • Crash failures is a benign way of halting the service • Fail-stop failures – Halting can be detected by other processes • The halting server may announce its status • Fail-silent systems – Halting is not announced • Other processes need to detect the failure • Fail-safe – Server is producing random output • Other servers can detect the failure

Failure Masking by Redundancy • Hiding failures from other processes • Three types of redundancies • Information redundancy – Extra data is added to hide failure. • Eg. Hamming codes • Timing redundancy – Extra actions are performed for hiding failures • Redoing a transaction • Physical redundancy – Extra equipment (processes) for hiding failures • Extra disks, process pools etc.

Triple Modular Redundancy

Enhancing Fault Tolerance in Distributed Systems: Understanding Failures and Recovery

Enhancing Fault Tolerance in Distributed Systems: Understanding Failures and Recovery

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance