Enhancing Fault Tolerance in Distributed Systems: Understanding Failures and Recovery
This document explores the concept of fault tolerance in distributed systems, focusing on partial failures and the strategies to ensure systems can automatically recover from such incidents. It outlines key principles of dependable systems, including availability, reliability, safety, and maintainability. The text explains the differences between failures, errors, and faults, while categorizing types of faults and failures. Additionally, it discusses failure models, failure masking through redundancy techniques, and the importance of implementing physical, information, and timing redundancy to enhance system resilience.
Enhancing Fault Tolerance in Distributed Systems: Understanding Failures and Recovery
E N D
Presentation Transcript
Fault Tolerance CSCI 4780/6780
Failures in Distributed Systems • Partial failures – characteristic of distributed systems • Goals: • Construct systems which can automatically recover from partial failures • System should operate in an acceptable way even during failures
Basic of Dependable Systems • Availability – Property that the system is operating correctly at a given moment • Reliability – Property that a system can continuously run without failures • Safety – Failures should not lead to catastrophes • Maintainability – How easy is it to repair a failed system
Failures, Errors and Faults • Failure – A system not meeting its promises • Error – Part of system’s state that may lead to failure • Eg: Damaged packets • Fault – Cause of error • Bad transmission medium, bad disk, etc. • Types of faults • Transient – Occur once and disappear • Intermittent – Appear, vanish and reappear • Permanent – Continues until repair
Failure Models • Different types of failures.
Arbitrary Failures • Crash failures is a benign way of halting the service • Fail-stop failures – Halting can be detected by other processes • The halting server may announce its status • Fail-silent systems – Halting is not announced • Other processes need to detect the failure • Fail-safe – Server is producing random output • Other servers can detect the failure
Failure Masking by Redundancy • Hiding failures from other processes • Three types of redundancies • Information redundancy – Extra data is added to hide failure. • Eg. Hamming codes • Timing redundancy – Extra actions are performed for hiding failures • Redoing a transaction • Physical redundancy – Extra equipment (processes) for hiding failures • Extra disks, process pools etc.