alessa
Uploaded by
8 SLIDES
258 VIEWS
80LIKES

Enhancing Fault Tolerance in Distributed Systems: Understanding Failures and Recovery

DESCRIPTION

This document explores the concept of fault tolerance in distributed systems, focusing on partial failures and the strategies to ensure systems can automatically recover from such incidents. It outlines key principles of dependable systems, including availability, reliability, safety, and maintainability. The text explains the differences between failures, errors, and faults, while categorizing types of faults and failures. Additionally, it discusses failure models, failure masking through redundancy techniques, and the importance of implementing physical, information, and timing redundancy to enhance system resilience.

1 / 8

Download Presentation

Enhancing Fault Tolerance in Distributed Systems: Understanding Failures and Recovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault Tolerance CSCI 4780/6780

  2. Failures in Distributed Systems • Partial failures – characteristic of distributed systems • Goals: • Construct systems which can automatically recover from partial failures • System should operate in an acceptable way even during failures

  3. Basic of Dependable Systems • Availability – Property that the system is operating correctly at a given moment • Reliability – Property that a system can continuously run without failures • Safety – Failures should not lead to catastrophes • Maintainability – How easy is it to repair a failed system

  4. Failures, Errors and Faults • Failure – A system not meeting its promises • Error – Part of system’s state that may lead to failure • Eg: Damaged packets • Fault – Cause of error • Bad transmission medium, bad disk, etc. • Types of faults • Transient – Occur once and disappear • Intermittent – Appear, vanish and reappear • Permanent – Continues until repair

  5. Failure Models • Different types of failures.

  6. Arbitrary Failures • Crash failures is a benign way of halting the service • Fail-stop failures – Halting can be detected by other processes • The halting server may announce its status • Fail-silent systems – Halting is not announced • Other processes need to detect the failure • Fail-safe – Server is producing random output • Other servers can detect the failure

  7. Failure Masking by Redundancy • Hiding failures from other processes • Three types of redundancies • Information redundancy – Extra data is added to hide failure. • Eg. Hamming codes • Timing redundancy – Extra actions are performed for hiding failures • Redoing a transaction • Physical redundancy – Extra equipment (processes) for hiding failures • Extra disks, process pools etc.

  8. Triple Modular Redundancy

More Related