Middleware for Fault Tolerant Applications

Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003

Outline • Basic technologies in fault tolerance • Middleware for fault tolerant applications • Egida • AQuA

Why Fault Tolerance? • “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” ----------Leslie Lamport, May 1987

Basic Technologies in Fault Tolerant Distributed Systems • Hardened hardware component technologies • Fault detection and membership maintenance • Log-based scheme and checkpointing

Hardened hardware component technologies • Hardened processor modules: • Pair of self-checking processors (PSP), • RAID ( redundant array of inexpensive disks): • Popular even in database-centric business computing applications.

Fault detection and membership maintenance • Timeout • Comparison of the results of repeated or redundant executions • Error-detection and error-correction code • Acceptance test : Test reasonableness of intermediate computation results • Membership maintenance • Simplest version: Master node makes a periodic roll-call of other nodes • Heartbeat message exchange

Log-based scheme and checkpointing • Log-based schemes record, on stable storage, information describing all the modifications by the transaction to the various data it accessed. • Checkpointing is a technique to minimize the time taken to recover in the event of a system crash.

Middleware for Fault Tolerant Applications

Egida • Itis an object-oriented toolkit designed to support transparent rollback recovery for low-overhead fault-tolerance.

Log-based rollback recovery protocols • Log information are recorded on stable storage during failure free executions • Use that information to recover after a failure • The protocols have a set of variant, including checkpointing and message logging.

Checkpointing

Message Logging • Pessimistic logging allows processes to communicate only from recoverable states . • Optimistic logging allows processes to communicate with other processes even from states that are not yet recoverable. • Causal logging allows the possibility that a state from which a process communicates may become unrecoverable because of a failure, but only if no correct process depends on that state. • A correct process is one that exhibits no failures at any point in the execution under consideration. So a process that crashes at some point is “non-failed” before that point, but is not “correct” before that point.

Deconstructing Log-Based Rollback-Recovery Protocols • The diversity of rollback-recovery protocols reflects the heterogeneity in the requirements of applications. • This diversity shows a simple event-driven structure that all these protocols share and that all protocols are interested in the same set of “relevant” events.

Relevant Events • Non-deterministic events • A non-deterministic event is an event whose outcome may change for different executions of the same program. • Dependency-generating events • These events can increase the number of processes that depend on the nondeterministic events executed by a process. • Output-commit events • These events can make the external environment depend on the non-deterministic events executed by a process. • Checkpointing events • These events instruct the protocols to write to stable storage the state of one or more processes. • Failure-detection events • These events are generated on detecting the failure of one or more processes.

A Simple Language Specifying Rollback-recovery Protocols • A protocol is defined in terms the actions it takes in response to non-deterministic events, dependency generating events, output commit events, checkpointing events and failure-detection events. • Implementing a specific protocol is equal to selecting the set of actions performed in response to each relevant event. • A simple language is used to specify the rollback-recovery protocols.

Module Definitions • To define a protocol completely, it is necessary to instantiate a set of variables which specify, for instance, the set of non-deterministic events, the form of their determinant, the implementation of stable storage, etc. • Egida identifies a set of building blocks which are incorporated into the protocol structure yield different rollback recovery protocols.

Architecture

Synthesizing Protocols through Module Composition • Egida allows the co-existence of multiple implementations for each of the modules. • To synthesize a protocol, a specific implementation of each module must be selected. • Egida maintains a binding between the values for the modules and their corresponding implementations. • Therefore, synthesizing a protocol requires processing the specification along with the binding information to initialize the modules to their appropriate implementations.

Advantages • Promote extensibility and flexibility by allowing multiple implementation of each of the core functionalities. • Facilitate rapid implementation of rollback recovery protocols with minimal programming effort by gluing together objects from the available library of building blocks. • Egida enables designers of fault-tolerance protocols to develop new rollback recovery protocols by combining different implementations of the core functionalities in novel ways.

AQuA: An Adaptive Architecture that provides dependable distributed objects

Overview • To allow distributed applications to request and obtain a desired level of availability using a QuO contract through a property manager. • Fault tolerance in AQuA is provided by Proteus, which dynamically manages the replication of distributed objects to make them dependable.

Background • Ensemble group communication system 1. ensure reliable communication between groups of processes, 2. ensure atomic delivery of multicasts to groups with changing membership, 3. detect and exclude from the group members that fail by crashing. • Maestro Object-oriented interface to Ensemble

Background (cont) • Quality Objects 1.transmit applications’ availability requirements to Proteus, which attempts to configure the system to achieve the desired availability. 2. provide an adaptation mechanism that is used when Proteus is unable to provide a specified level of availability.

Background (cont) • Proteus ♠ dependability manager replicated consists of advisor and protocol coordinator ♠ handlers implement voters and monitors in the gateway ♠ object factories implemented on each host

AQuA Architecture Overview

Group Structure in AQuA

Fault Tolerance in AQuA • Fault Model crash failures, value faults, time faults • Error Detection Proteus, voter, monitor • Fault Treatment Proteus manager advisor

Middleware for Fault Tolerant Applications

Middleware for Fault Tolerant Applications

Presentation Transcript

Fault-Tolerant Platforms for Automotive Safety-Critical Applications

Fault-Tolerant Broadcast

Fault-Tolerant Broadcast

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Fault-Tolerant Consensus

Fault Tolerant Backplane

FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Fault Tolerant Configuration

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

FAULT-TOLERANT TECHNIQUES FOR NANOCOMPUTERS

Building Fault-Tolerant Enterprise Applications

Fault-tolerant routing

Fault-Tolerant Consensus

Fault-Tolerant Broadcast

Fault-tolerant Computing