html5-img
1 / 27

Middleware for Fault Tolerant Applications

Middleware for Fault Tolerant Applications. Lihua Xu and Sheng Liu Jun, 05, 2003. Outline. Basic technologies in fault tolerance Middleware for fault tolerant applications Egida AQuA. Why Fault Tolerance?.

freja
Download Presentation

Middleware for Fault Tolerant Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003

  2. Outline • Basic technologies in fault tolerance • Middleware for fault tolerant applications • Egida • AQuA

  3. Why Fault Tolerance? • “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” ----------Leslie Lamport, May 1987

  4. Basic Technologies in Fault Tolerant Distributed Systems • Hardened hardware component technologies • Fault detection and membership maintenance • Log-based scheme and checkpointing

  5. Hardened hardware component technologies • Hardened processor modules: • Pair of self-checking processors (PSP), • RAID ( redundant array of inexpensive disks): • Popular even in database-centric business computing applications.

  6. Fault detection and membership maintenance • Timeout • Comparison of the results of repeated or redundant executions • Error-detection and error-correction code • Acceptance test : Test reasonableness of intermediate computation results • Membership maintenance • Simplest version: Master node makes a periodic roll-call of other nodes • Heartbeat message exchange

  7. Log-based scheme and checkpointing • Log-based schemes record, on stable storage, information describing all the modifications by the transaction to the various data it accessed. • Checkpointing is a technique to minimize the time taken to recover in the event of a system crash.

  8. Middleware for Fault Tolerant Applications

  9. Egida • Itis an object-oriented toolkit designed to support transparent rollback recovery for low-overhead fault-tolerance.

  10. Log-based rollback recovery protocols • Log information are recorded on stable storage during failure free executions • Use that information to recover after a failure • The protocols have a set of variant, including checkpointing and message logging.

  11. Checkpointing

  12. Message Logging • Pessimistic logging allows processes to communicate only from recoverable states . • Optimistic logging allows processes to communicate with other processes even from states that are not yet recoverable. • Causal logging allows the possibility that a state from which a process communicates may become unrecoverable because of a failure, but only if no correct process depends on that state. • A correct process is one that exhibits no failures at any point in the execution under consideration. So a process that crashes at some point is “non-failed” before that point, but is not “correct” before that point.

  13. Deconstructing Log-Based Rollback-Recovery Protocols • The diversity of rollback-recovery protocols reflects the heterogeneity in the requirements of applications. • This diversity shows a simple event-driven structure that all these protocols share and that all protocols are interested in the same set of “relevant” events.

  14. Relevant Events • Non-deterministic events • A non-deterministic event is an event whose outcome may change for different executions of the same program. • Dependency-generating events • These events can increase the number of processes that depend on the nondeterministic events executed by a process. • Output-commit events • These events can make the external environment depend on the non-deterministic events executed by a process. • Checkpointing events • These events instruct the protocols to write to stable storage the state of one or more processes. • Failure-detection events • These events are generated on detecting the failure of one or more processes.

  15. A Simple Language Specifying Rollback-recovery Protocols • A protocol is defined in terms the actions it takes in response to non-deterministic events, dependency generating events, output commit events, checkpointing events and failure-detection events. • Implementing a specific protocol is equal to selecting the set of actions performed in response to each relevant event. • A simple language is used to specify the rollback-recovery protocols.

  16. Module Definitions • To define a protocol completely, it is necessary to instantiate a set of variables which specify, for instance, the set of non-deterministic events, the form of their determinant, the implementation of stable storage, etc. • Egida identifies a set of building blocks which are incorporated into the protocol structure yield different rollback recovery protocols.

  17. Architecture

  18. Synthesizing Protocols through Module Composition • Egida allows the co-existence of multiple implementations for each of the modules. • To synthesize a protocol, a specific implementation of each module must be selected. • Egida maintains a binding between the values for the modules and their corresponding implementations. • Therefore, synthesizing a protocol requires processing the specification along with the binding information to initialize the modules to their appropriate implementations.

  19. Advantages • Promote extensibility and flexibility by allowing multiple implementation of each of the core functionalities. • Facilitate rapid implementation of rollback recovery protocols with minimal programming effort by gluing together objects from the available library of building blocks. • Egida enables designers of fault-tolerance protocols to develop new rollback recovery protocols by combining different implementations of the core functionalities in novel ways.

  20. AQuA: An Adaptive Architecture that provides dependable distributed objects

  21. Overview • To allow distributed applications to request and obtain a desired level of availability using a QuO contract through a property manager. • Fault tolerance in AQuA is provided by Proteus, which dynamically manages the replication of distributed objects to make them dependable.

  22. Background • Ensemble group communication system 1. ensure reliable communication between groups of processes, 2. ensure atomic delivery of multicasts to groups with changing membership, 3. detect and exclude from the group members that fail by crashing. • Maestro Object-oriented interface to Ensemble

  23. Background (cont) • Quality Objects 1.transmit applications’ availability requirements to Proteus, which attempts to configure the system to achieve the desired availability. 2. provide an adaptation mechanism that is used when Proteus is unable to provide a specified level of availability.

  24. Background (cont) • Proteus ♠ dependability manager replicated consists of advisor and protocol coordinator ♠ handlers implement voters and monitors in the gateway ♠ object factories implemented on each host

  25. AQuA Architecture Overview

  26. Group Structure in AQuA

  27. Fault Tolerance in AQuA • Fault Model crash failures, value faults, time faults • Error Detection Proteus, voter, monitor • Fault Treatment Proteus manager advisor

More Related