Rollback-Recovery Protocols in Message-Passing Systems

Rollback-Recovery Protocols in Message-Passing Systems Based on A Survey of Rollback-Recovery Protocols in Message-Passing Systems by Mootaz Elnozahy Lorenzo Alvisi Yi-Min Wang David B. Johnson

Motivation • Large distributed systems have vast computing potential. • In these systems a machine can stop participating in execution of a distributed application as a result of: • disconnection from the network • shut down or reboot by the user • power break If any of these events occur we say that the node has failed. • The computing potential is hampered by the nodes’ susceptibility to failures. • There is a need to preserve the correctness of a distributed execution despite failures.

Rollback Recovery • Periodically use stable storage (e.g. disk) to save the processes’ state and maybe some additional useful data during failure-free execution. • A saved state of a process is called a checkpoint • Upon a failure, restart a failed process from one of the saved checkpoints • reduces the amount of lost computation • Of course, when recovering, consistency between processes must be maintained.

Flavors of Rollback Recovery • There are techniques that • rely on the application to decide when and what to save, or • provide the programmer with linguistic constructs to be added to the application. • There are also techniques, called transparenttechniques, that do not require any intervention on the part of the application or the programmer. • We focus on transparent rollback recovery.

System Model • A constant number of processes (N) • Communicate only through messages • Interact with outside world through messages • Cooperate to execute a distributed program

System Model: Communication • Most protocols assume that the communication network is immune to partitioning. • Some protocols assume reliable FIFO delivery of messages. • Other protocols assume unreliable communication, which mean that the messages can be • lost • duplicated • reordered

System Model: Failures • A process that fails • loses its volatile state • stops execution • does not send any more messages Such behavior is called fail-stop • Processes have a stable storage device that survives failures. • Number of tolerated failures in different protocols varies from 1 to N. • Some protocols do not tolerate failures during recovery.

Consistent System States • A global state of a message-passing system consists of: • individual states of all processes • the states of communication channels • A consistent global stateis a global state in which if a process’s state reflects a message receipt, then the state of the corresponding sender reflects sending that message

Consistent System States (2) • Intuitively, a consistent global state is one that may occur during a failure-free, correct execution of a distributed computation. • The goal of a rollback-recovery protocol is to bring the system into a consistent state

Consistent Global Checkpointsand Recovery Line • A consistent global checkpoint is a set of N checkpoints, one from each processes, forming a consistent system state. • Any consistent global checkpoint can be used to restart process execution upon failure • It is desirable to minimize the amount of lost work by restoring the system to the most recent consistent global checkpoint, which is called the recovery line.

Orphan Messages and Orphan Processes • A message m sent by a process Pi that has failed is an orphan message, if the system cannot guarantee regeneration of the same m during the recovery of Pi. • A process Pkwhose state depends on a non-deterministic event (e.g. receipt of a message) that cannot be reproduced is called an orphan process. • Existence of orphan processes violates integrity of the execution and therefore must be prevented

In Transit Messages • A message that has been sent but not yet received is called an in-transitmessage. • Do rollback recovery protocols have to guarantee the delivery of in-transit messages? • Depends on whether reliable communication is assumed.

In-Transit Messages: Reliable Communication • Reliable communication protocols cannot ensure reliability of message delivery if processes fail. • For example, if an in-transit message is lost because the intended receiver has failed, then • conventional communication protocols will generate a timeout and inform the sender that the message cannot be delivered. • In a rollback-recovery system, however, the receiver will eventually recover, and therefore the system must: • mask the timeout from the application program at the sender process, and • make in-transit messages available to the intended receiver process after it recovers.

In Transit Messages: Unreliable Communication • If unreliable communication is assumed, then: • In-transit messages lost due to failure of the receiver cannot be distinguished from those lost due to communication failures. • Loss of an in-transit message is a legal event. • Therefore, the recovery protocol need not handle in-transit messages in any special way.

Interactions with the Outside World • A message-passing system often interacts with the outside world to receive input data or show the outcome of a computation. • If a failure occurs, the outside world cannot be relied on to roll back: • a printer cannot roll back the effects of printing a character • an automatic teller machine cannot recover the money that it dispensed to a customer.

Interactions with the Outside World: Output Messages • It is therefore necessary that the outside world perceive a consistent behavior of the system despite failures. • Before sending output to the outside world, the system must ensure that the state from which the output is sent can be recovered. • This is commonly called the output commit problem

Interactions with the Outside World: Input Messages • Input messages that a system receives from the outside world may not be reproducible during recovery • It may not be possible for the outside world to regenerate them. • Recovery protocols must arrange to save these input messages so that they can be retrieved when needed for execution replay after a failure. • A common approach is to save each input message on stable storage before allowing the application program to process it.

Stable Storage • Rollback recovery uses stable storage to save checkpoints, event logs, and other recovery-related information despite failures. • Stable storage in rollback recovery is only an abstraction. • Often confused with the disk storage used to implement it.

Stable Storage (2) • There are different implementation styles of stable storage: • In a system that tolerates only a single failure, stable storage may consist of the volatile memory of another process. • In a system that wishes to tolerate an arbitrary number of transient failures, stable storage may consist of a local disk in each host. • In a system that tolerates non-transient failures, stable storage must consist of a persistent medium outside the host on which a process is running. A replicated file system is a possible implementation in such systems.

Garbage Collection • As the application progresses and more recovery information is collected, a subset of the stored recovery information may become useless. • Deletion of such useless recovery information is called garbage collection. • A common approach to garbage collection is to identify the recovery line and discard all data relating to events that occurred before that line. • For example, processes that coordinate their checkpoints to form consistent states will always restart from the most recent checkpoint of each process, and so all previous checkpoints can be discarded.

Z-Cycles and Z-Paths • A Z-path(zigzag path) is a special sequence of messages that connects two checkpoints. • Let denote Lamport’s happen-before relation. • Let ci,xdenote the xth checkpoint of process Pi. • Define the execution portion between two consecutive checkpoints on the same process to be the checkpoint interval (starting with the earlier checkpoint). • Let sendiand deliveri be the communication events by process Pi.

Definition of Z-Path Given two checkpoints ci,x and cj,y, a Z-path exists between ci,xand cj,yif and only if one of the following two conditions holds: 1. x < y and i = j; or 2. There exists a sequence of messages [m0, m1,…, mn], n 0, such that: • ci,x sendi(m0); •  l < n, either deliverk(ml) and sendk(ml+1) are in the same checkpoint interval, or deliverk(ml)  sendk(ml+1); and • deliverj(mn)  cj,y

Z-Cycles and Z-Paths (2) • Z-cycle is a Z-path that begins and ends with the same checkpoint. • Above, [m5, m4, m3] is a Z-cycle that start and ends at checkpoint c2,2. [m1, m2] and [m3, m4] are Z-paths between c0,1 and c2,2

The Z-Cycles Theory • The Z-cycle theory was first introduced as a framework for reasoning about consistent system states. • The theory has proved a powerful tool for reasoning about a class of protocols known as communication-induced checkpointing. • In particular, it has been proven that a checkpoint involved in a Z-cycle cannot become part of a consistent state in a system that uses only checkpoints.

Types of Rollback Recovery Protocols

Checkpoint-based and Log-based Recovery Protocols • Checkpoint-basedrollback recovery protocols, a.k.a. checkpointing protocols, rely only on checkpointing to achieve fault-tolerance. • Log-basedrollback recovery protocols, a.k.a. logging protocols, combine checkpointing with logging of non-deterministic events.

Checkpoint-Based Protocols • Rely only on checkpointing to achieve fault-tolerance • Upon a failure, strive to restore the system to the most recent consistent set of checkpoints (a.k.a. recovery line) • The checkpointing protocols differ in the amount of cooperation between processes.

Classification of Checkpoint-based Protocols

Classification of Checkpoint-based Protocols • Uncoordinated checkpointing – each process takes its checkpoints independently • Coordinated checkpointing – processes coordinate their checkpoints in order to save a system-wide consistent state • Communication-induced checkpointing – forces each process to take checkpoints based on information piggybacked on the application messages it receives from other processes.

Uncoordinated Checkpointing • A.k.a. independent checkpointing • A process decides when to make a checkpoint independently of other processes • chooses the most convenient time • for example, when the amount of state information is small • The processes record dependencies among the checkpoints during the failure-free execution, in order to determine a consistent global checkpoint during recovery. • Uncoordinated checkpointing protocols inherently suffer from the domino effect

Rollback Propagation and The Domino Effect • Upon a failure of one or more processes, the dependencies induced by messages may force some of the processes that did not fail to roll back. • This is commonly called rollback propagation. • If the processes have to roll back to the beginning of the computation, this is called the domino effect. Failure of P2 causes rollback to the beginning of the computation

Monitoring the Dependencies • Let ci,xbe the xthcheckpoint of process Pi. We call x the checkpoint index. • Let Ii,xdenote the interval between checkpoints ci,x-1 and ci,x. We call it the checkpoint interval. • If process Piat interval Ii,xsends a message m to Pj, it piggybacks the pair (i,x) on m. • When Pjreceives m during interval Ij,y, it records the dependency from Ii,xto Ij,y • the dependency is later saved onto stable storage when Pj takes checkpoint cj,y.

Monitoring the Dependencies (2) • The recorded dependencies are used at recovery time for calculation of the recovery line. There are two methods to do it: • Rollback-dependency graphs • Checkpoint-graphs

Rollback-Dependency Graphs • Consider the system at the time of a failure. Let C be the set of all the checkpoints, F the set of failure points of the failed processes, and L the set of current states of the living processes. • Denote the current state of a process Pi (failed or living) that follows a checkpoint ci,xby ci,x+1 • A rollback-dependency graphis a graph G(V,E) so that: • V = C  F  L • E contains an edge from ci,xto cj,yonlyif either (1) i j, and a message m is sent from Ii,xand received in Ij,y , or (2) i = j and y = x + 1 • If there is an edge from ci,xto cj,yand a failure forces Ii,xto be rolled back, then Ij,ymust also be rolled back. • This is why it is called “rollback-dependency graph”.

Rollback-Dependency Graphs (2) • Mark the failure points. • Mark all the nodes reachable from the failure points. • In each process, the latest unmarked checkpoint belongs to the recovery line. Rollback-dependency graph Algorithm to discover the recovery line

cj,y ci,x Checkpoint Graphs • Checkpoint graphs are similar to rollback-dependency graphs, except: • when a message is sent from Ii,xand received in Ij,y, a directed edge is drawn from ci,x-1 to cj,y(instead of from ci,xto cj,y). • failure points are not included in V ( = C  L ) Rollback-dependency graph: Checkpoint graph: cj,y ci,x-1

Checkpoint Graphs (2) • Checkpoint graph represents the happened-before relationship between the checkpoints. • The recovery line is calculated by the rollback propagation algorithm, which at each step rolls back the processes according to the recorded dependencies. Checkpointing graph

Rollback Propagation Algorithm include last checkpoint of each failed process as an element in set RootSet; include current state of each surviving process as an element in RootSet; mark all checkpoints reachable by following at least one edge from any member of RootSet; while (at least one member of RootSet is marked) replace each marked element in RootSet by the last unmarked checkpoint of the same process; mark all checkpoints reachable by following at least one edge from any member of RootSet end RootSet is the recovery line.

Rollback-Dependency Graphs vs. Checkpoint Graphs • Both the rollback-dependency graph and the checkpoint graph approaches are equivalent. • they always produce the same recovery line (as indeed they do in the example). Rollback-Dependency graph Checkpointing graph

Recovery • In order to be able to calculate the recovery line some process needs to collect all the dependency data recorded by all the processes. • A process recovering from a failure broadcasts a dependency request message • Each process that receives a dependency request • stops execution • replies with the local dependency information • Then, the initiator • calculates the recovery line based on the received data • broadcasts a rollback request message containing the recovery line

Recovery (2) • A process whose current state belongs to the recovery line resumes execution. • Otherwise, it rolls back to a checkpoint indicated by the recovery line. Recovery line P0 A A m1 m3 m3 P1 B B m2 m0 P2 C C

Garbage Collection • In order to prevent memory overflow and reduce storage overhead only useful checkpoints should be kept. • Any checkpoint that precedes the recovery lines for all possible combinations of process failures can be discarded.

Garbage Collection Algorithm • Build a rollback-dependency graph as if all the processes have failed. • Run the algorithm for discovery of the recovery line. • The resulting recovery line is called global recovery line. • All the checkpoints taken before the recovery line are obsolete.

Garbage Collection Example • As can be seen from the example when the global recovery line is unable to advance because of rollback propagation, a large number of non-obsolete checkpoints may need to be retained.

Disadvantages of Uncoordinated Checkpointing • Susceptible to the domino effect • Checkpoints that will never be part of a global consistent state can be taken • Storage overhead • do not advance the recovery line • A process needs to maintain multiple checkpoints and to use garbage collector to reclaim checkpoints that are no longer needed • Not suitable for output commit, because output commit requires global coordination to compute the recovery line

Classification of Checkpoint-based Protocols

Coordinated Checkpointing • The processes cooperate in order to form a consistent global checkpoint. • Only one checkpoint needs to be maintained on the stable storage at all times. • No need for garbage collection • Reduced storage overhead. • Recovery is less complicated than in uncoordinated checkpointing. • Expensive output commit • A global checkpoint is needed before output can be committed to the outside world.

Preventing Dependencies • The main purpose of coordination is to avoid dependencies between the local checkpoints belonging to the same logical global checkpoint. • The coordinated checkpointing protocols differ in the way they prevent dependencies.

Classification of Coordinated Checkpointing Protocols

Blocking Checkpoint Coordination • Blocking Checkpoint Coordination is the most straightforward approach to implement coordinated checkpointing. • A coordinator process orchestrates the checkpointing by sending a request to checkpoint to each process. • not very scalable

Rollback-Recovery Protocols in Message-Passing Systems