Checkpointing-based Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware

Checkpointing-based Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware Raphael Y. de Camargo Andrei Goldchleger Fabio Kon Alfredo Goldman Department of Computer Science University of São Paulo, Brazil Middleware 2004 – Toronto, Canada 2nd International Workshop on Grid Computing

Summary • Introduction • InteGrade Grid middleware • BSP Computing Model • Checkpointing-based Rollback Recovery • Checkpointing Infrastructure • Preliminary Experiments • Conclusions

Introduction Challenges: • Environment composed of shared user workstations spread across many different LANs. • Machines may fail, become unaccessible, or may switch from idle to busy very rapidly • Some mechanism for fault-tolerance is a major requirement for such a system. Grid Computing: • Grid computing allows the leveraging and integration of computer resources distributed across LANs and WANs • Besides dedicated computing resources, it is also possible to use idle computing power from commodity workstations (opportunistic computing)

InteGrade Grid Middleware Implementation Status: • Basic architecture already implemented • Uses CORBA distributed object technology for communication • Provides support for execution of sequential, BSP and bag-of-tasks applications Objectives: • Use idle computing power of commodity workstations (opportunistic computing) • Allow organizations to increase their available computing power without buying extra hardware • Ensures the quality of service of machine owners sharing its computing resources

InterCluster InteGrade Architecture • GRM(Global Resource Manager): Manages the grid resources and schedules applications for execution • ASCT: Allows the submission and controlling of applications on the Grid • LRM(Local Resource Manager): Manages a node´s resources • Runtime Libraries Provide support for running parallel applications

BSP Parallel Computing Model • Computation is performed using a sequence of parallel supersteps • Each superstep is composed of computation and communication, with a synchronization barriers in the end • All data from communication is available to other processes only in the next superstep • Two communication Mechanisms: • Direct Remote Memory Access (DRMA) • Bulk Synchronous Message Passing (BSMP)

Checkpointing-based Rollback Recovery Two approachs for checkpointing: • System-level: - The memory space and processor registers from an application are saved into the checkpoint. • Application-level: - The application is responsible for providing the data to be saved and reconstructing its state from the checkpoint. Checkpointing: • Consists in periodically saving the application state into a checkpoint, so that its state can be recovered from it Checkpointing-based Rollback-Recovery: • Process of reinitializing an application from an intermediate execution point after a failure is detected

Application-level checkpointing • The application is reponsible for: • Providing which data needs to be saved • Recovering its state from a previous checkpoint Advantages • Semantic information about data being saved: Possibility of generating portable checkpoints • Only the necessary data for recovering application state needs to be saved Disadvantages • Need to instrument source-code with checkpointing code • Necessary to have access to application source-code • Cannot generate forced checkpoints

Checkpointing of Parallel Applications • In case of parallel applications we must consider the depencies among application processes generated by message exchanges; • Global checkpoint: is a collection contaning checkpoints from every application process. In the diagram, the global checkpoint s1 is inconsistent while global checkpoint s2 is consistent. • BSP applications: consistency can be guaranteed by generating the checkpoints after the synchronization phases.

Checkpointing Infrastructure • Pre-Compiler • Instruments a C/C++ application source-code with checkpointing code • Runtime libraries • Allows saving the application state into a checkpoint and recovering the data from a previous checkpoint • ExecutionMonitor • Keep information about applications running on the grid, allowing the restarting of these applications in case of failures.

PreCompiler • Based on OpenC++. Permits that we use compile-time reflection to instrument an application source-code with checkpointing code • Needs to modify application code in order to save the following data : • Execution Stack: contains runtime data from the active functions in a particular moment during application execution • Position Counter: the current position in the program • The Heap: contains memory chuncks allocated by commands such as malloc and new • Global variables

Saving and Recoveringthe Execution Stack State • Execution stack state: Not directly accessible from application code. • Saving the execution stack state: Save a list of the currently active functions and the values of their local variables. • Recovering the execution stack state: Call the functions in the saved list, declare the local variables and recover their values from the checkpoint. The remaining code is skipped. • Position Counter: Process state will only be saved in certain points in the source code, marked by a call to some function, such as checkpoint_candidate()

Saving Local Vars and Pointers Pointers: • In the case of pointers, it is necessary first to dereference the pointer • When saving pointer with multiple levels of indirection it is necessary to follow the pointer graph structure • Special care is necessary with graphs containing cycles and when multiple pointers reference the same memory chunk Local Variables: • Auxiliary stack keeps the address of local variables that are currently in scope. • Local variable addresses are pushed into the stack just after their declaration, and removed when the variables leave scope • During checkpoint generation, the values contained in these addresses are saved in the checkpoint.

Saving the Heap Memory HeapManager • Mantains a list of currently allocated chunks of memory • Includes the memory address, its size, and a flag that indicates if that chunk has already been saved during checkpoint generation • Updated before memory allocation calls such as malloc and free for C and new and delete for C++.

Classes, Structures and BSP Calls BSP • The bsp_begin and bsp_synch standard functions are replaced by functions from the checkpointing library • During reinitialization, calls to functions that modify the state of the BSP library must be reexecuted. (e.g., bsp_pushregister) Structures • Saved in the same way as local vars. • Must follow the pointers present in the structure. Classes • Use introspection to add methods for saving and restoring the class members.

Precompiler – Example of Instrumented Code int function () { int lastFunctionCalled = -1; int localVar = 0; ckp_push_data(&lastFunctionCalled,sizeof(int)); ckp_push_data(&localVar, sizeof(int)); if( ckpRecovering == 1 ) { ckp_get_data(&lastFunctionCalled, sizeof(int)); ckp_get_data(&localVar, sizeof(int)); if( lastFunctionCalled == 0 ) goto ckp0; } // Do computations (...) ckp0: lastFunctionCalled = 0; functionA ( ) ; // Do computations (...) ckp_npop_data(2); return localVar; }

Checkpointing Runtime Library BSP Ckp Library: • Provides specific functionality for checkpointing BSP applications: bsp_begin_ckp( ): registers some addresses necessary for checkpointing coordination and initializes the timer. bsp_synch_ckp( ): Test if the timer has expired and if true, signals the others processes to generate a new checkpoint. Checkpointing Library: • Provides the functionality for mantaining a stack of local variables, managing heap state and saving the data to a checkpoint; • Provides a timer that applications can set to ensure a minimun time between checkpoints • Checkpoints are currently architecture dependent and saved to file in the file system.

Application ExecutionMonitoring and Reinitialization LRM • Captures the exit status of running applications and sends to the ExecutionMonitor • If process was explicitally killed by the signals SIGTERM or SIGKILL it is restarted BSP Applications • For BSP applications, all the processes in the application are reinitialized Execution Monitor: • Contains a list of running applications in the nodes from its cluster • Reschedule new executions with the GRM for failed processes GRM • Detects when a node or LRM fails and notifies the Execution Monitor • Report nodes failures to the GRM

Preliminary Experiments Sequence similarity application: • Compares two sequences of characters and finds the similarity among them using given criteria. • Used in bioinformatics to compare sequences of DNA. • Was parallelized using the BSP computing model Experiments were performed on a cluster of 10 1.4GHz machines connect by a 100Mbps Fast Ethernet network.

Conclusions • We described an checkpointing-based rollback recovery mechanism for applications running in the InteGrade Grid middleware • This mechanism will allow a better resource utilization in the Grid, since it will be possible to migrate processes between nodes • Premiliminary indicates that checkpointing overhead can be low enough to be used on long-running BSP parallel applications

Ongoing Work • Improve pre-compiler support for C++ • Support for portable checkpoints • Allows better resource utilization In heterogeneous environments • Robust storage system for checkpoints • Data saved in a distributed way • Provide some degree of replication to provide fault-tolerance • Implement a efficient process migration mechanism on InteGrade • Can be used for both fault-tolerance and dynamic adaptation

Questions ?

Checkpointing-based Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware