FTOP : A library for fault tolerance in a cluster

FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava

Why FTOP ? • Fault tolerant environment built for PVM. • Implements a transparent fault tolerance technique using Checkpointing and Rollback Recoveryfor PVM based distributed applications. • Handles issues related to in-transit messages, routing of messages to migrated tasks and open files. • Entirely at user level. No changes in kernel needed. • Intended to be extensible to other C&RR schemes.

FTOP assumptions • Assumes a homogeneous Linux cluster with PVMrunning on them. • One of the host is configured as a Global resource Manager which is assumed to be fault free. ... (impl.!) • Another host assumed to be fault free is configured as the Stable storage. The file system of the stable storage is NFSmounted on all other host. (Using NFS has problems ?) • Assumes reliable FIFO channels between hosts in the cluster. • Handles task/node crash failure only.

System and Fault Model • System consists of: • A set of workstations. • Connected through a high speed LAN. • Stable storage accessible to all workstations (assumed to be fault proof). • Fault can be: • Network failure. • Node failure. • Fail stop model.

Implementation: Checkpointing • Non blocking Coordinated checkpointing. • What is checkpointed? • The process context ( pc value, registers etc. ). • The process control state( like pid, parent pid, fd of open files etc.). • The process address space ( the text area, data area and stack area). • Where are the checkpoints stored ? • On a stable storage(assumed to be failure proof). • Two checkpoint files for each process.

How we checkpoint • The process context( pc ,register value etc ) • Signal mechanism. A process on receiving a signal saves state in stack which could be checkpointed.. Use of setjmp( ) and longjmp( ). • The process Memory Regions • “RO” sections are not checkpointed. Other sections are checkpointed by writing them to a file. • /proc file system provides section boundaries. • The process control state • Written to a regular file named after the taskid.

Checkpoint Protocol SIGALARM SM_CKPTSIGNAL SIGUSR1 TM_CKPTDONE SM_CKPTDONE SM_CKPTCOMMIT SIGUSR1 GRM PVMd TASK Time Diagram of the Checkpointing Protocol. It is based on 2 phase commit Protocol.

Checkpoint Protocol(contd..) GRM SM_CKPTSIGNAL SM_CKPTDONE SM_CKPTCOMMIT Host 1 Host2 PVMd PVMd SIGUSR1 TM_CKPTDONE Task 2 Task 3 Task 1 Task 2 Task 3 Task 1

Other Messages • Two more messages are requiredfor the consistency of the checkpoints taken - • TM_Ckptsignal ( from task to its daemon ) • DM_Ckptsignal ( from daemon to another daemon ) • To allow checkpointing to be partly non-blocking, these messagesprecedeany application messagewhen the checkpoint protocol is in progress i.e. after a process has taken a checkpoint and before the checkpoint is committed.

Other Messages(contd ..) • For TM_Ckptsignal if the application message is destined to a local task the daemon determines the status of the task and delivers the message to the destination only if it has completed its checkpoint. • If the application message is bound to a foreign task the daemon sends DM_Ckptsignal to the destination before sending the application message.

Recovery • Fault Detection • Daemons detect node failure. • Inform GRM through SM_HOSTX message • Fault Assessment • GRM finds all the failed tasks. • Fault Recovery • GRM spawns the failed tasks on appropriate hosts. Each Failed tasks start from beginning and then copy its last checkpoint on its own address space.

Recovery(contd..) • Recovering tasks • Local state of the tasks are restored using setjmp() and longjmp() calls. Setjmp() is called before checkpointing begins and longjmp() is called after the address space is restored from the checkpoint file. • Note issues related to • Processes which started after the recovery-line. • Processes which exited normally after recovery-line.

Recovery(contd..) • GRM starts the recovery protocol • Calculates the recovery-line. • Transmits to every process the file-id of the last committed checkpoint (integer 1 or 2). • Each process restores its checkpointed image. • Processes not allowed to send or receiveapplication Messages during the recovery stage.

Recovery Protocol HOSTX SM_RECOVER SIGUSR2 TM_RECOVERYDONE SM_RECOVERYDONE SM_RECOVERYCOMMIT SIGUSR2 GRM PVMd TASK

Other Issues • In-transit messages. • Logging: reliable comm. model, part of checkpoint. • Replaying: Before futureinteraction. • Routing. • Why a problem? • Maintain route table: what to keep… • Open files. • Why a problem? • How to handle… • Reconnecting with daemon.

Handling Routing • tid (task identifier)is used as an address of message in PVM. Failed task when they recover get a new tid. Other tasks don’t know about this change causing routing problems. • A mapping tableof the oldest and the most recent tid of a task is maintained. • Header of each message is parsed; and if the message is destined to one of the failed task, then the address field is replaced with the most recent tid of the failed task.

Handling Open files • lsofa Linux utility provides list of all open files, their descriptors and mode. An lseek call provides the file pointer. • Allthis information(file name, descriptor, mode and file pointer)is storedwith the checkpoint image of the process. • The state of the file isrestored using this informationat the time of recovery. • May need to actuallycheckpoint the file content.

Reconnecting with the Daemon • A task is connected to the virtual machine through the PVM daemon. A failed task when spawns on a new hostneeds to reconnect to the daemon. • It connects to the new daemonthrough the unix domain socket name advertised by the daemonin a host specific file. It will also clean up old socket information.

Testing • Testing Environment: • The Hosts : 3-5Pentium III with red hat Linux 7.1 • The Channel : 100 Mbps Ethernet LAN. • Failure Simulation: By removing a host from the virtual machine. • Test Cases: • Matrix Multiplication. • PVMPOV (full featured distributed ray tracer algorithm build on PVM). • Others for correctness: simple file I/O, “ping-pong” etc.

40 10, 36 35 30 25 20 Series1 20, 18 Running time (secs) 15 30, 11 10 40, 8 5 infinity, 3 0 10 20 30 40 infinity Checkpointing Interval (secs) Overheads Checkpointing Overhead for the Matrix multiplication program. 260 255 254 250 247 246 245 Checkpointing Overhead for the PVMPOV program. 241 241 240 238 236 235 Running time 230 228 225 220 215 30 60 90 120 150 180 210 240 Checkpointing Interval

Conclusion and future work • Builds fault tolerance into the standard PVM staying entirely at the user level. • Able to rollback the open files and in transit messages. • In future direction we wish to handle device association which may require explicit OS support. • We also intend to integrate well known optimizations into the checkpointing protocol. • We also aim to other C&RR Schemes.

FTOP : A library for fault tolerance in a cluster

FTOP : A library for fault tolerance in a cluster

Presentation Transcript

Peripheral tolerance and Immunoregulation.

The University of Sunderland Cluster Computer

Safety Critical Software Development

FT NT: A Tutorial on Microsoft Cluster Server ™ (formerly “Wolfpack”)

Chapter 3

The Virtual International Authority File

Fault tolerance

Dr. Timothy Spangler The COMET Program

Presented at Dept of CS, IUPUI, April 15, 2011

Cluster Analysis: Basic Concepts and Algorithms

聚类分析：基本概念和算法

Critical systems development

CORBA Overview and Advanced Programming Issues

High Performance Cluster Computing Architectures and Systems

Applied Cryptography

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Byzantine Techniques II

Today’s Producers

Circuit breakers