Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

Cruz:Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems G. (John) Janakiraman, Jose Renato Santos, Dinesh Subhraveti§, Yoshio Turner HP Labs §: Currently at Meiosys, Inc.

Broad Opportunity for Checkpoint-Restart in Server Management • Fault tolerance (minimize unplanned downtime) • Recover by restarting from checkpoint • Minimize planned downtime • Migrate application before hardware/OS maintenance • Resource management • Manage resource allocation in shared computing environments by migrating applications

Need for General-Purpose Checkpoint-Restart • Existing checkpoint-restart methods are too limited: • No support for many OS resources that commercial applications use (e.g., sockets) • Limited to applications using specific libraries • Require application source and recompilation • Require use of specialized operating systems • Need a practical checkpoint-restart mechanism that is capable of supporting a broad class of applications

Cruz: Our Solution for General-Purpose Checkpoint-Restart on Linux • Application-transparent: supports applications without modifications or recompilation • Supports a broad class of applications (e.g., databases, parallel MPI apps, desktop apps) • Comprehensive support for user-level state, kernel-level state, and distributed computation and communication state • Supported on unmodified Linux base kernel – checkpoint-restart integrated via a kernel module

Cruz Overview Builds on Columbia Univ.’s Zap process migration Our Key Extensions • Support for migrating networked applications, transparent to communicating peers • Enables role in managing servers running commercial applications (e.g., databases) • General method for checkpoint-restart of TCP/IP-based distributed applications • Also enables efficiencies compared to library-specific approaches

Outline • Zap (Background) • Migrating Networked Applications • Network Address Migration • Communication State Checkpoint and Restore • Checkpoint-Restart of Distributed Applications • Evaluation • Related Work • Future Work • Summary

Applications Pods Zap Linux Linux System calls Zap (Background) • Process migration mechanism • Kernel module implementation • Virtualization layer groups processes into Pods with private virtual name space • Intercepts system calls to expose only virtual identifiers (e.g., vpid) • Preserves resource names and dependencies across migration • Mechanism to checkpoint and restart pods • User and kernel-level state • Primarily uses system call handlers • File system not saved or restored (assumes a network file system)

Migrating Networked Applications • Migration must be transparent to remote peers to be useful in server management scenarios • Peers, including unmodified clients, must not perceive any change in the IP address of the application • Communication state of live connections must be preserved • No prior solution for these (including original Zap) • Our Solution: • Provide unique IP address to each pod that persists across migration • Checkpoint and restore the socket control state and socket data buffer state of all live sockets

Network Address Migration 3. dhcprequest(MAC-p1) DHCP Server Pod DHCP Client • Pod attached to virtual interface with own IP & MAC addr. • Implemented by using Linux’s virtual interfaces (VIFs) • IP address assigned statically or through a DHCP client running inside the pod (using pod’s MAC address) • Intercept bind() & connect() to ensure pod processes use pod’s IP address • Migration: delete VIF on source host & create on new host • Migration limited to subnet 4. dhcpack(IP-p1) 2. MAC-p1 1. ioctl() Network eth0:1 eth0 [IP-1, MAC-h1]

Communication State Checkpoint and Restore Communication state: • Control: Socket data structure, TCP connection state • Data: contents of send and receive socket buffers Challenges in communication state checkpoint and restore: • Network stack will continue to execute even after application processes are stopped • No system call interface to read or write control state • No system call interface to read send socket buffers • No system call interface to write receive socket buffers • Consistency of control state and socket buffer state

Control Control direct access Timers, Options, etc. Timers, Options, etc. X X copied_seq write_seq Rh St+1 Rt+1 Sh Rh St+1 snd_una rcv_nxt Rt+1 Sh Rt+1 Sh direct access Rh St Rh St receive() . . . . . . Sh Rt Send buffers Recv buffers Rt Sh Recv buffers Send buffers Live Communication State Communication State Checkpoint • Acquire network stack locks to freeze TCP processing • Save receive buffers using socket receive system call in peek mode • Save send buffers by walking kernel structures • Copy control state from kernel structures • Modify two sequence numbers in saved state to reflect empty socket buffers • Indicate current send buffers not yet written by application • Indicate current receive buffers all consumed by application Checkpoint State State for one socket Note: Checkpoint does not change live communication state

To App by intercepted receive system call Control direct update Control Timers, Options, etc. Timers, Options, etc. Rt+1 Sh St+1 Rt+1 Sh copied_seq write_seq Rt+1 snd_una rcv_nxt Rt+1 Sh write() Rh Rh St St . . . Sh Rt Rt Recv data Send buffers Recv buffers Sh Send buffers Checkpoint State Live Communication State Communication State Restore • Create a new socket • Copy control state in checkpoint to socket structure • Restore checkpointed send buffer data using the socket write call • Deliver checkpointed receive buffer data to application on demand • Copy checkpointed receive buffer data to a special buffer • Intercept receive system call to deliver data from special buffer until buffer is emptied Sh State for one socket

Processes Processes Processes TCP/IP TCP/IP TCP/IP Node Node Node Checkpoint-Restart of Distributed Applications Checkpoint • State of processes and messages in channel must be checkpointed and restored consistently • Prior approaches specific to particular library – e.g., modify library to capture and restore messages in channel • Cruz preserves TCP connection state and IP addresses of each pod, implicitly preserving global communication state • Transparently supports TCP/IP-based distributed applications • Enables efficiencies compared to library-based implementations Library Library Library Communication Channel

Pod (processes) Pod (processes) Pod (processes) TCP/IP TCP/IP TCP/IP Node Node Node Checkpoint-Restart of Distributed Applications in Cruz Checkpoint • Global communication state saved and restored by saving and restoring TCP communication state for each pod • Messages in flight need not be saved since the TCP state will trigger retransmission of these messages at restart • Eliminates O(N2) step to flush channel for capturing messages in flight • Eliminates need to re-establish connections at restart • Preserving pod’s IP address across restart eliminates need to re-discover process locations in library at restart Library Library Library Communication Channel

Consistent Checkpoint Algorithm in Cruz (Illustrative) <checkpoint> <checkpoint> • Algorithm has O(N) complexity (blocking algorithm shown for simplicity) • Can be extended to improve robustness and performance, e.g.: • Tolerate Agent & Coordinator failures • Overlap computation and checkpointing using copy-on-write • Allow nodes to continue without blocking for all nodes to complete checkpoint • Reduce checkpoint size with incremental checkpoints • Disable pod comm§ • Save pod state • Disable pod comm • Save pod state Agent Coordinator Agent Pod Pod <done> Library Library Node Node Node <done> TCP/IP TCP/IP <continue> <continue> • Enable pod comm • Enable pod comm • Resume pod • Resume pod §: using netfilter rules in Linux <continue-done> <continue-done>

Evaluation • Cruz implemented for Linux 2.4.x on x86 • Functionality verified on several applications, e.g., MySQL, K Desktop Environment, and a multi-node MPI benchmark • Cruz incurs negligible runtime overhead (less than 0.5%) • Initial study shows performance overhead of coordinating checkpoints is negligible, suggesting the scheme is scalable

Performance Result – Negligible Coordination Overhead • Checkpoint behavior for Semi-Lagrangian atmospheric model benchmark in configurations from 2 to 8 nodes • Negligible latency in coordinating checkpoints (time spent in non-local operations) suggests scheme is scalable • Coordination latency of 400-500 microseconds is a small fraction of the overall checkpoint latency of about 1 second

Related Work • MetaCluster product from Meiosys • Capabilities similar to Cruz (e.g., checkpoint and restart of unmodified distributed applications) • Berkeley Labs Checkpoint Restart (BLCR) • Kernel-module based checkpoint-restart for single node • No identifier virtualization – restart will fail in the event of an identifier (e.g., pid) conflict • No support for handling communication state – relies on application or library changes • MPVM, CoCheck, LAM-MPI • Library-specific implementations of parallel application checkpoint-restart with disadvantages described earlier

Future Work Many areas for future work, e.g., • Improve portability across kernel versions by minimizing direct access to kernel structures • Recommend additional kernel interfaces when advantageous (e.g., accessing socket attributes) • Implement performance optimizations to the coordinated checkpoint-restart algorithm • Evaluate performance on a wide range of applications and cluster configurations • Support systems with newer interconnects and newer communication abstractions (e.g., InfiniBand, RDMA)

Summary • Cruz, a practical checkpoint-restart system for Linux • No change to applications or to base OS kernel needed • Novel mechanisms to support checkpoint-restart of a broader class of applications • Migrating networked applications transparent to communicating peers • Consistent checkpoint-restart of general TCP/IP-based distributed applications • Cruz’s broad capabilities will drive its use in solutions for fault tolerance, online OS maintenance, and resource management

http://www.hpl.hp.com/research/dca

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

Presentation Transcript

Advanced Operating Systems

Distributed Systems

Distributed Operating Systems CS551

Application Restart Control for IMS - Overview Steven Peek BMC Software

Distributed Object-Based Systems

FT-MPICH : Providing fault tolerance for MPI parallel applications

Distributed Object-Based Systems

Distributed Operating Systems

FT-MPICH : Providing fault tolerance for MPI parallel applications

Module 16: Distributed-System Structures

Distributed (Operating) Systems -Architectures-

Distributed (Operating) Systems -Introduction-

Distributed Computing Systems

15-446 Distributed Systems Spring 2009

Distributed (Operating) Systems -Communication in Distributed Systems-

Distributed Object-Based Systems

Distributed Systems

Operating Systems

Checkpointing and Recovery

Checkpointing and Recovery

Treadmarks: Distributed Shared Memory on Standard Workstations and Operating Systems

Operating Systems CMPSCI 377 Lecture 19: Network Structures