DMTCP: A New Linux Checkpointing Mechanism For Vanilla Universe Jobs

  • Why checkpoint at all?

  • Problems with Condor’s Standard Universe

    • Single process.

    • No pthreads.

    • No mmap() support.

    • Forced re-link to form a static executable.

  • DMTCP removes these restrictions!

What is DMTCP?

  • Distributed Multi-Threaded CheckPointing.

  • Works with Linux Kernel 2.6.9 and later.

  • Supports sequential and multi-threaded computations across single/multiple hosts.

  • Entirely in user space (no kernel modules or root privilege).

  • Transparent (no recompiling, no re-linking).

  • Written at Northeastern U. and MIT and under active development for 4+ years.

  • LGPL’d and freely available.

  • No Remote I/O.

Process Structure


Signal (USR2)




Process 1

Process N




Network Socket

CT = DMTCP checkpoint thread

T = User Thread

How Does It Work?

  • ./dmtcp_checkpoint a.out # starts coordinator too

  • ./dmtcp_command –c # talks to coordinator

  • ./dmtcp_restart ckpt_a.out-*.dmtcp

  • Coordinator is a stateless synchronization server for the distributed checkpointing algorithm.

  • Checkpoint/Restart performance related to size of memory, disk write speed, and synchronization.

How Does It Work?

  • LD_PRELOAD: Transparently preloads checkpoint libraries which installs libc wrappers and checkpointing code.

  • SIGUSR2: Used internally from checkpoint thread to user threads.

  • Wrappers: Only on less heavily used calls to libc

    • fork, exec, system, pipe, bind, listen, setsockopt, connect, accept, clone, close, ptsname, openlog, closelog, signal, sigaction, sigvec, sigblock, sigsetmask, sigprocmask, rt_sigprocmask, pthread_sigmask

    • Overhead is negligible.

How Does It Work?

  • Additional wrappers when process id & thread id virtualization is enabled

    • getpid, getppid, gettid, tcgetpgrp, tcsetprgrp, getgrp, setpgrp, getsid, setsid, kill, tkill, tgkill, wait, waitpid, waitid, wait3, wait4

How Does It Work?

  • Checkpoint image compression on-the-fly (default).

  • Currently only supports dynamically linking to Support for static libc.a is feasible, but not implemented.

  • Stays close to POSIX API standards.

A Checkpoint Under DMTCP

  • & present in executable’s memory.

  • Ask coordinator process for checkpoint via dmtcp_command.

  • Now what happens?

A Checkpoint Under DMTCP

  • Suspend user threads with SIGUSR2.

  • Elect shared file descriptor leaders.

  • Drain kernel buffers and do network handshake with peers.

  • Write checkpoint to disk.

  • Refill kernel buffers.

  • Resume user threads.

Where Is the Checkpoint?

  • In the cwd of the application.

    • A set of ckpt_<exec>_<id>.dmtcp files.

  • In the cwd of the coordinator.

    • A file.

    • The may need tweaking depending upon circumstance.

A Restart Under DMTCP

  • Restart Process loads in memory.

  • Reopen files and recreate ptys.

  • Recreate and reconnect sockets.

  • Fork into user processes.

  • Rearrange file descriptors to initial layout.

  • Restore memory and threads.

  • Refill kernel buffers.

  • Resume user threads.

Supported OS Features

  • Threads, mutexes/semaphores, fork, exec

    • Shared memory (via mmap), TCP/IP sockets, UNIX domain sockets, pipes, ptys, terminal modes, ownership of controlling terminals, signal handlers, open and/or shared fds, I/O (including the readline library), parent-child process relationships, process id & thread id virtualization, session and process group ids, and more…

  • Trying to keep the implementation small!

Supported Applications

  • MPICH-2, OpenMPI, SciPy/iPython, Python

    • cmsRun, Perl, Ruby, PHP, GHCi (Glasgow Haskell Compiler), Ocaml, Octave, Macaulay2, GNUPlot, slsh (S-Lang scripts), MZScheme, GST (Gnu Smalltalk virtual machine), tcsh, dash, csh, tclsh (tcl-based interpreter), SQLite.

    • And many others!

Planned Application Support

  • Bash, gcl (GNU Common Lisp), maxima (based on gcl), and the Sun JVM.

  • These programs use sbrk() for their own memory management and induce a bug in DMTCP.

  • A fix is planned and will go in soon.

Planned Application Support

  • Matlab

    • Directly calling the binary without graphics works, but matlab uses bash which needs the sbrk() fix.

Condor/DMTCP Integration

  • Experimental at this time.

    • Determining scalability, stability, and extent of “weird edge cases” of DMTCP mixed with Condor.

  • Completely outside of Condor source code.

    • A vanilla job called “shim_dmtcp” that wraps the user’s job and stdfiles with DMTCP.

    • A submit description file which transfers needed dmtcp files over to the remote side and saves intermediate checkpoints.

    • No remote I/O!

Submit File Example

universe = vanilla

executable = shim_dmtcp

arguments = logfile stdinf stdoutf stderrf a.out arg0 arg1…

should_transfer_files = YES

when_to_transfer_output = ON_EVICT_OR_EXIT

transfer_input_files = <dmtcp libraries and programs>,\ a.out, stdinf, stdoutf, stderrf

environment = DMTCP_TMPDIR=./;JALIB_STDERR_PATH=/dev/null

kill_sig = 2

output = shim.$(Cluster).$(Process).out

error = shim.$(Cluster).$(Process).err

log = shim.log


Condor/DMTCP Integration

  • Early Results

    • It works with our test case and thousands of jobs.

    • Problems

      • Checkpointing between Physical Address Kernels and normal kernels is a challenge.

      • DMTCP’s API needs some improvement.

      • Coordinator failure means job failure.

      • Shim script is clunky, e.g. no streaming I/O.

  • Next: Integration into our stduniv test suite for full regression testing.

Future Condor Integration

  • Add WantCheckpoint = True and CheckpointMethod = DMTCP for a vanilla universe job.

  • Condor takes care of the wrapping of the job with DMTCP and transferal of needed DMTCP files--no shim script voodoo.

  • Condor should honor CheckpointPlatform for Vanilla universe jobs in case of pool segmentation.

  • Parallel universe support with single coordinator.

  • Doug Thain’s Parrot for remote I/O.

  • C/C++ runtime library compatibility issues.

    • Recompile DMTCP on slot before job execution?

  • Dynamic library incompatibilities.

  • No Checkpoint Server.

    • Condor file transfer protocol enhancement?

  • Debugging methods and practices?

Further Reading

  • “DMTCP: Transparent Checkpointing for Cluster Computation and the Desktop”


  • Source Code


