Keith D. Ball, PhD Evergrid, Inc.

User-Friendly Checkpointing and Stateful Preemption in HPC Environments Using Evergrid Availability Services Keith D. Ball, PhD Evergrid, Inc. Oklahoma Supercomputing Symposium October 3, 2007 1

Overview • Challenges for Large HPC Systems • Availability Services (AVS) Checkpointing • AVS Performance • Preemptive Scheduling • AVS Integration with LSF on Topdawg • Conclusions • About Evergrid

Challenges for Large HPC Systems Robustness & fault tolerance: • Long runs + many nodes = Increased likelihood of failure • Need to insure real-time and compute-time “investment” in long computations Scheduling: need “stateful preemption” capability for efficient and optimal fair-share scheduling Without stateful preemption: • High priority jobs terminate low priority jobs, forcing them to restart from the beginning • Increases average throughput time, decreases utilization rate Maintenance: long time to quiesce system;hard to do scheduled (or emergency) maintenance without killing jobs

Relentless Effect of Scale

What Happens in the Real World? Source: D. Reed, High-end computing: The challenge of scale, May 2004

Solution Requirements How about a checkpoint /restart (CP/R) capability? Need the following features to be useful in HPC systems: “Just works”: allows users to do their research (and not more programming!) No recoding or recompiling: allows application developers to focus on their domain (and not system programming) Requires transparent, standardized CP/R: restart and/or migrate application between machines without side effects CP/R must be automatic and integrate with existing use of resource managers and schedulers to fully realize potential

Evergrid Availability Services (AVS) Implemented via dynamically-linked library libavs.so Uses LD_PRELOAD env. variable = no recompiling! Completely asynchronous and concurrent CP/R Incremental checkpointing Fixed upper bound on checkpoint file size Tunable “page” size Application/OS Transparent Migration capable Stateful preemption for both serial and parallel jobs Integrates with commercial and open-source queuing systems: LSF, PBS, Torque/Maui, etc.

Technology: OS Abstraction Layer Application Application Application App Lib App Lib App Lib OS Abstraction OS Abstraction OS Abstraction OS OS OS User Space AVS System Space Server/ OS Pool Interconnect • OS Abstraction Layer • Decouples applications from the operating system • Transparent fault tolerance for stateful applications • Pre-emptive scheduling Key Features Distributed: N nodes running the same/different apps Transparent: No modifications to OS or application Performance: <5% Overhead

What Do We Checkpoint? AVS virtualizes the following resources used by applications to ensure transparent CP/R: • Process • Process ID • Process group ID • Thread ID • fork(), Parent/Child • Shared Memory • Semaphores • Memory • Heap • mmap()’d pages • Stack • Registers • Selected shared libs • Files • Open descriptors • STDIO streams • STDIN, STDOUT • File contents (COW) • Links,Directories • Network • BSD Sockets • IP Addresses • MVAPICH 0.9.8 • OFED, VAPI

Shared-filesystem checkpointing Best for jobs using fewer ( < 16 ) processors Works with NFS, Lustre, GPFS, SAN, …. Local-disk checkpointing More efficient for large distributed computations Provides for “mirrored checkpointing” Backup of checkpoint in case checkpointing fails or ruins local copy Provides redundancy: checkpoint automatically recovered from the mirror if local disk/machine fails Checkpoint Storage Modes ~

Local Disk Checkpointing & Mirroring

Interoperability Application types: Parallel/distributed Serial Shared-memory (testing) Stock MPICH, MVAPICH (customized), OpenMPI (underway) Interconnect fabrics: Infiniband, Ethernet (p4, “GigE”), 10GigE Potential Myrinet support via OpenMPI Operating Systems: RHEL 4, 5 (+ CentOS, Fedora) SLES 9, 10 Architecture: 64-bit Linux (x86_64) Supported platforms, apps, etc. are customer-driven

Tested Codes QA-certified codes and compilers, with many more in the pipeline • Benchmarks • Linpack • NAS • STREAM • IOzone • TI-06 (DoD) apps • Compilers • Pathscale • Intel Fortran • Portland Group • GNU Compilers • Commercial Codes • LS-DYNA • StarCD (CFD) • Cadence and other EDA • apps underway • Academic Codes • LAMMPS, Amber, VASP • MPIBlast, ClustalW-MPI • ARPS, WRF • HYCOMM

Runtime & Checkpoint Overhead Virtualization and checkpoint overheads are negligible ( < 5%) with most workloads

Memory Overhead On a per node basis, the RAM overhead is constant:

Preemptive Scheduling High PriorityQueue Running Jobs Checkpoints Low PriorityQueue Increases server utilization & job throughput by 10-50% based on priority mix 17

Integration with LSF: Topdawg @ OU Topdawg cluster at OSCER 512 dual-core Xeon 3.20 GHz, 2MB cache, 4GB RAM RHEL 4.3, kernel 2.6.9-55.EL_lustre-1.4.11smp Using Platform LSF 6.1 for resource manager and scheduler Objective: Set up two queues with preemption (“lowpri” and “hipri”) lowpri Long/unlimited run time, but preemptable by hipri hipri Time-limited, but can preempt lowpri jobs e.g.: Have long-running (24-hour) low-priority clustalw-mpi job, which can be preempted by 4-6 hour ARPS and WRF jobs

Integration with LSF: Topdawg @ OU Checkpointing and preemption under LSF Uses echkpnt and erestart for checkpointing/preempting and restarting Allows use of custom methods “echkpnt.xxx” and “erestart.xxx” Checkpoint method defined as environment variable, or in lsf.conf Checkpoint directory, interval, and preemption defined in lsb.queues Evergrid integration of AVS into LSF Introduces methods echkpnt.evergrid and erestart.evergrid to handle start, checkpointing, and restart under AVS Uses Topdawg variables MPI_INTERCONNECT, MPI_COMPILER to determine parallel vs. serial, IB vs. p4, run-time compiler libs User sources only one standard Evergrid script from within bsub script!

Integration with LSF: Topdawg @ OU In environment (/etc/bashrc, /etc/csh.cshrc): export EVERGRID_BASEDIR=/opt/evergrid Before starting job: export MPI_COMPILER=gcc export MPI_INTERCONNECT=infiniband At the top of your bsub script: ## Load the Evergrid and LSF integration: source $EVERGRID_BASEDIR/bsub/env/evergrid-avs-lsf.src Submitting a long-term preemtable job: bsub -q lowpri < clustalw-job.bsub Submitting a high-priority job: bsub -q hipri < arps-job.bsub

What’s Underway • Working with OpenMPI • - Support for Myrinet • Growing list of supported applications • EDA, simulation, … • Configure LSF for completely transparent integration

Conclusions • Evergrid’s Availability Services provides: • Transparent, scalable checkpointing for HPC • applications • Compute time overhead of < 5% for most applications • Bounded, nominal memory overhead • Eliminates impacts of hardware faults • Ensures jobs run to completion • Seamless integration into resource managers and • schedulers for preemptive scheduling, maintenance • and job recovery

Reference Ruscio, J.F., Heffner, M.A., and Srinidhi Varadarajan, IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007, 26-30 March 2007, pp. 1 - 10.

About Evergrid Vision:Build a vertically integrated management system that makes multi-datacenter scale-out infrastructures behave as a single managed entity HPC: Cluster Availability Management Suite (CAMS): Availability Services (AVS), Resource Manager (RSM) Enterprise: DataCenter Management Suite: AVS, RSM Enterprise, Live Migration, Load Manager, Provisioning Founded: Feb. 2004 by Dr. Srinidhi Varadarajan (VA Tech, “SystemX”), B. J. Arun Team: 50+ employees: R&D in Blacksburg, VA and Pune, India. HQ in Fremont, CA. Patents: 1 patent pending, 6 patents filed, 2 in process

Acknowledgements • NSF (funding for original research) • OSCER (Henry Neeman , Brett Zimmerman, David Akin, • Jim White)

Finding Out More To find out more about Evergrid Software, contact: Keith Ball keith.ball@evergrid.com Sales:Partnering opportunities: Natalie Van Unen Mitchell Ratner 617-784-8445 510-668-0500 ext. 5058 natalie.vanunen@evergrid.commitchell.ratner@evergrid.com http://www.evergrid.com Note: Evergrid will be at booth 2715 at the SC07 conference in Reno, Nevada Nov 13-15. Come by for a demo and presentation on other products

Keith D. Ball, PhD Evergrid, Inc.

Keith D. Ball, PhD Evergrid, Inc.

Presentation Transcript

Keith D. Ignotz CEO and President

N. Keith Tovey, M.A. PhD, C.Eng MICE

By Keith Ball

Keith Tovey M.A., PhD, CEng, MICE

Ball Media Innovations, Inc

Infrared Spectroscopy Keith D Shepherd

Charlan D. Kroelinger , PhD

Lori Ashcraft, PhD Recovery Innovations Inc.

D-Wave Systems Inc.

N. Keith Tovey, M.A. PhD, C.Eng MICE

Pink ball, pink ball, purple ball. Pink ball, purple ball, orange ball.

keith

Keith Tovey M.A., PhD, CEng, MICE

Why Soil Spectroscopy? Keith D Shepherd

Keith Anderson, PhD, FACHA Vice President, ACHA

N. Keith Tovey, M.A. PhD, C.Eng MICE

Adaptive Sound Technologies, Inc. hires Keith Washo

TTNYD& D OBGYN Inc

D-Wave Systems Inc.

Keith D. Stockmann, PhD

Keith Tovey M.A., PhD, CEng, MICE