MPICH- V : Toward a scalable fault tolerant MPI for V olatile nodes

G. Bosilca, A. Bouteiller,F. Cappello, S. Djilali, G. Fedak, C. Germain, Th. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, A. Selikhov Cluster & GRID group LRI, University of Paris South. fci@lri.fr, www.lri.fr/~fci MPICH-V: Toward a scalable fault tolerant MPI for Volatile nodes SC 2002

Outline • Introduction • Motivations & Objectives • Architecture • Performance • Future work • Concluding remarks SC 2002

Large Scale Parallel and Distributed systems and node Volatility • Industry and academia are building larger and larger computing facilities for technical computing (research and production). • Platforms with 1000s of nodes are becoming common: Tera Scale Machines (US ASCI, French Tera), Large Scale Clusters (Score III, etc.), Grids, PC-Grids(Seti@home, XtremWeb,Entropia, UD, Boinc) • These large scale systems have frequent failures/disconnections: • ASCI-Q full system MTBF is estimated (analytic) to few hours (Petrini: LANL), A 5 hours job with 4096 procs has less than 50% chance to terminate. • PC Gridsnodes are volatile  disconnections / interruptions are expected to be very frequent (several/hour) • When failures/disconnections can not be avoided, they become • onecharacteristic of the system calledVolatility • Many HPC applications use message passing paradigm • We need a Volatility tolerant Message passing environment Scaling to Thousands of Processors with Buffered Coscheduling Workshop: Scaling to New Heights, Pittsburgh, May 2002. SC 2002

Related work Fault tolerant Message passing: a long history of research! 3 main parameters distinguish the proposed FT techniques: Transparency: application checkpointing, MP API+Fault management, automatic. application ckpt: application store intermediate results and restart form them MP API+FM: message passing API returns errors to be handled by the programmer automatic: runtime detects faults and handle recovery Checkpoint coordination:no, coordinated, uncoordinated. coordinated: all processes are synchronized, network is flushed before ckpt; all processes rollback from the same snapshot uncoordinated: each process checkpoint independently of the others each process is restarted independently of the others Message logging:no, pessimistic, optimistic, causal. pessimistic: all messages are logged on reliable media and used for replay optimistic: all messages are logged on non reliable media. If 1 node fails, replay is done according to other nodes logs. If >1 node fail, rollback to last coherent checkpoint causal: optimistic+Antecedence Graph, reduces the recovery time SC 2002

Related work A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques. Automatic Non Automatic Checkpoint based Log based Optimistic log (sender based) Pessimistic log Causal log Optimistic recovery In distributed systems n faults with coherent checkpoint [SY85] Causal logging + Coordinated checkpoint Manetho n faults [EZ92] Cocheck Independent of MPI [Ste96] Framework Starfish Enrichment of MPI [AF99] FT-MPI Modification of MPI routines User Fault Treatment [FD00] Egida [RAV99] Clip Semi-transparent checkpoint [CLP97] MPI/FT Redundance of tasks [BNC01] API Pruitt 98 2 faults sender based [PRU98] MPI-FT N fault Centralized server [LNLE00] MPICH-V N faults Distributed logging Communication Lib. Sender based Mess. Log. 1 fault sender based [JZ87] Level No automatic/transparent, n fault tolerant, scalable message passing env. SC 2002

Programmer’s view unchanged: PC client MPI_send() PC client MPI_recv() Objectives and constraints Goal: execute existing or new MPI Apps Problems: 1) volatile nodes(any number at any time) 2) firewalls(PC Grids) 3) non named receptions( should be replayed in the same order as the one of the previous failed exec.) Objective summary: 1) Automatic fault tolerance 2) Transparency for the programmer & user 3) Tolerate n faults (n being the #MPI processes) 4) Firewall bypass (tunnel) for cross domain execution 5) Scalable Infrastructure/protocols 6) Avoid global synchronizations (ckpt/restart) 7) Theoretical verification of protocols SC 2002

Uncoordinated checkpoint restart Coordinated Checkpoint (Chandy/Lamport) detection/ global stop The objective is to checkpoint the application when there is no in transit messages between any two nodes  global synchronization network flush not scalable failure Ckpt Sync Nodes Uncoordinated Checkpoint • No global synchronization (scalable) • Nodes may checkpoint at any time (independently of the others) • Need to log undeterministic events: In-transit Messages restart detection failure Ckpt Nodes SC 2002

Pessimistic message logging on Channel Memories Distributedpessimistic remote logging node A set of reliable nodes called “Channel Memories” logs every message. All communications are Implemented by 1 PUT and 1 GET operation to the CM PUT and GET operations are transactions When a process restarts, it replays all communications using the Channel Memory CM stores and delivers messages in FIFO order for ensuring a consistent state for each receiver Network Firewall Get Put Channel Memory (stable-tunnel) node node Get node Network Get Put node Channel Memory CM also works as a tunnel for firewall protected nodes (PC-Grids) SC 2002

Crash Rollback to latest process checkpoint Putting all together: Sketch of execution with a crash Worst condition: in-transit message + checkpoint Processes Pseudo time scale 0 CM Ckpt image 1 1 CM Ckpt image 2 2 2 CS 2 1 Ckpt images SC 2002

Global architecture MPICH-V : • Communications Library: a MPICH device with Channel Memory • Run-time: execute/manage instances of MPI processes on nodes •  requires only to re-link the application with libmpichv instead of libmpich 5 Channel Memory Checkpoint server 2 3 Dispatcher 1 Node Network 4 Node Firewall Node Firewall SC 2002

Dispatcher (stable) -- Initializes the execution: distributes roles (CM, CS and Nodes) to participant nodes (launches the appropriate job), checks readiness -- Launches the instances of MPI processes on Nodes -- Monitors the Node state (alive signal, or time-out) -- Reschedules tasks on available nodes for dead MPI process instances Checkpoint servers Channel Memories Dispatcher Role distribution MPI proc. instance Alive signal Nodes New MPI proc. instance Faillure SC 2002

Channel Memory (stable) Out-of-core message storage + garbage collection Disc Removes messages older than the current checkpoint image for each node Memory FIFO Message queues For ensuring total order on receiver messages Multithread server Poll, treat event and release other threads Incoming Message (Put transaction + control) Outgoing Message (Get transaction + control) Open Sockets: -one per attached Node -one per home checkpoint server of attached node -one for the dispatcher SC 2002

Mapping Channel Memories with nodes Several CM  coordination constraints: 1) Force a total order on the messages for each receiver. 2) Avoid coordination messages among CMs Home For node 2 Channel Memories CM CM CM Nodes N 0 1 2 … • Our solution: • Each Node is “attached” to only one “home” CM • A node Receives messages from its home CM • A node Sends massages to the home CM of the destination node SC 2002

Checkpoint Server (stable) Checkpoint images are stored on reliable media: 1 file per Node (name given By Node) Disc Checkpoint images Multiprocess server Poll, treat event and dispatch job to other processes Incoming Message (Put ckpt transaction) Outgoing Message (Get ckpt transaction + control) Open Sockets: -one per attached Node -one per home CM of attached Nodes SC 2002

Node (Volatile) :Checkpointing • User-level Checkpoint : Condor Stand Alone Checkpointing • Clone checkpointing + non blocking checkpoint (1) fork Resume execution using CSAC just after (4), reopen sockets and return code Ckpt order CSAC (2) Terminate ongoing coms (3) close sockets (4) call ckpt_and_exit() libmpichv fork • Checkpoint image is sent to CS on the fly (not stored locally) • Checkpoint order is triggered locally (not by a dispatcher signal) SC 2002

ADI _cmbsend - blocking send _cmbrecv - blocking receive Channel Interface _cmprobe - check for any message avail. Chameleon Interface Library: based on MPICH • A new device: ‘ch_cm’ device • All ch_cm device functions are blocking communication functions built over TCP layer MPI_Send MPID_SendControl MPID_SendChannel _cmfrom - get the src of the last message Binding _cmInit - initialize the client CM device Interface _cmbsend _cmFinalize - finalize the client SC 2002

~4,8 Gb/s ~4,8 Gb/s ~1 Gb/s Experimental platform • Icluster-Imag, 216 PIII 733 Mhz, 256MB/node • 5 subsystems with 32 to 48 nodes, 100BaseT switch • 1Gb/s switch mesh between subsystems • Linux, PGI Fortran or GCC compiler • Very close to a typical Building LAN • Simulate node Volatility XtremWeb as software environment (launching MPICH-V) NAS BT benchmark  complex application (high comm/comp) SC 2002

Basic performance RTT Ping-Pong : 2 nodes, 2 Channel Memories, blocking coms. Time, sec Mean over 100 measurements 0.2 P4 ch_cm 1 CM out-of-core ch_cm 1 CM in-core 0.15 ch_cm 1 CM out-of-core best 5.6 MB/s X ~2 0.1 10.5 MB/s 0.05 Message size 0 0 64kB 128kB 192kB 256kB 320kB 384kB • Performance degradation of a factor 2 (compared to P4) but MPICH-V tolerates arbitrary number of faults • Reasonable since every message crosses the network • twice (store and forward through CM). SC 2002

Time, sec 0.5 12 nodes 0.4 8 nodes 0.3 0.2 4 nodes CM 2 nodes 0.1 1 node Token size 0 0 64kB 128kB 320kB 192kB 256kB 384kB • CM response time (as seen by a node) increases linearly with the number of nodes. • Standard deviation < 3% across nodes •  fair distribution of the CM resource Impact of sharing a Channel Memory Individual communication time according to the number of nodes attached to 1CM (simultaneous communications) Asynchronous token ring (#tokens= # nodes) Mean over 100 executions Tokens are rotating simultaneously around the ring: there are always #nodes communications at the same time SC 2002

Time, sec CM Token size (Bytes) Impact the number of threads in Channel Memory Individual communication time according to the number of nodes attached to 1CM and the number of threads in the CM Asynchronous token ring (# tokens = # nodes) Mean over 100 executions • Increasing the number of threads reduces the CM response time whatever number of nodes are using the same CM. SC 2002

Impact of remote checkpoint on node performance Time between reception of a checkpoint signal and actual restart: fork, ckpt, compress, transfer to CS, way back, decompress, restart RTT Time, sec 250 +2% Dist. Ethernet 100BaseT 214 208 Local (disc) 200 150 +25% 100 78 +14% 62 50 50 44 +28% 1.8 1.4 0 bt.A.4 (43MB) bt.B.4 (21MB) bt.A.1 (201MB) bt.w.4 (2MB) • Cost of remote checkpoint is close to the one of local checkpoint (can be as low as 2%)… …because compression and transfer are overlapped SC 2002 Checkpointed Benchmark

Stressing the checkpoint server:Ckpt RTT for simultaneous ckpts. RTT experienced by every node for simultaneous ckpt, (ckpt signals are sync.) according to #checkpointing nodes 500 RTT Time, sec 450 400 350 300 250 200 2 3 5 6 7 1 5 Number of simultaneous checkpoint on a single CS (BT.A.1) • RTT increases almost linearly according to the number of nodes, after network saturation is reached (from 1 to 2) SC 2002

Impact of checkpointing on application performance Performance reduction for NAS BT.A.4 according to the number of consecutive checkpoints A single checkpoint server for 4 MPI tasks (P4 driver) Ckpt is performed at random time on each node (no sync.) 100 100 Dual processor Uni processor 99 90 98 80 97 Relative performance (%) 96 70 95 60 Blocking Non blocking 94 50 93 0 4 2 3 1 0 4 2 3 1 Number of checkpoints during BT.A.4 • When 4 checkpoints are performed per process performance is about 94% the one of a non checkpointed execution. • Several nodes can use the same CS SC 2002

Time, sec 0.3 8 restarts Re-execution is faster than execution: Messages are already stored in CM 0 restart 0 restart 1 restart 0.2 Crash 2 restarts 3 restarts 4 restarts 5 restarts 6 restarts 7 restarts 8 restarts 0.1 token size 0 256kB 128kB 192kB 0 64kB • The system can survive the crash of all MPI Processes • re-execution is faster because messages are available in the CM (stored by the previous execution) Performance of re-execution Time for the re-execution of a token ring on 8 nodes According to the token size and number of re-started nodes SC 2002

Global operation performance MPI all-to-all for 9 nodes (1CM) 2,1 x3 1 0,7 0,3 SC 2002

Putting all together: Performance scalability Performance of MPI-PovRay • Parallelized version of the PovRay raytracer application • 1 CM for 8 MPI processes • Render a complex 450x350 scene • Comm/comp ratio is about 10% for 16 MPI processes Execution time • MPICH-V provides similar performance compared to P4 + fault-tolerance (at the cost of 1 CM every 8 nodes) SC 2002

Putting all together: Performance with volatile nodes Performance of BT.A.9 with frequent faults • 3 CM, 2 CS (4 nodes on 1 CS, 5 on the other) • 1 checkpoint every 130 seconds on each node (non sync.) ~1 fault/110 sec. Total execution time (sec.) 1100 1050 1000 950 900 850 800 Base exec. without ckpt. and fault 750 700 Number of faults during execution 650 610 0 1 2 3 4 5 6 7 8 9 10 • Overhead of ckpt is about 23% • For 10 faults performance is 68% of the one without fault • MPICH-V allows application to survive node volatility (1 F/2 min.) • Performance degradation with frequent faults stays reasonable SC 2002

MPICH-V (CM but no logs) MPICH-V (CM with logs) MPICH-V (CM+CS+ckpt) MPICH-P4 Putting all together: MPICH-V vs. MPICH-P4 on NAS BT • 1 CM per MPI process, 1 CS for 4 MPI processes • 1 checkpoint every 120 seconds on each node (Whole) MPICH-V Compares favorably to MPICH-P4 for all configurations on this platform for BT class A The differences for the communication times is due to the way asynchronous coms. are handled by each environment. SC 2002

Future Work Channel Memories reduce the communication performance:  change packet transit from Store and Forward to Wormhole  remove CMs (cluster), message logging on node, communication causality vector stored separately on CSs Remove the need of stable resources: add redundancy Channel Memories Checkpoint servers Dispatcher Redundancy Redundancy node Network node Firewall node SC 2002

Concluding remarks • MPICH-V: • full fledge fault tolerant MPI environment (lib + runtime). • uncoordinated checkpoint + distributed pessimistic message logging. • Channel Memories, Checkpoint Servers, Dispatcher and nodes. • Main results: • Raw communication Performance (RTT) is about ½ of MPICH-P4. • Scalability is as good as the one of P4 (128 nodes) for MPI-Pov. • MPICH-V allows application to survive node volatility (1 F/ 2min). • When frequent faults occur, performance degradation is reasonable. • NAS BT performance comparable to MPICH-P4 (up to 25 nodes). www.lri.fr/~fci/Group SC 2002

MPICH- V : Toward a scalable fault tolerant MPI for V olatile nodes