A Multi-Protocols Fault Tolerant MPI

gl Grand Large PCRI/INRIA Franck Cappello INRIA Grand-Large LRI, University of Paris South. fci@lri.fr, www.lri.fr/~gk/MPICH-V A Multi-Protocols Fault Tolerant MPI Current main contributors: Aurélien Bouteiller, Thomas Hérault, Géraud Krawezik Pierre Lemarinier and Vincent Néri, Franck Cappello Other contributors: Hinde Bouziane, Boris Quinson, George Bosilca CCGSC 2004

Introduction A fault tolerance generic Framework 4 protocols: MPICH-V1, V2, Vcausal, VCL Comparison What we have learned Ongoing work Outline CCGSC 2004

MPICH-V Objectives Main Goals: I) Study, design and implement existing and new F. T. MPI protocols II) Compare them fairly in the contexts of Cluster and Grid Fault tolerance context: a) Fault model: Machine crash(according to a fault detector) b) Distributed execution model: PieceWise Deterministic (each process execution is modeled as consisting of a number of state intervals bounded by message receiving events ) MPICH-V Objectives: 1) Automatic fault tolerance 2) Transparency for the programmer & user 3) Tolerate n faults (n being the #MPI processes) 4) Scalable Infrastructure/protocols 5) Theoretical verification of protocols CCGSC 2004

Fault tolerance in message passing distributed systems Fault tolerant Message passing: a long history of research! 2 main parameters distinguish the proposed FT techniques: Transparency: application checkpointing, MP API+Fault management, automatic. application ckpt: applications store intermediate results and can restart form them MP API+FM: message passing API returns errors handled by the programmer (FT-MPI) automatic: runtime detects faults and handles recovery Checkpoint coordination:coordinated, uncoordinated. coordinated: all processes are synchronized and compute a snapshot all processes rollback from the same snapshot uncoordinated: each process checkpoint independently of the others each process is restarted independently of the others Message logging:no, pessimistic, optimistic, causal. pessimistic: all messages are logged on reliable media and used for replay optimistic: all messages are logged on non reliable media. If 1 node fails, replay is done according to other nodes logs. If >1 node fail, rollback to last coherent checkpoint may lead to domino effect!!! causal: optimistic+Antecedence Graph, reduces the recovery time CCGSC 2004

Related work A classification of fault tolerant message passing environments Automatic Controled Uncoordinated Chpt. Coordinated Ckpt. Message Logging CIC Level Optimistic log (sender based) Causal log Pessimistic log Optimistic recovery In distributed systems n faults with coherent checkpoint [SY85] Cocheck Independent of MPI [Ste96] Manetho n faults [EZ92] Framework Starfish Enrichment of MPI [AF99] FT-MPI Modification of MPI routines User Fault Treatment [FD00] Clip Semi-transparent checkpoint [CLP97] MPI/FT Redundance of tasks [BNC01] API [FTSC99] Stong drawbacks Egida [RAV99] Pruitt 98 2 faults sender based [PRU98] MPICH-V Communication Lib. MPI-FT N fault Centralized server [LNLE00] SB 1 fault sender based [JZ87] LAM Para keet Meio Sys OS 1) No automatic/transparent, n fault tolerant, scalable message passing env. (when we have started the project) 2) No fair comparison between protocols on well accepted benchmarks CCGSC 2004

Introduction A fault tolerance generic Framework 4 protocols: MPICH-V1, V2, Vcausal, VCL Comparison What we have learned after 3 years Ongoing work Outline CCGSC 2004

MPICH-V components • Several stable components • 2 components on every node (ckpt lib + deamon). • A MPICH-V protocol uses a subset of these components Event Loggers Volatile Checkpoint servers Channel Memories Stable Dispatcher Fault detector Checkpoint Scheduler node Network Proc. Ckpt FT protoc. node node CCGSC 2004

Computing node Event Logger Ckpt Server Reception Event: Sender ID, Sender logical clock, Receiver logical clock Reception event • Process clone ckpt. • (copy on write) • Non blocking ckpt • Ckpt. Image stored • remotly and localy • depending on the • protocol Send Ckpt. Lib. MPI process Send Receive V daemon Message ID Receive Sent payload Message copy Disc Node CCGSC 2004

MPICH-V as seen by users - A MPICH device “ch_v” replacing P4 ones • A modification of the Runtime: execute/manage instances of MPI processes on nodes + description of the fault tolerant environment (where to execute specific fault tolerant components) Dispatcher Fault detector MPI process instance Node Network Node CCGSC 2004

Experimental platform 32 node Cluster with Ethernet 100 network (other experiments on Myrinet and SCI in papers) Athlon XP 2800+ (2Ghz) CPU 1 GB memory (DDR SDRAM) 70 GB IDE ATA 100 hard drive 100 Mbits/s Ethernet NIC All nodes connected by a 48 ports Ethernet switch Linux 2.4.20 MPICH 1.2.5 Benchmark compiled with GCC –O3 and PGF77 Tests in dedicated mode Measurement repeated 5 times (only mean is presented) Microbenchmark with NetPIPE utility A single Checkpoint server for all experiments CCGSC 2004

MPICH-V1 protocol[SC2002] Pessimistic remote message logging, Uncoordinated checkpointing Restart Checkpoint Channel Memories (stable) CM CM CM Nodes (volatiles) N 0 1 2 … CCGSC 2004

MPICH-V1 Basic performance RTT Ping-Pong : 2 nodes, 2 Channel Memories, blocking coms. Time, sec Mean over 100 measurements 0.2 P4 ch_cm 1 CM out-of-core ch_cm 1 CM in-core 0.15 ch_cm 1 CM out-of-core best 5.6 MB/s X ~2 0.1 V1 Latency: 154us (2* P4) 10.5 MB/s 0.05 Message size 0 0 64kB 128kB 192kB 256kB 320kB 384kB • Performance degradation of a factor 2 (compared to P4) but MPICH-V tolerates arbitrary number of faults CCGSC 2004

MPICH-V2 protocol [SC2003] Pessimistic sender based message logging, Uncoordinated ckpt. Improve bandwidth (direct communications) remove Channel Memories Pessimistic message logging: only reception events should be saved Payload should be checkpointed If B crashes and the event is not logged, How will we know that M should be received before sending M’? Sender based logging • Send information about reception to Event Logger • When sending, first wait for EL acknowledge of prev. mess. And store mess. payload in mem (memory mapped file). • After a crash, retrieve ordered list of reception from EL • 4) Contact initial senders for replay CCGSC 2004

MPICH-V2 Bandwidth and Latency Acceptable Bandwidth But very high latency Latency for a 0 byte MPI message : MPICH-P4 (77us), MPICH-V1 (154us), MPICH-V2 (277us!!!) V2 Latency is high due to the event logging.  A receiving process can send a new message only when the reception event has been successfully logged (6 TCP messages for a ping-pong) CCGSC 2004

MPICH-Vcausal protocol [Cluster 2004] Causal message logging, Uncoordinated checkpointing Improve latency  asynchronous event logging All message event should be saved (non logged events) on the on nodes  piggyback message event to MPI messages Causality should be checkpointed Payload should be checkpointed Sender based logging Causality information • Send information about reception to Event Logger, asynchronously (total order) • EL acknowledge of prev. mess. asynchronously • If no ack from the EL, piggyback causality info to messages • 4) After a crash, retrieve ordered list of reception from EL and other nodes • 5) Contact initial senders for replay CCGSC 2004

MPICH-Vcausal Latency Latency for a 0 byte MPI message: MPICH-P4 (77us), MPICH-V1 (154us), MPICH-V2 (277us!!!) MPICH-Vcausal (120us) Vcausal Latency is lower thanks to asynchronous event logging CCGSC 2004

MPICH-Vcausal Bandwidth Vcausal However, bandwidth is lower due to causal information piggybacking !! CCGSC 2004

MPICH-V/CL protocol [Cluster2003] Coordinated checkpointing, (Chandy-Lamport) Reference protocol for coordinated checkpointing • When receiving a Checkpoint tag, start checkpoint + store any incoming mess. • Store all incoming messages in checkpoint image • Send checkpoint tag to all neighbors in the topology. • Checkpoint is finished when a Tag has been received from all neighbors • 4) After a crash, all nodes retrieve checkpoint images from the CS • 5) Deliver stored in-transit messages to restarted processes CCGSC 2004

MPICH-V/CL Latency Latency for a 0 byte MPI message: MPICH-P4 (77us), MPICH-V1 (154us), MPICH-V2 (277us!!!) MPICH-Vcausal (120us) MPICH-V/CL (110us) CL Latency/P4 is due to protocol implementation CCGSC 2004

MPICH-V/CLBandwidth CL Bandwidth/P4 is lower due to protocol implementation CCGSC 2004

Coordinated “node” checkpointingMPI process + protocol stack ckpt. (Parakeet, MEIOSYS, Score) Manetho causal message logging Logon causal message logging Other considered Fault Tolerant Protocols (not discussed here) Tested with and without EL CCGSC 2004

NAS Benchmark Class A and B CCGSC 2004

NAS Benchmark Class A and B Bandwidth Latency Full duplex Logging overhead Implementation overhead CCGSC 2004

Performance when faults occur NAS BT Class B on 25 nodes Increasing Number of Non overlapping faults 1F/2M (+100%) CCGSC 2004

What we have learned 1) Message logging protocols are competitive compared to coordinated checkpointing 2) Message logging protocols tolerate higher fault frequencies 3) In contrary to general belief, coordinated checkpoint provides very good performances in presence of faults (local ckpt copies) 4) The synchronization time in CL is not the first limiting factor, which is the stress of the checkpoint server during checkpoints and restarts. This stress can be reduced by using local copies of checkpoint images. 5) Tested 3 Causal message logging protocols provide similar performance. Event Logger are critical to get high performance None of protocol outperforms the others on all benchmarks!! It’s important to have the choice and select the best one according to the applications. CCGSC 2004

Improve performance: Zero copy for Myrinet, Infiniband, etc. Migration of running MPI applications on Heterogeneous networks MPICH-V3 for Grid (cluster of clusters environment)  issues of FT protocols association, hierarchical FT protocols. Out-of-core scheduling of MPI applications (using checkpointing to schedule a set of MPI applications under memory constraints) Ongoing work CCGSC 2004

Questions ? www.lri.fr/~gk/MPICH-V CCGSC 2004

References 1) Aurélien Bouteiller, Boris Collin, Thomas Herault, Pierre Lemarinier, Franck Cappello, “Comparison of Causal Message Log for fault tolerant MPI", Research repport. 2) Aurélien Bouteiller, Thomas Herault, Pierre Lemarinier, Géraud Krawezik, Franck Cappello, “MPICH-V project", submitted in Journal 3) Aurélien Bouteiller, Thomas Herault, Pierre Lemarinier, Géraud Krawezik, Franck Cappello, “Coordinated Checkpoint versus Causal Message Log for fault tolerant MPI",",Cluster 2004, San Diego 4) Aurélien Bouteiller, Pierre Lemarinier, Géraud Krawezik, Franck Cappello, "Coordinated Checkpoint versus Message Log for fault tolerant MPI",FGCS, extended version of 6), 2004 5) Aurélien Bouteiller, Pierre Lemarinier, Géraud Krawezik, Franck Cappello, “MPICH-V3: design of a fault tolerant MPI for Grids of Clusters", Poster, IEEE/ACM SC 2003, Phoenix USA, November 2003. 6) Aurélien Bouteiller, Pierre Lemarinier, Géraud Krawezik, Franck Cappello, "Coordinated Checkpoint versus Message Log for fault tolerant MPI", IEEE Cluster 2003, Hong Kong, December 2003. 7) Aurélien Bouteiller, Franck Cappello, Thomas Hérault, Géraud Krawezik, Pierre Lemarinier, Frédéric Magniette, "MPICH-V2: a Fault Tolerant MPI foor Volatile Nodes based on Pessimistic Sender Based Message Logging", IEEE/ACM SC 2003, Phoenix USA, November 2003. 8) G. Bosilca, A. Bouteillier, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, A. Selhikov, "MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes“, IEEE/ACM SC 2002, Baltimore,November 2002. CCGSC 2004

A Multi-Protocols Fault Tolerant MPI

A Multi-Protocols Fault Tolerant MPI

Presentation Transcript

Fault-Tolerant Broadcast

Open MPI - A High Performance Fault Tolerant MPI Library

Building Algorithmically Nonstop Fault Tolerant MPI Programs

Fault-Tolerant Broadcast

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Fault Tolerant Multi-path Enhanced Routing Algorithm

Fault-Tolerant Consensus

Coordinated Checkpoint Versus Message Log For Fault Tolerant MPI

Fault Tolerant Backplane

Fault Tolerant Configuration

HARNESS and Fault Tolerant MPI

FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

Fault-tolerant routing

Fault-Tolerant Consensus