90 likes | 257 Views
FT-MPI. Fault Tolerant MPI Brian Alexander CSS 534, Spring 2014. The problem. user process. Every member of MPI_COMM_WORLD is expected to complete its task. If one process fails, then all communicators with that process become invalid.
E N D
FT-MPI Fault Tolerant MPI Brian Alexander CSS 534, Spring 2014
The problem user process • Every member of MPI_COMM_WORLD is expected to complete its task. If one process fails, then all communicators with that process become invalid. • In most MPI implementations there’s no way to recover from a failed task. • With only a few nodes, failures are relatively rare and easy to recover from. With hundreds or thousands of nodes the problem is bigger. user process user process user process rank 0 rank 3 rank 1 rank 2 Init( ): barrier Send( ) Send( ) Send( ) Data Processing Recv( ) Recv( ) Recv( ) Crash Crash Crash Finalize( ): barrier Crash
The (not-optimal) solution user process user process user process user process • The whole system needs to be checkpointed so that if a process fails, the program can restart without losing all progress. • This hurts performance because the checkpoint operation isn’t free and the ranks need to be synchronized and message buffers cleared before the checkpoint. rank 0 rank 3 rank 1 rank 2 Init( ): barrier Send( ) Send( ) Send( ) Checkpoint( ) Data Processing Recv( ) Crash Recv( ) Recv( ) Finalize( ): barrier
What is FT-MPI? • Developed in the late 90s/early 2000s as part of the HARNESS (Heterogeneous Adaptable Reconfigurable Networked Systems) project funded by the Department of Energy • It’s an implementation of MPI that allows for user-level recovery after a process has failed. • Fault recovery is done without affecting the performance of MPI as a whole (bandwidth/message size is as good as other MPI implementations. From http://web.eecs.utk.edu/~dongarra/lyon2002/Fagg.ppt
FT-MPI solution user process • Instead of having only two states for processes (OK, FAILED), FT-MPI adds a third (OK, PROBLEM, FAILED) and gives recovery options to the user. • Instead of crashing FT-MPI can: • “Blank” (ignore) the failed process the failed process. • Shrink the size of MPI_COMM_WORLD by removing the failed process • Rebuild failed processes or rebuild all processes (like starting from a checkpoint) • FT-MPI doesn’t handle data or process recovery. That has to be done by the user. user process user process user process rank 0 rank 3 rank 1 rank 2 Init( ): barrier Send( ) Send( ) Send( ) Data Processing Recv( ) Recv( ) ProcessFailed() Rebuild( ) Recover Finalize( ): barrier
FT-MPI Recovery Normal startup (all ranks) Recovery (failed rank) MPI_Init(…) MPI_Init(…) I Am Recovered Recover() Make LongJMP Install Error Handler & Set LongJMP ErrorHandler (…) Recover () SolveProblem(…) SolveProblem(…) MPI_Finalize(…) MPI_Finalize(…)
FT-MPI limitations • Implementation was created in the early 2000s based on the MPI 1.2 spec (with parts of the MPI-2 spec) and hasn’t been updated. • Some MPI implementations (like OpenMPI) claim to include some of FT-MPI’s features but they generally don’t work. For these, checkpointing is the most commonly used method of fault tolerance. • The implementation is at the MPI level. If you’re working in a system that already includes MPI at a lower level, you will need to possibly re-write some of the lower-level code to handle the new error states. • FT-MPI is build in to HARNESS and also works with PVM (Parallel Virtual Machines), but if you don’t want to use HARNESS it may be more complicated to use FT-MPI.
Recommendations • For small, simple problems: • Checkpointning and FT in MPI implementations like OpenMPI should be good enough. • Use checkpointingif your program is long-running enough that a crash would cause considerable lost data or time. • For large-scale or grid systems: • If it makes sense for your problem to be able to recover on the fly and you can do so without losing data, then you may want to use FT-MPI. • Also make sure you don’t need any features outside of MPI 1.2 you can use FT-MPI • If map-reduce makes sense for the problem use Hadoop.
Resources and references • http://icl.cs.utk.edu/harness/ • http://icl.cs.utk.edu/graphics/posters/files/FT-MPI-2006.pdf • http://web.eecs.utk.edu/~dongarra/lyon2002/Fagg.ppt • Fagg, G. E., & Dongarra, J. (2000). FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. Paper presented at the Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 346-353. Retrieved from http://dl.acm.org/citation.cfm?id=648137.746632 • Gropp, W., & Lusk, E. (2004). Fault tolerance in message passing interface programs. International Journal of High Performance Computing Applications, 18(3), 363-372. • HARNESS: http://www.csm.ornl.gov/harness/