MPI—The Best High Performance Programming Model for Clusters and Grids

William Groppwww.mcs.anl.gov/~groppMathematics and Computer ScienceArgonne National Laboratory MPI—The Best High Performance Programming Model for Clusters and Grids

Outline • MPI Features that make it successful • Why is MPI the programming model of choice? • Special Cluster and Grid needs • Fault Tolerance • Latency Tolerance • Communication Tuning

The Success of MPI • Applications • Most recent Gordon Bell prize winners use MPI • 26TF Climate simulation on Earth Simulator, 16TF DNS • Libraries • Growing collection of powerful software components • MPI programs with no MPI calls (all in libraries) • Tools • Performance tracing (Vampir, Jumpshot, etc.) • Debugging (Totalview, etc.) • Results • Papers: http://www.mcs.anl.gov/mpi/papers • Beowulf • Ubiquitous parallel computing • Grids • MPICH-G2 (MPICH over Globus, http://www3.niu.edu/mpi)

Why Was MPI Successful? • It address all of the following issues: • Portability • Performance • Simplicity and Symmetry • Modularity • Composability • Completeness Portability Performance Modularity Simplicity and Symmetry Composability Completeness

These should be the same Performance • Performance must be competitive • Pay attention to memory motion • Leave freedom for implementers to exploit any special features • Standard document requires careful reading • Not all implementations are perfect • MPI pingpong performance should be asymptotically as good as any interprocess communication mechanism

CPUs CPUs Cache Cache This is the hardest gap Main Memory Main Memory Remote Memory Not this Parallel Computing and Uniprocessor Performance • Deeper memory hierarchy • Synchronization/coordination costs

Simplicity and Symmetry • MPI is organized around a small number of concepts • The number of routines is not a good measure of complexity • Fortran • Large number of intrinsic functions • C and Java runtimes are large • Development Frameworks • Hundreds to thousands of methods • This doesn’t bother millions of programmers • Why should it bother us?

Is Ease of Use the Overriding Goal? • MPI often described as “the assembly language of parallel programming” • C and Fortran have been described as “portable assembly languages” • Ease of use is important. But completeness is more important. • Don’t force users to switch to a different approach as their application evolves • Don’t forget: no application wants to use parallelism • Parallelism only to get adequate performance • Ease of use at the expense of performance is irrelevant

Fault Tolerance in MPI • Can MPI be fault tolerant? • What does that mean? • Implementation vs. Specification • Work to be done on the implementations • Work to be done on the algorithms • Semantically meaningful and efficient collective operations • Use MPI at the correct level • Build libraries to encapsulate important programming paradigms • (Following slides are joint work with Rusty Lusk)

Myths and Facts Myth: MPI behavior is defined by its implementations. Fact: MPI behavior is defined by the Standard Document at http://www.mpi-forum.org Myth: MPI is not fault tolerant. Fact: This statement is not well formed. Its truth depends on what it means, and one can’t tell from the statement itself. More later. Myth: All processes of MPI programs exit if any one process crashes. Fact: Sometimes they do; sometimes they don’t; sometimes they should; sometimes they shouldn’t. More later. Myth: Fault tolerance means reliability. Fact: These are completely different. Again, definitions are required.

More Myths and Facts Myth: Fault tolerance is independent of performance. Fact: In general, no. Perhaps for some (weak) aspects, yes. Support for fault tolerance will negatively impact performance. Myth: Fault tolerance is a property of the MPI standard (which it doesn’t have. Fact: Fault tolerance is not a property of the specification, so it can’t not have it.  Myth: Fault tolerance is a property of an MPI implementation (which most don’t have). Fact: Fault tolerance is a property of a program. Some implementations make it easier to write fault-tolerant programs than others do.

What is Fault Tolerance Anyway? • A fault-tolerant program can “survive” (in some sense we need to discuss) a failure of the infrastructure (machine crash, network failure, etc.) • This is not in general completely attainable. (What if all processes crash?) • How much is recoverable depends on how much state the failed component holds at the time of the crash. • In many master-slave algorithms a slave holds a small amount of easily recoverable state (the most recent subproblem it received). • In most mesh algorithms a process may hold a large amount of difficult-to-recover state (data values for some portion of the grid/matrix). • Communication networks hold varying amount of state in communication buffers.

What Does the Standard Say About Errors? • A set of errors is defined, to be returned by MPI functions if MPI_ERRORS_RETURN is set. • Implementations are allowed to extend this set. • It is not required that subsequent operations work after an error is returned. (Or that they fail, either.) • It may not be possible for an implementation to recover from some kinds of errors even enough to return an error code (and such implementations are conforming). • Much is left to the implementation; some conforming implementations may return errors in situations where other conforming implementations abort. (See “quality of implementation” issue in the Standard.) • Implementations are allowed to trade performance against fault tolerance to meet the needs of their users

Some Approaches to Fault Tolerance in MPI Programs • Master-slave algorithms using intercommunicators • No change to existing MPI semantics • MPI intercommunicators generalize the well-understood two party model to groups of processes, allowing either the master or slave to be a parallel program optimized for performance. • Checkpointing • In wide use now • Plain vs. fancy • MPI-IO can help make it efficient • Extending MPI with some new objects in order to allow a wider class of fault-tolerant programs. • The “pseudo-communicator” • Another approach: Change semantics of existing MPI functions • No longer MPI (semantics, not syntax, defines MPI)

A Fault-Tolerant MPI Master/Slave Program • Master process comes up alone first. • Size of MPI_COMM_WORLD = 1 • It creates slaves with MPI_Comm_spawn • Gets back an intercommunicator for each one • Sets MPI_ERRORS_RETURN on each • Master communicates with each slave using its particular communicator • MPI_Send/Recv to/from rank 0 in remote group • Master maintains state information to restart each subproblem in case of failure • Master may start replacement slave with MPI_Comm_spawn • Slaves may themselves be parallel • Size of MPI_COMM_WORLD > 1 on slaves • Allows programmer to control tradeoff between fault tolerance and performance

Checkpointing • Application-driven vs. externally-driven • Application knows when message-passing subsystem is quiescent • Checkpointing every n timesteps allows very long (months) ASCI computations to proceed routinely in face of outages. • Externally driven checkpointing requires much more cooperation from MPI implementation, which may impact performance. • MPI-IO can help with large, application-driven checkpoints • “Extreme” checkpointing – MPICH-V (Paris group) • All messages logged • States periodically checkpointed asynchronously • Can restore local state from checkpoint + message log since last checkpoint • Not high-performance • Scalability challenges

Latency Tolerance • Memory systems have 100+ to 1 latency to CPU • Cluster interconnects have 10000+ to 1 latency to CPU • Grid interconnects have 10000000+ to 1 latency to CPU • The others you might just possibly fix with clever engineering. Fixing this one requires Warp Drive technology. • Classic approach is to split operations into a separate initiation and completion step • LogP model separates out overhead from irreducible latency • Programmers rarely good at writing programs with split operations

Split Phase Operations • MPI nonblocking operations provides split operations • MPI-2 adds generalized requests (with same wait/test) • “Everything’s a Request” (almost (sigh)) • MPI-2 RMA operations are nonblocking but with separate completion operations • Chosen to make RMA API fast • Nonblocking  Concurrent • Parallel I/O in MPI • Aggregates, not POSIX streams • Rational atomicity • Permits sensible caching strategies while maintaining reasonable semantics • Well-suited to grid-scale remote I/O • Two-phase collective I/O provides a model for two-phase collective operations

MPI Implementation Choices for Latency Tolerance • MPI Implementations can be optimized for Clusters and Grids • Example: Point-to-point “Rendezvous” • Typical 3-way: • Sender requests • Receiver acks with ok to send • Sender delivers data • Alternative: “Receiver requests” 2-way • Receiver sends “request to receive” to designated sender • Sender delivers data • MPI_ANY_SOURCE receives interfere • (Even better: use MPI RMA: sender delivers data to previously agreed location) • Example: Collective algorithms • Typical: simple MST, etc. • Grids and SMP clusters: Topology-aware (MagPIE; MPICH-G2) • Switched networks • Scatter/gather based algorithms instead of fanout trees

Communication Tuning for Grids • Quality of Service • MPI Attributes provide a modular way to provide information to the implementation • Only change to user code is MPI_Comm_set_attr call • TCP recovery • Recovery from transient TCP failures • Hard to get TCP state from TCP API (how much data was delivered?) • Messages are not streams • User buffer can be sent in any order • Allows aggressive (but good-citizen) UDP based communication • Aggregate acks/nacks • Compare to “Infinite Window” TCP (receive buffer) • 80%+ of bandwidth achievable on long-haul system • Contention management can maintain “good Internet behavior” • Actually reduces network load by reducing the number of acks and retransmits; makes better use of network bandwidth (use it or lose it)

MPICH-2 • Original goals of MPICH, plus • Scalability to 100,000 processes • Improved performance in multiple areas • Portability to new interconnects • New RDMA-oriented abstract communication device implementation • Fast, scalable, modular process management interface • Thread safety • Full MPI-2 Standard (I/O, RMA, dynamic processes, more) • Not yet released • Detailed design complete and publicly available • Core functionality (point-to-point and collective operations) from MPI-1 complete • MPI-1 part to be released this fall

Early Results on Channel/TCP Device • Conclusion: little added overhead over low-level communication • But will become more critical with high-performance network

Conclusion • MPI provides an effective programming model for clusters and for the grid • Performance focus • Modularity and Composability encourages software tools • Libraries can work safely together • Programs can provide sufficient performance on enough platforms to justify their development • MPI is not the final programming model • Any replacement must be better than MPI across the spectrum, not just better for one or two capabilities

MPI—The Best High Performance Programming Model for Clusters and Grids