MPICH2—A New Start for MPI Implementations

MPICH2—A New Start for MPI Implementations William D. GroppMathematics and Computer ScienceArgonne National Laboratorywww.mcs.anl.gov/~gropp

The newest edition of Using MPI-2, translated by Takao Hatazaki, www.pearsoned.co.jp/washo/prog/wa_pro61-j.html

Bill Gropp Rusty Lusk Rajeev Thakur Rob Ross David Ashton Brian Toonen Rob Latham MPICH2 Team

What’s New • Pre-pre-pre-pre release version available for groups that expect to perform research on MPI implementations with MPICH2 • Contains • Most of MPI-1, services functions from MPI-2 • C, Fortran bindings • Example devices for TCP • Documentation

MPICH2 Research • All new implementation is our vehicle for research in • Thread safety and efficiency (e.g., avoid thread locks) • Optimized MPI datatypes • Optimized Remote Memory Access (RMA) • High Scalability (64K MPI processes and more) • Exploiting Remote Direct Memory Access (RDMA) capable networks • All of MPI-2, including dynamic process management, parallel I/O, RMA • Usability and Robustness • Software engineering techniques that automate and simplify creating and maintaining a solid, user-friendly implementation • Allow extensive runtime error checking but do not require it

Some Target Platforms • Clusters (TCP, UDP, Infiniband, Myrinet, Proprietary Interconnects, …) • Clusters of SMPs • Grids (UDP, TCP, Globus I/O, … ) • BlueGene/x • 64K processors; 64K address spaces • ANL/IBM developing MPI and process management for BG/L • Other systems • ANL/Cray developing MPI and PVFS for Red Storm

Structure of MPICH-2 MPICH-2 ADI-3 ADIO PMI Existing parallel file systems PVFS Fork MPD Vendors Multi- Method Channel Interface Myrinet, Other NIC Existing Unix(python) Windows In Progress For others Portals MM BG/L TCP

The Major Components • PMI • Process Manager Interface • Provides scalable interface to both process creation and communication setup • Designed to permit many implementations, including with/without demons and with 3rd party process managers • ADIO • I/O interface. No change (except for error reporting and request management) from current ROMIO, at least this year • ADI3 • New device aimed at higher performance networks and new network capabilities

Needs of an MPI Implementation • Point-to-point communication • Can work with polling • Cancel of send • Requires some agent (interrupt-driven receive, separate thread, guaranteed timer) • Active target RMA • Can work with polling • Performance may require “bypass” • Passive target RMA • Requires some agent • For some operations, the agent may be special hardware capabilities

The Layers • ADI3 • Full-featured interface, closely matched to MPI point-to-point and RMA operations • Most MPI communication routines perform (optional) error checking and then “call” ADI3 routine • Modular design allows replacement of parts, e.g., • Datatypes • New Channel Interface • Much smaller than ADI-3, easily implemented on most platforms • Nonblocking design is more robust and efficient than MPICH-1 version

Expose Structures To All Levels of the Implementation • All MPI opaque objects are defined structs for all levels (ADI, channel, and lower) • All objects have a handle that includes the type of the object within the handle value • Permits runtime type checking of handles • Null handles are now distinct • Easier detection of misused values • Fortran Integer-valued handles simplify the implementation for 64-bit systems • Consistent mechanism to extend definitions to support needs of particular devices • Defined fields simplifies much code • E.g., direct access to rank, size of communicators

Special Case:Predefined Objects • Many predefined objects contain all information within the handle • Predefined MPI datatype handles contain • Size in bytes • Fact that the handle is a predefined datatype • No other data needed by most MPI routines • Eliminates extra loads, pointer chasing, and setup at MPI_Init time • Predefined attributes handled separately from general attributes • Special case anyway, since C and Fortran versions are different for the predefined attributes • Other predefined objects initialized only on demand • Handle always valid • Data areas may not be initialized until needed • Example: names (MPI_Type_set_name) on datatypes

0 1 0 0 1 1 Code forDatatype Index to struct Size in Bytes Builtin Builtin MPI Datatypes • MPI_TYPE_NULL has datatype code in upper bits • Index is used to implement MPI_Type_set/get_name

Channel “CH3” • One possible implementation design for ADI3 • Others possible and underway • Thread safe by design • Requires* atomic operations • Nonblocking design • Requires some completion handle, so • Delayed allocation • Only allocate/initialize if a communication operation did not complete

An Example: CH3 Implementation over TCP • Pollable and active-message data paths • RMA Path

Typical CH3 Routines • Begin messages • CH3_iStartMsg, CH3_iStartMsgv • CH3_iStartRead • CH3_Request_create • Send or receive data using an existing request (e.g., incremental datatype handling, rendezvous) • CH3_iSend, CH3_iSendv • CH3_iWrite • CH3_iRead • Ensure progress • CH3_Progress, CH3_Progress_start, CH3_Progress_end, CH3_Progress_poke, CH3_Progress_signal_completion • CH3_Init, CH3_Finalize, CH3_InitParent

Implementation Sketch MPID_Send( ) { decide if eager based on message size and flow control if (eager) { create packet on stack fill in as eager send packet if (data contiguous) { request = CH3_iStartmsgv( iov ) // note that the request will be null if the message was sent } else { create pack buffer pack into buffer request = CH3_iStartmsgv( iov ) if (!request) free pack buffer else save location of pack buffer in request } else (rendezvous) { create packet on stack request = CH3_request_create(); fill in request fill in packet as rndv_req to send (include request id) CH3_iSend( request, packet ) } return request }

Example Channel/TCP Internal Profiling

Example: Cost of Preallocating Requests

Extra Wrinkles • Some fast networks do not guarantee ordered delivery (probably all of the very fastest) • MPI, fortunately, does not require ordered delivery • Except for message envelopes (message headers are ordered) • What should a channel interface look like? Should it require ordering (streams)?

BlueGene/L

Preserving Message Ordering • Diversion • Consider message matching for trace tools such as jumpshot. How are arrows drawn? • If communication is single threaded, the trace tool can “replay” the messages to perform matching • For multithreaded (MPI_THREAD_MULTIPLE), the easiest solution is a message sequence number on envelopes • But that’s (almost) all we need to handle unordered message delivery! • Need to count data segments • Need to handle data that arrives before its envelope • Can be handled as an low (but not zero) probability case • Solution: • Provide hooks in the channel device for sequence numbers on envelopes; use these for MPI_THREAD_MULTIPLE and unordered networks. • Other wrinkles handled below the channel device • Ok, because it does not force an ordered, stream-like messaging on the low levels.

BG/L Tree BG/L GI BG/L Torus BG/L and the MPICH2 Architecture Interface Implementation Message Passing Interface MPI Types, key values notion of requests Abstract Device Interface MPID Transform to pt2pt ops Channel Interface CH3 Request progress engine • Special opportunities: • collective bypass • scalable buffer mgmnt • out-of-order network TCP/IP

User MPI_Send MPI_Bcast MPI MPID_Send ADI-3 Impl. MPICH2 Abstract Device Interface/MPID CH3_iStartmsgv Channel Impl. Channel Interface/CH3 Channel_Write Transport: Torus Message Layer Torus_Send BG/L software Torus Packet Layer lfpdux() sfpdx() Torus hardware

CH3 Summary • Nonblocking interface for correctness and “0 copy” transfers • struct iovec routines to provide “0 copy” for headers and data • Lazy request creation to avoid unnecessary operations when data can be sent immediately (low latency case); routines to reuse requests during incremental transfers • Thread-safe message queue manipulation routines • Supports both polling and preemptive progress

100Mb Ethernet TCP

GigE TCP Results

Enhancing Collective Performance • MPICH-1 Algorithms are a combination of purely functional and minimum spanning tree (MST) • Better algorithms, based on scatter/gather operations, exist for large messages • E.g., see van de Geijn for 1-D mesh • And better algorithms, based on MST, exist for small messages • Correct implementations must be careful of MPI Datatypes • Rajeev Thakur and I have developed and implemented algorithms for switched networks that provide much better performance • The following results are for MPICH-1 and will be in the next MPICH-1 release

Bcast with Scatter/Gather • Implement MPI_Bcast(buf,n,…) as MPI_Scatter(buf, n/p,…, buf+rank*n/p,…) MPI_Allgather(buf+rank*n/p, n/p,…,buf,…) P0 P1 P2 P3 P4 P5 P6 P7

Collective Performance

And We’ll Go Faster Yet • MPICH-2 enables additional collective optimizations: • Pipelining of long messages • Store and forward of communication buffers eliminates extra copies, particularly for non-contiguous messages • And custom algorithms • Each collective operation may be replaced, on a per-communicator basis, with a separate routine • Unlike MPICH-1, each algorithm is contained within the file implementing that particular collective routine, so only the collective routines used are loaded…

Memory Footprint • Processor-rich systems have less memory per node, making lean implementations of libraries important • MPICH-2 addresses the memory “footprint” of the implementation through • Library design and organization reduces the number of extraneous routines needed only to satisfy (normally unused) references • Don’t even think of a (classic, independent on-demand) shared library approach on 64K nodes • Lazy initialization of objects that are not commonly used • Use of callbacks to bring in code only when needed

Reducing the Memory Footprint • MPI Groups • Groups are rarely used by applications • Groups are not used internally in the implementation of MPI routines, not even within the communicators • Datatype names • MPI-2 allows users to set and get names on all MPI datatypes, including predefined ones. • Rather than preinitialize all names during MPI_Init, names are set during first use of MPI_Type_set_name or MPI_Type_get_name. • Buffered Send • Proper completion requires checking during MPI_Finalize for pending buffered sends • First use of MPI_Buffer_attach adds a bsend callback to a list of routines in MPI_Finalize that are called during MPI_Finalize

Thread Safety • Internal APIs are defined to be thread-safe, usually by providing an atomic operation. Locks are not part of the internal API • They may be used to implement the API, but they are not required. Locks are bad but sometimes inevitable. • Level of thread-safety settable at configure and/or runtime using the same source tree

Reference Counting MPI Objects • Most MPI objects have reference count semantics • Free only happens when reference count is zero • Important for nonblocking (split phase) operations • Updating the reference count must be atomic • Using non-atomic update one of the most common and most difficult to find errors in multithreaded programs • MPICH2 uses • MPIU_Object_add_ref(pointer_to_object) • MPIU_Object_release_ref(pointer_to_object,&inuse) • If inuse false, can free the object • Why not use fetch and increment? • Not available on IA32 • Above API use atomic instructions on all common platforms

Managing the Source • Error messages • Short text, long text • Instance-specific text • Tools extract and write error message routines • Source markings for error tests • Leads to… • Testing • Coverage tests • Problem: ignoring error-handling code, particularly for low-likelihood errors • Exploit error messaging macros and comment markers

Build Environment • Goal: no replicated data • Approach: Solve everything with indirection • Suite of scripts that derive and write the necessary files • Autoconf 2.13/2.52 • “simplemake” creates Makefile.in, understands libraries whose source files are in different directories, handles make bugs; will create Microsoft visual studio project files • “codingcheck” looks for source code problems • “extracterrmsgs” creates message catalogs • Fortran and C++ interfaces derived from mpi.h

MPI-2 In MPICH-2 • Everything is in progress: • I/O will have a first port (error messaging and generialized requests); later versions will exploit more of MPICH-2 internals • MPI_Comm_spawn and friends will appear soon, using the PMI interface • RMA will appear over the next year • Challenge is low latency and high performance • Key is to exploit semantics of the operations and the work in BSP, generalized for MPI

Conclusion • MPICH2 is ready for researchers • New design is cleaner, more flexible, faster • Raises the bar on performance for all MPI implementations

MPICH2—A New Start for MPI Implementations

MPICH2—A New Start for MPI Implementations

Presentation Transcript

Optimizing UDP-based Protocol Implementations

caGrid 1.0 Reference Implementations

Performance of Windows Multicore Systems on Threading and MPI

High Dynamic Range Images

projectOne SAP 4.6C implementations using RLINK at

Overview of TIO-index implementations

3. Status of WRF Operational Implementations

4G World – Show me the Money – 4G Implementations

MB Hydro Experiences with Synchrophasor Implementations

Low power AES implementations for RFID

Energy

Faye Business Systems Group presents:

Node Type Implementations

What is different now?

Good for DJ over Java

LMP – Insurance Documentation Contract Certainty Initial Implementations

Upgrading QAD MFG/PRO Where Do I Start? Part I - Due Diligence

Designing a Go bot

START

START