1 / 41

MPICH2—A New Start for MPI Implementations

MPICH2—A New Start for MPI Implementations. William D. Gropp Mathematics and Computer Science Argonne National Laboratory www.mcs.anl.gov/~gropp. The newest edition of Using MPI-2 , translated by Takao Hatazaki, www.pearsoned.co.jp/washo/prog/wa_pro61-j.html. Bill Gropp Rusty Lusk

craig
Download Presentation

MPICH2—A New Start for MPI Implementations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MPICH2—A New Start for MPI Implementations William D. GroppMathematics and Computer ScienceArgonne National Laboratorywww.mcs.anl.gov/~gropp

  2. The newest edition of Using MPI-2, translated by Takao Hatazaki, www.pearsoned.co.jp/washo/prog/wa_pro61-j.html

  3. Bill Gropp Rusty Lusk Rajeev Thakur Rob Ross David Ashton Brian Toonen Rob Latham MPICH2 Team

  4. What’s New • Pre-pre-pre-pre release version available for groups that expect to perform research on MPI implementations with MPICH2 • Contains • Most of MPI-1, services functions from MPI-2 • C, Fortran bindings • Example devices for TCP • Documentation

  5. MPICH2 Research • All new implementation is our vehicle for research in • Thread safety and efficiency (e.g., avoid thread locks) • Optimized MPI datatypes • Optimized Remote Memory Access (RMA) • High Scalability (64K MPI processes and more) • Exploiting Remote Direct Memory Access (RDMA) capable networks • All of MPI-2, including dynamic process management, parallel I/O, RMA • Usability and Robustness • Software engineering techniques that automate and simplify creating and maintaining a solid, user-friendly implementation • Allow extensive runtime error checking but do not require it

  6. Some Target Platforms • Clusters (TCP, UDP, Infiniband, Myrinet, Proprietary Interconnects, …) • Clusters of SMPs • Grids (UDP, TCP, Globus I/O, … ) • BlueGene/x • 64K processors; 64K address spaces • ANL/IBM developing MPI and process management for BG/L • Other systems • ANL/Cray developing MPI and PVFS for Red Storm

  7. Structure of MPICH-2 MPICH-2 ADI-3 ADIO PMI Existing parallel file systems PVFS Fork MPD Vendors Multi- Method Channel Interface Myrinet, Other NIC Existing Unix(python) Windows In Progress For others Portals MM BG/L TCP

  8. The Major Components • PMI • Process Manager Interface • Provides scalable interface to both process creation and communication setup • Designed to permit many implementations, including with/without demons and with 3rd party process managers • ADIO • I/O interface. No change (except for error reporting and request management) from current ROMIO, at least this year • ADI3 • New device aimed at higher performance networks and new network capabilities

  9. Needs of an MPI Implementation • Point-to-point communication • Can work with polling • Cancel of send • Requires some agent (interrupt-driven receive, separate thread, guaranteed timer) • Active target RMA • Can work with polling • Performance may require “bypass” • Passive target RMA • Requires some agent • For some operations, the agent may be special hardware capabilities

  10. The Layers • ADI3 • Full-featured interface, closely matched to MPI point-to-point and RMA operations • Most MPI communication routines perform (optional) error checking and then “call” ADI3 routine • Modular design allows replacement of parts, e.g., • Datatypes • New Channel Interface • Much smaller than ADI-3, easily implemented on most platforms • Nonblocking design is more robust and efficient than MPICH-1 version

  11. Expose Structures To All Levels of the Implementation • All MPI opaque objects are defined structs for all levels (ADI, channel, and lower) • All objects have a handle that includes the type of the object within the handle value • Permits runtime type checking of handles • Null handles are now distinct • Easier detection of misused values • Fortran Integer-valued handles simplify the implementation for 64-bit systems • Consistent mechanism to extend definitions to support needs of particular devices • Defined fields simplifies much code • E.g., direct access to rank, size of communicators

  12. Special Case:Predefined Objects • Many predefined objects contain all information within the handle • Predefined MPI datatype handles contain • Size in bytes • Fact that the handle is a predefined datatype • No other data needed by most MPI routines • Eliminates extra loads, pointer chasing, and setup at MPI_Init time • Predefined attributes handled separately from general attributes • Special case anyway, since C and Fortran versions are different for the predefined attributes • Other predefined objects initialized only on demand • Handle always valid • Data areas may not be initialized until needed • Example: names (MPI_Type_set_name) on datatypes

  13. 0 1 0 0 1 1 Code forDatatype Index to struct Size in Bytes Builtin Builtin MPI Datatypes • MPI_TYPE_NULL has datatype code in upper bits • Index is used to implement MPI_Type_set/get_name

  14. Channel “CH3” • One possible implementation design for ADI3 • Others possible and underway • Thread safe by design • Requires* atomic operations • Nonblocking design • Requires some completion handle, so • Delayed allocation • Only allocate/initialize if a communication operation did not complete

  15. An Example: CH3 Implementation over TCP • Pollable and active-message data paths • RMA Path

  16. Typical CH3 Routines • Begin messages • CH3_iStartMsg, CH3_iStartMsgv • CH3_iStartRead • CH3_Request_create • Send or receive data using an existing request (e.g., incremental datatype handling, rendezvous) • CH3_iSend, CH3_iSendv • CH3_iWrite • CH3_iRead • Ensure progress • CH3_Progress, CH3_Progress_start, CH3_Progress_end, CH3_Progress_poke, CH3_Progress_signal_completion • CH3_Init, CH3_Finalize, CH3_InitParent

  17. Implementation Sketch MPID_Send( ) { decide if eager based on message size and flow control if (eager) { create packet on stack fill in as eager send packet if (data contiguous) { request = CH3_iStartmsgv( iov ) // note that the request will be null if the message was sent } else { create pack buffer pack into buffer request = CH3_iStartmsgv( iov ) if (!request) free pack buffer else save location of pack buffer in request } else (rendezvous) { create packet on stack request = CH3_request_create(); fill in request fill in packet as rndv_req to send (include request id) CH3_iSend( request, packet ) } return request }

  18. Example Channel/TCP Internal Profiling

  19. Example: Cost of Preallocating Requests

  20. Extra Wrinkles • Some fast networks do not guarantee ordered delivery (probably all of the very fastest) • MPI, fortunately, does not require ordered delivery • Except for message envelopes (message headers are ordered) • What should a channel interface look like? Should it require ordering (streams)?

  21. BlueGene/L

  22. Preserving Message Ordering • Diversion • Consider message matching for trace tools such as jumpshot. How are arrows drawn? • If communication is single threaded, the trace tool can “replay” the messages to perform matching • For multithreaded (MPI_THREAD_MULTIPLE), the easiest solution is a message sequence number on envelopes • But that’s (almost) all we need to handle unordered message delivery! • Need to count data segments • Need to handle data that arrives before its envelope • Can be handled as an low (but not zero) probability case • Solution: • Provide hooks in the channel device for sequence numbers on envelopes; use these for MPI_THREAD_MULTIPLE and unordered networks. • Other wrinkles handled below the channel device • Ok, because it does not force an ordered, stream-like messaging on the low levels.

  23. BG/L Tree BG/L GI BG/L Torus BG/L and the MPICH2 Architecture Interface Implementation Message Passing Interface MPI Types, key values notion of requests Abstract Device Interface MPID Transform to pt2pt ops Channel Interface CH3 Request progress engine • Special opportunities: • collective bypass • scalable buffer mgmnt • out-of-order network TCP/IP

  24. User MPI_Send MPI_Bcast MPI MPID_Send ADI-3 Impl. MPICH2 Abstract Device Interface/MPID CH3_iStartmsgv Channel Impl. Channel Interface/CH3 Channel_Write Transport: Torus Message Layer Torus_Send BG/L software Torus Packet Layer lfpdux() sfpdx() Torus hardware

  25. CH3 Summary • Nonblocking interface for correctness and “0 copy” transfers • struct iovec routines to provide “0 copy” for headers and data • Lazy request creation to avoid unnecessary operations when data can be sent immediately (low latency case); routines to reuse requests during incremental transfers • Thread-safe message queue manipulation routines • Supports both polling and preemptive progress

  26. 100Mb Ethernet TCP

  27. 100Mb Ethernet TCP

  28. GigE TCP Results

  29. GigE TCP Results

  30. Enhancing Collective Performance • MPICH-1 Algorithms are a combination of purely functional and minimum spanning tree (MST) • Better algorithms, based on scatter/gather operations, exist for large messages • E.g., see van de Geijn for 1-D mesh • And better algorithms, based on MST, exist for small messages • Correct implementations must be careful of MPI Datatypes • Rajeev Thakur and I have developed and implemented algorithms for switched networks that provide much better performance • The following results are for MPICH-1 and will be in the next MPICH-1 release

  31. Bcast with Scatter/Gather • Implement MPI_Bcast(buf,n,…) as MPI_Scatter(buf, n/p,…, buf+rank*n/p,…) MPI_Allgather(buf+rank*n/p, n/p,…,buf,…) P0 P1 P2 P3 P4 P5 P6 P7

  32. Collective Performance

  33. And We’ll Go Faster Yet • MPICH-2 enables additional collective optimizations: • Pipelining of long messages • Store and forward of communication buffers eliminates extra copies, particularly for non-contiguous messages • And custom algorithms • Each collective operation may be replaced, on a per-communicator basis, with a separate routine • Unlike MPICH-1, each algorithm is contained within the file implementing that particular collective routine, so only the collective routines used are loaded…

  34. Memory Footprint • Processor-rich systems have less memory per node, making lean implementations of libraries important • MPICH-2 addresses the memory “footprint” of the implementation through • Library design and organization reduces the number of extraneous routines needed only to satisfy (normally unused) references • Don’t even think of a (classic, independent on-demand) shared library approach on 64K nodes • Lazy initialization of objects that are not commonly used • Use of callbacks to bring in code only when needed

  35. Reducing the Memory Footprint • MPI Groups • Groups are rarely used by applications • Groups are not used internally in the implementation of MPI routines, not even within the communicators • Datatype names • MPI-2 allows users to set and get names on all MPI datatypes, including predefined ones. • Rather than preinitialize all names during MPI_Init, names are set during first use of MPI_Type_set_name or MPI_Type_get_name. • Buffered Send • Proper completion requires checking during MPI_Finalize for pending buffered sends • First use of MPI_Buffer_attach adds a bsend callback to a list of routines in MPI_Finalize that are called during MPI_Finalize

  36. Thread Safety • Internal APIs are defined to be thread-safe, usually by providing an atomic operation. Locks are not part of the internal API • They may be used to implement the API, but they are not required. Locks are bad but sometimes inevitable. • Level of thread-safety settable at configure and/or runtime using the same source tree

  37. Reference Counting MPI Objects • Most MPI objects have reference count semantics • Free only happens when reference count is zero • Important for nonblocking (split phase) operations • Updating the reference count must be atomic • Using non-atomic update one of the most common and most difficult to find errors in multithreaded programs • MPICH2 uses • MPIU_Object_add_ref(pointer_to_object) • MPIU_Object_release_ref(pointer_to_object,&inuse) • If inuse false, can free the object • Why not use fetch and increment? • Not available on IA32 • Above API use atomic instructions on all common platforms

  38. Managing the Source • Error messages • Short text, long text • Instance-specific text • Tools extract and write error message routines • Source markings for error tests • Leads to… • Testing • Coverage tests • Problem: ignoring error-handling code, particularly for low-likelihood errors • Exploit error messaging macros and comment markers

  39. Build Environment • Goal: no replicated data • Approach: Solve everything with indirection • Suite of scripts that derive and write the necessary files • Autoconf 2.13/2.52 • “simplemake” creates Makefile.in, understands libraries whose source files are in different directories, handles make bugs; will create Microsoft visual studio project files • “codingcheck” looks for source code problems • “extracterrmsgs” creates message catalogs • Fortran and C++ interfaces derived from mpi.h

  40. MPI-2 In MPICH-2 • Everything is in progress: • I/O will have a first port (error messaging and generialized requests); later versions will exploit more of MPICH-2 internals • MPI_Comm_spawn and friends will appear soon, using the PMI interface • RMA will appear over the next year • Challenge is low latency and high performance • Key is to exploit semantics of the operations and the work in BSP, generalized for MPI

  41. Conclusion • MPICH2 is ready for researchers • New design is cleaner, more flexible, faster • Raises the bar on performance for all MPI implementations

More Related