1 / 30

CSE 160 – Lecture 16

CSE 160 – Lecture 16. MPI Concepts, Topology and Synchronization. So Far …. We have concentrated on generic message passing algorithms We have used PVM to implement codes (simpler on workstations) MPI is a defacto standard for message passing and introduces some important concepts.

villeda
Download Presentation

CSE 160 – Lecture 16

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 160 – Lecture 16 MPI Concepts, Topology and Synchronization

  2. So Far … • We have concentrated on generic message passing algorithms • We have used PVM to implement codes (simpler on workstations) • MPI is a defacto standard for message passing and introduces some important concepts

  3. Message Addressing • Identify an endpoint • Use a tag to distinguish a particular message • pvm_send(dest, tag) • MPI_SEND(COMM, dest, tag, buf, len, type) • Receiving • recv(src, tag); recv(*,tag); recv (src, *); recv(*,*); • What if you want to build a library that uses message passing? Is (src, tag) safe in all instances?

  4. Level O Issues • Basic Pt-2-Pt Message Passing is straightforward, but how does one … • Make it go fast • Eliminate extra memory copies • Take advantage of specialized hardware • Move complex data structures (packing) • Receive from one-of-many (wildcarding) • Synchronize a group of tasks • Recover from errors • Start tasks • Build safe libraries • Monitor tasks • …

  5. MPI-1 addresses many of the level 0 issues (but not all)

  6. A long history of research efforts in message passing • P4 • Chameleon • Parmacs • TCGMSG • CHIMP • NX (Intel i860, Paragon) • PVM • … And these begot MPI

  7. So What is MPI • It is a standard message passing API • Specifies many variants of send/recv • 9 send interface calls • Eg., synchronous send, asynchronous send, ready send, asynchronous ready send • Plus other defined APIs • Process topologies • Group operations • Derived Data types • Profiling API (standard way to instrument MPI code) • Implemented and optimized by machine vendors • Should you use it? Absolutely!

  8. So What’s Missing in MPI-1? • Process control • How do you start 100 tasks? • How do you kill/signal/monitor remote tasks • I/O • Addressed in MPI-2 • Fault-tolerance • One MPI process dies, the rest eventually hang • Interoperability • No standard set for running a single parallel job across architectures (eg. Cannot split computation between x86 Linux and Alpha)

  9. Communication Envelope • (src, tag) is enough to distinguish messages however, there is a problem: • How does one write safe libraries that also use message passing? • Tags used in libraries can be the same as tags used in calling code  messages could get confused • Message envelope – add another, system-assigned tag, to a message • This label defines a message envelope

  10. Envelopes Tag 100, Src A Tag 100, Src A Tag 100, Src B Tag 100, Src B Tag 200, Src A Tag 200, Src A Envelope  Envelope  Context tag – messages received relative to context

  11. Communicators • Messages ride inside of message envelopes and receives can only match candidate messages relative to an envelope. • MPI implements these envelopes as communicators. • Communicators are more than just and envelope – it also encompasses a group of processes • Communicator = context + group of tasks

  12. MPI_COMM_WORLD • MPI programs are static – they can never grow or shrink • There is one default communicator called MPI_COMM_WORLD • All processes are members of COMM_WORLD • Each process is labeled 0 – N, for a communicator of size N+1 • Similar to a PVM group.

  13. How do you build new groups? • All new communicators are derived from existing communicators • All communicators have MPI_COMM_WORLD as an “ancestor” • New communicators are formed synchronously • All members of a new communicator call the construction routine at the same time. • MPI_COMM_DUP(), MPI_COMM_SPLIT()

  14. 2 0 1 2 1 0 Communicators as groups 1 0 2 3 5 4

  15. Running MPI Programs • The MPI-1 Standard does not specify how to run an MPI program, just as the Fortran standard does not specify how to run a Fortran program. • In general, starting an MPI program is dependent on the implementation of MPI you are using, and might require various scripts, program arguments, and/or environment variables. • mpiexec <args> is part of MPI-2, as a recommendation, but not a requirement • You can use mpiexec for MPICH and mpirun for other systems

  16. Finding Out About the Environment • Two important questions that arise early in a parallel program are: • How many processes are participating in this computation? • Which one am I? • MPI provides functions to answer these questions: • MPI_Comm_size reports the number of processes. • MPI_Comm_rank reports the rank, a number between 0 and size-1, identifying the calling process

  17. MPI Datatypes • The data in a message to sent or received is described by a triple (address, count, datatype), where • An MPI datatype is recursively defined as: • predefined, corresponding to a data type from the language (e.g., MPI_INT, MPI_DOUBLE_PRECISION) • a contiguous array of MPI datatypes • a strided block of datatypes • an indexed array of blocks of datatypes • an arbitrary structure of datatypes • There are MPI functions to construct custom datatypes, such an array of (int, float) pairs, or a row of a matrix stored columnwise.

  18. MPI Tags • Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message. • Messages can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive. • Some non-MPI message-passing systems have called tags “message types”. MPI calls them tags to avoid confusion with datatypes.

  19. Why Datatypes? • Since all data is labeled by type, an MPI implementation can support communication between processes on machines with very different memory representations and lengths of elementary datatypes (heterogeneous communication). • Specifying application-oriented layout of data in memory • reduces memory-to-memory copies in the implementation • allows the use of special hardware (scatter/gather) when available • Theoretically simplifies the programmer’s life • How would you build “Datatypes” in PVM – sequences of pvm_pack()

  20. Introduction to Collective Operations in MPI • Collective operations are called by all processes in a communicator. • MPI_BCAST distributes data from one process (the root) to all others in a communicator. • How does this differ from pvm_mcast(), pvm_broadcast()? • MPI_REDUCE combines data from all processes in communicator and returns it to one process. • In many numerical algorithms, SEND/RECEIVE can be replaced by BCAST/REDUCE, improving both simplicity and efficiency.

  21. Example: PI in C -1 #include "mpi.h" #include <math.h> int main(int argc, char *argv[]) {int done = 0, n, myid, numprocs, i, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;

  22. Example: PI in C - 2 h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT));}MPI_Finalize(); return 0; }

  23. Process 0 Send(1) Recv(1) Process 1 Send(0) Recv(0) Sources of Deadlocks • Send a large message from process 0 to process 1 • If there is insufficient storage at the destination, the send must wait for the user to provide the memory space (through a receive) • What happens with • This is called “unsafe” because it depends on the availability of system buffers (pvm does buffering)

  24. Process 0 Send(1) Recv(1) Process 0 Isend(1) Irecv(1) Waitall Process 1 Recv(0) Send(0) Process 1 Isend(0) Irecv(0) Waitall Some Solutions to the “unsafe” Problem • Order the operations more carefully: • Use non-blocking operations:

  25. Extending the Message-Passing Interface • Dynamic Process Management • Dynamic process startup • Dynamic establishment of connections • One-sided communication • Put/get • Other operations • Parallel I/O • Other MPI-2 features • Generalized requests • Bindings for C++/ Fortran-90; interlanguage issues

  26. When to use MPI • Portability and Performance • Irregular Data Structures • Building Tools for Others • Libraries • Need to Manage memory on a per processor basis

  27. When not to use MPI • Regular computation matches HPF • But see PETSc/HPF comparison (ICASE 97-72) • Solution (e.g., library) already exists • http://www.mcs.anl.gov/mpi/libraries.html • Require Fault Tolerance • Sockets • Distributed Computing • CORBA, DCOM, etc.

  28. Synchronization • Messages serve two purposes: • Moving data • Synchronizing processes • When does synchronization occur for • Blocking send? • Non-blocking send? • Dependencies on implementation

  29. Common Synchronization Constructs • Barrier • All processes stop, No information passed • Broadcast • All processes get information from the root • How does this differ from a barrier • Ideas on how to implement an MPI_BCAST? • Reduction • Data is reduced (summed, multiplied, counted, …) • Ideas on how to implement MPI_REDUCE?

  30. Synchronization Trees • The topology of the messages that implement a reduction is important. • LogP lecture illustrated a tree that wasn’t linear and wasn’t binary • Major idea was to maintain concurrency • MPI has many operations as synchronizing so that implementers have a choice in optimization.

More Related