MPI Concepts and Topology

CSE 160 – Lecture 16 MPI Concepts, Topology and Synchronization

So Far … • We have concentrated on generic message passing algorithms • We have used PVM to implement codes (simpler on workstations) • MPI is a defacto standard for message passing and introduces some important concepts

Message Addressing • Identify an endpoint • Use a tag to distinguish a particular message • pvm_send(dest, tag) • MPI_SEND(COMM, dest, tag, buf, len, type) • Receiving • recv(src, tag); recv(*,tag); recv (src, *); recv(*,*); • What if you want to build a library that uses message passing? Is (src, tag) safe in all instances?

Level O Issues • Basic Pt-2-Pt Message Passing is straightforward, but how does one … • Make it go fast • Eliminate extra memory copies • Take advantage of specialized hardware • Move complex data structures (packing) • Receive from one-of-many (wildcarding) • Synchronize a group of tasks • Recover from errors • Start tasks • Build safe libraries • Monitor tasks • …

MPI-1 addresses many of the level 0 issues (but not all)

A long history of research efforts in message passing • P4 • Chameleon • Parmacs • TCGMSG • CHIMP • NX (Intel i860, Paragon) • PVM • … And these begot MPI

So What is MPI • It is a standard message passing API • Specifies many variants of send/recv • 9 send interface calls • Eg., synchronous send, asynchronous send, ready send, asynchronous ready send • Plus other defined APIs • Process topologies • Group operations • Derived Data types • Profiling API (standard way to instrument MPI code) • Implemented and optimized by machine vendors • Should you use it? Absolutely!

So What’s Missing in MPI-1? • Process control • How do you start 100 tasks? • How do you kill/signal/monitor remote tasks • I/O • Addressed in MPI-2 • Fault-tolerance • One MPI process dies, the rest eventually hang • Interoperability • No standard set for running a single parallel job across architectures (eg. Cannot split computation between x86 Linux and Alpha)

Communication Envelope • (src, tag) is enough to distinguish messages however, there is a problem: • How does one write safe libraries that also use message passing? • Tags used in libraries can be the same as tags used in calling code  messages could get confused • Message envelope – add another, system-assigned tag, to a message • This label defines a message envelope

Envelopes Tag 100, Src A Tag 100, Src A Tag 100, Src B Tag 100, Src B Tag 200, Src A Tag 200, Src A Envelope  Envelope  Context tag – messages received relative to context

Communicators • Messages ride inside of message envelopes and receives can only match candidate messages relative to an envelope. • MPI implements these envelopes as communicators. • Communicators are more than just and envelope – it also encompasses a group of processes • Communicator = context + group of tasks

MPI_COMM_WORLD • MPI programs are static – they can never grow or shrink • There is one default communicator called MPI_COMM_WORLD • All processes are members of COMM_WORLD • Each process is labeled 0 – N, for a communicator of size N+1 • Similar to a PVM group.

How do you build new groups? • All new communicators are derived from existing communicators • All communicators have MPI_COMM_WORLD as an “ancestor” • New communicators are formed synchronously • All members of a new communicator call the construction routine at the same time. • MPI_COMM_DUP(), MPI_COMM_SPLIT()

2 0 1 2 1 0 Communicators as groups 1 0 2 3 5 4

Running MPI Programs • The MPI-1 Standard does not specify how to run an MPI program, just as the Fortran standard does not specify how to run a Fortran program. • In general, starting an MPI program is dependent on the implementation of MPI you are using, and might require various scripts, program arguments, and/or environment variables. • mpiexec <args> is part of MPI-2, as a recommendation, but not a requirement • You can use mpiexec for MPICH and mpirun for other systems

Finding Out About the Environment • Two important questions that arise early in a parallel program are: • How many processes are participating in this computation? • Which one am I? • MPI provides functions to answer these questions: • MPI_Comm_size reports the number of processes. • MPI_Comm_rank reports the rank, a number between 0 and size-1, identifying the calling process

MPI Datatypes • The data in a message to sent or received is described by a triple (address, count, datatype), where • An MPI datatype is recursively defined as: • predefined, corresponding to a data type from the language (e.g., MPI_INT, MPI_DOUBLE_PRECISION) • a contiguous array of MPI datatypes • a strided block of datatypes • an indexed array of blocks of datatypes • an arbitrary structure of datatypes • There are MPI functions to construct custom datatypes, such an array of (int, float) pairs, or a row of a matrix stored columnwise.

MPI Tags • Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message. • Messages can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive. • Some non-MPI message-passing systems have called tags “message types”. MPI calls them tags to avoid confusion with datatypes.

Why Datatypes? • Since all data is labeled by type, an MPI implementation can support communication between processes on machines with very different memory representations and lengths of elementary datatypes (heterogeneous communication). • Specifying application-oriented layout of data in memory • reduces memory-to-memory copies in the implementation • allows the use of special hardware (scatter/gather) when available • Theoretically simplifies the programmer’s life • How would you build “Datatypes” in PVM – sequences of pvm_pack()

Introduction to Collective Operations in MPI • Collective operations are called by all processes in a communicator. • MPI_BCAST distributes data from one process (the root) to all others in a communicator. • How does this differ from pvm_mcast(), pvm_broadcast()? • MPI_REDUCE combines data from all processes in communicator and returns it to one process. • In many numerical algorithms, SEND/RECEIVE can be replaced by BCAST/REDUCE, improving both simplicity and efficiency.

Example: PI in C -1 #include "mpi.h" #include <math.h> int main(int argc, char *argv[]) {int done = 0, n, myid, numprocs, i, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;

Example: PI in C - 2 h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT));}MPI_Finalize(); return 0; }

Process 0 Send(1) Recv(1) Process 1 Send(0) Recv(0) Sources of Deadlocks • Send a large message from process 0 to process 1 • If there is insufficient storage at the destination, the send must wait for the user to provide the memory space (through a receive) • What happens with • This is called “unsafe” because it depends on the availability of system buffers (pvm does buffering)

Process 0 Send(1) Recv(1) Process 0 Isend(1) Irecv(1) Waitall Process 1 Recv(0) Send(0) Process 1 Isend(0) Irecv(0) Waitall Some Solutions to the “unsafe” Problem • Order the operations more carefully: • Use non-blocking operations:

Extending the Message-Passing Interface • Dynamic Process Management • Dynamic process startup • Dynamic establishment of connections • One-sided communication • Put/get • Other operations • Parallel I/O • Other MPI-2 features • Generalized requests • Bindings for C++/ Fortran-90; interlanguage issues

When to use MPI • Portability and Performance • Irregular Data Structures • Building Tools for Others • Libraries • Need to Manage memory on a per processor basis

When not to use MPI • Regular computation matches HPF • But see PETSc/HPF comparison (ICASE 97-72) • Solution (e.g., library) already exists • http://www.mcs.anl.gov/mpi/libraries.html • Require Fault Tolerance • Sockets • Distributed Computing • CORBA, DCOM, etc.

Synchronization • Messages serve two purposes: • Moving data • Synchronizing processes • When does synchronization occur for • Blocking send? • Non-blocking send? • Dependencies on implementation

Common Synchronization Constructs • Barrier • All processes stop, No information passed • Broadcast • All processes get information from the root • How does this differ from a barrier • Ideas on how to implement an MPI_BCAST? • Reduction • Data is reduced (summed, multiplied, counted, …) • Ideas on how to implement MPI_REDUCE?

Synchronization Trees • The topology of the messages that implement a reduction is important. • LogP lecture illustrated a tree that wasn’t linear and wasn’t binary • Major idea was to maintain concurrency • MPI has many operations as synchronizing so that implementers have a choice in optimization.

MPI Concepts and Topology

MPI Concepts and Topology

Presentation Transcript

BCB 444/544

Lecture S1: Sample Lecture

16.360 Lecture 13

Gene Finding and HMMs

Materials for Lecture 20

Materials for Lecture 17

Materials for Lecture 20

Materials for Lecture 21

QCD

Lecture 3

Lecture 20

Lecture 4

SOA Part1 Lecture 4

Lecture 3 Outline

Intro to APUSH

Lecture 6

Physics at the Tevatron Lecture III

6.096 Lecture 10

Cold atoms

Lecture 18

Lecture 2.6: Matrices*

“Elementary Particles” Lecture 6