Standard

Standard Description Performance

Contents • Introduction to MPI • Message passing • Different type of communication • MPI functionalities • MPI structures • Basic functions • Data types • Contexts and tags • Groups and communication domains • Communication functions • Point to point communications • Asynchronous communications • Global communications • MPI-2 • One-sided communications • I/O • Dynamicity

CPU CPU CPU CPU CPU CPU CPU RAM RAM RAM RAM RAM RAM RAM Network Message passing (1) • Problem : • We have N nodes • All nodes connected by network How to use the global computer gathering the N nodes ?

Message passing (2) • One answer : message passing • Execute one process per processor • Exchange explicitly data between processors • Synchronize explicitly the different processes • Two types of data transfer : • Only one process initiate the communication: ‘one sided’ • The two processes cooperate for the communication: ‘cooperative’

Cooperatives Communications The communication involves the two processes Implicit synchronization in the simple case Functions prototypes : send(destination, data) recv(source, data) CPU CPU CPU CPU CPU CPU Two types of data transfer • ‘one sided’ communications • No Rendez-vous protocol • No warning about reading or writing actions inside local memory for a process • Costly synchronization • Functions prototypes : • put(remote_process, data) • get(remote_process, data) put() get() send() recv()

MPI (Message Passing Interface) • Standard developed by academics and industrial partners • Objective: to specify a portable message passing library • Imply an execution environment for launching and connecting together all the processes • Allow: • Synchronous and asynchronous communications • Global communications • Separated communication domains

Contents • Introduction to MPI • Message passing • Different type of communication • MPI functionalities • MPI structures • Basic functions (exemple HelloWorld_MPI.c) • Data types • Contexts and tags • Groups and communication domains • Communication functions • Point to point communications • Asynchronous communications • Global communications • MPI-2 • One-sided communications • I/O • Dynamicity

Sequential section MPI initialization Parallel initialization Computation Communications Synchronization End of parallel section Sequential section MPI Programming Structure • Follows the SPMD programming model • All processes are launched at the same time • Same program for every processors • Can differentiate processors roles by a rank number Non parallel section Parallel section initialization Multinode parallel section (MPI) Parallel section termination Remark: Most implementations advise to limit this program part to the exit call 

Basic functions • MPI environment initialization • C : MPI_Init(&argc, char &argv); • Fortran : call MPI_Init(ierror) • MPI Environment termination (program are recommended to exit after this function call) • C : MPI_Finalize(); • Fortran : call MPI_Finalize(ierror) • Getting the process rank • C : MPI_Comm_rank(MPI_COMM_WORLD, &rank); • Fortran : call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierror) • Getting the total number of processes • C : MPI_Comm_size(MPI_COMM_WORLD, &size); • Fortran : call MPI_comm_size(MPI_COMM_WORLD, size, ierror)

HelloWorld_MPI.c #include <stdio.h> #include <mpi.h> void main(int argc, char ** argv) { int rang, nprocs; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rang); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); printf(“hello, I am %d (Of %d processes)\n”, rang, nprocs); MPI_Finalize(); }

Type MPI Type C Type MPI Type FORTRAN MPI_CHAR signed char MPI_INTEGER INTEGER MPI_SHORT signed short int MPI_REAL REAL MPI_INT signed int MPI_DOUBLE_PRECISION DOUBLE PRECISION MPI_LONG signed long int MPI_COMPLEX COMPLEX MPI_UNSIGNED_CHAR unsigned char MPI_LOGICAL LOGICAL MPI_UNSIGNED_SHORT unsigned short int MPI_CHARACTER CHARACTER(1) MPI_UNSIGNED unsigned int MPI_BYTE MPI_UNSIGNED_LONG unsigned long int MPI_PACKED MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE MPI_PACKED MPI data types

User data types • By default: MPI exchanges data using vector of MPI data • It is possible to create data types to simplify communication operations (simplifying buffer and linearization operations) • User data types replace the obsolete MPI_PACK type • A user type consists in a sequence of basic types and a sequence of offsets for fitting the memory • creation : MPI_Type_commit(type) ; • Destruction : MPI_Type_free(type) ;

Contexts and tags • Need to distinguish different messages in reception • Context allow to distinguish between a point-to-point communication and a global communication • Every message is sent in a within a context, and must be received in the same context • Context is automatically managed by MPI • The communication tags allow to identify one communication among multiple ones • When communication are made asynchronously, this tags allow to sort them • For reception operations, we can received the next message by specifying the MPI_ANY_TAG keyword • Tag management is up to the MPI programmer

Communication domains • Nodes can be grouped in a communication domain called communicator • Every process as a rank number per group it is involved in • MPI_COMM_WORLD is the default communication domain gathering all processes and created at the initialization. • More generally, All operations can only be made on a single set of processes specified by their communicator • Each domain constitutes an distinct specific context for communications

Split a communicator (1/2): groups • To create a new domain, first you have to create a new group of processes: • int MPI_Comm_group(MPI_Comm comm, MPI_Group *group); • int MPI_Group_incl(MPI_Group group, int rsize, int *ranks, MPI_Group *newgroup); • int MPI_Group_excl(MPI_Group group, int rsize, int *ranks, MPI_Group *newgroup); • Set of operations on the groups: • int MPI_Group_union(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ; • int MPI_Group_intersection(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ; • int MPI_Group_difference(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ; • Destruction of a group: • int MPI_Group_free(MPI_Group *group) ;

Split a communicator (2/2): communicators • Associating a communicator to a group: • int MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm) ; • Dividing a domain in sub-domains: • int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm) ; • MPI_Comm_split is a collective operation on the initial communicator comm • Every process gives its color, Every process of the same color are then in the same newcomm • The MPI_UNDEFINED color allows for a process to not be part of the new communicator • Every process gives its key, Processes of the same color are ranked by these keys • A group is implicitly created for each new communicator created this way • Communicators destruction: • int MPI_Comm_free(MPI_Comm *comm) ;

Contents • Introduction to MPI • Message passing • Different type of communication • MPI functionalities • MPI structures • Basic functions • Data types • Contexts and tags • Groups and communication domains • Communication functions • Point to point communications (exemple Jeton.c) • Asynchronous communications • Global communications (exemple trace.c) • MPI-2 • One-sided communications • I/O • Dynamicity

Point-to-point communications • Send and receive data between a pair of processes • The two processes initiates the communication, one sends the data, the other asks for the reception • Communications are identified by tags • The type and the size of the data must be specified

Basic communication functions • Synchronous sending (between the computation process and the action of sending): • int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) ; • The tag allow unique identifying of messages • Synchronous data reception: • int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) ; • The tag must be identical to the tag sent • MPI_ANY_SOURCE can be specified to receive from anyone

#include <stdio.h> #include <mpi.h> void main(int argc, char ** argv) { int me, prec, suiv, np; int jeton = 0; MPI_Status * status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Comm_size(MPI_COMM_WORLD, &np); if (me == 0) prec = np – 1; else prec = me – 1; if (me == np - 1) suiv = 0; else suiv = me + 1; if (me == 0) MPI_Send(&jeton, 1, MPI_INT, suiv, 0, MPI_COMM_WORLD,); while (1) { MPI_Recv(&jeton, 1, MPI_INT, prec, 0, MPI_COMM_WORLD, status); MPI_Send(&jeton, 1, MPI_INT, suiv, 0, MPI_COMM_WORLD); } MPI_Finalize(); } 2 1 3 0 4 5 np -1 Jeton.c

Synchronism and asynchronism (1) • To solve some deadlocks, and to allow le recouvrement des communications par le calcul, one can use non blocking functions • In this case, the communication scheme is the following: • Initialization of the non blocking communication (by either the two or one of the process) • The communication (non blocking or blocking) is called by other process • … computation • Termination of the communication (Blocking operation until the communication is performed)

Synchronism and asynchronism (2) • Non blocking functions : • int MPI_Isend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request); • int MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) ; • The request field is used to know the state of a non blocking communication. To wait for its termination, one can call the following function: • int MPI_Wait(MPI_Request *request, MPI_Status *status) ;

Synchronism and asynchronism (3) • Data can be exchanged by blocking or non blocking functions. There are multiple functions to manage how the send and the receive operation are coupled • To fix the communication mode, you use prefix (MPI_[*]Send): • Synchronous send ([S]) : finished when the coresponding receive is posted (hard coupled to the reception, without buffers) • Buffered send ([B]) : a buffer is created, the send operation ends when the user buffer is copied to the system buffer (not coupled to the reception) • Standard send () : the send ends when the emission buffer is empty (MPI implementation decides for buffering or coupling to reception) • Ready send ([R]) : User assures that reception request is already posted when calling this function (coupled to the reception without buffer)

Collective or global operations • To simplify communication operation involving multiple processes, one can use collective operations on a communicator • Typical operations: • reductions • Data exchange: • Broadcast • Scatter • Gather • All-to-All • Explicit synchronization

Reductions (1) • A reduction is an arithmetic operation on the distributed data made by a set of processors • Prototype : • C : int MPI_Reduce(void * sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm communicator); • Fortran : MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, communicator, ierror) • Using MPI_Reduce(), only the root processor gets the result • With MPI_AllReduce(), all processes get the result

MPI_Op Operation MPI_MIN Minimum MPI_MAX Maximum MPI_SUM Sum MPI_PROD Product element by element MPI_LAND Logical and MPI_BAND Bit/bit and MPI_LOR Logical or MPI_BOR Bit/bit or MPI_LXOR Logical exclusive or MPI_BXOR Bit/bit exclusive or MPI_MINLOC Minimum and localization MPI_MAXLOC Maximum and localization Reductions (2) • Available operations

Broadcast • A broadcast operation allows to distribute the same data to all processes • One-to-all communication, from a specified process ‘root’ to all processes of a communicator • Prototypes : • C : int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm); • Fortran : MPI_Bcast(buffer, count, datatype, root, communicator, ierror) 0 0 1 2 3 np-1 1 2 3 np-1 buffer root = 1

Scatter • One-to-all operation, different data are sent to each receiver process according to their rank • Prototypes : • C : int MPI_Scatter(void * sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator); • Fortran : MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror) • The ‘send’ parameters are used by only the sender process 0 1 2 3 np-1 0 1 2 3 np-1 sendbuf recvbuf root = 2

Gather • All-to-one operation, different data are received by a receiver process • Prototypes : • C : int MPI_Gather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator); • Fortran : MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror) • The ‘receive’ parameters are only used by the receiver process 0 1 2 3 np-1 0 1 2 3 np-1 sendbuf recvbuf root = 3

All-to-All • All-to-all operation, different data are sent to each process, according to their rank • Prototypes : • C : int MPI_AlltoAll(void * sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator); • Fortran : MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror) 0 1 2 3 np-1 0 1 2 3 np-1 sendbuf recvbuf

Explicit Synchronization • Synchronization barrier : All processes of a communicator waits for the last process to enter the barrier before continuing their execution • For computer with material barrier available (such as SGI and Cray T3E), the MPI barrier is slower than these material barrier • Prototype • C : int MPI_Barrier (MPI_Comm communicator); • Fortran : MPI_Barrier(Communicator, IERROR)

Matrix trace (1) • Computing the trace of a matrix An • The matrix trace is the sum of the diagonal element (square matrix) • One can easily see that the sum can be made on multiple processor, ending by using a reduction to compute the complete trace

#include <stdio.h> #include <mpi.h> void main(int argc, char ** argv) { int me, np, root=0; int N; /* Suppose N = m*np */ double A[N][N]; double buffer[N], diag[N]; double traceA, trace_loc; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Comm_size(MPI_COMM_WORLD, &np); tranche = N/np; /* Initialization of A made by 0 */ /* … */ /* buffering diagonal elements on the root process */ if (me == 0) { for (i=root; i<N; i++) buffer[i] = A[i][i]; } /* Scatter operation allows to distribute the buffered elements among the processes */ MPI_Scatter( buffer, tranche, MPI_DOUBLE, diag, tranche, MPI_DOUBLE, MPI_COMM_WORLD); Matrix trace (2.1)

/* Each process computes the partial trace */ trace_loc = 0; for (i = 0; i < tranche; i++) trace_loc += diag[i]; /* Then we do the reduction */ MPI_Reduce(&trace_loc, &traceA, 1, MPI_DOUBLE, MPI_SUM, root, MPI_COMM_WORLD); if (me == root) printf("La trace de A est : %f \n", traceA); MPI_Finalize(); } Matrix trace (2.2)

Contents • Introduction to MPI • Message passing • Different type of communication • MPI functionalities • MPI structures • Basic functions • Data types • Contexts and tags • Groups and communication domains • Communication functions • Point to point communications • Asynchronous communications • Global communications • MPI-2 • One-sided communications • I/O • Dynamicity

One-sided communications (1/2) • No synchronization during communications • Allow simulated shared memory implementation (Remote Memory Access) • Defining the part of memory other processes can access: • MPI_Win_create() • MPI_Win_free() • One-sided communication functions: • MPI_Put() • MPI_Get() • MPI_Accumulate() • Operations: MPI_SUM, MPI_LAND, MPI_REPLACE

One-sided communications (2/2) • Active synchronization function • MPI_Win_fence() • Take a win window of memory as parameter • Collective operation (barrier) on all processes of the group MPI_Win_group(win) • Act as a synchronization barrier which ends every RMA transfer using the window win • Passive synchronization function • MPI_Win_lock() and MPI_Win_unlock() • Classical mutex functions • The communications initiator is the only responsible for the synchronization • When MPI_Win_unlock() returns, every transfer operation is finished

Parallel Input/Output • Need for intelligent management of I/O is mandatory for parallel applications • MPI-IO is a set of functions for optimised I/O • Extending classical file access functions • Collective synchronization for accessing file • File offset shared or individual • Blocking or non blocking read • View (for accessing non sequential memory zone) • Similar syntax as MPI communication functions

Dynamic allocation of processes • Dynamic change of the number of processes • Spawning new processes during execution • The MPI_Comm_spawn() function allow to create a new set of processes on other processors • An inter-communicator links the domain of the parent to the new domain gathering the new processes • The MPI_Intercom_merge() function allows the merge of a unique communicator from an inter-communicator • MPI-2 allows dynamic MPMD style using the function MPI_Comm_spawn_multiple() • MPI_Comm_get_attr (MPI_UNIVERSE_SIZE) is used to know the maximum possible number of MPI processes • Process destruction • No explicit exit() function of MPI process • For exiting a MPI process, its communicator MPI_COMM_WORLD must contain only finalizing processes • All inter-communicator must be closed before finalization

Remarks and conclusion • MPI has become, thanks to the distributed computing community, a standard library for message passing • The MPI-2 breaks the classic message passing SPMD model of MPI-1 • Numbers of implementation exist, on most of architectures • Lots of Documentations and publications are available

Some pointers • MPI standard official site • http://www-unix.mcs.anl.gov/mpi/ • The MPI forum • http://www.mpi-forum.org/ • Book: MPI, The Complete Reference (Marc Snir et al.) • http://www.netlib.org/utk/papers/mpi-book/mpi-book.html

Standard Description Performance

Contents • MPI implementation • Performance metrics • High performance networks • Communication type / 0-copy

MPI implementation • LAM-MPI • Optimised for collective operations • MPICH • Easy writing of new low level driver • Open-MPI • Try to combine performance and ease of the two prior ones • Conform to MPI-2 • IBM / NEC / FUJITSU… • Complete and performant implementation of MPI-2 • Target specific architecture

Performance metrics • Comparison criteria • Latency • bandwidth • Collective operation • Overlapping capabilities • Real applications • Measuring tools • Round Trip Time (ping-pong) • NetPipe • NAS benchmarks • CG • LU • BT • FT

High performance networks (1/3) Technologies • Myrinet • Connexionless reliable api • Registered buffers • Fully programmable DMA NIC processor • Up to full-duplex 2Gb/s bandwidth with Myrinet 2000 • SCINet • Torus topology based network with static routing • No need to register buffers • Very small latency (suitable for RMA) • Up to 2Gb/s • Gigabit Ethernet • No need to registered buffers • DMA operations • High latency • Up to 1Gb/s and 10Gb/s bandwidth • Infiniband • Reliable Connexion mode and Unreliable Datagram mode • Registered buffers • Queued DMA operations • Up to 10Gb/s bandwidth

High performance networks (2/3) Technologies • Myrinet • Socket-GM • MPICH-GM • SCINet • No functional socket API • SCI-MPICH • Gigabit Ethernet • Have to use socket interface • Infiniband • IoIP • LAM-MPI, MPICH, MPI/pro etc…

High performance networks (3/3) Technologies

Eager vs Rendez-vous (1/2) • Eager protocol • Message is sent without control • Better latency • Copied in a buffer if the receiver has not posted the reception yet • Memory consuming for long messages • Used only for long messages (<64KB) • Rendez-vous protocol • Sender and receiver are synchronized • High latency • 0-copy • Better bandwidth • Reduce the memory consumption

Eager vs Rendez-vous (2/2)

Standard

Standard

Presentation Transcript

Standard

Standard

Standard

Standard

Standard

Standard

Standard

Standard

Standard

Standard

Standard

Standard

Standard:

Standard

Standard

Standard

STANDARD

Standard

STANDARD