1 / 53

Standard

Standard. Description Performance. Contents. Introduction to MPI Message passing Different type of communication MPI functionalities MPI structures Basic functions Data types Contexts and tags Groups and communication domains Communication functions Point to point communications

kennan
Download Presentation

Standard

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Standard Description Performance

  2. Contents • Introduction to MPI • Message passing • Different type of communication • MPI functionalities • MPI structures • Basic functions • Data types • Contexts and tags • Groups and communication domains • Communication functions • Point to point communications • Asynchronous communications • Global communications • MPI-2 • One-sided communications • I/O • Dynamicity

  3. CPU CPU CPU CPU CPU CPU CPU RAM RAM RAM RAM RAM RAM RAM Network Message passing (1) • Problem : • We have N nodes • All nodes connected by network How to use the global computer gathering the N nodes ?

  4. Message passing (2) • One answer : message passing • Execute one process per processor • Exchange explicitly data between processors • Synchronize explicitly the different processes • Two types of data transfer : • Only one process initiate the communication: ‘one sided’ • The two processes cooperate for the communication: ‘cooperative’

  5. Cooperatives Communications The communication involves the two processes Implicit synchronization in the simple case Functions prototypes : send(destination, data) recv(source, data) CPU CPU CPU CPU CPU CPU Two types of data transfer • ‘one sided’ communications • No Rendez-vous protocol • No warning about reading or writing actions inside local memory for a process • Costly synchronization • Functions prototypes : • put(remote_process, data) • get(remote_process, data) put() get() send() recv()

  6. MPI (Message Passing Interface) • Standard developed by academics and industrial partners • Objective: to specify a portable message passing library • Imply an execution environment for launching and connecting together all the processes • Allow: • Synchronous and asynchronous communications • Global communications • Separated communication domains

  7. Contents • Introduction to MPI • Message passing • Different type of communication • MPI functionalities • MPI structures • Basic functions (exemple HelloWorld_MPI.c) • Data types • Contexts and tags • Groups and communication domains • Communication functions • Point to point communications • Asynchronous communications • Global communications • MPI-2 • One-sided communications • I/O • Dynamicity

  8. Sequential section MPI initialization Parallel initialization Computation Communications Synchronization End of parallel section Sequential section MPI Programming Structure • Follows the SPMD programming model • All processes are launched at the same time • Same program for every processors • Can differentiate processors roles by a rank number Non parallel section Parallel section initialization Multinode parallel section (MPI) Parallel section termination Remark: Most implementations advise to limit this program part to the exit call 

  9. Basic functions • MPI environment initialization • C : MPI_Init(&argc, char &argv); • Fortran : call MPI_Init(ierror) • MPI Environment termination (program are recommended to exit after this function call) • C : MPI_Finalize(); • Fortran : call MPI_Finalize(ierror) • Getting the process rank • C : MPI_Comm_rank(MPI_COMM_WORLD, &rank); • Fortran : call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierror) • Getting the total number of processes • C : MPI_Comm_size(MPI_COMM_WORLD, &size); • Fortran : call MPI_comm_size(MPI_COMM_WORLD, size, ierror)

  10. HelloWorld_MPI.c #include <stdio.h> #include <mpi.h> void main(int argc, char ** argv) { int rang, nprocs; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rang); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); printf(“hello, I am %d (Of %d processes)\n”, rang, nprocs); MPI_Finalize(); }

  11. Type MPI Type C Type MPI Type FORTRAN MPI_CHAR signed char MPI_INTEGER INTEGER MPI_SHORT signed short int MPI_REAL REAL MPI_INT signed int MPI_DOUBLE_PRECISION DOUBLE PRECISION MPI_LONG signed long int MPI_COMPLEX COMPLEX MPI_UNSIGNED_CHAR unsigned char MPI_LOGICAL LOGICAL MPI_UNSIGNED_SHORT unsigned short int MPI_CHARACTER CHARACTER(1) MPI_UNSIGNED unsigned int MPI_BYTE MPI_UNSIGNED_LONG unsigned long int MPI_PACKED MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE MPI_PACKED MPI data types

  12. User data types • By default: MPI exchanges data using vector of MPI data • It is possible to create data types to simplify communication operations (simplifying buffer and linearization operations) • User data types replace the obsolete MPI_PACK type • A user type consists in a sequence of basic types and a sequence of offsets for fitting the memory • creation : MPI_Type_commit(type) ; • Destruction : MPI_Type_free(type) ;

  13. Contexts and tags • Need to distinguish different messages in reception • Context allow to distinguish between a point-to-point communication and a global communication • Every message is sent in a within a context, and must be received in the same context • Context is automatically managed by MPI • The communication tags allow to identify one communication among multiple ones • When communication are made asynchronously, this tags allow to sort them • For reception operations, we can received the next message by specifying the MPI_ANY_TAG keyword • Tag management is up to the MPI programmer

  14. Communication domains • Nodes can be grouped in a communication domain called communicator • Every process as a rank number per group it is involved in • MPI_COMM_WORLD is the default communication domain gathering all processes and created at the initialization. • More generally, All operations can only be made on a single set of processes specified by their communicator • Each domain constitutes an distinct specific context for communications

  15. Split a communicator (1/2): groups • To create a new domain, first you have to create a new group of processes: • int MPI_Comm_group(MPI_Comm comm, MPI_Group *group); • int MPI_Group_incl(MPI_Group group, int rsize, int *ranks, MPI_Group *newgroup); • int MPI_Group_excl(MPI_Group group, int rsize, int *ranks, MPI_Group *newgroup); • Set of operations on the groups: • int MPI_Group_union(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ; • int MPI_Group_intersection(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ; • int MPI_Group_difference(MPI_Group g1, MPI_Group g2, MPI_Group *gr) ; • Destruction of a group: • int MPI_Group_free(MPI_Group *group) ;

  16. Split a communicator (2/2): communicators • Associating a communicator to a group: • int MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm) ; • Dividing a domain in sub-domains: • int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm) ; • MPI_Comm_split is a collective operation on the initial communicator comm • Every process gives its color, Every process of the same color are then in the same newcomm • The MPI_UNDEFINED color allows for a process to not be part of the new communicator • Every process gives its key, Processes of the same color are ranked by these keys • A group is implicitly created for each new communicator created this way • Communicators destruction: • int MPI_Comm_free(MPI_Comm *comm) ;

  17. Contents • Introduction to MPI • Message passing • Different type of communication • MPI functionalities • MPI structures • Basic functions • Data types • Contexts and tags • Groups and communication domains • Communication functions • Point to point communications (exemple Jeton.c) • Asynchronous communications • Global communications (exemple trace.c) • MPI-2 • One-sided communications • I/O • Dynamicity

  18. Point-to-point communications • Send and receive data between a pair of processes • The two processes initiates the communication, one sends the data, the other asks for the reception • Communications are identified by tags • The type and the size of the data must be specified

  19. Basic communication functions • Synchronous sending (between the computation process and the action of sending): • int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) ; • The tag allow unique identifying of messages • Synchronous data reception: • int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) ; • The tag must be identical to the tag sent • MPI_ANY_SOURCE can be specified to receive from anyone

  20. #include <stdio.h> #include <mpi.h> void main(int argc, char ** argv) { int me, prec, suiv, np; int jeton = 0; MPI_Status * status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Comm_size(MPI_COMM_WORLD, &np); if (me == 0) prec = np – 1; else prec = me – 1; if (me == np - 1) suiv = 0; else suiv = me + 1; if (me == 0) MPI_Send(&jeton, 1, MPI_INT, suiv, 0, MPI_COMM_WORLD,); while (1) { MPI_Recv(&jeton, 1, MPI_INT, prec, 0, MPI_COMM_WORLD, status); MPI_Send(&jeton, 1, MPI_INT, suiv, 0, MPI_COMM_WORLD); } MPI_Finalize(); } 2 1 3 0 4 5 np -1 Jeton.c

  21. Synchronism and asynchronism (1) • To solve some deadlocks, and to allow le recouvrement des communications par le calcul, one can use non blocking functions • In this case, the communication scheme is the following: • Initialization of the non blocking communication (by either the two or one of the process) • The communication (non blocking or blocking) is called by other process • … computation • Termination of the communication (Blocking operation until the communication is performed)

  22. Synchronism and asynchronism (2) • Non blocking functions : • int MPI_Isend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request); • int MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) ; • The request field is used to know the state of a non blocking communication. To wait for its termination, one can call the following function: • int MPI_Wait(MPI_Request *request, MPI_Status *status) ;

  23. Synchronism and asynchronism (3) • Data can be exchanged by blocking or non blocking functions. There are multiple functions to manage how the send and the receive operation are coupled • To fix the communication mode, you use prefix (MPI_[*]Send): • Synchronous send ([S]) : finished when the coresponding receive is posted (hard coupled to the reception, without buffers) • Buffered send ([B]) : a buffer is created, the send operation ends when the user buffer is copied to the system buffer (not coupled to the reception) • Standard send () : the send ends when the emission buffer is empty (MPI implementation decides for buffering or coupling to reception) • Ready send ([R]) : User assures that reception request is already posted when calling this function (coupled to the reception without buffer)

  24. Collective or global operations • To simplify communication operation involving multiple processes, one can use collective operations on a communicator • Typical operations: • reductions • Data exchange: • Broadcast • Scatter • Gather • All-to-All • Explicit synchronization

  25. Reductions (1) • A reduction is an arithmetic operation on the distributed data made by a set of processors • Prototype : • C : int MPI_Reduce(void * sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm communicator); • Fortran : MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, communicator, ierror) • Using MPI_Reduce(), only the root processor gets the result • With MPI_AllReduce(), all processes get the result

  26. MPI_Op Operation MPI_MIN Minimum MPI_MAX Maximum MPI_SUM Sum MPI_PROD Product element by element MPI_LAND Logical and MPI_BAND Bit/bit and MPI_LOR Logical or MPI_BOR Bit/bit or MPI_LXOR Logical exclusive or MPI_BXOR Bit/bit exclusive or MPI_MINLOC Minimum and localization MPI_MAXLOC Maximum and localization Reductions (2) • Available operations

  27. Broadcast • A broadcast operation allows to distribute the same data to all processes • One-to-all communication, from a specified process ‘root’ to all processes of a communicator • Prototypes : • C : int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm); • Fortran : MPI_Bcast(buffer, count, datatype, root, communicator, ierror) 0 0 1 2 3 np-1 1 2 3 np-1 buffer root = 1

  28. Scatter • One-to-all operation, different data are sent to each receiver process according to their rank • Prototypes : • C : int MPI_Scatter(void * sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator); • Fortran : MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror) • The ‘send’ parameters are used by only the sender process 0 1 2 3 np-1 0 1 2 3 np-1 sendbuf recvbuf root = 2

  29. Gather • All-to-one operation, different data are received by a receiver process • Prototypes : • C : int MPI_Gather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator); • Fortran : MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror) • The ‘receive’ parameters are only used by the receiver process 0 1 2 3 np-1 0 1 2 3 np-1 sendbuf recvbuf root = 3

  30. All-to-All • All-to-all operation, different data are sent to each process, according to their rank • Prototypes : • C : int MPI_AlltoAll(void * sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm communicator); • Fortran : MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, communicator, ierror) 0 1 2 3 np-1 0 1 2 3 np-1 sendbuf recvbuf

  31. Explicit Synchronization • Synchronization barrier : All processes of a communicator waits for the last process to enter the barrier before continuing their execution • For computer with material barrier available (such as SGI and Cray T3E), the MPI barrier is slower than these material barrier • Prototype • C : int MPI_Barrier (MPI_Comm communicator); • Fortran : MPI_Barrier(Communicator, IERROR)

  32. Matrix trace (1) • Computing the trace of a matrix An • The matrix trace is the sum of the diagonal element (square matrix) • One can easily see that the sum can be made on multiple processor, ending by using a reduction to compute the complete trace

  33. #include <stdio.h> #include <mpi.h> void main(int argc, char ** argv) { int me, np, root=0; int N; /* Suppose N = m*np */ double A[N][N]; double buffer[N], diag[N]; double traceA, trace_loc; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &me); MPI_Comm_size(MPI_COMM_WORLD, &np); tranche = N/np; /* Initialization of A made by 0 */ /* … */ /* buffering diagonal elements on the root process */ if (me == 0) { for (i=root; i<N; i++) buffer[i] = A[i][i]; } /* Scatter operation allows to distribute the buffered elements among the processes */ MPI_Scatter( buffer, tranche, MPI_DOUBLE, diag, tranche, MPI_DOUBLE, MPI_COMM_WORLD); Matrix trace (2.1)

  34. /* Each process computes the partial trace */ trace_loc = 0; for (i = 0; i < tranche; i++) trace_loc += diag[i]; /* Then we do the reduction */ MPI_Reduce(&trace_loc, &traceA, 1, MPI_DOUBLE, MPI_SUM, root, MPI_COMM_WORLD); if (me == root) printf("La trace de A est : %f \n", traceA); MPI_Finalize(); } Matrix trace (2.2)

  35. Contents • Introduction to MPI • Message passing • Different type of communication • MPI functionalities • MPI structures • Basic functions • Data types • Contexts and tags • Groups and communication domains • Communication functions • Point to point communications • Asynchronous communications • Global communications • MPI-2 • One-sided communications • I/O • Dynamicity

  36. One-sided communications (1/2) • No synchronization during communications • Allow simulated shared memory implementation (Remote Memory Access) • Defining the part of memory other processes can access: • MPI_Win_create() • MPI_Win_free() • One-sided communication functions: • MPI_Put() • MPI_Get() • MPI_Accumulate() • Operations: MPI_SUM, MPI_LAND, MPI_REPLACE

  37. One-sided communications (2/2) • Active synchronization function • MPI_Win_fence() • Take a win window of memory as parameter • Collective operation (barrier) on all processes of the group MPI_Win_group(win) • Act as a synchronization barrier which ends every RMA transfer using the window win • Passive synchronization function • MPI_Win_lock() and MPI_Win_unlock() • Classical mutex functions • The communications initiator is the only responsible for the synchronization • When MPI_Win_unlock() returns, every transfer operation is finished

  38. Parallel Input/Output • Need for intelligent management of I/O is mandatory for parallel applications • MPI-IO is a set of functions for optimised I/O • Extending classical file access functions • Collective synchronization for accessing file • File offset shared or individual • Blocking or non blocking read • View (for accessing non sequential memory zone) • Similar syntax as MPI communication functions

  39. Dynamic allocation of processes • Dynamic change of the number of processes • Spawning new processes during execution • The MPI_Comm_spawn() function allow to create a new set of processes on other processors • An inter-communicator links the domain of the parent to the new domain gathering the new processes • The MPI_Intercom_merge() function allows the merge of a unique communicator from an inter-communicator • MPI-2 allows dynamic MPMD style using the function MPI_Comm_spawn_multiple() • MPI_Comm_get_attr (MPI_UNIVERSE_SIZE) is used to know the maximum possible number of MPI processes • Process destruction • No explicit exit() function of MPI process • For exiting a MPI process, its communicator MPI_COMM_WORLD must contain only finalizing processes • All inter-communicator must be closed before finalization

  40. Remarks and conclusion • MPI has become, thanks to the distributed computing community, a standard library for message passing • The MPI-2 breaks the classic message passing SPMD model of MPI-1 • Numbers of implementation exist, on most of architectures • Lots of Documentations and publications are available

  41. Some pointers • MPI standard official site • http://www-unix.mcs.anl.gov/mpi/ • The MPI forum • http://www.mpi-forum.org/ • Book: MPI, The Complete Reference (Marc Snir et al.) • http://www.netlib.org/utk/papers/mpi-book/mpi-book.html

  42. Standard Description Performance

  43. Contents • MPI implementation • Performance metrics • High performance networks • Communication type / 0-copy

  44. MPI implementation • LAM-MPI • Optimised for collective operations • MPICH • Easy writing of new low level driver • Open-MPI • Try to combine performance and ease of the two prior ones • Conform to MPI-2 • IBM / NEC / FUJITSU… • Complete and performant implementation of MPI-2 • Target specific architecture

  45. Performance metrics • Comparison criteria • Latency • bandwidth • Collective operation • Overlapping capabilities • Real applications • Measuring tools • Round Trip Time (ping-pong) • NetPipe • NAS benchmarks • CG • LU • BT • FT

  46. High performance networks (1/3) Technologies • Myrinet • Connexionless reliable api • Registered buffers • Fully programmable DMA NIC processor • Up to full-duplex 2Gb/s bandwidth with Myrinet 2000 • SCINet • Torus topology based network with static routing • No need to register buffers • Very small latency (suitable for RMA) • Up to 2Gb/s • Gigabit Ethernet • No need to registered buffers • DMA operations • High latency • Up to 1Gb/s and 10Gb/s bandwidth • Infiniband • Reliable Connexion mode and Unreliable Datagram mode • Registered buffers • Queued DMA operations • Up to 10Gb/s bandwidth

  47. High performance networks (2/3) Technologies • Myrinet • Socket-GM • MPICH-GM • SCINet • No functional socket API • SCI-MPICH • Gigabit Ethernet • Have to use socket interface • Infiniband • IoIP • LAM-MPI, MPICH, MPI/pro etc…

  48. High performance networks (3/3) Technologies

  49. Eager vs Rendez-vous (1/2) • Eager protocol • Message is sent without control • Better latency • Copied in a buffer if the receiver has not posted the reception yet • Memory consuming for long messages • Used only for long messages (<64KB) • Rendez-vous protocol • Sender and receiver are synchronized • High latency • 0-copy • Better bandwidth • Reduce the memory consumption

  50. Eager vs Rendez-vous (2/2)

More Related