1 / 45

Crash Course In Message Passing Interface

Crash Course In Message Passing Interface. Adam Simpson NCCS User A ssistance. Interactive job on mars. $ qsub -I -AUT-NTNLEDU - lwalltime = 02:00 :00,size =16 $ module swap PrgEnv -cray PrgEnv -gnu $ cd / lustre /medusa/$USER

marcie
Download Presentation

Crash Course In Message Passing Interface

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance

  2. Interactive job on mars $ qsub-I -AUT-NTNLEDU -lwalltime=02:00:00,size=16 $ module swap PrgEnv-crayPrgEnv-gnu $ cd /lustre/medusa/$USER Once in the interactive job you may submit MPI jobs using aprun

  3. Terminology • Process • Instance of program being executed • Has own virtual address space (memory) • Often referred to as a task • Message • Data to be sent from one process to another • Message Passing • The act of transferring a message from one process to another

  4. Why message passing? • Traditional distributed memory systems Network Network Network memory memory memory CPU CPU CPU Network memory CPU

  5. Why message passing? • Traditional distributed memory systems Network Network Network memory memory memory CPU CPU CPU Node Network memory CPU

  6. Why message passing? • Traditional distributed memory systems • Pass data between processes running on different nodes Network Network memory memory CPU CPU Network memory Network memory CPU CPU

  7. Why message passing? • Modern hybrid shared/distributed memory systems Network Network memory memory Node CPU CPU CPU CPU CPU CPU CPU CPU Network memory Network memory CPU CPU CPU CPU CPU CPU CPU CPU

  8. Why message passing? • Modern hybrid shared/distributed memory systems • Need support for inter/intranode communication Network Network memory memory CPU CPU CPU CPU CPU CPU CPU CPU Network memory Network memory CPU CPU CPU CPU CPU CPU CPU CPU

  9. What is MPI? • A message passing library specification • Message Passing Interface • Community driven • Forum composed of academic and industry researchers • De facto standard for distributed message passing • Available on nearly all modern HPC resources • Currently on 3rd major specification release • Several open and proprietary implementations • Primarily C/Fortran implementations • OpenMPI, MPICH, MVAPICH, Cray MPT, Intel MPI, etc. • Bindings available for Python, R, Ruby, Java, etc.

  10. What about Threads and Accelerators? • MPI complements shared memory models • Hybrid method • MPI between nodes, threads/accelerators on node • Accelerator support • Pass accelerator memory addresses to MPI functions • DMA from interconnect to GPU memory

  11. MPI Functions • MPI consists of hundreds of functions • Most users will only use a handful • All functions prefixed with MPI_ • C functions return integer error • MPI_SUCCESSif no error • Fortran subroutines take integer error as last argument • MPI_SUCCESSif no error

  12. More terminology • Communicator • An object that represents a group of processes than can communicate with each other • MPI_Commtype • Predefined communicator: MPI_COMM_WORLD • Rank • Within a communicator each process is given a unique integer ID • Ranks start at 0 and are incremented contiguously • Size • The total number of ranks in a communicator

  13. Basic functions intMPI_Init( int*argc, char ***argv ) • argc • Pointer to the number of arguments • argv • Pointer to argument vector • Initializes MPI environment • Must be called before any other MPI call

  14. Basic functions intMPI_Finalize( void ) • Cleans up MPI environment • Call right before exiting program • Calling MPI functions after MPI_Finalize is undefined

  15. Basic functions intMPI_Comm_rank(MPI_Commcomm, int*rank) • comm is an MPI communicator • Usually MPI_COMM_WORLD • rank • will be set to the rank of the calling process in the communicator of comm

  16. Basic functions intMPI_Comm_size(MPI_Commcomm, int*size) • comm is an MPI communicator • Usually MPI_COMM_WORLD • size • will be set to the number of ranks in the communicator comm

  17. Hello MPI_COMM_WORLD #include “stdio.h” #include “mpi.h” intmain(intargc, char **argv) { intrank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf(“Hello from rank %d of %d total \n”, rank, size); MPI_Finalize(); return 0; }

  18. Hello MPI_COMM_WORLD • Compile • mpiccwrapper used to link in libraries and includes • On a Cray system us cc instead • Uses standard C compiler, such as gcc, under the hood $ cc hello_mpi.c –o hello.exe

  19. Hello MPI_COMM_WORLD • Run • mpirun –n # ./a.outlaunches # copies of a.out • On a Cray system use aprun –n # ./a.out • Ability to control mapping of processes onto hardware $ aprun –n 3 ./hello.exe Hello from rank 0 of 3 total Hello from rank 2 of 3 total Hello from rank 1 of 3 total

  20. Hello MPI_COMM_WORLD • Run • mpirun –n # ./a.outlaunches # copies of a.out • On a Cray system use aprun –n # ./a.out • Ability to control mapping of processes onto hardware $ aprun –n 3 ./hello.exe Hello from rank 0 of 3 total Hello from rank 2 of 3 total Hello from rank 1 of 3 total Note the order

  21. MPI_Datatype • Many MPI functions require a datatype • Not all datatypes are contiguous in memory • User defined datatypes possible • Built in types for all intrinsic C types • MPI_INT, MPI_FLOAT, MPI_DOUBLE, … • Built in types for all intrinsic Fortran types • MPI_INTEGER, MPI_REAL, MPI_COMPLEX,…

  22. Point to Point • Point to Point routines • Involves two and only two processes • One process explicitly initiates send operation • One process explicitly initiates receive operation • Several send/receive flavors available • Blocking/non-blocking • Buffered/non-buffered • Combined send-receive • Basis for more complicated messaging

  23. Point to Point intMPI_Send(void *buf, intcount, MPI_Datatypedatatype, intdest, inttag, MPI_Commcomm) • buf • Initial address of send buffer • count • Number of elements to send • datatype • Datatype of each element in send buffer • dest • Rank of destination • tag • Integer tag used by receiver to identify message • comm • Communicator • Returns after send buffer is ready to reuse • Message may not be delivered on destination rank

  24. Point to Point intMPI_Recv(void *buf, intcount, MPI_Datatypedatatype, intsource, inttag, MPI_Commcomm, MPI_Statusstatus) • buf • Initial address of receive buffer • count • Maximum number of elements that can be received • datatype • Datatype of each element in receive buffer • source • Rank of source • tag • Integer tag used to identify message • comm • Communicator • status • Struct containing information on received message • Returns after receive buffer is ready to reuse

  25. Point to Point: try 1 #include “mpi.h” intmain(intargc, char **argv) { intrank, other_rank, send_buff, recv_buff; MPI_Statusstatus; tag = 5; send_buff = 10; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==0) other_rank = 1; else other_rank = 0; MPI_Recv(&recv_buff, 1, MPI_INT, other_rank, tag, MPI_COMM_WORLD, &status); MPI_Send(&send_buff, 1, MPI_INT, other_rank, tag, MPI_COMM_WORLD); MPI_Finalize(); return 0; }

  26. Point to Point: try 1 • Compile $ cc dead.c–o lock.exe • Run $ aprun –n 2 ./lock.exe … … … deadlock

  27. Deadlock • The problem • Both processes are waiting to receive a message • Neither process ever sends a message • Deadlock as both processes wait forever • Easier to deadlock than one might think • Buffering can mask logical deadlocks

  28. Point to Point: Fix 1 #include “mpi.h” intmain(intargc, char **argv) { intrank, other_rank, recv_buff; MPI_Statusstatus; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==0){ other_rank = 1; MPI_Recv(&recv_buff, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &status); MPI_Send(&rank, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD); }else { other_rank = 0; MPI_Send(&rank, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD); MPI_Recv(&recv_buff, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &status); } MPI_Finalize(); return 0; }

  29. Non blocking P2P • Non blocking Point to Point functions • Allow Send and Receive to not block on CPU • Return before buffers are safe to reuse • Can be used to prevent deadlock situations • Can be used to overlap communication and computation • Calls prefixed with “I”, because they return immediately

  30. Non blocking P2P intMPI_Isend(void *buf, intcount, MPI_Datatypedatatype, intdest, inttag, MPI_Requestrequest) • buf • Initial address of send buffer • count • Number of elements to send • datatype • Datatype of each element in send buffer • dest • Rank of destination • tag • Integer tag used by receiver to identify message • request • Object used to keep track of status of receive operation

  31. Non blocking P2P intMPI_Irecv(void *buf, intcount, MPI_Datatypedatatype, intsource, inttag, MPI_Commcomm, MPI_Requestrequest) • buf • Initial address of receive buffer • count • Maximum number of elements that can be received • datatype • Datatype of each element in receive buffer • source • Rank of source • tag • Integer tag used to identify message • comm • Communicator • request • Object used to keep track of status of receive operation

  32. Non blocking P2P intMPI_Wait(MPI_Request*request, MPI_Status*status) • request • The request you want to wait to complete • status • Status struct containing information on completed request • Will block until specified request operation is complete • MPI_Wait, or similar, must be called if request is used

  33. Point to Point: Fix 2 #include “mpi.h” intmain(intargc, char **argv) { intrank, other_rank, recv_buff; MPI_Requestsend_req, recv_req; MPI_Statussend_stat, recv_stat; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank==0) other_rank = 1; else other_rank = 0; MPI_Irecv(&recv_buff, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &recv_req); MPI_Isend(&rank, 1, MPI_INT, other_rank, 5, MPI_COMM_WORLD, &send_req); MPI_Wait(&recv_req, &recv_stat); MPI_Wait(&send_req, &send_stat); MPI_Finalize(); return 0; }

  34. Collectives • Collective routines • Involves all processes in communicator • All processes in communicator must participate • Serve several purposes • Synchronization • Data movement • Reductions • Several routines originate or terminate at a single process known as the “root”

  35. Collectives intMPI_Barrier( MPI_Commcomm ) • Blocks process in comm until all process reach it • Used to synchronize processes in comm

  36. Collectives Broadcast Root Rank 1 Rank 0 Rank 2 Rank 3

  37. Collectives intMPI_Bcast(void *buf, intcount, MPI_Datatypedatatype, introot, MPI_Commcomm) • buf • Initial address of send buffer • count • Number of elements to send • datatype • Datatype of each element in send buffer • root • Rank of node that will broadcast buf • comm • Communicator

  38. Collectives Scatter Root Root Root Root Rank 1 Rank 3 Rank 2 Rank 0

  39. Collectives intMPI_Scatter(const void *sendbuf, intsendcount, MPI_Datatypesendtype, void *recvbuf intrecvcount, MPI_Datatyperecvtype, introot,MPI_Commcomm) • sendbuf • Initial address of send buffer on root node • sendcount • Number of elements to send • sendtype • Datatype of each element in send buffer • recvbuf • Initial address of receive buffer on each node • recvcount • Number of elements in receive buffer • recvtype • Datatype of each element in receive buffer • root • Rank of node that will broadcast buf • comm • Communicator

  40. Collectives Gather Root Root Root Root Rank 1 Rank 3 Rank 2 Rank 0

  41. Collectives intMPI_Gather(const void *sendbuf, intsendcount, MPI_Datatypesendtype, void *recvbuf intrecvcount, MPI_Datatyperecvtype, introot,MPI_Commcomm) • sendbuf • Initial address of send buffer on root node • sendcount • Number of elements to send • sendtype • Datatype of each element in send buffer • recvbuf • Initial address of receive buffer on each node • recvcount • Number of elements in receive buffer • recvtype • Datatype of each element in receive buffer • root • Rank of node that will broadcast buf • comm • Communicator

  42. Collectives Reduce Root + + + Rank 1 Rank 3 Rank 2 Rank 0 - - - * * *

  43. Collectives intMPI_Reduce(constvoid *sendbuf, void *recvbuf intcount, MPI_Datatypedatatype, MPI_Opop, introot, MPI_Commcomm) • sendbuf • Initial address of send buffer • recvbuf • Buffer to receive reduced result on root rank • count • Number of elements in • sendbuf • datatype • Datatype of each element in send buffer • op • Reduce operation (MPI_MAX, MPI_MIN, MPI_SUM, …) • root • Rank of node that will receive reduced value in recvbuf • comm • Communicator

  44. Collectives: Example #include “stdio.h” #include “mpi.h” intmain(intargc, char **argv) { intrank, root, bcast_data; root = 0; if(rank == root) bcast_data = 10; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Bcast(&bcast_data, 1, MPI_INT, root, MPI_COMM_WORLD); printf("Rank %d has bcast_data = %d\n", rank, bcast_data); MPI_Finalize(); return 0; }

  45. Collectives: Example $ cc collectives.c –o coll.exe $ aprun –n 4 ./coll.exe Rank 0 has bcast_data = 10 Rank 1 has bcast_data = 10 Rank 3 has bcast_data = 10 Rank 2 has bcast_data = 10

More Related