1 / 30

High Performance Parallel Programming

High Performance Parallel Programming. Dirk van der Knijff Advanced Research Computing Information Division. High Performance Parallel Programming. Lecture 4: Message Passing Interface 3. So Far. Messages source, dest, data, tag, communicator Communicators MPI_COMM_WORLD

pascal
Download Presentation

High Performance Parallel Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division

  2. High Performance Parallel Programming • Lecture 4: Message Passing Interface 3 High Performance Parallel Programming

  3. So Far.. • Messages • source, dest, data, tag, communicator • Communicators • MPI_COMM_WORLD • Point-to-point communications • different modes - standard, synchronous, buffered, ready • blocking vs non-blocking • Derived datatypes • construct then commit High Performance Parallel Programming

  4. Ping-pong exercise: program /********************************************************************** * This file has been written as a sample solution to an exercise in a * course given at the Edinburgh Parallel Computing Centre. It is made * freely available with the understanding that every copy of this file * must include this header and that EPCC takes no responsibility for * the use of the enclosed teaching material. * * Authors: Joel Malard, Alan Simpson * * Contact: epcc-tec@epcc.ed.ac.uk * * Purpose: A program to experiment with point-to-point * communications. * * Contents: C source code. * ********************************************************************/ High Performance Parallel Programming

  5. #include <stdio.h> #include <mpi.h> #define proc_A 0 #define proc_B 1 #define ping 101 #define pong 101 float buffer[100000]; long float_size; void processor_A (void), processor_B (void); void main ( int argc, char *argv[] ) { int ierror, rank, size; extern long float_size; MPI_Init(&argc, &argv); MPI_Type_extent(MPI_FLOAT, &float_size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == proc_A) processor_A(); else if (rank == proc_B) processor_B(); MPI_Finalize(); }

  6. void processor_A( void ) { int i, length, ierror; MPI_Status status; double start, finish, time; extern float buffer[100000]; extern long float_size; printf("Length\tTotal Time\tTransfer Rate\n"); for (length = 1; length <= 100000; length += 1000){ start = MPI_Wtime(); for (i = 1; i <= 100; i++){ MPI_Ssend(buffer, length, MPI_FLOAT, proc_B, ping, MPI_COMM_WORLD); MPI_Recv(buffer, length, MPI_FLOAT, proc_B, pong, MPI_COMM_WORLD, &status); } finish = MPI_Wtime(); time = finish - start; printf("%d\t%f\t%f\n", length, time/200., (float)(2 * float_size * 100 * length)/time); } }

  7. void processor_B( void ) { int i, length, ierror; MPI_Status status; extern float buffer[100000]; for (length = 1; length <= 100000; length += 1000) { for (i = 1; i <= 100; i++) { MPI_Recv(buffer, length, MPI_FLOAT, proc_A, ping, MPI_COMM_WORLD, &status); MPI_Ssend(buffer, length, MPI_FLOAT, proc_A, pong, MPI_COMM_WORLD); } } }

  8. Ping-pong exercise: results High Performance Parallel Programming

  9. Ping-pong exercise: results 2 High Performance Parallel Programming

  10. Running ping-pong compile: mpicc ping_pong.c -o ping_pong submit: qsub ping_pong.sh where ping_pong.sh is #PBS -q exclusive #PBS -l nodes=2 cd <your sub_directory> mpirun ping_pong High Performance Parallel Programming

  11. Collective communication • Communications involving a group of processes • Called by all processes in a communicator • for sub-groups need to form a new communicator • Examples • Barrier synchronisation • Broadcast, Scatter, Gather • Global sum, Global maximum, etc. High Performance Parallel Programming

  12. Characteristics • Collective action over a communicator • All processes must communicate • Synchronisation may or may not occur • All collective operations are blocking • No tags • Recieve buffers must be exactly the right size • Collective communications and point-to-point communications cannot interfere High Performance Parallel Programming

  13. MPI_Barrier • Blocks each calling process until all other members have also called it. • Generally used to synchronise between phases of a program • Only one argument - no data is exchanged MPI_Barrier(comm) High Performance Parallel Programming

  14. Broadcast • Copies data from a specified root process to all other processes in communicator • all processes must specify the same root • other aguments same as for point-to-point • datatypes and sizes must match MPI_Bcast(buffer, count, datatype, root, comm) • Note: MPI does not support a multicast function High Performance Parallel Programming

  15. a a b b c c d d e e before after a b c d e Scatter, Gather • Scatter and Gather are inverse operations • Note that all processes partake - even root Scatter: High Performance Parallel Programming

  16. a b c d e before a b c d e after a b c d e Gather Gather: High Performance Parallel Programming

  17. MPI_Scatter, MPI_Gather MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) • sendcount in scatter and recvcount in gatherrefer to the size of each individual message (sendtype = recvtype => sendcount = recvcount) • total type signatures must match High Performance Parallel Programming

  18. Example MPI_Comm comm; int gsize, sendarray[100]; int root, myrank, *rbuf; MPI_Datatype rtype; ... MPI_Comm_rank(comm, myrank); MPI_Comm_size(comm, &gsize); MPI_Type_contigous(100, MPI_INT, &rtype); MPI_Type_commit(&rtype); if (myrank == root) { rbuf = (int *)malloc(gsize*100*sizeof(int)); } MPI_Gather(sendarray, 100, MPI_INT, rbuf, 1, rtype, root, comm); High Performance Parallel Programming

  19. a b c d e d f u c a e b p a a k a a a u a f k a p l b b b q v b b g h j g f b i b q v g l c m h c m c c r l h w o w k r c c n c m s x d d d i p n n x s d r d i s d d t q e e o t j u v w y e y e e x y t o j e e a b c d e More routines MPI_Allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) High Performance Parallel Programming

  20. Vector routines MPI_Scatterv(sendbuf, sendcount, displs, sendtype, recvbuf, recvcount, recvtype, root, comm) MPI_Gatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, root, comm) MPI_Allgatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, comm) MPI_Alltoallv(sendbuf, sendcount, sdispls, sendtype, recvbuf, recvcount, rdispls, recvtype, comm) • Allow send/recv to be from/to non-contiguous locationsin an array • Useful if sending different counts at different times High Performance Parallel Programming

  21. Global reduction routines • Used to compute a result which depends on data distributed over a number of processes • Examples: • global sum or product • global maximum or minimum • global user-defined operation • Operation should be associative • aside: remember floating-point operations technically aren’t associative but we usually don’t care - can affect results in parallel programs though High Performance Parallel Programming

  22. Global reduction (cont.) MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm) • combines count elements from each sendbuf using op and leaves results in recvbuf on process root • e.g. MPI_Reduce(&s, &r, 2, MPI_INT, MPI_SUM, 1, comm) r r r r r 2 3 1 1 3 2 1 1 1 2 s s s s s r r r r r 2 2 1 1 3 3 1 1 1 1 s s s s s 8 9 High Performance Parallel Programming

  23. Reduction operators MPI_MAX Maximum MPI_MIN Minumum MPI_SUM Sum MPI_PROD Product MPI_LAND Logical AND MPI_BAND Bitewise AND MPI_LOR Logical OR MPI_BOR Bitwise OR MPI_LXOR Logical XOR MPI_BXOR Bitwise XOR MPI_MAXLOC Max value and location MPI_MINLOC Min value and location High Performance Parallel Programming

  24. User-defined operators In C the operator is defined as a function of type typedef void MPI_User_function(void *invec, void *inoutvec, int *len, MPI_Datatype *datatype); In Fortran must write a function as function <user_function>(invec(*), inoutvec(*), len, type) where the function has the following schema for (i = 1 to len) inoutvec(i) = inoutvec(i) op invec(i) Then MPI_Op_create(user_function, commute, op) returns a handle op of type MPI_Op High Performance Parallel Programming

  25. Variants MPI_Allreduce(sendbuf, recvbuf, count, datatype, op, comm) • All processes invloved receive identical results MPI_Reduce_scatter(sendbuf, recvbuf, recvcounts, datatype, op, comm) • Acts as if a reduce was performed and then each process recieves recvcount(myrank) elements of the result. High Performance Parallel Programming

  26. 1 2 1 7 1 2 1 2 2 1 1 2 3 9 1 3 1 2 1 2 1 2 2 1 1 1 6 1 1 1 2 1 1 2 3 9 3 1 2 2 1 1 2 1 3 2 2 1 1 2 9 1 1 2 3 Reduce-scatter MPI_INT *s, *r, *rc; int rank, gsize; ... rc = (/ 1, 2, 0, 1, 1 /) MPI_Reduce-scatter(s, r, rc, MPI_INT, MPI_SUM, comm) High Performance Parallel Programming

  27. 1 2 7 5 1 3 1 2 2 2 1 1 2 1 1 2 2 1 3 9 1 1 2 6 2 3 1 1 7 3 3 1 1 5 1 1 6 2 2 1 2 1 1 4 1 9 3 6 1 4 3 2 1 2 1 3 2 1 1 2 3 2 2 1 1 3 2 1 1 3 9 7 8 5 2 Scan MPI_Scan(sendbuf, recvbuf, count, datatype, op, comm) • Performs a prefix reduction on data across group recvbuf(myrank) = op(sendbuf((i,i=1,myrank))) MPI_Scan(&s, &r, 5, MPI_INT, MPI_SUM, comm); High Performance Parallel Programming

  28. Further topics • Error-handling • Errors are handled by an error handler • MPI_ERRORS_ARE_FATAL - default for MPI_COMM_WORLD • MPI_ERRORS_RETURN - MPI state is undefined • MPI_Error_string(errorcode, string, resultlen) • Message probing • Messages can be probed • Note - wildcard reads may receive a different message • blocking and non-blocking • Persistent communications High Performance Parallel Programming

  29. Assignment 2. • Write a general procedure to multiply 2 matrices. • Start with • http://www.hpc.unimelb.edu.au/cs/assignment2/ • This is a harness for last years assignment • Last year I asked them to optimise first • This year just parallelize • Next Tuesday I will discuss strategies • That doesn’t mean don’t start now… • Ideas available in various places… High Performance Parallel Programming

  30. High Performance Parallel Programming Tomorrow - matrix multiplication High Performance Parallel Programming

More Related