Download Presentation
## Advanced MPI programming

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Advanced MPI programming**Julien Langou George Bosilca**Outline**• Point-to-point communications, group and communicators, data-type, collective communications • 4 real life applications to play with**Point-to-point Communications**• Data transfer from one process to another • The communication requires a sender and a receiver, i.e.there must be a way to identify processes • There must be a way to describe the data • There must be a way to identify messages**Who are you ?**• Processes are identified by a double: • Communicator: a “stream like” safe place to pass messages • Rank: the relative rank of the remote process in the group of processes attached to the corresponding communicator • Messages are identified by a tag: an integer that allow you to differentiate messages in the same communicator**What data to exchange ?**• The data is described by a triple: • The address of the buffer where the data is located in the memory of the current process • A count: how many elements make up the message • A data-type: a description of the memory layout involved in the message**The basic send**MPI_Send(buf, count, type, dest, tag, communicator) • Once this function returns the data have been copied out of the user memory and the buffer can be reused • This operation may require system buffers, and in this case it will block until enough space is available • The completion of the send DO NOT state anything about the reception state of the message**The basic receive**MPI_Recv(buf, count, type, dest, tag, communicator, status) • When this call returns the data is located in the user memory • Receiving fewer than count is permitted, but receiving more is forbidden • The status contain information about the received message (MPI_ANY_SOURCE, MPI_ANY_TAG)**The MPI_Status**• Structure in C (MPI_Status), array in Fortran (integer[MPI_STATUS_SIZE]) • Contain at least 3 fields (integers): • MPI_TAG: the tag of the received message • MPI_SOURCE: the rank of the source process in the corresponding communicator • MPI_ERROR: the error raised by the receive (usually MPI_SUCCESS).**Non Blocking Versions**MPI_Isend(buf, count, type, dest, tag, communicator, req) MPI_Irecv (buf, count, type, dest, tag, communicator, req) • The memory pointer by buf (and described by the count and type) should not be touched until the communication is completed • req is a handle to an MPI_Request which can be used to check the status of the communication …**Testing for Completion**Blocking MPI_Wait(req, status) Non Blocking MPI_Test(req, flag, status) • Test or wait the completion status of the request • If the request is completed: • Fill the status with the related information • Release the request and replace it by MPI_REQUEST_NULL**Multiple Completions**• MPI_Waitany( count, array_of_requests, index, status ) • MPI_Testany( count, array_of_requests, index, flag, status) • MPI_Waitall( count, array_of_requests, array_of_statuses) • MPI_Testall( count, array_of_requests, flag, array_of_statuses ) • MPI_Waitsome( count, array_of_requests, outcount, array_of_indices, array_of_statuses )**Synchronous Communications**• MPI_Ssend, MPI_Issend • No restrictions … • Does not complete until the corresponding receive has been posted, and the operation has been started • It doesn’t means the operation is completed • Can be used instead of sending ack • Provide a simple way to avoid unexpected messages (i.e. no buffering on the receiver side)**Buffered Communications**• MPI_Bsend, MPI_Ibsend • MPI_Buffer_attach, MPI_Buffer_detach • Buffered sends do not rely on system buffers • The buffer is managed by MPI • The buffer should be large enough to contains the largest message send by one send • The user cannot use the buffer as long as it is attached to the MPI library**Ready Communications**• MPI_Rsend, MPI_Irsend • Can ONLY be used if the matching receive has been posted • It’s the user responsibility to insure the previous condition, otherwise the outcome is undefined • Can be efficient in some situations as it avoid the rendezvous handshake • Should be used with extreme caution**Persistent Communications**• MPI_{BR}Send_init, MPI_Recv_init • MPI_Start, MPI_Startall • Reuse the same requests for multiple communications • Provide only a half-binding: the send is not connected to the receive • A persistent send can be matched by any kind of receive**Point to point communication**• 01-gs (simple) • 02-nanopse (tuning) • 03-summa (tuning) • Groups and communicators • 03-summa • Data-types • 04-lila • Collective • MPI_OP: 04-lila**Gram-Schmidt algorithm**The QR factorization of a long and skinny matrix with its data partitioned vertically across several processors arises in a wide range of applications. Input: A is block distributed by rows Output: Q is block distributed by rows R is global Q1 R A1 Q2 A2 Q3 A3 19**Example of applications: iterative methods.**• in iterative methods with multiple right-hand sides (block iterative methods:) • Trilinos (Sandia National Lab.) through Belos (R. Lehoucq, H. Thornquist, U. Hetmaniuk). • BlockGMRES, BlockGCR, BlockCG, BlockQMR, … • in iterative methods with a single right-hand side • s-step methods for linear systems of equations (e.g. A. Chronopoulos), • LGMRES (Jessup, Baker, Dennis, U. Colorado at Boulder) implemented in PETSc, • Recent work from M. Hoemmen and J. Demmel (U. California at Berkeley). • in iterative eigenvalue solvers, • PETSc (Argonne National Lab.) through BLOPEX (A. Knyazev, UCDHSC), • HYPRE (Lawrence Livermore National Lab.) through BLOPEX, • Trilinos (Sandia National Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk), • PRIMME (A. Stathopoulos, Coll. William & Mary ), • And also TRLAN, BLZPACK, IRBLEIGS. 20**Example of applications:**in block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers), in dense large and more square QR factorization where they are used as the panel factorization step, or more simply in linear least squares problems which the number of equations is extremely larger than the number of unknowns. 21**Example of applications:**in block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers), in dense large and more square QR factorization where they are used as the panel factorization step, or more simply in linear least squares problems which the number of equations is extremely larger than the number of unknowns. The main characteristics of those three examples are that there is only one column of processors involved but several processor rows, all the data is known from the beginning, and the matrix is dense. 22**Example of applications:**in block iterative methods (iterative methods with multiple right-hand sides or iterative eigenvalue solvers), in dense large and more square QR factorization where they are used as the panel factorization step, or more simply in linear least squares problems which the number of equations is extremely larger than the number of unknowns. The main characteristics of those three examples are that there is only one column of processors involved but several processor rows, all the data is known from the beginning, and the matrix is dense. Various methods already exist to perform the QR factorization of such matrices: Gram-Schmidt (mgs(row),cgs), Householder (qr2, qrf), or CholeskyQR. We present a new method: Allreduce Householder (rhh_qr3, rhh_qrf). 23**An example with modified Gram-Schmidt.**A nonsingular m x 3 Q = A r11= || Q1 ||2 Q1 = Q1 / r11 = Q1 / r11 r12 = Q1T Q2 Q2 = Q2 – Q1 r12 r22 = || Q2 ||2 Q2 = Q2 / r22 r13 = Q1T Q3 Q3 = Q3 – Q1 r13 r23 = Q2T Q3 Q3 = Q3 – Q2 r23 r33 = || Q3 ||2 Q3 = Q3 / r33 A = QR QTQ = I3**The CholeskyQR Algorithm**AiT Ci Ai SYRK: C:= ATA ( mn2) CHOL: R := chol( C ) ( n3/3 ) TRSM: Q := A\R ( mn2) ) chol ( R C \ R Qi Ai 26**Bibligraphy**• A. Stathopoulos and K. Wu, A block orthogonalization procedure with constant synchronization requirements, SIAM Journal on Scientific Computing, 23(6):2165-2182, 2002. • Popularized by iterative eigensolver libraries: • PETSc (Argonne National Lab.) through BLOPEX (A. Knyazev, UCDHSC), • HYPRE (Lawrence Livermore National Lab.) through BLOPEX, • Trilinos (Sandia National Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk), • PRIMME (A. Stathopoulos, Coll. William & Mary ).**Parallel distributed CholeskyQR**1. AiT Ci Ai 1: SYRK: C:= ATA ( mn2) 2: MPI_Reduce: C:= sumprocs C (on proc 0) 3: CHOL: R := chol( C ) ( n3/3 ) 4: MPI_Bdcast Broadcast the R factor on proc 0 to all the other processors 5: TRSM: Q := A\R ( mn2) 2. + + + C3 C2 C4 C C1 3-4. ) chol ( R C \ 5. R Qi Ai • This method is extremely fast. For two reasons: • first, there is only one or two communications phase, • second, the local computations are performed with fast operations. • Another advantage of this method is that the resulting code is exactly four lines, • so the method is simple and relies heavily on other libraries. • Despite all those advantages, • this method is highly unstable. The CholeskyQR method in the parallel distributed context can be described as follows: 28**CholeskyQR - code**AiT 1. Ci Ai 2. + + + C3 C2 C4 C C1 3-4. ) chol ( R C \ 5. R Qi Ai int choleskyqr_A_v0(int mloc, int n, double *A, int lda, double *R, int ldr, MPI_Comm mpi_comm){ int info; cblas_dsyrk( CblasColMajor, CblasUpper, CblasTrans, n, mloc, 1.0e+00, A, lda, 0e+00, R, ldr ); MPI_Allreduce( MPI_IN_PLACE, R, n*n, MPI_DOUBLE, MPI_SUM, mpi_comm ); lapack_dpotrf( lapack_upper, n, R, ldr, &info ); cblas_dtrsm( CblasColMajor, CblasRight, CblasUpper, CblasNoTrans, CblasNonUnit, mloc, n, 1.0e+00, R, ldr, A, lda ); return0; } (And OK, you might want to add an MPI user defined datatype to send only the upper part of R) 29**Efficient enough?**In this experiment, we fix the problem: m=100,000 and n=50. Time (sec) Performance (MFLOP/sec/proc) # of procs # of procs MFLOP/sec/proc Time in sec**Stability**m=100, n=50 || A – Q R ||2 / || A || 2 κ2(A) || I –Q TQ ||2 κ2(A)**performance analysis tool in PESCAN**Optimization of codes through performance tools Portability of codes while keeping good performances**after**before idum = 1 do i = 1,nnodes do j = 1,ivunpn2(i) combuf1(idum) = psi(ivunp2(idum)) idum = idum + 1 enddo enddo --------------------------------------------------------------------------------------- idum1 = 1 idum2 = 1 do i = 1, nnodes call mpi_isend(combuf1(idum1),ivunpn2(i),mpi_double_complex, & i-1,inode,mpi_comm_world,ireq(2*i-1),ierr) idum1 = idum1 + ivunpn2(i) call mpi_irecv(combuf2(idum2),ivpacn2(i),mpi_double_complex, & i-1, i, mpi_comm_world, ireq(2*i), ierr) idum2 = idum2 + ivpacn2(i) enddo call mpi_waitall( 2 * nnodes, ireq, MPI_STATUSES_IGNORE ) --------------------------------------------------------------------------------------- idum = 1 do i = 1,nnodes do j = 1,ivpacn2(i) psiy(ivpac2(idum)) = combuf2(idum) idum = idum + 1 enddo enddo -------------------------------------------------------------------------------------- call mpi_barrier(mpi_comm_world,ierr) idum = 1 do i = 1,nnodes call mpi_isend(combuf1(idum),ivunpn2(i),mpi_double_complex,i-1, & inode,mpi_comm_world,ireq(i),ierr) idum = idum + ivunpn2(i) enddo idum = 1 do i = 1,nnodes call mpi_recv(combuf2(idum),ivpacn2(i),mpi_double_complex,i-1,i, & mpi_comm_world,mpistatus,ierr) idum = idum + ivpacn2(i) enddo do i = 1, nnodes call mpi_wait(ireq(i), mpistatus, ierr) end do call mpi_barrier(mpi_comm_world,ierr) -------------------------------------------------------------------------------------**Original code:**13.2% of the overall time at barrier Modified code: Removing most of the barrier report a part of the problem on other synchronization point but we observe 6% of improvements**Lots of self-communication. After investigation, some copies**can be avoided. • Try different variants of asynchronous communication, removing barrier,….**Cd 83 – Se 81 on 16 processors**• In this example, we see that both modifications are useful and they are working fine together for small number of processors**Comparison of different matrix-vector products**implementations ZB4096_Ecut30 (order = 2,156,241) – time for 50 matrix-vector products • - me2me: 10% improvements on the matrix-vector products, less buffer used • asynchronous communication are not a good idea, (this is why the original code have them) • barrier are useful for large matrices (this why the original code have them)**after**before idum = 1 do i = 1,nnodes do j = 1,ivunpn2(i) combuf1(idum) = psi(ivunp2(idum)) idum = idum + 1 enddo enddo --------------------------------------------------------------------------------------- idum1 = 1 idum2 = 1 do i = 1, nnodes call mpi_isend(combuf1(idum1),ivunpn2(i),mpi_double_complex, & i-1,inode,mpi_comm_world,ireq(2*i-1),ierr) idum1 = idum1 + ivunpn2(i) call mpi_irecv(combuf2(idum2),ivpacn2(i),mpi_double_complex, & i-1, i, mpi_comm_world, ireq(2*i), ierr) idum2 = idum2 + ivpacn2(i) enddo call mpi_waitall( 2 * nnodes, ireq, MPI_STATUSES_IGNORE ) --------------------------------------------------------------------------------------- idum = 1 do i = 1,nnodes do j = 1,ivpacn2(i) psiy(ivpac2(idum)) = combuf2(idum) idum = idum + 1 enddo enddo -------------------------------------------------------------------------------------- call mpi_barrier(mpi_comm_world,ierr) idum = 1 do i = 1,nnodes call mpi_isend(combuf1(idum),ivunpn2(i),mpi_double_complex,i-1, & inode,mpi_comm_world,ireq(i),ierr) idum = idum + ivunpn2(i) enddo idum = 1 do i = 1,nnodes call mpi_recv(combuf2(idum),ivpacn2(i),mpi_double_complex,i-1,i, & mpi_comm_world,mpistatus,ierr) idum = idum + ivpacn2(i) enddo do i = 1, nnodes call mpi_wait(ireq(i), mpistatus, ierr) end do call mpi_barrier(mpi_comm_world,ierr) ------------------------------------------------------------------------------------- This is just one way of doing it : Solution use (and have) tuned MPI Global Communications (here MPI_Alltoallv )**First rule in linear algebra: Have an efficient DGEMM**• All the dense linear algebra operations rely on an efficient DGEMM (matrix Matrix Multiply) • This is by far the easiest operation O(n3) in Dense Linear Algebra. • So if we can not implement DGEMM correctly (peak performance), we will not be able to do much for the others operations.**Blocked LU and QR algorithms (LAPACK)**LAPACK block LU (right-looking): dgetrf LAPACK block QR (right-looking): dgeqrf U R L V A(1) A(1) dgetf2 dgeqf2 + dlarft Panel factorization lu( ) qr( ) dtrsm (+ dswp) \ Update of the remaining submatrix dgemm dlarfb - - U R L V A(2) A(2) 45**Blocked LU and QR algorithms (LAPACK)**LAPACK block LU (right-looking): dgetrf U L A(1) dgetf2 Panel factorization Latency bounded: more than nb AllReduce for n*nb2 ops lu( ) dtrsm (+ dswp) CPU - bandwidth bounded: the bulk of the computation: n*n*nb ops highly paralleliable, efficient and saclable. \ Update of the remaining submatrix dgemm - U L A(2)**Parallel Distributed MM algorithms**C11 C12 C13 C11 C12 C13 A11 A12 A13 B11 B12 B13 C21 C22 C23 C21 C22 C23 A21 A22 A23 B21 B22 B23 β + α * C31 C32 C33 C31 C32 C33 A31 A32 A33 B31 B32 B33**Parallel Distributed MM algorithms**A B A B A B + = * * * A B A B + + + … * * Use the outer version of the matrix-matrix multiply algorithm.**Parallel Distributed MM algorithms**A B A B A B + = * * * A B A B + + + … * * PDGEMM : For k = 1:nb:n, C C A B + * End For**Parallel Distributed MM algorithms**B11 B12 B13 A11 C11 A12 C12 A13 C13 B21 B22 B23 A21 C21 A22 C22 A23 C23 B31 B32 B33 A31 C31 A32 C32 A33 C33