1 / 18

Martyn F Guest, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory

MSc in High Performance Computing Computational Chemistry Module Lecture 8 Introducing One-sided Communications. Martyn F Guest, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory p.sherwood@daresbury.ac.uk. Outline of the Lecture. One sided vs two sided communication strategies

earl
Download Presentation

Martyn F Guest, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MSc in High Performance ComputingComputational Chemistry ModuleLecture 8 Introducing One-sided Communications Martyn F Guest, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory p.sherwood@daresbury.ac.uk

  2. Outline of the Lecture • One sided vs two sided communication strategies • Implementation in the Global Arrays and ARMCI • An example, 1D data transpose • Practical Session • Programming a matrix multiply using one-sided communication primitives

  3. Review of Message Passing Concepts • Messages are the only form of communication • all communication is therefore explicit • Most systems use the SPMD model • all processes run exactly the same code • each has a unique ID • processes can take different branches in the same codes • Basic form is point-to-point • collective communications implement more complicated patterns that often occur in many codes

  4. Two-sided communication • Communication is broken down into messages, with a sending and receiving node, consequences are • Receiving node must enter a receive call (e.g. MPI_Recv) from the users own code • This must be designed into the parallelisation algorithm • E.g. alternating compute and communication phases, cf systolic loop algorithm in MD • If task sizes are unpredictable, this may lead to inefficiency due to load balancing issues. • Program complexity increased by need to ensure both sending and receiving nodes are available

  5. One-Sided Communication • A one-sided communication is initiated by the node wishing to read or write from memory on a remote node. • Some typical one sided operations: • put - write to remote memory • get – read from remote memory • accumulate – read from remote memory, increment and write result • The use code running on the node owning the memory does not explicitly handle the request for data • One-sided operations are naturally supported by shared memory systems (they look the same as local memory access) • On distributed memory systems messages must be generated to transfer data.

  6. One-Sided Communication • Implementation of one sided communications • SHMEM • Offered by Cray to provide a virtual shared memory environment • Active Messages • Framework for providing handler functions to be called when messages of particular types arrive • E.g. LAPI from IBM, as used on HPCx • MPI-2 • Standardised implementation • Computation is broken into phases (windows, phases…) • It is not possible to overlap one- and two-sided message passing phases • Global Arrays • Toolkit to support one-sided programming • Via calls to SHMEM, LAPI etc

  7. One-Sided Approaches • Vendor provided tools • Cray T3D, and T3E systems provided SHMEM library • Subsequently implemented by other vendors (e.g. Quadrics) #include <stdio.h> #include <stdlib.h> #include "shmem.h” int main(int argc, char **argv){ int my_pe, num_pe, target, source, *value, *marker; value = (int*)calloc(1, sizeof(int)); marker = (int*)calloc(1, sizeof(int)); //initialise, prepare target and target processes shmem_init(); my_pe = _my_pe(); num_pe = _num_pes(); target = (my_pe==(num_pe-1)) ? 0 : (my_pe + 1); source = (my_pe==0) ? (num_pe-1) : (my_pe - 1); shmem_barrier_all(); //write to targets shmem_int_p(value, my_pe, target); shmem_fence(); shmem_int_p(marker, 1, target); if(*marker == 0) shmem_int_wait(marker, 0); printf("%d get value %d\n", my_pe, *value); shmem_barrier_all(); free(value); free(marker); return 0; }

  8. Global Arrays Distributed dense arrays that can be accessed through a shared memory-like style Physically distributed data single, shared data structure/ global indexing e.g.,accessA(4,3) rather than buf(7) on task 2 Global Address Space

  9. Global Arrays (cont.) • Shared memory model in context of distributed dense arrays • Much simpler than message-passing for many applications • Complete environment for parallel code development • Compatible with MPI • Data locality control similar to distributed memory/message passing model • Extensible • Scalable

  10. Remote Data Access in GA • Message Passing: • identify size and location of data blocks • loop over processors: • if (me = P_N) then • pack data in local message buffer • send block of data to message buffer on P0 • else if (me = P0) then • receive block of data from P_N in message buffer • unpack data from message buffer to local buffer • endif • end loop • copy local data on P0 to local buffer Global Arrays: NGA_Get(g_a, lo, hi, buffer, ld); } } Global Array handle Global upper and lower indices of data patch Local buffer and array of strides P0 P2 P1 P3

  11. GA Example: 1-D Transpose • Take a 1D array A, store it in a distributed fashion (g_a) • Perform the transpose operation Bi = An-i+1 for all i • Assume that each processor only needs to work with one patch to complete the operation. a1 a2 a3 a4 a5 a6 a7 a8 . . an an . . a8 a7 a6 a5 a4 a3 a2 a1

  12. GA Example: 1-D Transpose

  13. Example: 1-D Transpose (cont.) #define NDIM 1 #define TOTALELEMS 197 #define MAXPROC 128 program main implicit none #include "mafdecls.fh" #include "global.fh" integer dims(3), chunk(3), nprocs, me, i, lo(3), hi(3), lo1(3) integer hi1(3), lo2(3), hi2(3), ld(3), nelem integer g_a, g_b, a(MAXPROC*TOTALELEMS), b(MAXPROC*TOTALELEMS) integer heap, stack, ichk, ierr logical status heap = 300000 stack = 300000

  14. Example: 1-D Transpose (cont.) c initialize communication library call mpi_init(ierr) c initialize ga library call ga_initialize() me = ga_nodeid() nprocs = ga_nnodes() dims(1) = nprocs*TOTALELEMS + nprocs/2 ! Unequal data distribution ld(1) = MAXPROC*TOTALELEMS chunk(1) = TOTALELEMS ! Minimum amount of data on each processor status = ma_init(MT_F_DBL, stack/nprocs, heap/nprocs) c create a global array status = nga_create(MT_F_INT, NDIM, dims, "array A", chunk, g_a) status = ga_duplicate(g_a, g_b, "array B") c initialize data in GA do i=1, dims(1) a(i) = i end do lo1(1) = 1 hi1(1) = dims(1) if (me.eq.0) call nga_put(g_a,lo1,hi1,a,ld) call ga_sync() ! Make sure data is distributed before continuing

  15. Example: 1-D Transpose (cont.) c transpose data locally call nga_distribution(g_a, me, lo, hi) call nga_get(g_a, lo, hi, a, ld) ! Use locality nelem = hi(1)-lo(1)+1 do i = 1, nelem b(i) = a(nelem - i + 1) end do c transpose data globally lo2(1) = dims(1) - hi(1) + 1 hi2(1) = dims(1) - lo(1) + 1 call nga_put(g_b,lo2,hi2,b,ld) call ga_sync() ! Make sure transposition is complete

  16. Example: 1-D Transpose (cont.) c check transpose call nga_get(g_a,lo1,hi1,a,ld) call nga_get(g_b,lo1,hi1,b,ld) ichk = 0 do i= 1, dims(1) if (a(i).ne.b(dims(1)-i+1).and.me.eq.0) then write(6,*) "Mismatch at ",i ichk = ichk + 1 endif end do if (ichk.eq.0.and.me.eq.0) write(6,*) "Transpose OK" status = ga_destroy(g_a) ! Deallocate memory for arrays status = ga_destroy(g_b) call ga_terminate() call mpi_finalize(ierr) stop end

  17. Instrumenting single-sided memory access • Approach 1: Instrument the puts, gets and data server • Advantage: robust and accurate • Disadvantage: one does not always have access to the source of the data server • Approach 2: Instrument the puts and gets only, cheating on the source and destination of the messages • Advantage: no instrumentation of the data server required • Disadvantage: timings of the messages are inaccurate in case of non-blocking communications, flashing lines due to synchronisation corrections for timers of different processors In our work with the Global Arrays we have taken approach 2

  18. GA vs. MPI-2 • MPI-2 now provides a portable mechanism for one-sided communications • Memory is associated with one-sided communications by defining windows • One-sided (put/get) operations occur in well-defined regions of the code separated by fence calls • There are restrictions on what a code can do between synchronisation points, e.g. (point-to-point) messages, local compute, etc. • Standard – vendors will implement it. • Global Arrays • Designed to make operations as light-weight as possible • Minimal synchronisation required • Work to exploit overlap of communications and computation • Not standard, portability problems on new platforms (will “the OpenFabrics Alliance” [www.openfabrics.org] cure this?)

More Related