Martyn F Guest, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory

MSc in High Performance ComputingComputational Chemistry ModuleLecture 8 Introducing One-sided Communications Martyn F Guest, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory p.sherwood@daresbury.ac.uk

Outline of the Lecture • One sided vs two sided communication strategies • Implementation in the Global Arrays and ARMCI • An example, 1D data transpose • Practical Session • Programming a matrix multiply using one-sided communication primitives

Review of Message Passing Concepts • Messages are the only form of communication • all communication is therefore explicit • Most systems use the SPMD model • all processes run exactly the same code • each has a unique ID • processes can take different branches in the same codes • Basic form is point-to-point • collective communications implement more complicated patterns that often occur in many codes

Two-sided communication • Communication is broken down into messages, with a sending and receiving node, consequences are • Receiving node must enter a receive call (e.g. MPI_Recv) from the users own code • This must be designed into the parallelisation algorithm • E.g. alternating compute and communication phases, cf systolic loop algorithm in MD • If task sizes are unpredictable, this may lead to inefficiency due to load balancing issues. • Program complexity increased by need to ensure both sending and receiving nodes are available

One-Sided Communication • A one-sided communication is initiated by the node wishing to read or write from memory on a remote node. • Some typical one sided operations: • put - write to remote memory • get – read from remote memory • accumulate – read from remote memory, increment and write result • The use code running on the node owning the memory does not explicitly handle the request for data • One-sided operations are naturally supported by shared memory systems (they look the same as local memory access) • On distributed memory systems messages must be generated to transfer data.

One-Sided Communication • Implementation of one sided communications • SHMEM • Offered by Cray to provide a virtual shared memory environment • Active Messages • Framework for providing handler functions to be called when messages of particular types arrive • E.g. LAPI from IBM, as used on HPCx • MPI-2 • Standardised implementation • Computation is broken into phases (windows, phases…) • It is not possible to overlap one- and two-sided message passing phases • Global Arrays • Toolkit to support one-sided programming • Via calls to SHMEM, LAPI etc

One-Sided Approaches • Vendor provided tools • Cray T3D, and T3E systems provided SHMEM library • Subsequently implemented by other vendors (e.g. Quadrics) #include <stdio.h> #include <stdlib.h> #include "shmem.h” int main(int argc, char **argv){ int my_pe, num_pe, target, source, *value, *marker; value = (int*)calloc(1, sizeof(int)); marker = (int*)calloc(1, sizeof(int)); //initialise, prepare target and target processes shmem_init(); my_pe = _my_pe(); num_pe = _num_pes(); target = (my_pe==(num_pe-1)) ? 0 : (my_pe + 1); source = (my_pe==0) ? (num_pe-1) : (my_pe - 1); shmem_barrier_all(); //write to targets shmem_int_p(value, my_pe, target); shmem_fence(); shmem_int_p(marker, 1, target); if(*marker == 0) shmem_int_wait(marker, 0); printf("%d get value %d\n", my_pe, *value); shmem_barrier_all(); free(value); free(marker); return 0; }

Global Arrays Distributed dense arrays that can be accessed through a shared memory-like style Physically distributed data single, shared data structure/ global indexing e.g.,accessA(4,3) rather than buf(7) on task 2 Global Address Space

Global Arrays (cont.) • Shared memory model in context of distributed dense arrays • Much simpler than message-passing for many applications • Complete environment for parallel code development • Compatible with MPI • Data locality control similar to distributed memory/message passing model • Extensible • Scalable

Remote Data Access in GA • Message Passing: • identify size and location of data blocks • loop over processors: • if (me = P_N) then • pack data in local message buffer • send block of data to message buffer on P0 • else if (me = P0) then • receive block of data from P_N in message buffer • unpack data from message buffer to local buffer • endif • end loop • copy local data on P0 to local buffer Global Arrays: NGA_Get(g_a, lo, hi, buffer, ld); } } Global Array handle Global upper and lower indices of data patch Local buffer and array of strides P0 P2 P1 P3

GA Example: 1-D Transpose • Take a 1D array A, store it in a distributed fashion (g_a) • Perform the transpose operation Bi = An-i+1 for all i • Assume that each processor only needs to work with one patch to complete the operation. a1 a2 a3 a4 a5 a6 a7 a8 . . an an . . a8 a7 a6 a5 a4 a3 a2 a1

GA Example: 1-D Transpose

Example: 1-D Transpose (cont.) #define NDIM 1 #define TOTALELEMS 197 #define MAXPROC 128 program main implicit none #include "mafdecls.fh" #include "global.fh" integer dims(3), chunk(3), nprocs, me, i, lo(3), hi(3), lo1(3) integer hi1(3), lo2(3), hi2(3), ld(3), nelem integer g_a, g_b, a(MAXPROC*TOTALELEMS), b(MAXPROC*TOTALELEMS) integer heap, stack, ichk, ierr logical status heap = 300000 stack = 300000

Example: 1-D Transpose (cont.) c initialize communication library call mpi_init(ierr) c initialize ga library call ga_initialize() me = ga_nodeid() nprocs = ga_nnodes() dims(1) = nprocs*TOTALELEMS + nprocs/2 ! Unequal data distribution ld(1) = MAXPROC*TOTALELEMS chunk(1) = TOTALELEMS ! Minimum amount of data on each processor status = ma_init(MT_F_DBL, stack/nprocs, heap/nprocs) c create a global array status = nga_create(MT_F_INT, NDIM, dims, "array A", chunk, g_a) status = ga_duplicate(g_a, g_b, "array B") c initialize data in GA do i=1, dims(1) a(i) = i end do lo1(1) = 1 hi1(1) = dims(1) if (me.eq.0) call nga_put(g_a,lo1,hi1,a,ld) call ga_sync() ! Make sure data is distributed before continuing

Example: 1-D Transpose (cont.) c transpose data locally call nga_distribution(g_a, me, lo, hi) call nga_get(g_a, lo, hi, a, ld) ! Use locality nelem = hi(1)-lo(1)+1 do i = 1, nelem b(i) = a(nelem - i + 1) end do c transpose data globally lo2(1) = dims(1) - hi(1) + 1 hi2(1) = dims(1) - lo(1) + 1 call nga_put(g_b,lo2,hi2,b,ld) call ga_sync() ! Make sure transposition is complete

Example: 1-D Transpose (cont.) c check transpose call nga_get(g_a,lo1,hi1,a,ld) call nga_get(g_b,lo1,hi1,b,ld) ichk = 0 do i= 1, dims(1) if (a(i).ne.b(dims(1)-i+1).and.me.eq.0) then write(6,*) "Mismatch at ",i ichk = ichk + 1 endif end do if (ichk.eq.0.and.me.eq.0) write(6,*) "Transpose OK" status = ga_destroy(g_a) ! Deallocate memory for arrays status = ga_destroy(g_b) call ga_terminate() call mpi_finalize(ierr) stop end

Instrumenting single-sided memory access • Approach 1: Instrument the puts, gets and data server • Advantage: robust and accurate • Disadvantage: one does not always have access to the source of the data server • Approach 2: Instrument the puts and gets only, cheating on the source and destination of the messages • Advantage: no instrumentation of the data server required • Disadvantage: timings of the messages are inaccurate in case of non-blocking communications, flashing lines due to synchronisation corrections for timers of different processors In our work with the Global Arrays we have taken approach 2

GA vs. MPI-2 • MPI-2 now provides a portable mechanism for one-sided communications • Memory is associated with one-sided communications by defining windows • One-sided (put/get) operations occur in well-defined regions of the code separated by fence calls • There are restrictions on what a code can do between synchronisation points, e.g. (point-to-point) messages, local compute, etc. • Standard – vendors will implement it. • Global Arrays • Designed to make operations as light-weight as possible • Minimal synchronisation required • Work to exploit overlap of communications and computation • Not standard, portability problems on new platforms (will “the OpenFabrics Alliance” [www.openfabrics.org] cure this?)

Martyn F Guest, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory

Martyn F Guest, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory

Presentation Transcript

Introduction to CCP4 and ccp4i Martyn Winn CCP4, STFC Daresbury Laboratory m.d.winn@dl.ac.uk Bangalore, Feb 2008

SRF Development at daresbury laboratory A. Wheelhouse ASTeC, STFC Daresbury Laboratory

Advanced Accelerator Test Facilities at Daresbury Laboratory

Modelling Surface Adrian Wander Computational Materials Group CCLRC Daresbury Laboratory

Introduction CCP4i, Files and Utilities Martyn Winn CCP4, CCLRC Daresbury Laboratory

Martyn F Guest, Edo Apra, Huub van Dam and Paul Sherwood CCLRC Daresbury Laboratory

John Simpson Nuclear Physics Group Daresbury Laboratory

Daresbury Laboratory

Situated @: Bath; Birkbeck; Cambridge; CCLRC Daresbury Reading The Royal Institution

Roy Lemmon Daresbury Laboratory United Kingdom

Computational Chemistry at Daresbury Laboratory

David Holder ASTeC Daresbury Laboratory

Computational Chemistry at Daresbury Laboratory

Huub van Dam and Paul Sherwood STFC Daresbury Laboratory h.j.j.vandam@daresbury.ac.uk

John Hughes, and Andy van Dam

The GROWL Project Rob Allan E-Science Centre, CCLRC Daresbury Laboratory r.j.allan@dl.ac.uk

Daresbury Laboratory Accommodation

Peter van Dam

H.J.J. van Dam , Martyn Guest and Paul Sherwood,

The GROWL Project John Kewley CCLRC Daresbury Laboratory j.kewley@dl.ac.uk

John Hughes, and Andy van Dam

Ronan Keegan, Martyn Winn CCP4 group, Daresbury Laboratory