1 / 51

Computational Methods in Astrophysics

Computational Methods in Astrophysics. Dr Rob Thacker (AT319E) thacker@ap. Today’s Lecture. Distributed Memory Computing II. User defined (derived) datatypes, packing, communicators Abstract topologies, Cartesian primitives, performance and scaling. Build defined types from built-in.

Download Presentation

Computational Methods in Astrophysics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Methods in Astrophysics Dr Rob Thacker (AT319E) thacker@ap

  2. Today’s Lecture Distributed Memory Computing II • User defined (derived) datatypes, packing, communicators • Abstract topologies, Cartesian primitives, performance and scaling

  3. Build defined types from built-in • MPI comes with plenty of built-in datatypes

  4. Making a trivial new datatype • MPI_Type_contiguous(n,oldtype,newtype,ierror) • Produces a new datatype that is n copies of the old • Each entry is displaced by the extent of oldtype • oldtype can be a derived datatype as well • Must “commit” type before sending • Use MPI_Type_commit(newtype,ierr) • Free up with MPI_Type_Free(newtype,ierr) – finite number of datatypes MPI_Send(buffer,count,datatype,dest,tag,comm,ierr) Equivalent calls MPI_Type_contiguous(count,datatype,newtype,ierr) MPI_Type_commit(newtype,ierr) MPI_Send(buffer,1,newtype,dest,tag,comm,ierr) MPI_Type_free(newtype,ierr)

  5. Arbitrary datatypes • To derive a datatype from another - • Specify a typemap of datatypes and BYTE displacements • The type signature is the list of basic datatypes within the new datatype: • Type signature directly controls the interpretation of both sent and received data, displacements indicate where to find information • Decodes information contained in the buffers

  6. Vector datatypes • MPI_Type_vector(count, blocklength, stride, oldtype, newtype, ierror) • The function selects count blocks of data of type oldtype • Each block is blocklength data items long • Separation between successive starting blocks is stride • Function MPI_Type_contiguous can be thought of as a special case of MPI_Type_vector: • MPI_Type_contiguous(count, oldtype, newtype, ierr) = MPI_Type_vector(count, 1, 1, oldtype, newtype,ierr)

  7. Vector Example • Let oldtype = {(double, 0), (char, 8)} (named_double) • Call MPI_Type_vector(2, 3, 4, named_double, six_named_doubles,ierr) • New MPI data type, which has the following map: • newtype = {(double, 0), (char, 8), (double, 16), (char, 24), (double, 32), (char, 40), (double, 64), (char, 72), (double, 80), (char, 88), (double, 96), (char, 104)} • Two blocks of data are taken, each containing 3 structures of the type named_double • The 3 structures within the block are concatenated contiguously • The stride is set to extent of 4 objects of the same type = 64 bytes, since each named_double must have an extent of 16 bytes (word alignment)

  8. Indexed data types • MPI_Type_indexed(count,array_of_blocklengths, array_of_displacements, oldtype, newtype, ierror) • The function selects count blocks of data of type oldtype • Block n is array_of_blocklengths(n) data items long • List of displacements is defined relative to the first item • MPI_Type_vector can be thought of as a special case of MPI_Type_indexed • All array_of_blocklengths entries = vector blocklen • List of displacements are multiples of the stride

  9. Example of using indexing • Suppose you only need a subset of an array • Could send many short messages • Could pack a datatype • Better solution create a datatype the includes the displacements you need • Example: • List of offsets: 9,27,39,142 9 27 39 142

  10. Example cont • In this case we have 4 elements (=n_to_move) • Array of displacements (offset()) • Each part of the list is a single element • `Block’ lengths are all unity (blocksize()) • List datatype will be denoted ltype • Following code performs the implicit packing for you: MPI_Type_indexed(n_to_move,blocksize,offset,ltype,newtype,ierr) MPI_Type_commit(newtype,ierr) MPI_Send(buffer,1,newtype,dest,tag,comm,ierr) MPI_Type_free(newtype,ierr)

  11. When to use H versions • Data maybe stored in dynamically allocated region • No single buffer to refer displacements from • MPI_Address(location,address,ierr) will return the byte address of the element location within an allocated window • Create list of byte positions for Hindexed using MPI_Address • MPI_BOTTOM can then be used as the starting address for the MPI_Send • MPI_Send(MPI_BOTTOM,1,mytype,dest,tag,comm,ierr)

  12. Packing • On some occasions packing may be desirable over specifying a datatype • e.g. packing across multiple arrays • MPI_Pack allows a user to incrementally add data into a user provided buffer • MPI_Pack(inbuf,incount,datatype,outbuf,outcount,position,comm,ierr) • Packed data is given the MPI_PACKED datatype • MPI_Pack_size can be used to determine how large the required buffer is in bytes • Specify number of elements, MPI datatype, communicator • Communicator is required since heterogeneous environment may require more packing than expected

  13. Equivalent examples – both datatypes send the same message int I; char c[100]; MPI_Aint disp[2]; int blocklen[2] = {1, 100}; MPI_Datatype type[2] = {MPI_INT, MPI_CHAR}; MPI_Datatype Type; /*create datatype*/ MPI_Get_address(&I, &(disp[0])); MPI_Get_address(c,&(disp[1])); MPI_Type_create_struct(2, blocklen, disp, type, &Type); MPI_Type_commit(&Type); /* Send */ MPI_Send(MPI_BOTTOM,1,Type,1,0,MPI_COMM_WORLD); Struct datatype int I; char c[100]; char buffer[110]; int position = 0; /* Pack */ MPI_Pack(&i,1,MPI_INT, buffer, 110,&position,MPI_COMM_WORLD); MPI_Pack(c,100,MPI_CHAR,buffer,110,&position,MPI_COMM_WORLD); /*send*/ MPI_Send(buffer,position,MPI_PACKED,1,0,MPI_COMM_WORLD); Packed datatype

  14. Receiving & unpacking packed messages • Note MPI_PACKED datatype can be used to receive any message • To unpack, make succesful calls to MPI_Unpack(inbuf,insize,position,outbuf,outcount, datatype,comm,ierr) • First call starts at position 0 • Second call starts at the position returned by the first MPI_Unpack (value returned by position variable)

  15. MPI_Unpack example int I; char c[100]; MPI_Status status; char buffer[110]; int position = 0; /* receive */ MPI_Recv(buffer,110,MPI_PACKED,1,0,MPI_COMM_WORLD, &status); /*unpack*/ MPI_Unpack(buffer,110,&position,&i,1,MPI_INT,MPI_COMM_WORLD); MPI_Unpack(buffer,110,&position,c,100,MPI_CHAR,MPI_COMM_WORLD);

  16. Note on implementation • A packing unit may contain metadata (recall issue of heterogeneity in communicator) • e.g. 64 processors of standard type, 1 additional graphics processor of opposite endian • Any header will contain information on encoding and the size of the unit for error checking • Implies you can’t concatenate two packed messages and send in a single operation • Similarly can’t separate an incoming message into two sections and unpack separately

  17. Applicable to applications where data structures reflect a specific domain decomposition Optional attribute that can be given to an intracommunicator only – not to intercommunicators May possibly help runtime mapping of processes onto hardware We’ll examine one code example that brings together a number of features we have discussed and places them in context Topologies, Cartesian Primitives and messaging performance

  18. Inadequacies of linear ranking • Linear ranking implies a ring-like abstract topology for the processes • Few applications have a communication pattern that logically reflects a ring • Two or three dimensional grids/torii are very common • We define the logical processes arrangement defined by the communication pattern the “virtual topology” • Do not confuse the virtual topology with the machine topology • The virtual topology must be mapped to the hardware via an embedding • This may or may not be beneficial

  19. Graphs & Virtual Topologies Cartesian example • Communication pattern defines a graph • Communication graphs are assumed to be symmetric • if a communicates with b, b can communicate with a • Communicators allow messaging between any pair of nodes • virtual topologies may necessarily neglect some communication modes 3x4 2d virtual topology. Upper number is the rank, lower Value gives (column,row).

  20. Specific example: Poisson Equation on 2d grid • xi=i/n+1, where n defines number of points, same in y axis • On unit square grid spacing h=1/n+1 • Equation is approximated by • Solve by guessing for u initially, then iteratively solve for u (Jacobi iteration) interior boundary i,j+1 i,j i-1,j i+1,j i,j-1

  21. 2d Poisson continued Rank=1 • Calculation of the updated uk involves looping over all grid points • How do we decompose the problem in parallel? • Simplest choice: 1-d decomposition (slabs) • Solid lines denote data stored on node • Dotted lines denote all required data required for calculation • Nodes will always require data that they do not hold a copy of locally – ghost cells Rank=0

  22. Changes to the loop structure Single processor Parallel Node • Loop is now over node start and end points • Data arrays must include space for j+1 and j-1 entries in the y direction • Ghost data must be communicated • MPI_Cart topologies create the necessary facilities for this messaging to be done quickly and easily integer i,j,n double precision u(0:n+1,s-1:e+1) double precision unew(0:n+1,s-1:e+1) do j=s,e do i=1,n unew(i,j)= 0.25*( u(i-1,j) + u(i,j+1)+u(i,j-1)+u(i+1,j) + -h**2*f(i,j)) end do end do integer i,j,n double precision u(0:n+1,0:n+1) double precision unew(0:n+1,0:n+1) do j=1,n do i=1,n unew(i,j)= 0.25*( u(i-1,j) + u(i,j+1)+u(i,j-1)+u(i+1,j) + -h**2*f(i,j)) end do end do

  23. MPI_Cart_create • Creates a communicator with a Cartesian topology • 2nd through 5th arguments define the new topology. 1st argument is the initial communicator, 6th argument is the new communicator • Example for a logical 2d grid of processors: integer dims(2) logical isperiodic(2),reorder dims(1)=4 dims(2)=3 isperiodic(1)=.false. isperiodic(2)=.false. reorder=.true. ndim=2 call MPI_Cart_create(MPI_COMM_WORLD, + ndim,dims,isperiodic,reorder, + comm2d,ierr) Dimensions of axes Are axis directions periodic? (consider grid on the Earth for example) Rank in new communicator may be different from old (may improve performance)

  24. Important issue • The topology MPI_Cart_create initializes is related to the communication required by the domain decomposition • Not the implicit communication structure of the algorithm • For the 1d decomposition all x values are contained on node – only y boundaries need be exhanged • In the 2d Poisson example we can choose either a 1d or 2d decomposition

  25. Finding your coordinates & neighbours in the new communicator • MPI_Cart_get(comm1d,1,dims,isperiodic,coords,ierr) • Returns dims, isperiodic as defined in the communicator • Calling processes coords are returned • You may also want to convert a rank to coordinates • MPI_Cart_coords(comm1d,rank,maxdims,coords,ierr) • If you call with your own rank then call returns your coords • When moving data we frequently wish to move across one axis direction synchronously across all processors • If you just want to send and receive ghost cell values to/from neighbours there is a simple mechanism for determining the neighbouring ranks • MPI_Cart_shift

  26. MPI_Cart_shift Rank=1 • MPI_Cart_shift(comm,direction,shift,src,dest,ierr) • Input your rank • Input direction to shift (note numbered from 0 to n-1) • Input displacement • >0=upward shift • <0=downward • Returns rank of neighbour in upper direction (target) • Returns rank of neighbour in lower direction (source) • If no neighbour, MPI_PROC_NULL is returned • Message to this rank is ignored, and completes immediately Rank=0 Boundary exchanges, data must move upward and downward.

  27. Remaining ingredients • Still need to compute the start and end points for each node • If n is divisible by nprocs (best for load balance): • When nprocs does not divide n naïve solution: • Floor(x) returns largest integer value no greater than x • Better to use MPE_Decomp1d to even out, but load balance still won’t be perfect (see the ANL/MSU MPI website) s = 1 + myrank * (n/nprocs) e = s + (n / nprocs) -1 s = 1 + myrank * (n/nprocs) if (myrank .eq. nprocs-1) then e = n else e = s + floor(n / nprocs) -1 end if

  28. http://www-unix.mcs.anl.gov/mpi/usingmpi/examples/intermediate/oned_f.htmhttp://www-unix.mcs.anl.gov/mpi/usingmpi/examples/intermediate/oned_f.htm Prototype code call MPI_CART_CREATE( MPI_COMM_WORLD, 1, numprocs, .false., $ .true., comm1d, ierr ) c c Get my position in this communicator, and my neighbors c call MPI_COMM_RANK( comm1d, myid, ierr ) call MPI_Cart_shift( comm1d, 0, 1, nbrbottom, nbrtop, ierr ) c Compute the actual decomposition call MPE_DECOMP1D( ny, numprocs, myid, s, e ) c Initialize the right-hand-side (f) and the initial solution guess (a) call onedinit( a, b, f, nx, s, e ) c c Actually do the computation. Note the use of a collective operation to c check for convergence, and a do-loop to bound the number of iterations. c do 10 it=1, maxit call exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) call sweep1d( a, f, nx, s, e, b ) ! Jacobi sweep call exchng1( b, nx, s, e, comm1d, nbrbottom, nbrtop ) call sweep1d( b, f, nx, s, e, a ) dwork = diff( a, b, nx, s, e ) ! Convergence test call MPI_Allreduce( dwork, diffnorm, 1, MPI_DOUBLE_PRECISION, $ MPI_SUM, comm1d, ierr ) if (diffnorm .lt. 1.0e-5) goto 20 c if (myid .eq. 0) print *, 2*it, ' Difference is ', diffnorm 10 continue if (myid .eq. 0) print *, 'Failed to converge' 20 continue if (myid .eq. 0) then print *, 'Converged after ', 2*it, ' Iterations’ endif Routine alternately computes into a and then b – hence 2*it at end

  29. Which routine determines performance (other than comp. kernel)? • exchng1 controls the messaging and how this is coded will strongly affect overall execution time • Recall from last lecture, 5(!) different ways to do communication: • Blocking sends and receives • very bad idea, potentially unsafe (although in this case it is OK) • Ordered send • match send and receive perfectly • Sendrecv • systems deals with buffering • Buffered Bsend • Nonblocking Isend

  30. Option #1: Simple sends and receives subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) include 'mpif.h' integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status(MPI_STATUS_SIZE), ierr C call MPI_SEND( a(1,e), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, comm1d, ierr ) call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm1d, status, ierr ) call MPI_SEND( a(1,s), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm1d, status, ierr ) return end

  31. Option #2: Matched sends and receives subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) use mpi integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop, rank, coord integer status(MPI_STATUS_SIZE), ierr ! call MPI_COMM_RANK( comm1d, rank, ierr ) call MPI_CART_COORDS( comm1d, rank, 1, coord, ierr ) if (mod( coord, 2 ) .eq. 0) then call MPI_SEND( a(1,e), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, comm1d, ierr ) call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm1d, status, ierr ) call MPI_SEND( a(1,s), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm1d, status, ierr ) else call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm1d, status, ierr ) call MPI_SEND( a(1,e), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm1d, status, ierr ) call MPI_SEND( a(1,s), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, comm1d, ierr ) endif return end

  32. Option #3: Sendrecv subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) include 'mpif.h' integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status(MPI_STATUS_SIZE), ierr c call MPI_SENDRECV( & a(1,e), nx, MPI_DOUBLE_PRECISION, nbrtop, 0, & a(1,s-1), nx, MPI_DOUBLE_PRECISION, nbrbottom, 0, & comm1d, status, ierr ) call MPI_SENDRECV( & a(1,s), nx, MPI_DOUBLE_PRECISION, nbrbottom, 1, & a(1,e+1), nx, MPI_DOUBLE_PRECISION, nbrtop, 1, & comm1d, status, ierr ) return end

  33. Option #4: Buffered Bsend subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) use mpi integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status(MPI_STATUS_SIZE), ierr call MPI_BSEND( a(1,e), nx, MPI_DOUBLE_PRECISION, nbrtop, & 0, comm1d, ierr ) call MPI_RECV( a(1,s-1), nx, MPI_DOUBLE_PRECISION, nbrbottom, & 0, comm1d, status, ierr ) call MPI_BSEND( a(1,s), nx, MPI_DOUBLE_PRECISION, nbrbottom, & 1, comm1d, ierr ) call MPI_RECV( a(1,e+1), nx, MPI_DOUBLE_PRECISION, nbrtop, & 1, comm1d, status, ierr ) return end Must also provide buffer space and connected to it via MPI_BUFFER_ATTACH (see Using MPI book)

  34. Option #5: Nonblocking Isend subroutine exchng1( a, nx, s, e, comm1d, nbrbottom, nbrtop ) include 'mpif.h' integer nx, s, e double precision a(0:nx+1,s-1:e+1) integer comm1d, nbrbottom, nbrtop integer status_array(MPI_STATUS_SIZE,4), ierr, req(4) C call MPI_IRECV ( & a(1,s-1), nx, MPI_DOUBLE_PRECISION, nbrbottom, 0, & comm1d, req(1), ierr ) call MPI_IRECV ( & a(1,e+1), nx, MPI_DOUBLE_PRECISION, nbrtop, 1, & comm1d, req(2), ierr ) call MPI_ISEND ( & a(1,e), nx, MPI_DOUBLE_PRECISION, nbrtop, 0, & comm1d, req(3), ierr ) call MPI_ISEND ( & a(1,s), nx, MPI_DOUBLE_PRECISION, nbrbottom, 1, & comm1d, req(4), ierr ) C call MPI_WAITALL ( 4, req, status_array, ierr ) return end

  35. (Problem size chosen to be large enough to decompose.) Comparison of code speed for the different communication methods Execution time Blocking send is by far the worst example – what happened? Ncpu

  36. P5 P0 P1 P2 P3 P4 SEND SEND SEND SEND SEND RECV RECV RECV RECV RECV What happens in the blocking sending version? • The blocking communication can create a serial cascade • “Top process” sends a message to a null process which completes • It can then receive from another process • The ability to receive then cascades through the process ranks Time Completely serial communication

  37. Observed scaling • Even non-blocking sends do not scale better than 20% efficiency on 64 nodes – why? Speed-up Ncpu

  38. Scalability Analysis • Examine execution time to see if time in communication is a bottleneck • Time to send n bytes: Timecomm=s+r*n • s = start-up overhead • r = time to send one byte (~1/bandwidth) • Time in exchng1d routine ~ 2(s+r*n) • n=8*nx for double precision data • Communication time per process is fixed regardless of number of processors! • Communication doesn’t scale

  39. 2-d topology communication time • Excluding processors on domain edge, then exchange of x boundary, followed by exchange of y boundary • For a square, with p total processors • Communication time decreases as a function of 1/sqrt(p) – moderately scalable • Note for a small number of processors 1d decomposition is actually better

  40. Execution time comparison NCPU

  41. Changes required for the 2-d domain decomposition • Need to update process topology • MPI_Cart_create(MPI_COMM_WORLD,2,.. • Need two calls to MPI_Cart_shift • One for each axis direction • Body of sweep routine must change loop limits • Now have limits in x direction • Arrays must also be updated for smaller sizes • Lastly, exchng1 must be updated to exchng2 • Boundary exchanges in both directions

  42. 2d: User derived datatype for boundary exchanges • x-boundary consists of non-contiguous memory locations • Unavoidable due to 2d array being mapped into memory • Use MPI_Type_vector discussed previously

  43. Solution • Number of blocks = (y end)-(y start)+1=ey-sy+3 • e.g. if end=4, start=2 there are 3 entries • 1 entry per block • Stride = (x end)-(x start)+1 = ex-sx+3 • Remember it is a displacement between entries • Create and commit: call MPI_Type_vector(ey-sy+3,1,ex-sx+3, & MPI_DOUBLE_PRECISION, stridetype,ierr) call MPI_Type_commit(stridetype,ierr)

  44. exchng2d subroutine exchng2( a, sx, ex, sy, ey, & comm2d, stridetype, & nbrleft, nbrright, nbrtop, nbrbottom ) include 'mpif.h' integer sx, ex, sy, ey, stridetype double precision a(sx-1:ex+1, sy-1:ey+1) integer nbrleft, nbrright, nbrtop, nbrbottom, comm2d integer status(MPI_STATUS_SIZE), ierr, nx c nx = ex - sx + 1 c These are just like the 1-d versions, except for less data call MPI_SENDRECV( a(sx,ey), nx, MPI_DOUBLE_PRECISION, & nbrtop, 0, & a(sx,sy-1), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 0, comm2d, status, ierr ) call MPI_SENDRECV( a(sx,sy), nx, MPI_DOUBLE_PRECISION, & nbrbottom, 1, & a(sx,ey+1), nx, MPI_DOUBLE_PRECISION, & nbrtop, 1, comm2d, status, ierr ) c c This uses the vector datatype stridetype call MPI_SENDRECV( a(ex,sy), 1, stridetype, nbrright, 0, & a(sx-1,sy), 1, stridetype, nbrleft, 0, & comm2d, status, ierr ) call MPI_SENDRECV( a(sx,sy), 1, stridetype, nbrleft, 1, & a(ex+1,sy), 1, stridetype, nbrright, 1, & comm2d, status, ierr ) return end

  45. Overlapping communication and computation • The expensive nature of communication argues for overlapping (pipelining!) it with computation • Increasing divide between CPU speed and message latency means this will only be more important in the future • Not always achievable – e.g. master-slave task decomposition • Nonblocking communication facilitates this kind of overlap • 2d Poisson example • Interior of arrays can be calculated without the ghost cells

  46. Computation that can be computed without ghost cells • All blue points are contained on node • Red points denote ghost cells • Blue points contained within the rectangle use only local data 2d domain decomposition

  47. Steps in comms/comp overlapping version • Indicate where data from other processes is to be received • Post non-blocking sends to other processes • Perform local computation • Check for arrival of data from other processes • Complete remaining computation Final algorithm is more complicated than the non-overlapping version, but all the necessary code has already been written. This algorithmic structure is used universally to overlap comms & comps.

  48. Structure of computation of final pieces • Post 4 MPI_Irecv’s and 4 MPI_Isends • Handles from the 4 Irecv’s determine which part of the remaining cells to compute • Use MPI_WAITANY to test on whether operations have completed • Ignore completion of sends (but must still ensure they have completed) do 100 k=1,8 call MPI_WAITANY( 8, requests, idx, status, ierr ) ! Use tag to determine which edge if (idx.eq.1) then do j=sy,ey ! Left hand cells unew(si,j) = ... end do else if (idx.eq.2) then do j=sy,ey ! Right hand cells unew(ei,j) = ... end do else if (idx.eq.3) then do i=sx,ex ! Upper cells unew(i,ej) = ... end do else if (idx.eq.4) then do i=sx,ex ! Lower cells unew(i,sj) = ... end do end if end do

  49. Note about overlapping comms and comps • Nonblocking operations were introduced first and foremost to allow the writing of safe programs • Many implementations cannot physically overlap without additional hardware • Communication processor or system communication thread • Thus nonblocking operations may allow the overlapping, but they are not required to • If code that overlaps comms/comps does not increase performance it may not be your fault

  50. 3d • Problem sizes scale cubically in three dimensions • Number of grids cells frequently exceeds 107 • Large amount of work makes 3d codes natural candidates for parallelization • Domain decomposition is expected to best for a 3d virtual topology • MPI_Dims_create(nodes,ndims,dims) divides up nodes into best load balanced topology • e.g. MPI_Dims_create(6,3,dims) gives dims=(3,2,1) • Values of dims may be set before the call, e.g. dims=(0,3,0) then MPI_Dims_create(6,3,dims) gives dims=(2,3,1) • If nodes is not divisible then an error is returned

More Related