An HPspmd Analysis of the Gadget 2 Code for Cosmological Simulations

1. An HPspmd Analysis of the Gadget 2 Code for Cosmological Simulations Bryan Carpenter University of Southampton

2. Background HPspmd model was introduced around 1997� as a reaction to (the failure of) High Performance Fortran (HPF). The idea was to keep the data model of HPF, which we thought was good, but abandon the execution model, which was both inflexible and hard to implement. � See papers at www.hpjava.org

3. The HPspmd Model The idea of HPspmd is to extend a base language with a specific set of syntax extensions for representing distributed data (distributed arrays). The extended language does not contain any communication or synchronization primitives. The programming model is SPMD, and all communication between nodes goes through libraries. HPspmd itself only provides a framework for writing libraries that act on distributed data, including communication libraries.

4. HPJava, and other Base Languages It was always the intention that HPspmd could be bound to various base languages. First full implementation was its binding to Java. HPJava finally released 2003. Up to now HPJava is the only full implementation of HPspmd (other than early C++ prototypes). But a full binding to, say, C or C++ also possible (and desirable?)

5. Goals of This Work Revisit HPspmd, with a particular eye to a possible role in programming multicore systems. Push the HPspmd model beyond its original zone of application�to a highly irregular production code for astrophysics. See what we learn.

6. HPspmd

7. HPF Distributed Array Example

8. HPF Parallel Assignments ! Parallel loop FORALL (I = 2:8, J = 1:8) A(I, J) = (I + 1) * (J + 1) ! Communication (implicit) A(1, :) = A(8, :) Communication can happen almost anywhere.

9. HPspmd Distributed Array Example

10. HPspmd Parallel Assigments // Parallel loop overall(i = x for 1:7) overall(j = y for :) a[i, j] = (i` + 2) * (j` + 2); // Communication (always explicit) Adlib.remap(a[[0, :]], a[[7, :]]); In element reference, subscript must be a distributed index (like i or j) in the range of the array Section subscripts�anything goes.

11. HPF Collapsed Dimensions

12. HPspmd Sequential Dimensions

13. Working with Sequential Dimensions overall(i = x for 1:7) for(int j = 0; j<N; j++) a[i, j] = (i` + 2) * (j + 2); overall is a general control construct�it can include (almost) arbitrary blocks of code Constraints on nesting of overalls, and on calls to collective ops inside overall. See later.

14. HPspmd �Co-arrays�

15. Working with Co-arrays overall(me = procs for :) for(int i = 0; i<N; i++) a[me, i] = me` * N + i` ; Just a special case of HPspmd distributed arrays, where distributed range is process dimension. Distribution format reminiscent of Co-array Fortran, but no communication in the language�still use remap() etc.

16. HPspmd and Multicore HPspmd model originally designed for distributed memory architectures. There are some arguments why it might be a good model for shared memory and (especially) hybrid architectures.

17. It isn�t MPI In principle you can do explicit message-passing in HPspmd programs, but not usually necessary. High level collective communications like remap() can be implemented efficiently by MPI, or by copying in shared memory, or a mixture. See also other collective primitives proposed later in this talk.

18. It isn�t OpenMP It doesn�t require shared memory for efficient implementation. Even primitives built into PGAS languages like Co-array Fortran have a bias towards shared memory model. HPspmd model is agnostic about communication, leaving it to libraries. Existing and proposed libraries are �above� message-passing or shared memory level.

19. Cache Issues In cache-coherent processors, common advice to reduce false sharing, fragmentation, etc is to use �higher dimensional arrays� to keep partitions contiguous in the address space�. This is just how HPspmd distributed arrays are implemented on shared memory processors. � See for example Parallel Computer Architecture, Culler and Pal Singh, 1999, �5.6

20. Gadget

21. 21 Gadget-2 A free production code for cosmological N-body and hydrodynamic computations�. Written by Volker Springel, of the Max Plank Institute for Astrophysics, Garching. Written in C�already fully parallelized using MPI. Versions used in various astrophysics research papers, including the Millennium Simulation. � http://www.mpa-garching.mpg.de/gadget

22. 22 Millennium Simulation Follows the evolution of 1010 dark matter �particles� from early Universe to current day�. Performed on 512 nodes of IBM p690. Used 1TB of distributed memory. 350,000 CPU hours � 28 days elapsed time. Floating point performance around 0.2 TFLOPS. Around 20Tb total data generated. � Simulating the Joint Evolution of Quasars, Galaxies and their Large-Scale Distribution, Springel et al. Gadget home page.

23. 23 Dynamics in Gadget Gadget is �just� simulating the movement of (a lot of) representative particles under the influence of Newton�s law of gravity, plus hydrodynamic forces on gas. Classical N-body problem.

24. 24 Gadget Main Loop Schematic view of the Gadget code: � Initialize � while (not done) { move_particles() ; // update positions domain_Decomposition() ; compute_accelerations() ; advance_and_find_timesteps() ; // update velocities } Most of the interesting work happens in domain_Decomposition() and compute_accelerations() .

25. 25 Computing Forces The function compute_accelerations() must compute forces experienced by all particles. In particular must compute gravitational force. Because gravity is a long range force, every particle affects every other. Naively, total cost is O(N2). With N � 1010, this is infeasible. Need some kind of approximation. Intuitive approach: when computing force on particle i, group together particles in regions far from i, and treat a groups as a single particle, located at its centre of gravity.

26. Barnes-Hut Tree

27. Barnes Hut Force Computation BH computation of force on a particle i, traverse tree starting from root: if a node n is far from i, just add contribution to force on i from centre of mass of n � no need to visit children of n; if node n is close to i, visit children of n and recurse. On average, number of nodes �opened� to compute force on i is O(log N). c.f. visiting O(N) particles in na�ve algorithm 27

28. BH in Gadget In Gadget, preferred method of dealing with long range part gravitational force is PMTree algorithm (see later). But BH-tree is still the fundamental data structure�traversals of BH are used to locate particles near to a given particle: Short-range part of gravity force in PMTree Estimation of density in SPH Computation of hydrodynamic forces

29. TreePM Artificially split gravitational potential into two parts:

30. Smoothed Particle Hydrodynamics Uses discrete particles to represent state of fluid. Equations of motion dependent on mass density, estimated within a smoothing radius. Force calculation has 2 distinct phases: Calculate density, adaptively setting smoothing radius to give enough neighbours. Calculate forces based on the smoothing kernel.

31. 31 Domain Decomposition Can�t just divide space in a fixed way, because some regions will have many more particle than others � poor load balancing. Can�t just divide particles in a fixed way, because particles move independently through space, and want to maintain physically close particles on the same processor, as far as practical � communication problem.

32. Peano-Hilbert Curve � 32

33. Decomposition based on P-H Curve Sort particles by position on Peano-Hilbert curve, then divide evenly into P domains. Characteristics of this decomposition: Good load balancing. Domains simply connected and quite �compact� in real space, because particles that are close along P-H curve are close in real space. Domains have relatively simple mapping to BH octree nodes. 33

34. Distribution of BH Tree in Gadget� � Ibid. 34

35. Communication in Gadget Can identify 4 recurring �non-trivial� patterns: Distributed sort of particles, according to P-H key: implements domain decomposition. Export of particles to other nodes, for calculation of remote contribution to force, density, etc, and retrieval of results. Projection of particle density to regular grid for calculation of Flong; distribution of results back to irregularly distributed particles. Distributed Fast Fourier Transform.

36. HPspmd Analysis of Gadget

37. Approach As a first step, we go through the whole code, and annotate it with the changes that would be needed to make it into HPspmd code. Gadget is 16,000+ lines of C code. Gadget is more complex than any previous HPspmd code, so we expect that new communication functions may be needed. Also open to the possibility of new language extensions.

38. Use of Co-Arrays Immediately clear the translation will make extensive use of �co-arrays�. General HPspmd distributed arrays are not suitable for a structures like the Nodes array that holds the B-H, because HPspmd arrays only allow very regular patterns of local access. Similar considerations apply to other major arrays, like the particles array P.

39. The Particles Array Consider P, which holds NumPart particles in each node, where in general NumPart is different for each node. Could use a general HPspmd distributed array with an IrregRange�: Range x = new IrregRange(procs, NumPart); struct particle_data [[-]] P = new struct particle_data [[x]]; � Called GEN_BLOCK in HPF 2.0

40. Possible IrregRange Array for P Assumes NumParts = [6, 8, 5].

41. Using General Distributed Array This would often be convenient, e.g. in move_particles(): overall(i=x for :) for(j = 0 ; j < 3 ; j++) P[i].Pos[j] += P[i].Vel[j] * dt_drift; and even, in domain_Decomposition(): sort(P, domain_compare_key); where sort() is a notional distributed sort.

42. Problem with General Array But, in compute_potential(), for example: overall(i = x for :) { pos_x = P[i].Pos[0]; � � Traverse down from root of BH tree� if(� node contains no�th single particle�) { dx = P[no].Pos[0] � pos_x; � } } Reference P[no] is illegal in HPspmd!!

43. Co-Array for P Assumes NumParts = [6, 8, 5].

44. Using a Co-Array Now we need in move_particles(): overall(me = procs for :) for(i = 0 ; i < NumParts[me] ; i++) { for(j = 0 ; j < 3 ; j++) P[me, i].Pos[j] += P[me, i].Vel[j] * dt_drift; } and, notionally, in domain_Decomposition(): sort(P, NumParts, domain_compare_key);

45. Different Views? HPJava actually has a mechanism for viewing an array as a �high-level� distributed array in one part of the code, and a �co-array� in another�. This mechanism was designed for library-writers. Complex, and I was reluctant to introduce it into �user code�. First pass �translation� of Gadget uses co-arrays for most of the major arrays. �Mechanism is called a splitting section.

46. Programming Style Use of HPspmd programming features is largely limited to patterns like this: � Declaration of co-arrays, etc � � Global computation � overall(me = procs for :) { � Local computation � } � Collective communication � [Any of above in sequential control flow]

47. Programming Discipline Syntactically, distributed array elements can only be accessed inside overall. �Global variables��variables declared outside overall, not distributed array elements�should not be assigned inside overall. In principle, message passing is allowed inside overall (�local computation� section), but collective communication (outside overall) is preferred.

48. Basic Communication Gadget makes extensive use of MPI collectives. Easily map to methods from the Adlib collective library�always more succinct. Translation uses:

49. Non-trivial Communication Patterns Will consider in detail two recurring patterns in the Gadget code: Basic communication pattern in compute_potential(). Projection to particle mesh in TreePM.

50. Sketch of compute_potential() while(... some local particles not processed ...) { for(... all local particles, while send buffer not exhausted ...) { ... Compute local contribution to potential, and flag exports ... ... Add exported particles to send buffer ... } ... Sort export buffer by destination ... while(... there are more peers in export list ...) { ... Send exports to all peers possible (without exhausting recv buffers); recv particles for local computation ... ... Compute local contribution for received particles ... for(... all peer nodes communicated with above ...) { ... Send results back to exporters; recv our results ... ... Add the results to the P array... } } }

51. Notes on compute_potential() Highlighted code is essentially application specific. Remaining code is essentially boilerplate code that recurs in: compute_potential() gravity_tree() gravity_forcetest() density() hydro_force()

52. �Boilerplate Code� while(ntotleft > 0) { overall(me = procs for :) { int nexport = 0; for(j = 0 ; j < NTask ; j++) nsend_local[me, j] = 0; ndone[me] = 0; while(i < num[me] && nexport < bunchSize - NTask) { if(condition(a[me, i])) { ndone[me]++; for(j = 0; j < NTask; j++) Exportflag[me, j] = 0; ... Compute local contribution to potential, and flag exports ... for(j = 0 ; j < NTask ; j++) if(Exportflag[me, j]) { setDataIn(dataIn[me, nexport], a[me, i], ...) nexport++; nsend_local[me, j]++; } } i++; } qsort(dataIn[[me, 0:nexport]], compare_key); for(j = 1, noffset[me, 0] = 0; j < NTask; j++) noffset[me, j] = noffset[me, j - 1] + nsend_local[me, j - 1]; } Adlib.remap(nsend, nsend_local); /* now do the particles that need to be exported */ for(level = 1; level < (1 << PTask); level++) { for(j = 0; j < NTask; j++) nbuffer[j] = 0; for(ngrp = level; ngrp < (1 << PTask); ngrp++) { maxfill = 0; for(j = 0; j < NTask; j++) if((j ^ ngrp) < NTask) if(maxfill < nbuffer[j] + nsend[j ^ ngrp, j]) maxfill = nbuffer[j] + nsend[j ^ ngrp, j]; if(maxfill >= bunchSize) break; overall(me = procs for :) { recvTask = me` ^ ngrp; if(recvTask < NTask) { /* get the particles */ int recv_offset = nbuffer[me`]; int recv_count = nsend[recvTask, me`]; int send_offset = noffset[me, recvTask]; int send_count = nsend_local[me, recvTask]; MPlib.sendrecv(dataIn[[send_offset: send_offset + send_count - 1]], world.getProc(recvTask), TAG_A, dataGet[[recv_offset: recv_offset + recv_count - 1]] world.getProc(recvTask), TAG_A, MPI_COMM_WORLD, &status); } } for(j = 0; j < NTask; j++) if((j ^ ngrp) < NTask) nbuffer[j] += nsend[j ^ ngrp, j]; } ... Compute local contributions for received particles ... /* get the result */ for(j = 0; j < NTask; j++) nbuffer[j] = 0; for(ngrp = level; ngrp < (1 << PTask); ngrp++) { maxfill = 0; for(j = 0; j < NTask; j++) if((j ^ ngrp) < NTask) if(maxfill < nbuffer[j] + nsend[j ^ ngrp, j]) maxfill = nbuffer[j] + nsend[j ^ ngrp, j]; if(maxfill >= bunchSize) break; overall(me = procs for :) { recvTask = me` ^ ngrp; if(recvTask < NTask) { /* get the particles */ int recv_offset = noffset[me, recvTask]; int recv_count = nsend_local[me, recvTask]; int send_offset = nbuffer[me`]; int send_count = nsend[recvTask, me`]; MPlib.sendrecv(dataResult[[send_offset: send_offset + send_count - 1]], world.getProc(recvTask), TAG_B, dataOut[[recv_offset: recv_offset + recv_count - 1]], world.getProc(recvTask), TAG_B, MPI_COMM_WORLD, &status); /* add the result to the particles */ for(j = 0; j < nsend_local[recvTask]; j++) { source = j + noffset[me, recvTask]; ... Add the results back to the P array ... } } } for(j = 0; j < NTask; j++) if((j ^ ngrp) < NTask) nbuffer[j] += nsend[j ^ ngrp, j]; } level = ngrp - 1; } ntotleft -= Adlib.sum(ndone); }

53. Remarks We expect this pure message-passing code is quite efficient. But presumably hard to maintain, and certainly not in the HPspmd spirit, which eschews low-level messaging. Too much code that is communication specific rather than application specific. Can we find a better abstraction?

54. A Different Point of View Yes, I think so. Let�s try considering this fragment as �the main loop�: for(... all local particles, [while send buffer not exhausted]...) { ... Compute local contribution to potential, and flag exports ... ... [Add exported particles to send buffer] ... } Now suppose that the highlighted elision, rather than just �flagging exports�, calls a library method, passing it the values to be exported.

55. The invoke() Method This library method, which I�ll suggestively call invoke(), can internally load values into the send buffer, and keep a track of whether the buffer is full. When the buffer is full, invoke() internally triggers all the remaining of the actions on the earlier slide�including all communication with peers�then returns control for the application-specific main loop, to resume with a cleared buffer.

56. A Better Code Organization So then we could implement exactly equivalent user code like this: for(... all local particles ...) { ... Compute local contribution to potential, and pass exports to invoke() ... } But invoke() must somehow trigger the user code segments indicated earlier by: ... Compute local contribution for received particles ... ... Add the results to the P array...

57. Callbacks Clearly these must be callbacks of some kind. Let�s regard the first callback as �remote implementation code� and the second as a �response handler�. The structure we now have is an asynchronous remote method invocation! But �behind the scenes� it is implemented efficiently and collectively, aggregating atomic invocations.

58. Putting Things Together ComputePotentialProxy [[-]] proxies = new ComputePotentialProxy [[procs]] ; ComputePotentialHandler [[-]] handlers = new ComputePotentialHandler [[procs]] ; � Instantiate custom proxies and handlers � RMISchedule schedule = new RMISchedule(proxies, handlers); overall(me = procs for :) for(i = 0 ; i < NumPart[me] ; i++) { ... Computing local contribution to potential � proxies[me].invoke(remoteTask, i, GravDataIn); // export ... } schedule.complete();

59. Defining Callback Methods class ComputePotentialHandler extends RMIHandler { � Declare fields and constructor � gravdata_in invoke(gravdata_in GravDataGet) { ... Compute local contribution for received particles ... } } class ComputePotentialProxy extends RMIProxy { � Declare fields and constructor � void handleResponse(int context, gravdata_in GravDataOut) { ... Add the results to the P array ... } }

60. Projection to Particle Mesh Second of the �non-trivial� communications patterns identified earlier comes from the TreePM algorithm: Projection of particle density to regular grid for calculation of Flong; distribution of results back to irregularly distributed particles.

61. Sketch of pmpotential_periodic() overall(me = procs for :) { workspace = new fftw_real [[dimx, dimy, dimz]]; ... Project local particle masses into workspace ... ... Accumulate workspace into rhogrid ... } ... Do the FFT of the density field ... ... Multiply with Green's function for the potential ... ... Reverse the FFT to get the potential ... overall(me = procs for :) { workspace = new fftw_real [[dimx, dimy, dimz]]; ... Get workspace from rhogrid ... ... Read out local potential from workspace... }

62. Notes on pmpotential_periodic() Computes Flong contribution to potential. workspace is large enough to cover all particles from P in local domain, and maps to a subsection of the full PM distributed array, rhogrid. Highlighted code is essentially application independent.

63. Accumulate workspace into rhogrid... for(level = 0; level < (1 << PTask); level++) /* note: for level=0, target is the same task */ { sendTask = me`; recvTask = me` ^ level; if(recvTask < NTask) { /* check how much we have to send */ sendmin = 2 * PMGRID; sendmax = -1; for(slab_x = meshmin[me, 0]; slab_x < meshmax[me, 0] + 2; slab_x++) if(slab_to_task[slab_x % PMGRID] == recvTask) { if(slab_x < sendmin) sendmin = slab_x; if(slab_x > sendmax) sendmax = slab_x; } if(sendmax == -1) sendmin = 0; /* check how much we have to receive */ recvmin = 2 * PMGRID; recvmax = -1; for(slab_x = meshmin_list[3 * recvTask]; slab_x < meshmax_list[3 * recvTask] + 2; slab_x++) if(slab_to_task[slab_x % PMGRID] == sendTask) { if(slab_x < recvmin) recvmin = slab_x; if(slab_x > recvmax) recvmax = slab_x; } if(recvmax == -1) recvmin = 0; if((recvmax - recvmin) >= 0 || (sendmax - sendmin) >= 0) /* ok, we have a contribution to the slab */ { recv_dimx = meshmax_list[3 * recvTask + 0] - meshmin_list[3 * recvTask + 0] + 2; recv_dimy = meshmax_list[3 * recvTask + 1] - meshmin_list[3 * recvTask + 1] + 2; recv_dimz = meshmax_list[3 * recvTask + 2] - meshmin_list[3 * recvTask + 2] + 2; forcegrid = new fftw_real [[sendmax - sendmin + 1, dimy, dimz]]; if(level > 0) { MPlib.sendrecv(workspace[[sendmin - meshmin[0]:sendmax - mashmin[0], :, :]], world.getProc(recvTask), TAG_PERIODIC_C, forcegrid, world.getProc(recvTask), TAG_PERIODIC_C, MPI_COMM_WORLD, &status); } else { HPspmd.copy(forcegrid, workspace[[sendmin - meshmin[0]:sendmax - mashmin[0], :, :]]); } for(slab_x = recvmin; slab_x <= recvmax; slab_x++) { slab_xx = (slab_x % PMGRID) - first_slab_of_task[me`]; if(slab_xx >= 0 && slab_xx < slabs_per_task[me]) { for(slab_y = meshmin_list[3 * recvTask + 1]; slab_y <= meshmax_list[3 * recvTask + 1] + 1; slab_y++) { slab_yy = slab_y; if(slab_yy >= PMGRID) slab_yy -= PMGRID; for(slab_z = meshmin_list[3 * recvTask + 2]; slab_z <= meshmax_list[3 * recvTask + 2] + 1; slab_z++) { slab_zz = slab_z; if(slab_zz >= PMGRID) slab_zz -= PMGRID; rhogrid[me, slab_xx, slab_yy, slab_zz] += forcegrid[slab_x - recvmin, slab_y - meshmin_list[3 * recvTask + 1]), slab_z - meshmin_list[3 * recvTask + 2]]; } } } } } } }

64. Get workspace from rhogrid� for(level = 0; level < (1 << PTask); level++) /* note: for level=0, target is the same task */ { sendTask = me`; recvTask = me` ^ level; if(recvTask < NTask) { /* check how much we have to send */ sendmin = 2 * PMGRID; sendmax = -PMGRID; for(slab_x = meshmin_list[3 * recvTask]; slab_x < meshmax_list[3 * recvTask] + 2; slab_x++) if(slab_to_task[(slab_x + PMGRID) % PMGRID] == sendTask) { if(slab_x < sendmin) sendmin = slab_x; if(slab_x > sendmax) sendmax = slab_x; } if(sendmax == -PMGRID) sendmin = sendmax + 1; /* check how much we have to receive */ recvmin = 2 * PMGRID; recvmax = -PMGRID; for(slab_x = meshmin[me, 0]; slab_x < meshmax[me, 0] + 2; slab_x++) if(slab_to_task[(slab_x + PMGRID) % PMGRID] == recvTask) { if(slab_x < recvmin) recvmin = slab_x; if(slab_x > recvmax) recvmax = slab_x; } if(recvmax == -PMGRID) recvmin = recvmax + 1; if((recvmax - recvmin) >= 0 || (sendmax - sendmin) >= 0) /* ok, we have a contribution to the slab */ { recv_dimx = meshmax_list[3 * recvTask + 0] - meshmin_list[3 * recvTask + 0] + 2; recv_dimy = meshmax_list[3 * recvTask + 1] - meshmin_list[3 * recvTask + 1] + 2; recv_dimz = meshmax_list[3 * recvTask + 2] - meshmin_list[3 * recvTask + 2] + 2; ncont = 1; cont_sendmin[0] = sendmin; cont_sendmax[0] = sendmax; cont_sendmin[1] = sendmax + 1; cont_sendmax[1] = sendmax; cont_recvmin[0] = recvmin; cont_recvmax[0] = recvmax; cont_recvmin[1] = recvmax + 1; cont_recvmax[1] = recvmax; for(slab_x = sendmin; slab_x <= sendmax; slab_x++) { if(slab_to_task[(slab_x + PMGRID) % PMGRID] != me`) { /* non-contiguous */ cont_sendmax[0] = slab_x - 1; while(slab_to_task[(slab_x + PMGRID) % PMGRID] != me`) slab_x++; cont_sendmin[1] = slab_x; ncont++; } } for(slab_x = recvmin; slab_x <= recvmax; slab_x++) { if(slab_to_task[(slab_x + PMGRID) % PMGRID] != recvTask) { /* non-contiguous */ cont_recvmax[0] = slab_x - 1; while(slab_to_task[(slab_x + PMGRID) % PMGRID] != recvTask) slab_x++; cont_recvmin[1] = slab_x; if(ncont == 1) ncont++; } } for(rep = 0; rep < ncont; rep++) { sendmin = cont_sendmin[rep]; sendmax = cont_sendmax[rep]; recvmin = cont_recvmin[rep]; recvmax = cont_recvmax[rep]; forcegrid = new fftw_real [[sendmax - sendmin + 1, recv_dimy, recv_dimz]] /* prepare what we want to send */ if(sendmax - sendmin >= 0) { for(slab_x = sendmin; slab_x <= sendmax; slab_x++) { slab_xx = ((slab_x + PMGRID) % PMGRID) - first_slab_of_task[me`]; for(slab_y = meshmin_list[3 * recvTask + 1]; slab_y < meshmax_list[3 * recvTask + 1] + 2; slab_y++) { slab_yy = (slab_y + PMGRID) % PMGRID; for(slab_z = meshmin_list[3 * recvTask + 2]; slab_z <= meshmax_list[3 * recvTask + 2] + 2; slab_z++) { slab_zz = (slab_z + PMGRID) % PMGRID; forcegrid[slab_x - sendmin, slab_y - meshmin_list[3 * recvTask + 1], slab_z - meshmin_list[3 * recvTask + 2]] = rhogrid[me, slab_xx, slab_yy, slab_zz]; } } } } if(level > 0) { MPlib.sendrecv(forcegrid, world.getProc(recvTask), TAG_PERIODIC_D, workspace [[recvmin - (meshmin[0] - 2):recvmax - (meshmin[0] - 2), :, :]], world.getProc(recvTask), TAG_PERIODIC_D, MPI_COMM_WORLD, &status); } else { HPspmd.copy(workspace [[recvmin - (meshmin[0] - 2):recvmax - (meshmin[0] - 2), :, :]], forcegrid); } } } } }

65. Recurrences Essentially similar code recurs in: pmforce_periodic() pmpotential_periodic() pmforce_nonperiodic() pmpotential_nonperiodic()

66. A �put� Schedule Modelled on earlier ideas about RMISchedule, we could implement a schedule�SectionPutSumSchedule, say�that has equivalent MPI implementation, but hides all the explicit messaging.

67. Rewriting pmpotential_periodic() SectionPutSumProxy [[-]] proxies = new SectionPutSumProxy [[procs]]; SectionPutSumSchedule schedule = new SectionPutSumSchedule(proxies, rhogrid); overall(me = procs for :) { workspace = new fftw_real [[dimx, dimy, dimz]]; ... Project local particle masses into workspace ... proxies[me].put(new int [] {meshminx, meshminy, meshminz}, workspace); } Schedule.complete(); � etc �

68. Finishing Touches Likewise we could introduce a SectionGetSchedule for the second part of pmpotential_periodic(). Proxy would have an application-specific response handler callback similar to RMISchedule. Estimate this API would reduce implementation of this single function in Gadget from 400+ lines to nearer 200. Underlying implementation still essentially the same in MPI case. But allows other implementations.

69. Conclusions

70. HPspmd HPspmd is a model for parallel programming that introduces a few concepts into a base language, for representing distributed data. Goal: minimal language extensions to support libraries that work with distributed data�including communication libraries. Originally inspired by fairly �regular� applications, and pure distributed memory.

71. HPspmd Analysis of Gadget Highly irregular application. Focus was on �co-arrays��most primitive kind of distributed array in HPspmd. Nevertheless, were able to propose new high-level communication APIs�very much in the HPspmd spirit�that should make Gadget-like codes much more maintainable.

72. Multicore We argue that the high-level abstraction of communication in HPspmd (and to some extent the data model itself) lend naturally to multiple implementations�including hybrid parallelism, across clusters of multicore processors. New communications APIs proposed here may be especially interesting in this respect. By construction they can be efficiently implemented in MPI. But their nature suggests they will also be directly implementable on shared memory.

An HPspmd Analysis of the Gadget 2 Code for Cosmological Simulations

An HPspmd Analysis of the Gadget 2 Code for Cosmological Simulations

Presentation Transcript

Chapter 2 The Cosmological Argument

THE NATURE OF DARK ENERGY FROM N-BODY COSMOLOGICAL SIMULATIONS Paola Solevi

The Gadget

The Gadget!

An Overview of the NARCCAP WRF Simulations and Analysis

An Overview of the NARCCAP WRF Simulations

Coeval Evolution of Galaxies and Supermassive Black Holes : Cosmological Simulations

The Cosmological Model: an overview and an outlook

Accuracy of Cosmological N -body Simulations

Confusion, Shocks and the WHIM: Cosmological AMR simulations of upcoming SZ surveys

Star Formation in Cosmological Simulations: the Molecular Gas Connection

Analysis Tools for Numerical Simulations of Estuaries

OUTPUT ANALYSIS FOR SIMULATIONS

Early Experiences in GPU-enabling the Gadget-2 Simulation Code

An excessive core collapse and its limitation in cosmological simulations

Gadget

Star Formation in Cosmological Simulations: the Molecular Gas Connection

Gadget Code Camp THE EYE TRIBE August 2016

An introduction of a Gadget