1 / 45

Parallelizing Highly Dynamic N-Body Simulations

Parallelizing Highly Dynamic N-Body Simulations. Joachim Stadel stadel@physik.uzh.ch University of Zurich Institute for Theoretical Physics. Meaning of “highly dynamic”.

colin
Download Presentation

Parallelizing Highly Dynamic N-Body Simulations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelizing Highly Dynamic N-Body Simulations Joachim Stadel stadel@physik.uzh.ch University of Zurich Institute for Theoretical Physics

  2. Meaning of “highly dynamic” • Push to ever higher resolution means that there is an ever widening range of timescales which need to be captured by the simulation. • In rocky planet formation simulations there is a big range from the outermost orbital timescale to that when there is a collision between planetesimals. (years to hours) • Including super massive black holes in galaxy simulations also creates a huge range in time scales. • Galaxy formation simulations with hydro, cooling, star formation, feedback etc. also have enormous ranges in dynamical times (as well as large numbers of particles).

  3. Multiple time-stepping integrators work quite well. Despite the fact that it is not symplectic, the KDK multi-stepping integrator has been found to work quite well for cosmological applications. For planet formation problems one needs to be much more careful and use a high order hermite, Symba-like, or Mercury-like integration scheme with the tree code.

  4. Time-step criteria, old and new. There isn’t time to go into detail on the method of deciding an appropriate time-step for a particle. Traditionally sqrt(eps/acc) has been used in cosmology, but we really want something that is representative of the dynamical time.

  5. Rung distribution as function of time. This is even a modest example. The red line shows where half the theoretical work lies if even high density particles take the same amount of time to calculate their force as low density particles. The real half-work line can lie even deeper in the time step hierarchy.

  6. The Central 50 kpc of the GHALO Simulation 1 billion particles of 1000 solar masses each form a galactic sized dark matter halo with a wealth of substructure. This simulation required 1.6 million CPU hours (2008) and ran on 1000 cores.

  7. Density profiles of successively zoomed simulation. Each higher resolution seems to converge on the Einasto profile which becomes shallower than r^-1 and does not converge to a set power-law.

  8. Keeping the same large scale modes allows 1-1 comparison of the simulation at different resolutions. Although small differences in accuracy parameters set during the simulation will cause a slightly different orbits in the smaller halos (it is chaotic after all).

  9. Actual refinement strategy in 2 cases: GHALO & SR These refinements are created by the codes of Doug Potter (Uni Zurich) who has parallelized both GRAFIC1 and more significantly GRAFIC2 codes of Bertschinger. Creating such initial conditions is still a very difficult job, but getting more automated now.

  10. No mixing of low-res particles inside of the halo.

  11. On big steps where N_active is large all is good.

  12. On smaller sub-steps there is a lot of imbalance.

  13. Where all the time goes for each rung. The lower left panel shows the case where the network protocol was not working properly so the dark green bars are not representative. Scaling is just ok at 2000 processors after a lot of tuning of the code. Note at 2000 the light green bars exceed the dark red of step 0 (forces on all particles)!

  14. HPC Hardware …how it is getting more parallel …and how to deal with this.

  15. HPC Hardware Paradigms Multi-core Nodes connected to HPC Network Multi-GPU Nodes connected to HPC Network Each level comes with a bandwidth bottleneck. HPC Network 512 cores/GPU (Fermi) 4 GPUs/node 1000s of nodes? 8 to 24 cores/node 1000s of nodes!

  16. Load Balancing Techniques 1:MIMD DOMAIN DECOMPOSITION THREAD SCHEDULING • High degree of data locality. • Very good comp/com if a lot of particles active in each domain. • Difficult to balance computing and memory usage. • Adapting to changing computation is expensive. • Adapting to changing computation is cheap. • Can balance memory use and computation. • Lower degree of data locality. • Overheads for switching threads. • Duplication of work when small number of particles are active.

  17. Load Balancing Techniques 1:MIMD DOMAIN DECOMPOSITION THREAD SCHEDULING • pkdgrav2 & gasoline • Typically MPI code. • ORB or Space-filling Curve • Changa: CHARM++ • Objects called Chares which each have O(100-1000) particles including all parent tree cells. • These “tree pieces” are scheduled dynamically. • Very good scaling beyond 1000 CPUs (egBluegene). • Particularly good scaling on other tasks such as neighbor searching!

  18. Load Balancing Techniques 2:SIMD • For on-chip SSE instructions and for the cores of a GPU we need a different approach. • Each core should execute exactly the same instruction stream, but use different data. • Load balance is very good as long as not too much dummy “filler” data needs to be used. Vectors need to be a multiple of nCore in length. • How can we do this in the context of a tree code where branching (different instructions) seems to be inherent in the tree walk algorithm?

  19. SIMD Parallelizing the Tree Code I won’t talk about the details of domain decomposition nor thread scheduling since these are quite complicated and also of little use to GPU programming. We will see that no branches are needed to accomplish a tree code!

  20. C CC PP S a c b 0 Cell c is larger than b. c b g d e f h o n j m p l k

  21. C CC PP S a f b 0 g Cell b is larger than f, but g is far enough away. Cell f stays on the checklist while g is moved to the CC list. c b g d e f h o n j m p l k

  22. C CC PP S a f g b 0 c The checklist has been processed and we add to the local expansion at b due to g. This is the main calculation in the FMM code. (Usually hundreds of cells) b g d e f h o n j m p l k

  23. C CC PP S a f f d L e b 0 Cell f is far enough away but cell e is too close and larger. c b g d e f h o n j m p l k

  24. C CC PP S a k f f d L l b 0 Cell cell l is far enough away but cell k is too close but smaller than d. c b g d e f h o n j m p l k

  25. C CC PP S a k f f d L l b 0 c Again we have completely processed the checklist for cell d. We evaluate the local expansions for f and l about d and add this to the current L’. b g d e f h o n j m p l k

  26. C CC PP S a k k h L’ j f d L b 0 c b This process continues, but let us assume that h, j and k are all leaf cells. This means they must interact by P-P. g d e f h o n j m p l k

  27. C CC PP S a k k h L’ j f d L b 0 c Checklist is now empty as it should be and we can process the PP interactions. Then we pop the stack and go to the other child of d. b g d e f h o n j m p l k

  28. C CC PP S a k f d L h b 0 c b We are again at leaf cell j. g d e f h o n j m p l k

  29. C CC PP S a k f d L h b 0 c b Again we can perform the interactions followed by popping the stack. g d e f h o n j m p l k

  30. C CC PP S a f b 0 d c b We proceed in the same manner from cell e now. Note again that we have also taken the old value of L from the stack as a starting point which included the interaction with cell g. g d e f h o n j m p l k

  31. There are 4 cases in this example. • Case 0: Cell stays on the checklist. • Case 1: Cell is opened and its children are placed at the end of the checklist. • Case 2: Cell is moved to the PP list. • Case 3: Cell is moved to the CC list. • An Opening Value of 0 to 3 can be calculated arithmetically from the current cell and the cell on the checklist, i.e., no if-then-else! • Suppose we could magically have the mapping from checklist entries to each of the 4 processing lists (based on the particular opening value). • Then each processor could independently move entries to their lists using the mapping. • All cores then do case 1, then case 2, finally case 3.

  32. For simplicity assume 3 cases, 3 cores. THREAD / CORE 0 0 0 0 ? ? 0 1 2 0 1 2 0 1 2 MAP We have had perfect load balancing in the determination of cases and have had to process 3 dummy data elements out of 12 in the second part. 1 0 0 2 1 2 0 1 0 1 1 1 This map can also be calculated in parallel in if each thread counts the number of occurances of each case. The map is found by doing a running sum (a parallel SCAN operation) over the case counts and threads. O(log P) 2 2 ?

  33. Pkdgrav2: more cases to deal with. • In reality we have 5 cases without softening and in pkdgrav2 we have actually 9 cases. • There are about 80 operations (floating point and logical) to calculate the opening value. • Typically each cell in the tree requires the processing of 1000s of checklist elements and 100s to 1000s of processing elements. • Even if only a single particle is active we get significant speed-up in the simulation as long as the simulation is not too small.

  34. Prospects It is still early-days in implementing code like this but I am confident that this will allow big speedups in tough highly dynamic simulations.

  35. Prospects for pkdgrav2 • Currently I am developing this method for use with SSE/OpenMP/MPI within pkdgrav2. • I also plan to later try thread scheduling by popping off the stack, and also pushing things when lists get too long. • I find speedups of 6 or so with 8 cores with as few checklist elements as 100, below this the speedup drops quickly.

  36. Prospects for GPUs and GPU clusters • I have started exploring the implementation of this method on GPUs, where the host currently does all tree building and the GPU only does the walk algorithm (calculates forces). • We want to have the entire code on the GPU so that we can simulate a modest number of particles O(1 million) with many manytimesteps all within the GPU memory  rocky planet formation simulations. • In the long run we would like the host to do the fuzzy edges of domains, while the GPU is given the majority of the work in the interior of the domain.

  37. Questions? Thank-you.

  38. Discussion Session – Baryons? • baryons and precision cosmology: The mass function of clusters is uncertain. • Stanek and Rudd http://arxiv.org/abs/0809.2805 hydro sims for cluster mass function with and without gas cooling, and preheating. implies ~10% uncertainty in m500 m.f. Teyssier and Davide's paper http://arxiv.org/abs/1003.4744 about AGN effects on cluster mass profile. • Rudd paper http://arxiv.org/abs/0703741 on baryons and matter power spectrum. • Precision solution for baryon effects involves cooling and feedback, so solution may require modeling star formation correctly over entire cosmic history, beginning with first stars. Baryon physics might explain many CDM challenges.

  39. Discussion Session – basic assumptions • Expansion of inhomogeneous universe versus the global expansion factor in simulations (back reactions and related). some people are very worried about it. • Buchert--- http://arxiv.org/abs/0707.2153 • In principle, this could create an apparent accelerating expansion without dark energy. Consensus in field appears to be that effect is too small, but may be important in precision dark energy measurements. • What about finite speed of light in very large volume simulations being performed now for surveys?

  40. Discussion session – numerical issues? • Lack of numerical standards for precision simulations: Previous standards become obsolete as particle numbers and resolution capabilities improve. Every simulation requires convergence tests to be trusted, and few do it. Starting redshift is a good example. • Dark matter detection in a cosmological context. If dark matter is detected, simulations will be needed (including any baryon influences), and to constrain dark matter particle properties. If small scale dark matter structure is important, simulations will need to resolve down to microhalo scale ~(10^20 particles/halo). • To what extent can we trust simulations of quite different models? Some examples include Warm Dark Matter simulations or f(R) gravity models or non-gaussian fluctuation (f_NL). • Can we make any use of GRID computing, or is it all hype? 

More Related