1 / 36

ENZO and extreme scale amr for hydrodynamic cosmology

Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu. ENZO and extreme scale amr for hydrodynamic cosmology. A parallel AMR application for astrophysics and cosmology simulations Hybrid physics: fluid + particle + gravity + radiation Block structured AMR MPI or hybrid parallelism

zan
Download Presentation

ENZO and extreme scale amr for hydrodynamic cosmology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu ENZO and extreme scale amr for hydrodynamic cosmology

  2. A parallel AMR application for astrophysics and cosmology simulations • Hybrid physics: fluid + particle + gravity + radiation • Block structured AMR • MPI or hybrid parallelism • Under continuous development since 1994 • Greg Bryan and Mike Norman @ NCSA • Shared memorydistributedmemoryhierarchical memory • C++/C/F, >185,000 LOC • Community code in widespread use worldwide • Hundreds of users, dozens of developers • Version 2.0 @ http://enzo.googlecode.com What is enzo?

  3. Astrophysical fluid dynamics Hydrodynamic cosmology Supersonic turbulence Large scale structure Two primary application domains

  4. ENZO physics Physics modules can be used in any combination in 1D, 2D and 3D making ENZO a very powerful and versatile code

  5. Enzo meshing • Berger-Collela structured AMR • Cartesian base grid and subgrids • Hierarchical timetepping

  6. AMR = collection of grids (patches);each grid is a C++ object Level 0 Level 1 Level 2

  7. Unigrid = collection of Level 0 grid patches

  8. Shared memory (PowerC) parallel (1994-1998) • SMP and DSM architecture (SGI Origin 2000, Altix) • Parallel DO across grids at a given refinement level including block decomposed base grid • O(10,000) grids • Distributed memory (MPI) parallel (1998-2008) • MPP and SMP cluster architectures (e.g., IBM PowerN) • Level 0 grid partitioned across processors • Level >0 grids within a processor executed sequentially • Dynamic load balancing by messaging grids to underloaded processors (greedy load balancing) • O(100,000) grids Evolution of Enzo Parallelism

  9. Projection of refinement levels 160,000 grid patches at 4 refinement levels

  10. 1 MPI task per processor Task = a Level 0 grid patch and all associated subgrids; processed sequentially across and within levels

  11. Hierarchical memory (MPI+OpenMP) parallel (2008-) • SMP and multicore cluster architectures (SUN Constellation, Cray XT4/5) • Level 0 grid partitioned across shared memory nodes/multicore processors • Parallel DO across grids at a given refinement level within a node • Dynamic load balancing less critical because of larger MPI task granularity (statistical load balancing) • O(1,000,000) grids Evolution of Enzo Parallelism

  12. N MPI tasks per SMP M OpenMP threads per task Task = a Level 0 grid patch and all associated subgrids processed concurrently within levels and sequentially across levels Each grid is an OpenMP thread

  13. Enzo on cray xt5 1% of the 64003 simulation • Non-AMR 64003 80 Mpc box • 15,625 (253) MPI tasks, 2563 root grid tiles • 6 OpenMP threads per task • 93,750 cores • 30 TB per checkpoint/re-start/data dump • >15 GB/sec read, >7 GB/sec write • Benefit of threading • reduce MPI overhead & improve disk I/O Enzo on petascale platforms

  14. Enzo on cray xt5 105 spatial dynamic range • AMR 10243 50 Mpc box, 7 levels of refinement • 4096 (163) MPI tasks, 643 root grid tiles • 1 to 6 OpenMP threads per task - 4096 to 24,576 cores • Benefit of threading • Thread count increases with memory growth • reduce replication of grid hierarchy data Enzo on petascale platforms

  15. Using MPI+threads to access more RAM as the AMR calculation grows in size

  16. Enzo-RHD on cray xt5 Cosmic reionization • Including radiation transport 10x more expensive • LLNL Hypremultigrid solver dominates run time • near ideal scaling to at least 32K MPI tasks • Non-AMR 10243 8 and 16 Mpc boxes • 4096 (163) MPI tasks, 643 root grid tiles Enzo on petascale platforms

  17. Blue Waters Target SimulationRe-Ionizing the Universe • Cosmic Reionization is a weak-scaling problem • large volumes at a fixed resolution to span range of scales • Non-AMR 40963with ENZO-RHD • Hybrid MPI and OpenMP • SMT and SIMD tuning • 1283to 2563root grid tiles • 4-8 OpenMP threads per task • 4-8 TBytes per checkpoint/re-start/data dump (HDF5) • In-core intermediate checkpoints (?) • 64-bit arithmetic, 64-bit integers and pointers • Aiming for 64-128 K cores • 20-40 M hours (?)

  18. ENZO’s AMR infrastructure limits scalability to O(104) cores • We are developing a new, extremely scalable AMR infrastructure called Cello • http://lca.ucsd.edu/projects/cello • ENZO-P will be implemented on top of Cello to scale to Petascale and beyond

  19. Current capabilities: amrvstreecode

  20. Hierarchical parallelism and load balancing to improve localization Relax global synchronization to a minimum Flexible mapping between data structures and concurrency Object-oriented design Build on best available software for fault-tolerant, dynamically scheduled concurrent objects (Charm++) Cello extreme amr framework: design principles

  21. hybrid replicated/distributed octree-based AMR approach, with novel modifications to improve AMR scaling in terms of both size and depth; patch-local adaptive time steps; flexible hybrid parallelization strategies; hierarchical load balancing approach based on actual performance measurements; dynamical task scheduling and communication; flexible reorganization of AMR data in memory to permit independent optimization of computation, communication, and storage; variable AMR grid block sizes while keeping parallel task sizes fixed; address numerical precision and range issues that arise in particularly deep AMR hierarchies; detecting and handling hardware or software faults during run-time to improve software resilience and enable software self-management. Cello extreme amr framework: approach and solutions

  22. Improving the AMR mesh:patch coalescing

  23. Improving the AMR mesh:targeted refinement

  24. Improving the AMR mesh:targeted refinement with backfill

  25. Cello software components http://lca.ucsd.edu/projects/cello

  26. roadmap

  27. Enzo website (code, documentation) • http://lca.ucsd.edu/projects/enzo • 2010 Enzo User Workshop slides • http://lca.ucsd.edu/workshops/enzo2010 • yt website (analysis and vis.) • http://yt.enzotools.org • Jacques website (analysis and vis.) • http://jacques.enzotools.org/doc/Jacques/Jacques.html Enzo Resources

  28. Backup slides

  29. Level 0 Level 1 (0,0) x x x (1,0) Level 2 (2,0) (2,1) Grid Hierarchy Data Structure

  30. (0) (1,0) (1,1) (2,0) (2,3) (2,1) (2,4) (2,2) (3,7) (3,0) (3,1) (3,2) (3,4) (3,5) (3,6) (4,0) (4,3) (4,1) (4,4) Scaling the AMR grid hierarchy in depth and breadth Depth (level) Breadth (# siblings)

  31. 10243, 7 level AMR stats

  32. Current MPI Implementation real grid object grid metadata physics data virtual grid object grid metadata

  33. Flat MPI implementation is not scalable because grid hierarchy metadata is replicated in every processor • For very large grid counts, this dominates memory requirement (not physics data!) • Hybrid parallel implementation helps a lot! • Now hierarchy metadata is only replicated in every SMP node instead of every processor • We would prefer fewer SMP nodes (8192-4096) with bigger core counts (32-64) (=262,144 cores) • Communication burden is partially shifted from MPI to intranode memory accesses Scaling AMR grid hierarchy

  34. Targeted at fluid, particle, or hybrid (fluid+particle) simulations on millions of cores • Generic AMR scaling issues: • Small AMR patches restrict available parallelism • Dynamic load balancing • Maintaining data locality for deep hierarchies • Re-meshing efficiency and scalability • Inherently global multilevel elliptic solves • Increased range and precision requirements for deep hierarchies Cello extreme amr framework

More Related