1 / 21

Challenges Beyond the Software Pipeline

LLNL-PRES -481274. Challenges Beyond the Software Pipeline. Bert Still, AX Division. presented to: Salishan 2011 High Speed Computing Conference, Gleneden Beach, OR, April 25-28, 2011.

ziarre
Download Presentation

Challenges Beyond the Software Pipeline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LLNL-PRES-481274 Challenges Beyond the Software Pipeline Bert Still, AX Division presented to: Salishan 2011 High Speed Computing Conference, Gleneden Beach, OR, April 25-28, 2011 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

  2. Summary Integrated multiphysics codes are complex, and will provide a substantial challenge at exascale Heterogeneous and hierarchical characteristics of the exascale machine may limit the amount of concurrency available through a traditional software-pipeline approach Moving to a multi-pipelined approach may provide a mechanism for overlapped computation, communication, I/O and viz. Developing an application which can support the multi-pipelined model will present significant challenges and collaboration opportunities Salishan 2011 / Still

  3. Integrated Codes provide the greatest exascale challenge • Often > 10 physics packages • Many different spatial, temporal scales • Algorithms tuned for minimal turn-around time instead of maximal computational efficiency • Multi-language (Fortran, Fortran90, C, C++, Python) • Variety of parallelism approaches • Diverse memory and processor performance needs • Steerable / interactive interfaces • 30+ third party libraries • Broad computational application space • Long life-time projects with >1 million lines of code Complex hydrodynamics of an ICF capsule Salishan 2011 / Still

  4. Multiphysics codes represent a significant investment:100’s of staff-years Scaled Thermal Explosion eXperiment • Physics models/geometries are complicated • Inter-relationships: carefully orchestrated for accuracy • Results: validated/verified against experimental data • A complete multiphysics code redesign requires abandoning substantial prior investment • Can existing discretization methods evolve for exascale efficiency? If so, how? Temperature Pressure New code projects must build on a mature, standardized programming model Materials • Research projects will provide: • assessment techniques for modification/re-use of codes • the pathway for modification • BUT IS THAT ENOUGH?

  5. Efficiently using heterogeneous exascale node hardware will require substantial code changes. Today’s Homogeneous Node Future Heterogeneous Node accelerator memory • Number of nodes will increase ~ 10x • Cores per node will increase 100-1000x • Total memory per node remains the same (eg, 32 GB) • Memory/core and bandwidth/core drops dramatically CPU core Today: parallelize with MPI Future: MPI + Memory Model + Threading or something else…? Salishan 2011 / Still

  6. Significant changes are required for integrated code (IC)performance at exascale Simulation of the Searchlight NIF experiment 3M total 3D zones; 77k unknowns per zone Research projects are in progress to explore whether existing multiphysics codes can efficiently use exascale hardware • LLNL integrated codes optimized for: • large memory, large bandwidth per task • store/fetch data > re-compute • Exascale architectures: • small memory, small bandwidth per task • some re-compute > store/fetch data • The issue is global • impacts all integrated codes (weapons, climate, ICF, combustion, structural mechanics, aerodynamic, etc) Salishan 2011 / Still

  7. Today we decompose the problem into domains … This is a 64 domain Metis decomposition of a 50K zone mesh Salishan 2011 / Still

  8. … And perform the physics simulation on each domainconcurrently in a tightly coupled software pipeline This is a 64 domain Metis decomposition of a 50K zone mesh Full physics code runs on every processor A B C This approach might work as long as every package scales. But it does not fully exploit asynchronicity.

  9. “Billion-way parallel” may not the best way to think about anexascale machine. Perhaps better to think about exascale as “societal parallelism”: decompose tasks into separate calculations coupled by data dependencies “Billion-way parallel” leads naturally to thinking in an SIMD fashion following a software pipeline – this may not be the best approach. Exascale machines will have substantial heterogeneity and multiple levels of hierarchy

  10. Mesh decomposition can impact load balance,communication, algorithm behavior and convergence Structured mesh, Bad Decomposition Unstructured mesh, Nice Decomposition Achieving good mesh decomposition remains hard. Graph-based partitioning may give suboptimal decompositions on regular grids, or even disconnected domains. Simplistic partitioning of complicated meshes can lead to workload imbalance and extra intra-processor communication. Salishan 2011 / Still

  11. What is a multi-pipelined approach? • Think “space sharing” rather than “time sharing” • Decompose tasks rather than just data • Each pipeline executes (mostly) independently [MPMD] • On different processor sets (a group) • Data flows within the pipeline • A pipeline may have data dependencies on other pipelines (client/server) • Consumer  upstream, Producer  downstream • There is no resource dependence between pipelines • A pipeline may model a physical process, for example, and be a multi-petascale calculation

  12. Moving beyond the software pipeline provides a mechanismfor exploiting additional concurrency. A C B • Overlapped computation in three pipelines: • Packages “A” and “B” are on different sets of processors • At each loop, “B” can begin execution once package “A” delivers the required data. • “A” can loop-back after getting its dependent data from “B” • Both run simultaneously for part of the calculation • “C” can start once its dependencies are resolved Salishan 2011 / Still

  13. There may be several advantages to this approach, beyond overlapping computation. A C B This approach requires support from the hardware, runtime, I/O, Viz, and programming models. But it does exploit multiple types of concurrency. • Once “A” has completed calculating, it can flip to “read state” • Send data to “B”, Start a package checkpoint, Start in-line viz • Receive dependent data from “B”, flip back to “read/write” and loop • This approach could alleviate some of the I/O bandwidth pressure

  14. Applications face several additional challenges beforetaking advantage of multiple pipelines. A C B • New algorithms must be developed, validated, and verified • Identify package dependencies • Establish stability/convergence of the new methods • Programming model must support co-routines • Tools will be needed to address: Dependency analysis, Communication bottlenecks and hotspots, Race conditions • Extra communication  more load on the network. Is that ok? Salishan 2011 / Still

  15. Summary Integrated multiphysics codes are complex, and will provide a substantial challenge at exascale Heterogeneous and hierarchical characteristics of the exascale machine may limit the amount of concurrency available through a traditional software-pipeline approach Moving to a multi-pipelined approach may provide a mechanism for overlapped computation, communication, I/O and viz. Developing an application which can support the multi-pipelined model will present significant challenges and collaboration opportunities Salishan 2011 / Still

  16. Many thanks to the following people for their helpfuldiscussions S. Langer (LLNL) B. Miller (LLNL) K. Mish (SNL) R. Neely (LLNL) A. Nichols (LLNL) B. Pudliner (LLNL) R. Robey (LANL) M. Schulz (LLNL) J. Shalf (LBNL) M. Steinkamp (LANL) S. Swaminarayan (LANL) D. Womble (SNL) M. Zika (LLNL) T. Adams (LLNL) M. Anderson (HQ/LANL) A. Arsenlis(LLNL) R. Barrett (SNL) J. Belak (LLNL) J. Bell (LBNL) M. Bement (LANL) R. Bond (SNL) T. Brunner (LLNL) J. Chen (SNL) J. Cohen (LLNL) L. Cox (LANL) D. Daniel (LANL) T. DeGroot (LLNL) B. de Supinski (LLNL) M. Dorr (LLNL) C. Ferenbaugh (LANL) T. Germann (LANL) R. Harrison (ORNL) P. Henning (LANL) R. Hornung (LLNL) J. Johannes (SNL) J. Keasler (LLNL) A. Koniges (LBNL) Most sincere apologies to anyone I’ve missed. Salishan 2011 / Still

  17. Backup Viewgraphs

  18. ALE3D (with Co-op framework) enables MPMD task parallelism and load balancing, leading to material science breakthroughs processors adaptive sampling direct numerical simulations microstructure generation finite element model (ale3d) Launch of symponents Remote Method Invocation pattern during material model evaluation ServerProxy Taylor Impact Grain Scale material model servers (mspB) Examples of Adaptive Sampling used to understand materials and develop models Biaxial Bulge Shear Banding Wire Draw code heritage

  19. Exascale Applications Overview • Exascale calculations will span a range from small (capacity) to very large (capability) • Applications will span a range • Integrated Codes will provide the greatest challenge at exascale – We expect that there will be changes • We are investigating relevant IC algorithms on exascale hardware • We could expect some IC packages to evolve, but others will need to be replaced • As the codes change, requirements extrapolated from the petascale may no longer be accurate.

  20. Exascale hardware will deviate substantially from today’s machines Today’s Homogeneous Node Future Heterogeneous Node accelerator memory ~ 140 Gflop/s, 32 GB no accelerators ~16 cores up to 96k nodes C/C++, Fortran + MPI Swim Lane 1 1-2 Tflop/s, 32 GB accelerators … 1,000 cores 500 k – 1 M nodes Swim Lane 2 10 Tflop/s, 32 GB accelerators = GPUs 10,000 cores 100 k nodes CPU core C/C++, Fortran + CUDA/OpenCL + MPI + threads, or Chapel/X10/Fortress ...? What is the programming model of the future?

  21. Exascale simulations will span a large range of sizes fromcapacity to capability. • Components of Defense Application Modeling • Physics and Engineering Models (PEM) • Integrated Codes (IC) • DAM Applications: Uncertainty Quantification (UQ) and Validation and Verification (V&V) • Exascale simulations will support all of these • Multiple terascale and petascale runs will be done (some simultaneously) to explore parameter space • High resolution/fidelity simulations will be done to enhance physics model accuracy

More Related