Scalable Scientific Applications C haracteristics & Future Directions

Scalable Scientific ApplicationsCharacteristics & Future Directions Douglas B. Kothe And Richard Barrett, Ricky Kendall, Bronson Messer, Trey White Leadership Computing Facility National Center for Computational Sciences Oak Ridge National Laboratory

Science Teams Have Specific PF Objectives

Application Requirements at the PF Application categories analyzed Science motivation and impact Science quality and productivity Application models, algorithms, software Application footprint on platform Data management and analysis Early access science-at-scale scenarios Results 100+ page Application Requirements Document published in Jul 07 New methods for categorizing platforms and application attributes devised and utilized in analysis: guiding tactical infrastructure purchase and deployment But still too qualitative! More work to do….

Application Codes in 2008An Incomplete List • Astrophysics • CHIMERA, GenASiS, 3DHFEOS, Hahndol, SNe, MPA-FT, SEDONA, MAESTRO, AstroGK • Biology • NAMD, LAMMPS • Chemistry • CPMD, CP2K, MADNESS, NWChem, Parsec, Quantum Expresso, RMG, GAMESS • Nuclear Physics • ANGFMC, MFDn, NUCCOR, HFODD • Engineering • Fasel, S3D, Raptor, MFIX, Truchas, BCFD, CFL3D, OVERFLOW, MDOPT • High Energy Physics • CPS, Chroma, MILC • Fusion • AORSA, GYRO, GTC, XGC • Materials Science • VASP, LS3DF, DCA++, QMCPACK, RMG, WL-LSMS, WL-AMBER, QMC • Accelerator Physics • Omega3P, T3P • Atomic Physics • TDCC, RMPS, TDL • Space Physics • Pogorelov • Climate & Geosciences • MITgcm, PFLOTRAN, POP, CCSM (CAM, CICE, CLM, POP) • Computer Science (Tools) • Active Harmony, IPM, KOJAK, mpiP, PAPI, PMaC, Sca/LAPACK, SvPablo, TAU

Apps Teams Are Reasonably Adept at Using our Current Systems* • *Is the “field of dreams” approach inadequate (too little too late)? • What is “effective utilization”? Scaling? Percent of Peak (Jacoby vs MG)? • Current SC apps range from 2-70% of peak: what’s the goal? • Remember, we improve what we measure so let’s have the right metrics and measures • My $0.02: science and engineering achievements on these systems is the legacy

Science WorkloadJob Sizes and Resource Usage of Key Applications

Preparing for the ExascaleLong-Term Science Drivers and Requirements • We have recently surveyed, analyzed, and documented the science drivers and application requirements envisioned for exascale leadership systems in the 2020 timeframe • These studies help to • Provide a roadmap for the ORNL Leadership Computing Facility • Uncover application needs and requirements • Focus our efforts on those disruptive technologies and research areas in need of our and the HPC community’s attention

What Will an EF System Look Like? • All projections are daunting • Based on projections of existing technology both with and without “disruptive technologies” • Assumed to arrive in 2016-2020 timeframe • Example 1 • 115K nodes @ 10 TF per node, 50-100 PB, optical interconnect, 150-200 GB/s injection B/W per node, 50 MW • Examples 2-4 (DOE “Townhall” report*) • *www.er.doe.gov/ASCR/ProgramDocuments/TownHall.pdf

Science Prospects and Benefits with High End Computing (EF?) in the Next Decade

250 TF Mitigation: Initial simulations with dynamic carbon cycle and limited chemistry Adaptation: Decadal simulations with high-resolution ocean (1/10°) 1 PF Mitigation: Full chemistry, carbon/nitrogen/sulfur cycles, ice-sheet model, multiple ensembles Adaptation: High-resolution atmosphere (1/4°), land, and sea ice, as well as ocean Sustained PF Mitigation: Increased resolution, longer simulations, more ensembles for reliable projections; coupling with socio-economic and biodiversity models Adaptation: Limited cloud-resolving simulations, large-scale data assimilation 1 EF Mitigation: Multi-century ensemble projections for detailed comparisons of mitigation strategies Adaptation: Full cloud-resolving simulations, decadal forecasts of regional impacts and extreme-event statistics Science Case: Climate Mitigation: Evaluate strategies and inform policy decisions for climate stabilization; 100-1000 year simulations Adaptation: Decadal forecasts & region impacts; prepare for committed climate change; 10-100 year simulations Resolve clouds, forecast weather & extreme events, provide quantitative mitigation strategies

Barriers in Ultrascale Climate SimulationAttacking the Fourth Dimension: Parallel in Time • Problem • Climate models use explicit time stepping • Time step must go down as resolution goes up • Time stepping is serial • Single-process performance is stagnating • More parallel processes do not help! • Possible complementary solutions • Implicit time stepping • High-order in time • “Fast” bases: curvelets and multi-wavelets • “Parareal” parallel in time • Progress • Implicit version of HOMME for global shallow-water equations: 10x speedup for steady-state test case • High-order single-step time integration • Single-cycle multi-grid linear solver for 1D • Pure advection with curvelets and multi-wavelets • Near-term plans • Scale, tune, and precondition implicit HOMME • Single-cycle multi-grid linear solver for 2D • “Parareal” for Burgers’ (1D nonlinear) Ref: Trey White (ORNL)

250 TF The interplay of several important phenomena: hydrodynamic instabilities, role of nuclear burning, neutrino transport 1 PF Determine the nature of the core-collapse supernova explosion mechanism Fully integrated, 3D neutrino radiation hydrodynamics simulations with nuclear burning Sustained PF Detailed nucleosynthesis (element production) from core-collapse SNe Large nuclear network capable of isotopic prediction (along with energy production) 1 EF Precision prediction of complete observable set from core-collapse SNe: nucleosynthesis, gravitational waves, neutrino signatures, light output Tests general relativity and information about the dense matter equation state, along with detailed knowledge of stellar evolution Full 3D Boltzmann neutrino tranpsort, 3D MHD/RHD, nuclear burning CHIMERA Science Case: Astrophysics Explanation and prediction of core-collapse SNe; put general relativity, dense EOS, stellar evolution theories to the test

Requirements Gathering • Consult literature and existing documentation • Construct a survey eliciting speculative requirements for scientific application on HPC platforms in 2010–2020 • Pass the survey to leading computational scientists in a broad range of scientific domains • Analyze and validate the survey results (hard) • Make informed decisions and take action

Survey Questions • What are some possible science drivers and urgent problems that would require Leadership Computing in 2010–2020? • What are some looming computational challenges that will need resolution in 2010–2020? • What are some science objectives and outcomes that Leadership Computing could enable in 2010–2020? • What are some improvement goals for science-simulation fidelity that Leadership Computing could enable in 2010–2020? • What are some possible changes in physical model attributes for Leadership-Computing applications in 2010–2020? • What major software-development projects could occur in your application area in 2010–2020? • What major algorithm changes could occur for your applications in 2010–2020? • What libraries and development tools may need to be developed or significantly improved for Leadership Computing in 2010–2020? • How might system-attribute priorities change for Leadership Computing for your application? • In what ways might or should your workflow in 2010–2020 be different from today? • Are there any disruptive technologies that might affect your applications?

Findings in Models and Algorithms • The seven algorithm types are scattered broadly among science domains, with no one particular algorithm being ubiquitous and no one algorithm going unused. • Structured grids and dense linear algebra continue to dominate, but other algorithm categories will become more common. • Compared to the Seven Dwarfs for current applications, we project a significant increase in Monte Carlo and increases in unstructured grids, sparse linear algebra, and particle methods, as well as a relative decrease in FFTs • These projections reflect the expectation of much-greater parallelism in architectures and the resulting need for very high scalability • Load balancing, scalable sparse solver, and random number generator algorithms will be more important. • Some important algorithms are not captured in the Seven Dwarfs • Categories expected by application scientists to be of growing importance in 2010–2020 include adaptive mesh refinement, implicit nonlinear systems, data assimilation, agent-based methods, parameter continuation, and optimization

Findings in Software • “Hero developer” mode is fatalistic • Does not scale and no single person can adequately understand breadth and depth of issues • Only accomplished by computer scientists, algorithm developers, application developers, and end-user scientists working together in a tightly integrated manner • Must develop a means of interface between the heterogeneous computer, the developer, and the end user scientist • Must raise the level of abstraction • Current approach based on low-level constructs places constraints on performance: over-constrain compiler and runtime system • Raising abstraction level allows for increased algorithm experimentation, incorporation of intent in data structures, flexible memory organization, inclusion of fault tolerance constructs • Enables exploration of power-aware algorithms • Freedom from heroic software efforts having to be the norm

Findings in Software • Application development and maintenance tools and practices need to fundamentally change • Productivity improvement is an important metric and guide for tool and software choices • Fault tolerance and V&V software components must be used to improve reliability and robustness of application software • Knowledge discovery techniques and tools should be explored to help with bug detection, simulation steering, and data feature extraction and correlation • A holistic view of application data (from input to archival) is needed to most effectively deliver tools for the end-to-end workflow performed by scientists

Applications Analyzed • CHIMERA • Astrophysics: core-collapse supernova explosion mechanism • S3D • Turbulent combustion: lifted flame stabilization in diesel & gas turbine engines • GTC • Fusion: Analyze and validate CTEM and ETG core turbulence • POP • Global ocean circulation: Eddy-resolved flow with biogeochemistry • DCA++ • High-temperature superconductivity: Effect of charge & spin inhomogeneities in the Hubbard model superconducting state • MADNESS • Chemistry: neutron & x-ray spectra of cuprates; dynamics of few-electron systems; metal oxides surfaces in catalytic processes • PFLOTRAN • Reactive flows in porous media: Uranium migration and CO2 sequestration in subsurface geologic formations

Application Requirements and Workload Reinforces a Balanced System Assertion 100% 100% Distribution in this space depends upon the applications and the problem being simulated for a given application • Applications analyzed represent almost one half of our 2008 allocation • A broad range of compute/communicate workloads must be supported • Depends upon science, application within that science, and problem tackled by application • Application requirements call for breadth in models, algorithms, software, and scaling type • Physical models • coupled continuum conservation laws, radiation transport, many-body Schrodinger, plasma physics, Maxwell’s equations, turbulence • Numerical algorithms • Each of the “7 dwarfs” is required • Software implementation • All popular languages are required • Science drivers • Strong scaling (time to solution) • Weak scaling (bigger problem) • Application readiness actions plans are in place and being followed CHIMERA CHIMERA Communication Communication POP POP GTC GTC MADNESS MADNESS S3D S3D PFLOTRAN DCA++ DCA++ 0% 0% Computation Computation 100% 100% 0% 0% CHIMERA S3D MADNESS PFLOTRAN POP GTC DCA++

Resource Utilization by Science ApplicationsScience Dictates the Requirements

Example: PF Performance Observations and Readiness Plan for Some of our Key Apps

Accelerating Development & Readiness • Automated diagnostics • Drivers: performance analysis, application verification, S/W debugging, H/W-fault detection and correction, failure prediction and avoidance, system tuning, and requirements analysis • Hardware latency • Won’t see improvement nearly as much as flop rate, parallelism, B/W in coming years • Can S/W strategies mitigate high H/W latencies? • Hierarchical algorithms • Applications will require algorithms aware of the system hierarchy (compute/memory) • In addtion to hybrid data parallelism, and file-based checkpointing, algorithms may need to include dynamic decisions between recomputing and storing, fine-scale task-data hybrid parallelism, and in-memory checkpointing • Parallel programming models • Improved programming models needed to allow developer to identify an arbitrary number of levels of parallelism and map them onto hardware hierarchies at runtime • Models continue to be coupled into larger models, driving the need for arbitrary hierarchies of task and data parallelism

Accelerating Development & Readiness • Solver technology and innovative solution techniques • Global communication operations across 106-8 processors will be prohibitively expensive, solvers will have to eliminate global communication where feasible and mitigate its effects where it cannot be avoided. Research on more effective local preconditioners will become a very high priority • If increases in memory B/W continue to lag the number of cores added to each socket, further research needed into ways to effectively trade flops for memory loads/stores • Accelerated time integration • Are we ignoring the time dimension along which to exploit parallelism? (Ex: climate) • Model coupling • Coupled models require effective methods to implement, verify, and validate the couplings, which can occur across wide spatial and temporal scales. The coupling requirements drive the need for robust methods for downscaling, upscaling, and coupled nonlinear solving • Evaluation of the accuracy and importance of couplings drives the need for methods for validation, uncertainty analysis, and sensitivity analysis of these complex models • Maintaining current libraries • Reliance of current HPC applications on libraries will grow • Libraries must perform as HPC systems grow in parallelism and complexity

PF Survey Findings (with some opinion) • A rigorous & evolving apps reqms process pays dividends • Needs to be quantitative: apps cannot “lie” with performance analysis • Algorithm development is evolutionary • Can we break this mold? • Ex: Explore new parallel dimensions (time, energy) • Hybrid/multi-level programming models virtually nonexistent • No algorithm “sweet spots” (one size fits all) • But algorithm footprints share characteristics • V&V and SQA not in good standing • Ramifications on compute systems as well as apps results generated • No one is really clamoring for new languages • MPI until the water gets too hot (frog analogy) • Apps lifetimes are >3-5x machine lifetimes • Refactoring a way of life • Fault tolerance via defensive checkpointing de facto standard • Won’t this eventually bite us? Artificially drives I/O demands • Weak or strong scale or both (no winner) • Data analytics paradigm must change • The middleware layer is surprisingly stable and agnostic across apps (and should expand!)

Summary & Recommendations: EF Survey • We are in danger of failing because of a software crisis unless concerted investments are undertaken to close the H/W-S/W gap • H/W has gotten way ahead of the S/W (same ole – same ole?) • Structured grids and dense linear algebra continue to dominate, but … • Increase projected for Monte Carlo algorithms, unstructured grids, sparse linear algebra, and particle methods (relative decrease in FFTs) • Increasing importance for AMR, implicit nonlinear systems, data assimilation, agent-based method, parameter continuation, optimization • Priority of computing system attributes • Increase: interconnect bandwidth, memory bandwidth, mean time to interrupt, memory latency, and interconnect latency • Reflect desire to increase computational efficiency to use peak flops • Decrease: disk latency, archival storage capacity, disk bandwidth, wide area network bandwidth, and local storage capacity • Reflect expectation that computational efficiency will not increase • Per-core requirements relatively static, while aggregate requirements will grow with the system

Summary & Recommendations: EF Survey • System software must possess more stability, reliability, and fault tolerance during application execution • New fault tolerance paradigms must be developed and integrated into applications • Job management and efficient scheduling of those resources will be a major obstacle faced by computing centers • Systems must be much better “science producers” • Strong software engineering practices must be applied to systems to ensure good end-to-end productivity • Data analytics must empower scientists to ask “what-if” questions, providing S/W & H/W infrastructure capable of answering these questions in a timely fashion (Google desktop) • Strong data management will become an absolute at the exascale • Just like H/W requires disruptive technologies for acceleration of natural evolutionary paths, so too will algorithm, software, and physical model development efforts need disruptive technologies (invest now!)

Hardware: 3 Software: 9 Fusion Simulation Project: Where to find 12 orders in 10 years? • 1.5 orders: increased processor speed and efficiency • 1.5 orders: increased concurrency • 1 order: higher-order discretizations • Same accuracy can be achieved with many fewer elements • 1 order: flux-surface following gridding • Less resolution required along than across field lines • 4 orders: adaptive gridding • Zones requiring refinement are <1% of ITER volume and resolution requirements away from them are ~102 less severe • 3 orders: implicit solvers • Mode growth time 9 orders longer than Alfven-limited CFL

A View from Berkeley (John Shalf)* • Need better benchmarks and better performance models • For reliable extrapolated code requirements • Power is driving daunting concurrency • Scalable programming models • Need to exploit hierarchical machine architecture • Hybrid processors • More concurrency; need a more generalized approach • Apps must deal with platform reliability • Don’t forget autotuning • Shows value of good compilers and associated R&D • Fast, robust I/O is hard • Scaling and concurrency is outsripping our ability to do rigorous V&V • Application code complexity has outgrown available tools • Frameworks and community codes can work but with certain “rules of engagement” • *ASCAC Fusion Simulation Project Review panel presentation (4/30/08)

Questions? Doug Kothe (kothe@ornl.gov)

Scalable Scientific Applications C haracteristics & Future Directions