Plan of presentation

Preconditioning Implicit Methods for Coupled Physics ProblemsDavid KeyesCenter for Computational ScienceOld Dominion University&Institute for Scientific Computing ResearchLawrence Livermore National Laboratory

Plan of presentation • Background/progress report • Fusion applications in the TOPS SciDAC portfolio • Center for Extended Magnetohydrodynamic Modeling (CEMM) • Center for Magnetic Reconnection Studies (CMRS) • Parallel nonlinearly implicit solvers in TOPS • PETSc’s domain-decomposed JFNK framework • Hypre’s multilevel preconditioners • What’s wrong with this picture? • Curse of the fast waves • Curse of the splitting errors • Curse of the “brute force” algebraic solvers • Current and future needs/plans • Physics-based preconditioning • Coupling multi-physics applications

Acknowledgments • For mentoring, application codes, and some slides, our primary SciDAC fusion collaborators to date: • Steve Jardin and Jin Chen (CEMM) • Amitava Bhattacharjee and Kai Germaschewski (CMRS) • For solver codes, new development, and some slides, our TOPS partners: • Barry Smith and the PETSc team (ANL) • Rob Falgout and the Hypre team (LLNL) • For putting it all together: • Florin Dobrian, TOPS project post-doc (ODU) • For future algorithmic philosophy: • Xiao-Chuan Cai (CU Boulder) and Dana Knoll (LANL) • For support and computational resources: • DOE SciDAC program and earlier computational math investments • NERSC and ORNL computing centers

Two fusion energy applications • Center for Magnetic Reconnection Studies (CMRS, based at Iowa) • Idealized 2D Cartesian geom., periodic BCs, simple FD discretization • Mix of primitive variables and streamfunctions • Sequential nonlinearly coupled explicit evolution, w/2 Poisson inversions on each timestep • Want from TOPS: • Now: scalable Jacobian-free Newton-Krylov nonlinearly implicit solver for higher resolution in 3D (and for AMR) • Later: physics-based preconditioning for nonlinearly implicit solver • Center for Extended MHD Modeling (CEMM, based at PPPL) • Realistic toroidal geom., unstructured mesh, hybrid FE/FD discretization • Fields expanded in scalar potentials, and streamfunctions • Operator-split, linearized, w/11 potential solves in each poloidal cross-plane/step (90% exe. time) • Parallelized w/PETSc (Tang et al., SIAM PP01, Chen et al., SIAM AN02, Jardin et al., SIAM CSE03) • Want from TOPS: • Now: scalable linear implicit solver for much higher resolution (and for AMR) • Later: fully nonlinearly implicit solvers and coupling to other codes

CEMM profile from SciDAC website Center for Extended Magnetohydrodynamic Modeling This research project will develop computer codes that will enable a realistic assessment of the mechanisms leading to disruptive and other stability limits in the present and next generation of fusion devices. With an improvement in the efficiency of codes and with the extension of the leading 3D nonlinear magneto-fluid models of hot, magnetized fusion plasmas, this research will pioneer new plasma simulations of unprecedented realism and resolution. These simulations will provide new insights into low frequency, long-wavelength non-linear dynamics in hot magnetized plasmas, some of the most critical and complex phenomena in plasma and fusion science. The underlying models will be validated through cross-code and experimental comparisons. PI: Steve JardinPrinceton Plasma Physics Lab

CMRS profile from SciDAC website Magnetic Reconnection: Applications to Sawtooth Oscillations, Error Field Induced Islands and the Dynamo Effect The research goals of this project include producing a unique high performance code and using this code to study magnetic reconnection in astrophysical plasmas, in smaller scale laboratory experiments, and in fusion devices. The modular code that will be developed will be a fully three-dimensional, compressible Hall MHD code with options to run in slab, cylindrical and toroidal geometry and flexible enough to allow change in algorithms as needed. The code will use adaptive grid refinement, will run on massively parallel computers, and will be portable and scalable. The research goals include studies that will provide increased understanding of sawtooth oscillations in tokamaks, magnetotail substorms, error-fields in tokamaks, reverse field pinch dynamos, astrophysical dynamos, and laboratory reconnection experiments. PI: Amitava BhattacharjeeUniversity of Iowa

Summary of progress on CEMM • TOPS running full-up M3D code on PPPL’s preferred NERSC “Seaborg” platform • Using up to 75 processors per poloidal cross-plane in weak scaling – fixed memory per processor (parallel concurrency can be 10-100 times this in toroidal direction), PETSc handles all aspects of the distributed data structures • Have integrated Hypre as preconditioner to PETSc and created new Hypre coarsening rules to accommodate a peculiarity of M3D’s Dirichlet boundary conditions • 1-level additive Schwarz preconditioner has excellent per-iteration implementation efficiency, but nonscalable convergence • Algebraic Multigrid preconditioner has perfect convergence scalability, but (presently) high set-up time and poor set-up scalability, and mildly degrading per-iteration implementation efficiency • PLAN: separately tune each of the 11 different Poisson solves per timestep • PLAN: work on improving Hypre solves and amortizing Hypre set-ups • PLAN (with APDEC team): incorporate AMR

Hypre’s AMG in CEMM app • M3D’s custom discretization and parallel data structures dictated by geometry and physics: periodic toroidally, unstructured poloidally, fit to flux surfaces, partitioned into 3n2 chunks poloidally times arbitrary toroidal decomposition, some mesh anisotropy • Mesh mapped and reordered using PETSc’s index sets, solved by GMRES, preconditioned with Hypre’s parallel algebraic MG solver of Ruge-Stueben type figures c/o S. Jardin, CEMM

Hypre’s AMG in CEMM app • Iteration count results below are averaged over 19 different PETSc SLESSolve calls in initialization and first timestep loop for this operator-split unsteady code, abcissa is number of procs in scaled problem; problem size ranges from 12K to 303K unknowns (approx 4K per processor) iterations size or procs

Hypre’s AMG in CEMM app, cont. • Scaled speedup timing results below are summed over 19 different PETSc SLESSolve calls in initialization and first timestep loop for this operator-split unsteady code • Much of AMG cost is coarse-grid formation (preprocessing); in production, these coarse hierarchies will be saved for reuse (same linear systems are called in each timestep loop), making AMG less expensive time size or procs

Hypre’s AMG in CEMM app, cont.

Summary of progress on CMRS • CMRS team has provided TOPS with discretization of model 2D multicomponent MHD evolution code in PETSc’s FormFunctionLocal format using DMMG and automatic differentiation for Jacobian objects • TOPS has implemented fully nonlinearly implicit GMRES-MG-ILU parallel solver with custom deflation of nullspace in CMRS’s doubly periodic formulation • CMRS and TOPS reproduce the same dynamics on the same grids with the same time-stepping, up to a finite-time singularity due to collapse of current sheet (that falls below presently uniform mesh resolution) • TOPS code, being implicit, can choose timesteps an order of magnitude larger, with potential for higher ratio in more physically realistic parameter regimes, but is still slower in wall-clock time • PLAN: tune PETSc solver by profiling, blocking, reuse, etc. • PLAN: go to higher-order in time • PLAN: identify the numerical complexity benefits from implicitness (in suppressing fast timescales) and quantify (explicit versus implicit) • PLAN (with APDEC team): incorporate AMR

Vorticity, early time Vorticity, later time ex29.c in PETSc 2.5.1 zoom 2D Hall MHDsawtooth instability (Porcelli et al., 1993, 1999) Model equations: Equilibrium: figures c/o A. Bhattacharjee, CMRS

PETSc’s DMMG in CMRS app • Mesh and time refinement studies of CMRS Hall magnetic reconnection model problem (4 mesh sizes, dt=0.1 (nondimensional, near CFL limit for fastest wave) on left, dt=0.8 on right) • Measure of functional inverse to thickness of current sheet versus time, for 0<t<200 (nondimensional), where singularity occurs around t=215

PETSc’s DMMG in CMRS app, cont. • Implicit timestep increase studies of CMRS Hall magnetic reconnection model problem, on finest (192192) mesh of previous slide, in absolute magnitude, rather than semi-log

Optimizer Sens. Analyzer Time integrator Nonlinear solver Eigensolver Linear solver Indicates dependence Scope for TOPS • Design and implementation of “solvers” • Time integrators • Nonlinear solvers • Constrained optimizers • Linear solvers • Eigensolvers • Software integration • Performance optimization (w/ sens. anal.) (w/ sens. anal.)

Jacobian-Free Newton-Krylov Method • In the Jacobian-Free Newton-Krylov (JFNK) method for F(u)=0, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products • These are approximated by the Fréchet derivatives so that the actual Jacobian elements are never explicitly needed, where  is chosen with a fine balance between approximation and floating point rounding error (or by automatic differentiation) • Build preconditioner using convenient approximations without losing Newton convergence in the outer loop

Background of PETSc Library • Developed at ANL to support research, prototyping, and production parallel solutions of operator equations in message-passing environments • Distributed data structures as fundamental objects - index sets, vectors/gridfunctions, and matrices/arrays • Iterative linear and nonlinear solvers, combinable modularly and recursively, and extensibly • Portable, and callable from C, C++, Fortran • Uniform high-level API, with multi-layered entry • Aggressively optimized: copies minimized, communication aggregated and overlapped, caches and registers reused, memory chunks preallocated, inspector-executor model for repetitive tasks (e.g., gather/scatter) Seehttp://www.mcs.anl.gov/petsc

User Code/PETSc Interactions Main Routine Timestepping Solvers (TS) Nonlinear Solvers (SNES) Linear Solvers (SLES) PETSc PC KSP Application Initialization Function Evaluation Jacobian Evaluation Post- Processing User code PETSc code

AMG added here User Code/PETSc Interactions Main Routine Timestepping Solvers (TS) Nonlinear Solvers (SNES) Linear Solvers (SLES) PETSc PC KSP Application Initialization Function Evaluation Jacobian Evaluation Post- Processing User code PETSc code AD code

Background of Hypre Library • Developed at LLNL to support research, prototyping, and production parallel solutions of linear equations in message-passing environments • Object-oriented design similar to PETSc • Concentrates on linear problems only • Richer in preconditioners than PETSc, with focus on algebraic multigrid, including manyresearch prototypes not yet in the public release • Includes other preconditioners, including sparse approximate inverse (Parasails) and parallel ILU (Euclid) Seehttp://www.llnl.gov/CASC/hypre/

AMG Framework error damped by pointwise relaxation Choose coarse grids, transfer operators, etc. to eliminate Algebraic multigrid background • AMG assumes only information about the underlying matrix structure • AMG automatically forms coarse operators, based on problem data, using heuristics • There are many AMG algorithms, and the area is under active theoretical and experimental development algebraically smooth error Recur, as in geometric MG

64K DOFs 200M DOFs Sample of Hypre’s scaled efficiency

What’s wrong with this picture? • Curse of the fast waves • MHD phenomena include waves at many different timescales, some of which (e.g., Whistler, Alfvén) are not dynamically relevant to some simulations but suppress the timestep of explicit methods due to instability • Curse of the splitting errors • Traditional operator-split approaches to advancing evolutionary PDEs incur first-order-in-time splitting errors that suppress the timestep of semi-implicit methods due to inaccuracy • Curse of the “brute force” algebraic solvers • Fully implicit methods to overcome timestep limitations demand powerful solvers, whose cost per iteration for large timesteps may take back any savings they provide by allowing larger timesteps for a given accuracy

What do we need? • To overcome the algebraic complexity of implicit solvers: • Knowledge of the physics already present in traditional operator-split approaches • To begin to couple existing large-scale nonlinear physics codes: • An algebraic framework like Schwarz methods for linear problems, which shows how to use conveniently obtained solutions to subproblems of the new fully coupled problem (e.g., the solver procedures in the original component codes) to arrive at the solution of the full problem, faster and more reliably than by nonlinear Picard iteration (“successive substitutions”) alone • TOPS is addressing these near-downstream issues: • Physics-based preconditioning (work w/ D. Knoll’s LANL group) • Nonlinear Schwarz (work w/ X.-C. Cai’s Boulder group (part of TOPS)) • Nonlinear solver interface (work w/ SciDAC CCA project)

Philosophy of Jacobian-free NK • To evaluate the linear residual, we use the true F’(u) , giving a true Newton step and asymptotic quadratic Newton convergence • To precondition the linear residual, we do anything convenient that uses understanding of the dominant physics/mathematics in the system and respects the limitations of the parallel computer architecture and the cost of various operations: • combinations of operator-split Jacobians (for reasons of physics or reasons of numerics) • Jacobian of related discretization (for “fast” solves) • Jacobian of lower-order discretization (for more stability, less storage) • Jacobian with “lagged” values for expensive terms (for less computation per degree of freedom) • Jacobian stored in lower precision (for less memory traffic per preconditioning step) • Jacobian blocks decomposed for parallelism

Philosophy of Jacobian-free NK, cont. • These motivations are not new; most large-scale application codes also take “short cuts” on the approximate Jacobian operator to be inverted – showing physical intuition • The problem with many codes is that they do not anywhere have an accurate global Jacobian operator; they use only the weak Jacobian • This leads to a weakly nonlinearly converging “defect correction method” • Defect correction: • in contrast to preconditioned Newton:

Physics-based preconditioning:example of 1D Shallow Water Equations • Continuity • Momentum • Gravity wave speed • Typically , but stability restrictions require timesteps based on the CFL criterion for the fastest wave, for an explicit method • One can solve fully implicitly, or one can filter out the gravity wave by solving semi-implicitly

Physics-based preconditioning, cont. • Continuity (*) • Momentum (**) • Solving (**) for and substituting into (*), where

Physics-based preconditioning, cont. • After the parabolic equation is spatially discretized and solved for , then can be found from • One scalar parabolic solve and one scalar explicit update replace an implicit hyperbolic system • Similar tricks are employed in aerodynamics (sound waves), MHD (Alfvén waves), etc. • Temporal truncation error remains in (**), due to lagged advection, but in “delta” form it makes a great preconditioner • Mousseau et al. for gravity waves (generalization of this example) • Pernice et al. for acoustic waves (SIMPLE scheme) • Chacon et al. for Whistler waves in MHD (very intricate; see JCP)

Newton Outside Physics-based Preconditioning • When the shallow-water wave splitting is used as a solver, it leaves a first-order in time splitting error • In the Jacobian-free Newton-Krylov framework, this solver, which maps a residual into a correction, can be regarded as a preconditioner • The true Jacobian is never formed yet the time-implicit nonlinear residual at each time step can be made as small as needed for nonlinear consistency in long time integrations

Physics-based preconditioning • In Newton iteration, one seeks to obtain a correction (“delta”) to solution, by inverting the Jacobian matrix on (the negative of) the nonlinear residual: • A typical operator-split code also derives a “delta” to the solution, by some implicitly defined means, through a series of implicit and explicit substeps • This implicitly defined mapping from residual to “delta” is a naturalpreconditioner • TOPS Software must (and will) accommodate this!

1D Shallow water preconditioning • Define continuity residual for each timestep: • Define momentum residual for each timestep: • Continuity delta-form (*): • Momentum delta form (**):

1D Shallow water preconditioning, cont. • Solving (**) for and substituting into (*), • After this parabolic equation is solved for  , we have • This completes the application of the preconditioner to one Newton-Krylov iteration at one timestep • Of course, the parabolic solve need not be done exactly; one sweep of multigrid can be used

Operator-split preconditioning • Subcomponents of a PDE operator often have special structure that can be exploited if they are treated separately • Algebraically, this is just a generalization of Schwarz, by term instead of by subdomain • Suppose and a preconditioner is to be constructed, where and are each “easy” to invert • Form a preconditioned vector from as follows: • Equivalent to replacing with • First-order splitting error, yet often used as a solver!

Operator-split preconditioning, cont. • Suppose S is convection-diffusion and R is reaction, among a collection of fields stored as gridfunctions • On a small regular 2D grid with a five-point stencil: • R is trivially invertible in block diagonal form • S is invertible with one multilevel solve per field = + J S R

Operator-split preconditioning, cont. • Preconditioners assembled from just the “strong” elements of the Jacobian, alternating the source term and the diffusion term operators, are competitive in convergence rates with block-ILU on the Jacobian • particularly, since the decoupled scalar diffusion systems are amenable to simple multigrid treatment – not as trivial for the coupled system • The decoupled preconditioners store many fewer elements and significantly reduce memory bandwidth requirements and are expected to be much faster per iteration when carefully implemented • See “alternative block factorization” by Bank et al.; incorporated into SciDAC TSI solver by D’Azevedo

Nonlinear Schwarz preconditioning • Nonlinear Schwarz has Newton both inside and outside and is fundamentally Jacobian-free • It replaces with a new nonlinear system possessing the same root, • Define a correction to the partition (e.g., subdomain) of the solution vector by solving the following local nonlinear system: where is nonzero only in the components of the partition • Then sum the corrections:

Nonlinear Schwarz, cont. • It is simple to prove that if the Jacobian of F(u) is nonsingular in a neighborhood of the desired root then and have the same unique root • To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : • the residual • the Jacobian-vector product • Remarkably, (Cai-Keyes, 2000) it can be shown that where and • All required actions are available in terms of !

Stagnation beyond critical Re Difficulty at critical Re Convergence for all Re Additive Schwarz Preconditioned Inexact Newton (ASPIN) Newton’s method Example of nonlinear Schwarz

Coupling using Nonlinear Schwarz • Consider systems F1(u1 ,u2)=0 and F2(u1 ,u2)=0 • Couple together as F(u)=0 • Apply nonlinear Schwarz with nonoverlapping partitions corresponding to original systems • Then where (gives insight into coupling direction/scaling)

Abstract Gantt Chart for TOPS Each color module represents an algorithmic research idea on its way to becoming part of a supported community software tool. At any moment (vertical time slice), TOPS has work underway at multiple levels. While some codes are in applications already, they are being improved in functionality and performance as part of the TOPS research agenda. Dissemination Applications Integration Hardened Codes e.g.,PETSc Research Implementations e.g.,TOPSLib Algorithmic Development e.g., ASPIN time

Summary • Important codes in fusion MHD are bottlenecked without powerful parallel implicit solvers • These same codes often contain valuable approximate solvers being employed as the only solver, thereby limiting timestep effectiveness • Coupling such codes together will only impose further limits on the timestep – it is likely to be smaller than the smallest timestep permitted by any component code, to accommodate higher levels of coupling • Nonlinearly implicit solvers for each component can leverage existing solvers as physics-based preconditioners, if software allows, strengthening the components even prior to their integration • Nonlinear versions of Schwarz contain promise for preconditioning coupled codes • Doing physics is more fun than doing driven cavities 

For more information ... Come to poster session tonight, see Knoll & Keyes (2002) JCP preprint at http://www.math.odu.edu/~keyes and see http://www.tops-scidac.org

Plan of presentation