David E. Keyes Center for Computational Science Old Dominion University

Algorithms and Software for Terascale Computation of PDEs and PDE-constrained Optimization David E. Keyes Center for Computational Science Old Dominion University Institute for Computer Applications in Science & Engineering (ICASE) NASA Langley Research Center Institute for Scientific Computing Research (ISCR) Lawrence Livermore National Laboratory

Plan of presentation • Imperative of “optimal” algorithms for terascale computing • Basic domain decomposition and multilevel algorithmic concepts • Illustrations of solver performance on ASCI platforms (with a little help from my friends…) • Terascale Optimal PDE Simulations (TOPS) software project of the U.S. DOE • Conclusions

Engineeringcrash testing aerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Applied Physics radiation transport supernovae Environment global climate contaminant transport Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

Engineeringcrash testing aerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Applied Physics radiation transport supernovae Environment global climate contaminant transport Experiments controversial Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

Engineeringcrash testing aerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Applied Physics radiation transport supernovae Experiments dangerous Environment global climate contaminant transport Experiments controversial Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

Engineering crash testing aerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Experiments prohibited or impossible Applied Physics radiation transport supernovae Experiments dangerous Environment global climate contaminant transport Experiments controversial Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

Engineeringcrash testingaerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Experiments prohibited or impossible Applied Physics radiation transport supernovae Experiments dangerous Experiments difficult to instrument Environment global climate contaminant transport Experiments controversial Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

Engineeringcrash testingaerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Experiments prohibited or impossible Applied Physics radiation transport supernovae Experiments dangerous Experiments difficult to instrument Environment global climate contaminant transport Experiments controversial Experiments expensive Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.

Engineeringcrash testingaerodynamics Lasers & Energycombustion ICF Biology drug design genomics Terascale simulation has been “sold” Experiments prohibited or impossible Applied Physics radiation transport supernovae Experiments dangerous Experiments difficult to instrument Environment global climate contaminant transport Experiments controversial Experiments expensive Scientific Simulation However, simulation is far from proven! To meet expectations, we need to handle problems of multiple physical scales.

Plan Develop Use Large platforms have been provided 100+ Tflop / 30 TB Livermore 50+ Tflop / 25 TB 7.2 Tflop/s LINPACK 30+ Tflop / 10 TB Capability 10+ Tflop / 4 TB White 3+ Tflop / 1.5 TB Blue Livermore Red 1+ Tflop / 0.5 TB ‘97 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06 Time (CY) Sandia Los Alamos ASCI program of the U.S. DOE has roadmap to go to 100 Tflop/s by 2006 www.llnl.gov/asci/platforms

Caltech Argonne NCSA/PACI 8 TF 240 TB SDSC 4.1 TF 225 TB NSF’s 13.6 TF TeraGrid coming on line TeraGrid: NCSA, SDSC, Caltech, Argonne www.teragrid.org Site Resources Site Resources 26 HPSS HPSS 4 24 External Networks External Networks 8 5 40Gb/s External Networks External Networks Site Resources Site Resources HPSS UniTree c/o I. Foster

Earth Simulator Bird’s-eye View of the Earth Simulator System Disks Cartridge Tape Library System Processor Node (PN) Cabinets 35.6 Tflop/s LINPACK Interconnection Network (IN) Cabinets Air Conditioning System 65m Power Supply System 50m Double Floor for IN Cables

Cross-section of Earth Simulator Building Lightning protection system Air-conditioning return duct Double floor for IN cables and air-conditioning Power supply system Air-conditioning system Seismic isolation system

Earth Simulator Power plant Building for computer system Building for operation and research

Building platforms is the “easy” part • Algorithms must be • highly concurrent and straightforward to load balance • latency tolerant • cache friendly (temporal and spatial locality of reference) • highly scalable (in the sense of convergence) • Goal for algorithmic scalability: fill up memory of arbitrarily large machines while preserving constant* running times with respect to proportionally smaller problem on one processor • Domain-decomposed multilevel methods “natural” for all of these • Domain decomposition also “natural” for software engineering * or logarithmically growing

Algorithmic requirementsfrom architecture • Must run on physically distributed memory units connected by message-passing network, each serving one or more processors with multiple levels of cache “horizontal” aspects “vertical” aspects T3E message passing, shared memory threads register blocking, cache blocking, prefetching

200 unscalable 150 Steel/rubber composite Parallel multigrid c/o M. Adams, Berkeley-Sandia 100 Time to Solution 50 scalable 0 1000 1 10 100 Problem Size (increasing with number of processors) The solver is a key part, but not the only part, of the simulation that needs to be scalable Keyword: “Optimal” • Convergence rate nearly independent of discretization parameters • Multilevel schemes for rapid linear convergence of linear problems • Newton-like schemes for quadratic convergence of nonlinear problems • Convergence rate as independent as possible of physical parameters • Continuation schemes • Physics-based preconditioning

Why Optimal Algorithms? • The more powerful the computer, the greater the importance of optimality • Example: • Suppose Alg1 solves a problem in time CN2, where N is the input size • Suppose Alg2 solves the same problem in time CN • Suppose that the machine on which Alg1 and Alg2 have been parallelized to run has 10,000 processors • In constant time (compared to serial), Alg1 can run a problem 100X larger, whereas Alg2 can run a problem 10,000X larger

Why Optimal?, cont. • Alternatively, filling the machine’s memory, Alg1 requires 100X time, whereas Alg2 runs in constant time • Is 10,000 processors a reasonable expectation? • Yes, we have it today (ASCI White)! • Could computational scientists really use 10,000X? • Of course; we are approximating the continuum • A grid for weather prediction allows points every 1km versus every 100km on the earth’s surface • In 2D 10,000X disappears fast; in 3D even faster • However, these machines are expensive (ASCI White is $100M, plus ongoing operating costs), and optimal algorithms are the only algorithms that we can afford to run on them

Consider, e.g., the implicitly discretized parabolic case Decomposition strategies for Lu=f in  • Operator decomposition • Function space decomposition • Domain decomposition

Operator decomposition • Consider ADI • Iteration matrix consists of four sequential (“multiplicative”) substeps per timestep • two sparse matrix-vector multiplies • two sets of unidirectional bandsolves • Parallelism within each substep • But global data exchanges between bandsolve substeps

System of ordinary differential equations • Perhaps are diagonal matrices • Perfect parallelism across spectral index • But global data exchanges to transformback to physical variables at each step Function space decomposition • Consider a spectral Galerkin method

= Domain decomposition • Consider restriction and extension operators for subdomains, , and for possible coarse grid, • Replace discretized with • Solve by a Krylov method, e.g., CG • Matrix-vector multiplies with • parallelism on each subdomain • nearest-neighbor exchanges, global reductions • possible small global system (not needed for parabolic case)

Comparison • Operator decomposition (ADI) • natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose) • Function space decomposition (Fourier) • natural mode-based assignment requires all-to-all, bulk data exchanges in each step (for transform) • Domain decomposition (Schwarz) • natural domain-based assignment requires local (nearest neighbor) data exchanges, global reductions, and optional small global problem

Theoretical scaling of domain decomposition(for three common network topologies) • With logarithmic-time (hypercube- or tree-based) global reductions and scalable nearest neighbor interconnects: • optimal number of processors scales linearlywith problem size (“scalable”, assumes one subdomain per processor) • With power-law-time (3D torus-based) global reductions and scalable nearest neighbor interconnects: • optimal number of processors scales as three-fourths power of problem size (“almost scalable”) • With linear-time (common bus) network: • optimal number of processors scales as one-fourth power of problem size (*not* scalable) • bad news for conventional Beowulf clusters, but see 2000 & 2001 Bell Prize “price-performance awards” using multiple commodity NICs per Beowulf node!

} the meat of the talk Some “Advanced” Concepts • Polynomial combinations of Schwarz projections • Schwarz-Schur combinations • Schwarz on Schur-reduced system • Schwarz inside Schur-reduced system • Nonlinear Schwarz new! Three Basic Concepts • Iterative correction • Schwarz preconditioning • Schur preconditioning

Iterative correction • The most basic idea in iterative methods • Evaluate residual accurately, but solve approximately, where is an approximate inverse to A • A sequence of complementary solves can be used, e.g., with first and then one has • Optimal polynomials of lead to various preconditioned Krylov methods • Scale recurrence, e.g., with , leads to multilevel methods

smoother A Multigrid V-cycle Restrictiontransfer from fine to coarse grid coarser grid has fewer cells (less work & storage) Prolongationtransfer from coarse to fine grid First Coarse Grid Recursively apply this idea until we have an easy problem to solve Multilevel Preconditioning Finest Grid

Schwarz Preconditioning • Given A x = b , partition x into subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of • Let Boolean rectangular matrix extract the subset of : • Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain support

Preconditioning Type in 2D in 3D Point Jacobi Ο(N1/2) Ο(N1/3) Domain Jacobi (=0) Ο((NP)1/4) Ο((NP)1/6) 1-level Additive Schwarz Ο(P1/2) Ο(P1/3) 2-level Additive Schwarz Ο(1) Ο(1) Iteration count estimates from the Schwarz theory • Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned system • In terms of N and P, where for d-dimensional isotropic problems, N=h-d and P=H-d, for mesh parameter hand subdomain diameter H, iteration counts may be estimated as follows:

Schur Preconditioning • Given a partition • Condense: • Let M be a good preconditioner for S • Then is a preconditioner for A • Moreover, solves with may be done approximately if all degrees of freedom are retained

This leads to algorithm “Hybrid II” in S-B-G’96: • Convenient for “SPMD” (single prog/multiple data) Schwarz polynomials • Polynomials of Schwarz projections that are combinations of additive and multiplicative may be appropriate for certain implementations • We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively)

“Balancing Neumann-Neumann” alg Schwarz-on-Schur • Preconditioning the Schur complement is complex in and of itself; Schwarz can be used on the reduced problem • “Neumann-Neumann” alg • Multigrid on the Schur complement

Schwarz-inside-Schur • Consider Newton’s method for solving the nonlinear rootfinding problem derived from the necessary conditions for constrained optimization • Constraint • Objective • Lagrangian • Form the gradient of the Lagrangian with respect to each of x, u, and :

Newton Reduced SQP solves the Schur complement system H u = g , where H is the reduced Hessian Schwarz-inside-Schur • Equality constrained optimization leads to the KKT system for states x , designs u , and multipliers  • Then

Schwarz-inside-Schur, cont. • Problems • is the Jacobian of a PDE  huge! • involve Hessians of objective and constraints  second derivatives and huge • H is unreasonable to form, store, or invert • Solutions • Use Schur preconditioning on full system • Form forward action of Hessians by automatic differentiation (vector-to-vector map) • Form approximate inverse action of state Jacobian and its transpose by Schwarz

Nonlinear Schwarz preconditioning • Nonlinear Schwarz has Newton both inside and outside and is fundamentally Jacobian-free • It replaces with a new nonlinear system possessing the same root, • Define a correction to the partition (e.g., subdomain) of the solution vector by solving the following local nonlinear system: where is nonzero only in the components of the partition • Then sum the corrections:

Nonlinear Schwarz, cont. • It is simple to prove that if the Jacobian of F(u) is nonsingular in a neighborhood of the desired root then and have the same unique root • To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : • The residual • The Jacobian-vector product • Remarkably, (Cai-Keyes, 2000) it can be shown that where and • All required actions are available in terms of !

Stagnation beyond critical Re Difficulty at critical Re Convergence for all Re Additive Schwarz Preconditioned Inexact Newton (ASPIN) Newton’s method Experimental example of nonlinear Schwarz

Then and — the Schwarz formula! “Unreasonable effectiveness” of Schwarz • When does the sum of partial inverses equal the inverse of the sums? When the decomposition is right! Let be a complete set of orthonormal row eigenvectors for A : or • Good decompositions are a compromise between conditioning and parallel complexity, in practice

Some anecdotal successes • Newton-Krylov-Schwarz Computational aerodynamics (Anderson et al., NASA, ODU, ANL) • FETI (Schwarz on Schur) Computational structural dynamics (Farhat & Pierson, CU-Boulder) • LNKS (Schwarz inside Schur) PDE-constrained optimization (Biros, NYU & Ghattas, CMU)

Newton-Krylov-Schwarz Popularized in parallel Jacobian-free form under this name by Cai, Gropp, Keyes & Tidriri (1994) Newton nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable

Jacobian-Free Newton-Krylov Method • In the Jacobian-Free Newton-Krylov (JFNK) method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products • These are approximated by the Fréchet derivatives so that the actual Jacobian elements are never explicitly needed, where  is chosen with a fine balance between approximation and floating point rounding error  Schwarz preconditions, using approximate elements

Computational Aerodynamics Transonic “Lambda” Shock, Mach contours on surfaces mesh c/o D. Mavriplis, ICASE Implemented in PETSc www.mcs.anl.gov/petsc

128 nodes 43min Four orders of magnitude in 13 years 3072 nodes 2.5min, 226Gf/s 11M unknowns 15µs/unknown 70% efficient Fixed-size Parallel Scaling Results This scaling study, featuring our widest range of processor number, was done for the incompressiblecase. c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes and B. Smith

Fixed-size Parallel Scaling Results on ASCI RedONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz processors)

PDE Workingsets • Smallest: data for single stencil • Largest: data for entire subdomain • Intermediate: data for a neighborhood collection of stencils, reused as possible

Processor Clock MHz Peak Mflop/s Opt. % of Peak Opt. Mflop/s Reord. Only Mflop/s Interl. only Mflop/s Orig. Mflop/s Orig. % of Peak Factor of Five! R10000 250 500 25.4 127 74 59 26 5.2 P3 200 800 20.3 163 87 68 32 4.0 P2SC (2 card) 120 480 21.4 101 51 35 13 2.7 P2SC (4 card) 120 480 24.3 117 59 40 15 3.1 604e 332 664 9.9 66 43 31 15 2.3 Alpha 21164 450 900 8.3 75 39 32 14 1.6 Alpha 21164 600 1200 7.6 91 47 37 16 1.3 Ultra II 300 600 12.5 75 42 35 18 3.0 Ultra II 360 720 13.0 94 54 47 25 3.5 Ultra II/HPC 400 800 8.9 71 47 36 20 2.5 Pent. II/LIN 400 400 20.8 83 52 47 33 8.3 Pent. II/NT 400 400 19.5 78 49 49 31 7.8 Pent. Pro 200 200 21.0 42 27 26 16 8.0 Pent. Pro 333 333 18.8 60 40 36 21 6.3 Improvements Resulting from Locality Reordering

Traffic decreases as cache gets bigger or subdomains get smaller Cache Traffic for PDEs • As successive workingsets “drop” into a level of memory, capacity (and with effort conflict) misses disappear, leaving only compulsory, reducing demand on main memory bandwidth

FETI-DP for structural mechanics • Numerically scalable, hardware scalable solutions for realistic solid/shell models • Used in Sandia codes Salinas, Adagio, Andante 60mdof 18mdof 30mdof 9mdof 4mdof 1mdof c/o C. Farhat and K. Pierson

David E. Keyes Center for Computational Science Old Dominion University