David E. Keyes Center for Computational Science Old Dominion University

Algorithms for Cluster Computing Based on Domain Decomposition David E. Keyes Center for Computational Science Old Dominion University Institute for Computer Applications in Science & Engineering NASA Langley Research Center Institute for Scientific Computing Research Lawrence Livermore National Laboratory

Plan of Presentation • Introduction (5) • Imperative of domain decomposition and multilevel methods for terascale computing (5) • Basic and “advanced” domain decomposition and multilevel algorithmic concepts (14) • Vignettes of domain decomposition and multilevel performance (13) (with a little help from my friends…) • Agenda for future research (6) • Terascale Optimal PDE Simulations (TOPS) software project (14) • Conclusions (4)

Engineeringstructural dynamicselectromagnetics Lasers & Energycombustion ICF modeling Chemistry & Bio materials modelingdrug design Terascale simulation has been “sold” Experiments prohibited Applied Physics radiation transport hydrodynamics Experiments dangerous Environment global climate groundwater flow Experiments expensive Experiments controversial Scientific Simulation However, it is far from proven! To meet expectations, we need to handle problems of multiple physical scales.

Plan Develop Use Large platforms have been provided 100+ Tflop / 30 TB Livermore 50+ Tflop / 25 TB 30+ Tflop / 10 TB Capability 10+ Tflop / 4 TB White 3+ Tflop / 1.5 TB Blue Livermore Red 1+ Tflop / 0.5 TB ‘97 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06 Time (CY) Sandia Los Alamos ASCI program of the U.S. DOE has roadmap to go to 100 Tflop/s by 2006 www.llnl.gov/asci/platforms

Caltech Argonne NCSA/PACI 8 TF 240 TB SDSC 4.1 TF 225 TB NSF’s 13.6 TF TeraGrid coming on line TeraGrid: NCSA, SDSC, Caltech, Argonne www.teragrid.org Site Resources Site Resources 26 HPSS HPSS 4 24 External Networks External Networks 8 5 40Gb/s External Networks External Networks Site Resources Site Resources HPSS UniTree

Algorithmic requirementsfrom architecture • Must run on physically distributed memory units connected by message-passing network, each serving one or more processors with multiple levels of cache “horizontal” aspects “vertical” aspects T3E message passing, shared memory threads register blocking, cache blocking, prefetching

Building platforms is the “easy” part • Algorithms must be • highly concurrent and straightforward to load balance • latency tolerant • cache friendly (good temporal and spatial locality) • highly scalable (in the sense of convergence) • Domain decomposed multilevel methods “natural” for all of these • Domain decomposition also “natural” for software engineering • Fortunate that mathematicians built up its theory in advance of requirements!

Consider the implicitly discretized parabolic case Decomposition strategies for Lu=f in  • Operator decomposition • Function space decomposition • Domain decomposition

Operator decomposition • Consider ADI • Iteration matrix consists of four multiplicative substeps per timestep • two sparse matrix-vector multiplies • two sets of unidirectional bandsolves • Parallelism within each substep • But global data exchanges between bandsolve substeps

System of ordinary differential equations • Perhaps are diagonal matrices • Perfect parallelism across spectral index • But global data exchanges to transformback to physical variables at each step Function space decomposition • Consider a spectral Galerkin method

= Domain decomposition • Consider restriction and extension operators for subdomains, , and for possible coarse grid, • Replace discretized with • Solve by a Krylov method • Matrix-vector multiplies with • parallelism on each subdomain • nearest-neighbor exchanges, global reductions • possible small global system (not needed for parabolic case)

Comparison • Operator decomposition (ADI) • natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose) • Function space decomposition (Fourier) • Natural mode-based assignment requires all-to-all, bulk data exchanges in each step (for transform) • Domain decomposition (Schwarz) • Natural domain-based assignment requires local (nearest neighbor) data exchanges, global reductions, and optional small global problem (Of course, domain decomposition can be interpreted as a special operator or function space decomposition)

Theoretical Scaling of Domain Decomposition for Three Common Network Topologies • With tree-based (logarithmic) global reductions and scalable nearest neighbor hardware: • optimal number of processors scales linearly with problem size (scalable, assumes one subdomain per processor) • With 3D torus-based global reductions and scalable nearest neighbor hardware: • optimal number of processors scales as three-fourths power of problem size (almost scalable) • With common bus network (heavy contention): • optimal number of processors scales as one-fourth power of problem size (not scalable) • bad news for conventional Beowulf clusters, but see 2000 Bell Prize “price-performance awards” using multiple NICs per Beowulf node!

“Advanced” Concepts • Polynomial combinations of Schwarz projections • Schwarz-Schur combinations • Neumann-Neumann/FETI (Schwarz on Schur) • LNKS (Schwarz inside Schur) • Nonlinear Schwarz Basic Concepts • Iterative correction • Schwarz preconditioning • Schur preconditioning

Iterative correction • The most basic idea in iterative methods • Evaluate residual accurately, but solve approximately, where is an approximate inverse to A • A sequence of complementary solves can be used, e.g., with first and then one has • Optimal polynomials of lead to various preconditioned Krylov methods • Scale recurrence, e.g., with , leads to multilevel methods

smoother A Multigrid V-cycle Restrictiontransfer from fine to coarse grid coarser grid has fewer cells (less work & storage) Prolongationtransfer from coarse to fine grid First Coarse Grid Recursively apply this idea until we have an easy problem to solve Multilevel Preconditioning Finest Grid

Schwarz Preconditioning • Given A x = b , partition x into subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of • Let Boolean rectangular matrix extract the subset of : • Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain support

Preconditioning Type in 2D in 3D Point Jacobi Ο(N1/2) Ο(N1/3) Domain Jacobi (=0) Ο((NP)1/4) Ο((NP)1/6) 1-level Additive Schwarz Ο(P1/3) Ο(P1/3) 2-level Additive Schwarz Ο(1) Ο(1) Iteration Count Estimates From the Schwarz Theory • Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned system • In terms of N and P, where for d-dimensional isotropic problems, N=h-d and P=H-d, for mesh parameter hand subdomain diameter H, iteration counts may be estimated as follows:

Schur Preconditioning • Given a partition • Condense: • Let M be a good preconditioner for S • Then is a preconditioner for A • Moreover, solves with may be done approximately if all degrees of freedom are retained

This leads to “Hybrid II” in Smith, Bjorstad & Gropp: Schwarz polynomials • Polynomials of Schwarz projections that are combinations of additive and multiplicative may be appropriate for certain implementations • We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively)

Balancing Neumann-Neumann Schwarz-on-Schur • Preconditioning the Schur complement is complex in and of itself; Schwarz is used on the reduced problem • Neumann-Neumann • Multigrid on the Schur complement

Newton Reduced SQP solves the Schur complement system H u = g , where H is the reduced Hessian Schwarz-inside-Schur • Equality constrained optimization leads to the KKT system for states x , designs u , and multipliers  • Then

Schwarz-inside-Schur, cont. • Problems • is the Jacobian of a PDE  huge! • involve Hessians of objective and constraints  second derivatives and huge • H is unreasonable to form, store, or invert • Solutions • Use Schur preconditioning on full system • Form forward action of Hessians by automatic differentiation • Form approximate inverse action of state Jacobian and its transpose by Schwarz

Nonlinear Schwarz Preconditioning • Nonlinear Schwarz has Newton both inside and outside and is fundamentally Jacobian-free • It replaces with a new nonlinear system possessing the same root, • Define a correction to the partition (e.g., subdomain) of the solution vector by solving the following local nonlinear system: where is nonzero only in the components of the partition • Then sum the corrections:

Nonlinear Schwarz, cont. • It is simple to prove that if the Jacobian of F(u) is nonsingular in a neighborhood of the desired root then and have the same unique root • To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : • The residual • The Jacobian-vector product • Remarkably, (Cai-Keyes, 2000) it can be shown that where and • All required actions are available in terms of !

Stagnation beyond critical Re Difficulty at critical Re Convergence for all Re Additive Schwarz Preconditioned Inexact Newton (ASPIN) Newton’s method Experimental Example of Nonlinear Schwarz

Then and — the Schwarz formula! “Unreasonable effectiveness” of Schwarz • When does the sum of partial inverses equal the inverse of the sums? When the decomposition is right! Let be a complete set of orthonormal row eigenvectors for A : or • Good decompositions are a compromise between conditioning and parallel complexity, in practice

Some anecdotal successes • Newton-Krylov-Schwarz • Computational aerodynamics (Anderson et al., NASA, ODU, ANL) • Free convection (Shadid & Tuminaro, Sandia) • Pseudo-spectral Schwarz • Incompressible flow (Fischer & Tufo, ANL) • FETI • Computational structural dynamics (Farhat, CU-Boulder & Pierson, Sandia) • LNKS • PDE-constrained optimization (Biros, NYU & Ghattas, CMU)

Newton-Krylov-Schwarz Popularized in parallel Jacobian-free form under this name by Cai, Gropp, Keyes & Tidriri (1994) Newton nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable

Jacobian-Free Newton-Krylov Method • In the Jacobian-Free Newton-Krylov (JFNK) method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products • These are approximated by the Fréchet derivatives so that the actual Jacobian elements are never explicitly needed, where  is chosen with a fine balance between approximation and floating point rounding error  Schwarz preconditions, using approximate elements

Computational Aerodynamics Transonic “Lambda” Shock, Mach contours on surfaces mesh c/o D. Mavriplis, ICASE Implemented in PETSc www.mcs.anl.gov/petsc

128 nodes 43min Four orders of magnitude in 13 years 3072 nodes 2.5min, 226Gf/s 11M unknowns 15µs/unknown 70% efficient Fixed-size Parallel Scaling Results This scaling study, featuring our widest range of processor number, was done for the incompressiblecase. c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes and B. Smith

Fixed-size Parallel Scaling Results on ASCI RedONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz processors)

PDE Workingsets • Smallest: data for single stencil • Largest: data for entire subdomain • Intermediate: data for a neighborhood collection of stencils, reused as possible

Processor Clock MHz Peak Mflop/s Opt. % of Peak Opt. Mflop/s Reord. Only Mflop/s Interl. only Mflop/s Orig. Mflop/s Orig. % of Peak Factor of Five! R10000 250 500 25.4 127 74 59 26 5.2 P3 200 800 20.3 163 87 68 32 4.0 P2SC (2 card) 120 480 21.4 101 51 35 13 2.7 P2SC (4 card) 120 480 24.3 117 59 40 15 3.1 604e 332 664 9.9 66 43 31 15 2.3 Alpha 21164 450 900 8.3 75 39 32 14 1.6 Alpha 21164 600 1200 7.6 91 47 37 16 1.3 Ultra II 300 600 12.5 75 42 35 18 3.0 Ultra II 360 720 13.0 94 54 47 25 3.5 Ultra II/HPC 400 800 8.9 71 47 36 20 2.5 Pent. II/LIN 400 400 20.8 83 52 47 33 8.3 Pent. II/NT 400 400 19.5 78 49 49 31 7.8 Pent. Pro 200 200 21.0 42 27 26 16 8.0 Pent. Pro 333 333 18.8 60 40 36 21 6.3 Improvements Resulting from Locality Reordering

Traffic decreases as cache gets bigger or subdomains get smaller Cache Traffic for PDEs • As successive workingsets “drop” into a level of memory, capacity (and with effort conflict) misses disappear, leaving only compulsory, reducing demand on main memory bandwidth

1 – Level DD 2 – Level DD Approx. Coarse Solve 2 – Level DD Exact Coarse Solve Transport Modeling MPSalsa/Aztec Newton-Krylov-Schwarz solver Thermal Convection Problem (Ra = 1000) www.cs.sandia.gov/CRF/MPSalsa 3D Results 512 procs 1 proc N.45 Avg. Iterations per Newton Step N.24 N0 Temperature iso-lines on slice plane, velocity iso-surfaces and streamlines in 3D Total Unknowns c/o J. Shadid and R. Tuminaro Newton-Krylov solver with Aztec non-restarted GMRES with 1 – level domain decomposition preconditioner, ILUT subdomain solver, and ML 2-level DD with Gauss-Seidel subdomain solver. Coarse Solver: “Exact” = Superlu (1 proc), “Approx” = one step of ILU (8 proc. in parallel)

ASCI-Red scaling to P=4096, 376 Gflops 27 M gridpoint Navier-Stokes calculation dual MPI multithreading Incompressible Flow Nek5000: unstructured spectral element code • unsteady Navier-Stokes solver • high-order tensor-product polynomials in space (N~5-15) • high-order operator splitting in time • two-level overlapping additive Schwarz for pressure • spectral elements taken as subdomains • fast local solves (tensor-product diagonalization) • fast coarse-grid solver • sparse exact factorization yields x0 =A0-1b= XXTb as parallel matrix-vector product • low communication; scales to 10,000processors Transition near roughness element Transition in arterio-venous graft, Rev=2700 c/o P. Fischer and H. Tufo • www.mcs.anl.gov/appliedmath/Flow/cfd.html

FETI-DP for structural mechanics • Numerically scalable, hardware scalable solutions for realistic solid/shell models • Used in Sandia applications Salinas, Adagio, Andante 60mdof 18mdof 30mdof 9mdof 4mdof 1mdof c/o C. Farhat and K. Pierson

wing tip vortices, no control (l); optimal control (r) optimal boundary controls shown as velocity vectors PDE-constrained Optimization Lagrange-Newton-Krylov-Schur implemented in Veltisto/PETSc • Optimal control of laminar viscous flow • optimization variables are surface suction/injection • objective is minimum drag • 700,000 states; 4,000 controls • 128 Cray T3E processors • ~5 hrs for optimal solution (~1 hr for analysis) c/o G. Biros and O. Ghattas www.cs.nyu.edu/~biros/veltisto/

Agenda for future research • High concurrency (100,000 processors) • Asynchrony • Fault tolerance • Automated tuning • Integration of simulation with studies of sensitivity, stability, and optimization

High ConcurrencyToday Future • 100,000 processors, in a room or as part of a grid • Most phases of DD computations scale well • favorable surface-to-volume comm-to-comp ratio • However, latencies will nix frequent exact reductions • Paradigm: extrapolate data in retarded messages; correct (if necessary) when message arrives, such as in C(p,q,j) schemes by Garbey and Tromeur-Dervout • 10,000 processors in a single room with tightly coupled network • DD computations scale well, when provided with • network rich enough for parallel near neighbor communication • fast global reductions (complexity sublinear in processor count)

AsynchronyToday Future • Adaptivity requirements and far-flung, nondedicated networks will lead to idleness and imbalance at synchronization points • Need algorithms with looser outer loops than global Newton-Krylov • Can we design algorithms that are robust with respect to incomplete convergence of inner tasks, like inexact Newton? • Paradigm: nonlinear Schwarz with regional (not global) nonlinear solvers where most execution time is spent • A priori partitionings for quasi-static meshes provide load-balanced computational tasks between frequent synchronization points • Good load balance is critical to parallel scalability on 1,000 processors and more

Fault ToleranceToday Future • With 100,000 processors or worldwide networks, MTBF will be in minutes • Checkpoint-restart could take longer than the time to next failure • Paradigm: naturally fault tolerant algorithms, robust with respect to failure, such as a new FD algorithm at ORNL • Fault tolerance is not a driver in most scientific application code projects • FT handled as follows: • Detection of wrong • System – in hardware • Framework – by runtime env • Library – in math or comm lib • Notification of application • Interrupt – signal sent to job • Error code returned by app process • Recovery • Restart from checkpoint • Migration of task to new hardware • Reassignment of work to remaining tasks c/o A. Geist

Automated TuningToday Future • Less knowledgeable users required to employ parallel iterative solvers in taxing applications • Need safe defaults and automated tuning strategies • Paradigm: parallel direct search (PDS) derivative-free optimization methods, using overall parallel computational complexity as objective function and algorithm tuning parameters as design variables, to tune solver in preproduction trial executions • Knowledgeable user-developers parameterize their solvers with experience and theoretically informed intuition for: • problem size/processor ratio • outer solver type • Krylov solver type • DD preconditioner type • maximum subspace dimensions • overlaps • fill levels • inner tolerances • potentially many others

Integrated SoftwareToday Future • Analysis increasingly an “inner loop” around which more sophisticated science-driven tasks are wrapped • Need PDE task functionality (e.g., residual evaluation, Jacobian evaluation, Jacobian inverse) exposed to optimization/sensitivity/stability algorithms • Paradigm: integrated software based on common distributed data structures • Each analysis is a “special effort” • Optimization, sensitivity analysis (e.g., for uncertainty quantification), and stability analysis to fully exploit and contextualize scientific results are rare

Lab-university collaborations to develop “Integrated Software Infrastructure Centers” (ISICs) and partner with application groups • For FY2002, 51 new projects at $57M/year total • Approximately one-third for ISICs • A third for grid infrastructure and collaboratories • A third for applications groups • 5 Tflop/s IBM SP platforms “Seaborg” at NERSC (#3 in latest “Top 500”) and “Cheetah” at ORNL (being installed now) available for SciDAC

Introducing “Terascale Optimal PDE Simulations” (TOPS) ISIC Nine institutions, five years, 24 co-PIs

Driven by three applications SciDAC groups • LBNL-led “21st Century Accelerator” designs • ORNL-led core collapse supernovae simulations • PPPL-led magnetic fusion energy simulations intended for many others TOPS • Not just algorithms, but vertically integrated software suites • Portable, scalable, extensible, tunable implementations • Starring PETSc and hypre, among other existing packages

Background of PETSc Library(in which FUN3D example was implemented) • Developed by Balay, Gropp, McInnes & Smith (ANL) to support research, prototyping, and production parallel solutions of operator equations in message-passing environments; now joined by four additional staff (Buschelman, Kaushik, Knepley, Zhang) under SciDAC • Distributed data structures as fundamental objects - index sets, vectors/gridfunctions, and matrices/arrays • Iterative linear and nonlinear solvers, combinable modularly and recursively, and extensibly • Portable, and callable from C, C++, Fortran • Uniform high-level API, with multi-layered entry • Aggressively optimized: copies minimized, communication aggregated and overlapped, caches and registers reused, memory chunks preallocated, inspector-executor model for repetitivetasks (e.g., gather/scatter) Seehttp://www.mcs.anl.gov/petsc

David E. Keyes Center for Computational Science Old Dominion University