Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel

Minimizing Communication in Numerical Linear Algebrawww.cs.berkeley.edu/~demmel Optimizing Krylov Subspace Methods Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Outline of Lecture 8: Optimizing Krylov Subspace Methods • k-steps of typical iterative solver for Ax=b or Ax=λx • Does k SpMVs with starting vector (eg with b, if solving Ax=b) • Finds “best” solution among all linear combinations of these k+1 vectors • Many such “Krylov Subspace Methods” • Conjugate Gradients, GMRES, Lanczos, Arnoldi, … • Goal: minimize communication in Krylov Subspace Methods • Assume matrix “well-partitioned,” with modest surface-to-volume ratio • Parallel implementation • Conventional: O(k log p) messages, because k calls to SpMV • New: O(log p) messages - optimal • Serial implementation • Conventional: O(k) moves of data from slow to fast memory • New: O(1) moves of data – optimal • Lots of speed up possible (modeled and measured) • Price: some redundant computation Summer School Lecture 8 2

Locally Dependent Entries for [x,Ax], A tridiagonal, 2 processors A8x A7x A6x A5x A4x A3x A2x Ax x Proc 1 Proc 2 Can be computed without communication Summer School Lecture 8 3

Locally Dependent Entries for [x,Ax,A2x], A tridiagonal, 2 processors A8x A7x A6x A5x A4x A3x A2x Ax x Proc 1 Proc 2 Can be computed without communication Summer School Lecture 8 4

Locally Dependent Entries for [x,Ax,…,A3x], A tridiagonal, 2 processors A8x A7x A6x A5x A4x A3x A2x Ax x Proc 1 Proc 2 Can be computed without communication Summer School Lecture 8 5

Locally Dependent Entries for [x,Ax,…,A4x], A tridiagonal, 2 processors A8x A7x A6x A5x A4x A3x A2x Ax x Proc 1 Proc 2 Can be computed without communication Summer School Lecture 8 6

Locally Dependent Entries for [x,Ax,…,A8x], A tridiagonal, 2 processors A8x A7x A6x A5x A4x A3x A2x Ax x Proc 1 Proc 2 Can be computed without communication k=8 fold reuse of A 7

A8x A7x A6x A5x A4x A3x A2x Ax x Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal, 2 processors Proc 1 Proc 2 One message to get data needed to compute remotely dependent entries, not k=8 Minimizes number of messages = latency cost Price: redundant work  “surface/volume ratio” 8

Spyplot of A with sparsity pattern of a 5 point stencil, natural order 9

Spyplot of A with sparsity pattern of a 5 point stencil, natural order Summer School Lecture 8 10

Spyplot of A with sparsity pattern of a 5 point stencil, natural order, assigned to p=9 processors For n x n mesh assigned to p processors, each processor needs 2n data from neighbors 11

Spyplot of A with sparsity pattern of a 5 point stencil, nested dissection order 12

Spyplot of A with sparsity pattern of a 5 point stencil, nested dissection order For n x n mesh assigned to p processors, each processor needs 4n/p1/2 data from neighbors 13

Remotely Dependent Entries for [x,Ax,…,A3x], A with sparsity pattern of a 5 point stencil 14

Remotely Dependent Entries for [x,Ax,…,A3x], A with sparsity pattern of a 5 point stencil Summer School Lecture 8 15

Remotely Dependent Entries for [x,Ax,A2x,A3x], A irregular, multiple processors 16

A8x A7x A6x A5x A4x A3x A2x Ax x Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal, 2 processors Proc 1 Proc 2 One message to get data needed to compute remotely dependent entries, not k=8 Minimizes number of messages = latency cost Price: redundant work  “surface/volume ratio” 17

Reducing redundant work for a tridiagonal matrix Summer School Lecture 8 18

Summary of Parallel Optimizations so far, for“Akx kernel”: (A,x,k)  [Ax,A2x,…,Akx] • Best case: tridiagonal (1D mesh): need O(1) data from nbrs, versus • O(n/p) data locally; • #flops grows by factor 1+O(k/(n/p)) = 1 + O(k/data_per_proc)  1 • Pretty good: 2D mesh (n x n): need O(n/p1/2) data from nbors, versus • O(n2/p) locally; #flops grows by 1+O(k/(data_per_proc)1/2) • Less good: 3D mesh (n x n x n): need O(n2/p2/3) data from nbors, versus • O(n3/p) locally; #flops grows by 1+O(k/(data_per_proc)1/3) • Possible to reduce #messages from O(k) to O(1) • Depends on matrix being “well-partitioned” • Each processor only needs data from a few neighbors • Price: redundant computation of some entries of Ajx • Amount of redundant work depends on “surface to volume ratio” of partition: ideally each processor only needs a little data from neighbors, compared to what it owns itself Summer School Lecture 8 19

Predicted speedup for model Petascale machine of [Ax,A2x,…,Akx] for 2D (n x n) Mesh k log2 n 20

Predicted fraction of time spent doing arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 2D (n x n) Mesh k log2 n 21

Predicted ratio of extra arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 2D (n x n) Mesh k log2 n 22

Predicted speedup for model Petascale machine of [Ax,A2x,…,Akx] for 3D (n x n x n) Mesh k log2 n 23

Predicted fraction of time spent doing arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 3D (n x n x n) Mesh k log2 n 24

Predicted ratio of extra arithmetic for model Petascale machine of [Ax,A2x,…,Akx] for 3D (n x n x n) Mesh k log2 n 25

Minimizing #messages versus #words_moved • Parallel case • Can’t reduce #words needed from other processors • Required to get right answer • Sequential case • Can reduce #words needed from slow memory • Assume A, x too large to fit in fast memory, • Naïve implementation of [Ax,A2x,…,Akx] by k calls to SpMV need to move A, x between fast and slow memory k times • We will move it one time – clearly optimal Summer School Lecture 8 26

Sequential [x,Ax,…,A4x], with memory hierarchy v One read of matrix from slow memory, not k=4 Minimizes words moved = bandwidth cost No redundant work 27

Remotely Dependent Entries for [x,Ax,…,A3x], A 100x100 with bandwidth 2 Only ~25% of A, vectors fits in fast memory 28

In what order should the sequential algorithm process a general sparse matrix? • For a band matrix, we obviously • want to process the matrix • “from left to right”, since data • we need for the next partition • is already in fast memory • Not obvious what best order is • for a general matrix • We can formulate question of • finding the best order as a • Traveling Salesman Problem (TSP) • One vertex per matrix partition • Weight of edge (j, k) is memory cost of processing • partition k right after partition j • TSP: find lowest cost “tour” visiting all vertices

What about multicore? • Two kinds of communication to minimize • Between processors on the chip • Between on-chip cache and off-chip DRAM • Use hybrid of both techniques described so far • Use parallel optimization so each core can work independently • Use sequential optimization to minimize off-chip DRAM traffic of each core Summer School Lecture 8 30

Speedups on Intel Clovertown (8 core) Test matrices include stencils and practical matrices See SC09 paper on bebop.cs.berkeley.edu for details Summer School Lecture 8 31

Minimizing Communication of GMRES to solve Ax=b Standard GMRES for i=1 to k w = A · v(i-1) MGS(w, v(0),…,v(i-1)) update v(i), H endfor solve LSQ problem with H Sequential: #words_moved = O(k·nnz) from SpMV + O(k2·n) from MGS Parallel: #messages = O(k) from SpMV + O(k2 · log p) from MGS Communication-Avoiding GMRES … CA-GMRES W = [ v, Av, A2v, … , Akv ] [Q,R] = TSQR(W) … “Tall Skinny QR” Build H from R, solve LSQ problem Sequential: #words_moved = O(nnz) from SpMV + O(k·n) from TSQR Parallel: #messages = O(1) from computing W + O(log p) from TSQR • GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2 • Cost of k steps of standard GMRES vs new GMRES • Oops – W from power method, precision lost! 32

How to make CA-GMRES Stable? • Use a different polynomial basis for the Krylov subspace • Not Monomial basis W = [v, Av, A2v, …], instead [v, p1(A)v,p2(A)v,…] • Possible choices: • Newton Basis WN = [v, (A – θ1 I)v , (A – θ2 I)(A – θ1 I)v, …] where shifts θi chosen as approximate eigenvalues from Arnoldi (using same Krylov subspace, so “free”) • Chebyshev Basis WC = [v, T1(A)v, T2(A)v, …] where T1(z) chosen to be small in region of complex plane containing large eigenvalues Summer School Lecture 8 34

“Monomial” basis [Ax,…,Akx] fails to converge Newton polynomial basis does converge 35

Speed ups on 8-core Clovertown 36

Summary of what is known (1/3), open • GMRES • Can independently choose k to optimize speed, restart length r to optimize convergence • Need to “co-tune” Akx kernel and TSQR • Know how to use more stable polynomial bases • Proven speedups • Can similarly reorganize other Krylov methods • Arnoldi and Lanczos, for Ax = λx and for Ax = λMx • Conjugate Gradients, for Ax = b • Biconjugate Gradients, for Ax=b • BiCGStab? • Other Krylov methods? Summer School Lecture 8 37

Summary of what is known (2/3), open • Preconditioning: MAx = Mb • Need different communication-avoiding kernel: [x,Ax,MAx,AMAx,MAMAx,AMAMAx,…] • Can think of this as union of two of the earlier kernels [x,(MA)x,(MA)2x,…,(MA)kx] and [x,Ax,(AM)Ax,(AM)2Ax,…,(AM)kAx] • For which preconditioners M can we minimize communication? • Easy: diagonal M • How about block diagonal M? Summer School Lecture 8 38

Examine [x,Ax,MAx,AMAx,MAMAx,…] for A tridiagonal, M block-diagonal Summer School Lecture 8 39

Summary of what is known (3/3), open Reconstruct yi on proc i Compute on proc j, Send to proc i • Preconditioning: MAx = Mb • Need different communication-avoiding kernel: [x,Ax,MAx,AMAx,MAMAx,AMAMAx,…] • For block diagonal M, matrix powers rapidly become dense, but ranks of off-diagonal block grow slowly • Can take advantage of low rank to minimize communication • yi = (AM)kij · xj = U  V · xj = U  (V · xj) • Works (in principle) for Hierarchically Semi-Separable M • How does it work in practice? • For details (and open problems) see M. Hoemmen’s PhD thesis Summer School Lecture 8 44

Of things not said (much about) … • Other Communication-Avoiding (CA) dense factorizations • QR with column pivoting (tournament pivoting) • Cholesky with diagonal pivoting • GE with “complete pivoting” • LDLT ? Maybe with complete pivoting? • CA-Sparse factorizations • Cholesky, assuming good separators • Lower bounds from Lipton/Rose/Tarjan • Matrix multiplication, anything else? • Strassen, and all linear algebra with #words_moved = O(n / M/2-1 ) • Parallel? • Extending CA-Krylov methods to “bottom solver” in multigrid with Adaptive Mesh Refinement (AMR) 45 Summer School Lecture 8

Summary Time to redesign all dense and sparse linear algebra Don’t Communic… Summer School Lecture 8

Extra slides Summer School Lecture 8 47

Minimizing Communication in Numerical Linear Algebra cs.berkeley/~demmel