Jim Demmel and Oded Schwartz

CS294, Lecture #5 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 Communication AvoidingAlgorithmsforDense Linear Algebra:LU, QR, and Cholesky decompositionsandSparse Choleskydecompositions Jim Demmel and Oded Schwartz Based on: Bootcamp 2011, CS267, Summer-school 2010, Laura Grigori’s, many papers

Outline for today Recall communication lower bounds What a “matching upper bound”, i.e. CA-algorithm, might have to do Summary of what is known so far Case Studies of CA dense algorithms Last time: case study I: Matrix multiplication Case study II: LU, QR Case study III: Cholesky decomposition Case Studies of CA sparse algorithms Sparse Cholesky decomposition

Communication lower bounds Applies to algorithms satisfying technical conditions discussed last time “smells like 3-nested loops” For each processor: M = size of fast memory of that processor G = #operations (multiplies) performed by that processor #words_moved to/from fast memory Ω ( max ( G / M1/2 , #inputs + #outputs ) ) = #messages to/from fast memory #words_moved max_message_size ≥ = Ω ( G / M3/2 )

Summary of dense sequential algorithms attaining communication lower bounds • Algorithms shown minimizing # Messages use (recursive) block layout • Not possible with columnwise or rowwise layouts • Many references (see reports), only some shown, plus ours • Cache-oblivious are underlined, Green are ours, ? is unknown/future work

Summary of dense parallel algorithms attaining communication lower bounds • Assume nxn matrices on P processors • MinimumMemory per processor = M = O(n2/ P) • Recall lower bounds: • #words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) • #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Summary of dense parallel algorithms attaining communication lower bounds • Assume nxn matrices on P processors (conventional approach) • MinimumMemory per processor = M = O(n2/ P) • Recall lower bounds: • #words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) • #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Summary of dense parallel algorithms attaining communication lower bounds • Assume nxn matrices on P processors(conventional approach) • Minimum Memory per processor = M = O(n2/ P) • Recall lower bounds: • #words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) • #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Summary of dense parallel algorithms attaining communication lower bounds • Assume nxn matrices on P processors (better) • MinimumMemory per processor = M = O(n2/ P) • Recall lower bounds: • #words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) • #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Can we do even better? • Assume nxn matrices on P processors • Why just one copy of data: M = O(n2/ P) per processor? • Increase M to reduce lower bounds: • #words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) • #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Recall Sequential One-sided Factorizations (LU, QR) • Classical Approach for i=1 to n update column i update trailing matrix • #words_moved = O(n3) • Blocked Approach (LAPACK) for i=1 to n/b update blocki of b columns update trailing matrix • #words moved = O(n3/M1/3) • Recursive Approach • func factor(A) • if A has 1 column, update it • else • factor(left half of A) • update right half of A • factor(right half of A) • #words moved = O(n3/M1/2) • None of these approaches • minimizes #messages or • addresses parallelism • Need another idea

Minimizing latency in sequential case requires new data structures • To minimize latency, need to load/store whole rectangular subblock of matrix with one “message” • Incompatible with conventional columnwise (rowwise) storage • Ex: Rows (columns) not in contiguous memory locations • Blocked storage: store as matrix of bxb blocks, each block stored contiguously • Ok for one level of memory hierarchy, what if more? • Recursive blocked storage: store each block using subblocks • Also known as “space filling curves”, “Morton ordering” 11

Different Data Layouts for Parallel GE Bad load balance: P0 idle after first n/4 steps Load balanced, but can’t easily use BLAS2 or BLAS3 1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout Can trade load balance and BLAS2/3 performance by choosing b, but factorization of block column is a bottleneck Complicated addressing, May not want full parallelism In each column, row b 4) Block Skewed Layout 3) 1D Column Block Cyclic Layout Bad load balance: P0 idle after first n/2 steps The winner! 6) 2D Row and Column Block Cyclic Layout 5) 2D Row and Column Blocked Layout Summer School Lecture 4

Distributed GE with a 2D Block Cyclic Layout CS267 Lecture 9 13

Matrix multiply of green = green - blue * pink CS267 Lecture 9 14

Where is communication bottleneck? • In both sequential and parallel case: panel factorization • Can we retain partial pivoting? No • “Proof” for parallel case: • Each choice of pivot requires a reduction (to find the maximum entry) per columns • #messages = Ω (n) or Ω (n log p), versus goal of O(p1/2) • Solution for QR, then show LU

TSQR: QR of a Tall, Skinnymatrix . W = = = . = = Q01 Q11 Q01 R01 Q11 R11 Q00 Q10 Q20 Q30 Q00 R00 Q10 R10 Q20 R20 Q30 R30 R00 R10 R20 R30 W0 W1 W2 W3 R00 R10 R20 R30 Q02 R02 = R01 R11 R01 R11 Output = { Q00, Q10, Q20, Q30, Q01, Q11, Q02, R02 }

TSQR: An Architecture-Dependent Algorithm W0 W1 W2 W3 R00 R10 R20 R30 R01 Parallel: W = R02 R11 R00 W0 W1 W2 W3 Sequential: R01 W = R02 R03 R00 R01 W0 W1 W2 W3 R01 Dual Core: W = R02 R11 R03 R11 Multicore / Multisocket / Multirack / Multisite / Out-of-core: ? Can choose reduction tree dynamically

Back to LU: Using similar idea for TSLU as TSQR: Use reduction tree, to do “Tournament Pivoting” W1 W2 W3 W4 W1’ W2’ W3’ W4’ P1·L1·U1 P2·L2·U2 P3·L3·U3 P4·L4·U4 Choose b pivot rows of W1, call them W1’ Choose b pivot rows of W2, call them W2’ Choose b pivot rows of W3, call them W3’ Choose b pivot rows of W4, call them W4’ Wnxb = = Choose b pivot rows, call them W12’ Choose b pivot rows, call them W34’ = P12·L12·U12 P34·L34·U34 = P1234·L1234·U1234 Choose b pivot rows W12’ W34’ Go back to W and use these b pivot rows (move them to top, do LU without pivoting)

Minimizing Communication in TSLU LU W1 W2 W3 W4 LU LU LU LU Parallel: (binary tree) W = LU LU LU W1 W2 W3 W4 Sequential: (flat tree) LU W = LU LU LU LU W1 W2 W3 W4 LU Dual Core: W = LU LU LU LU Multicore / Multisocket / Multirack / Multisite / Out-of-core: Can Choose reduction tree dynamically, as before

Making TSLU Numerically Stable • Stability Goal: Make ||A – PLU|| very small: O(machine_epsilon · ||A||) • Details matter • Going up the tree, we could do LU either on original rows of W (tournament pivoting), or computed rows of U • Only tournament pivoting stable • Thm: New scheme as stable as Partial Pivoting (PP) in following sense: get same results as PP applied to different input matrix whose entries are blocks taken from input A • CALU – Communication Avoiding LU for general A • Use TSLU for panel factorizations • Apply to rest of matrix • Cost: redundant panel factorizations: • Extra flops for one reduction step: n/b panels  (n  b2) flops per panel • For the sequential algorithms this is n2b. b = M1/2, so extra n2M1/2 flops, (<n3 as M < n2). • For the parallel 2D algorithms, we have log P reduction steps n/b panels, n/b b2 / P1/2 flops per panel and b = n/(log(P)2P1/2), so extra n/b (n  b2)  log P / P1/2= n2blog P / P1/2 = n3/(Plog(P))< n3/P • Benefit: • Stable in practice, but not same pivot choice as GEPP • One reduction operation per panel: reduces latency to minimum

Sketch of TSLU Stabilty Proof • Consider A = with TSLU of first columns done by : (assume wlog that A21 already contains desired pivot rows) • Resulting LU decomposition (first n/2 steps) is • Claim: As32 can be obtained by GEPP on larger matrix formed from blocks of A • GEPP applied to G pivots no rows, and L31 · L21-1 ·A21· U’11-1· U’12+As32 = L31 · U21 · U’11-1· U’12+As32 = A31· U’11-1· U’12+As32 = L’31· U’12+As32= A32 A11 A21 A31 A’11 · = = · A21 U’11 U’12 U21 -L21-1 ·A21· U’11-1· U’12 As32 U’11 U’12 As22 As32 A’11 A’12 A21 A21 -A31 A32 A’11 A’12 A’21 A’22 A31 A32 A11 A12 A21 A22 A31 A32 L’11 A21·U’11-1 L21 -L31 I 11 12 0 2122 0 0 0 I L’11 L’21I L’31 0 I A11 A12 A21 A22 A31 A32 G = = · Need to show As32 same as above

Stability of LU using TSLU: CALU (1/2) • Worst case analysis • Pivot growth factor from one panel factorization with b columns • TSLU: • Proven: 2bH where H = 1 + height of reduction tree • Attained: 2b, same as GEPP • Pivot growth factor on m x n matrix • TSLU: • Proven: 2nH-1 • Attained: 2n-1, same as GEPP • |L| can be as large as 2(b-2)H-b+1 , but CALU still stable • There are examples where GEPP is exponentially less stable than CALU, and vice-versa, so neither is always better than the other • Discovered sparse matrices with O(2n) pivot growth

Stability of LU using TSLU: CALU (2/2) • Empirical testing • Both random matrices and “special ones” • Both binary tree (BCALU) and flat-tree (FCALU) • 3 metrics: ||PA-LU||/||A||, normwise and componentwise backward errors • See [D., Grigori, Xiang, 2010] for details Summer School Lecture 4 23

Performance vs ScaLAPACK • TSLU • IBM Power 5 • Up to 4.37x faster (16 procs, 1M x 150) • Cray XT4 • Up to 5.52x faster (8 procs, 1M x 150) • CALU • IBM Power 5 • Up to 2.29x faster (64 procs, 1000 x 1000) • Cray XT4 • Up to 1.81x faster (64 procs, 1000 x 1000) • See INRIA Tech Report 6523 (2008), paper at SC08

Exascale predicted speedupsfor Gaussian Elimination: CA-LU vsScaLAPACK-LU log2 (n2/p) = log2 (memory_per_proc) log2 (p)

Case Study III: Cholesky Decomposition

LU and Cholesky decompositions LU decomposition of A A = P·L·U U is upper triangular L is lower triangular P is a permutation matrix Choleskydecomposition of A If A is a real symmetric and positive definite matrix, then L (a real lower triangular) s.t., A = L·LT (L is unique if we restrict diagonal elements  0) Efficient parallel and sequential algorithms?

Application of the generalized form to Cholesky decomposition (1) Generalized form: (i,j)  S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)), gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,…  Sij other arguments)

By the Geometric Embedding theorem[Ballard, Demmel, Holtz, S. 2011a]Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] The communication cost (bandwidth) of any classic algorithm for Cholesky decomposition (for dense matrices) is:

Some Sequential Classical Cholesky Algorithms

Some Parallel Classical Cholesky Algorithms

Upper bounds – Supporting Data Structures Top (column-major): Full, Old Packed, Rectangular Full Packed. Bottom (block-contiguous): Blocked, Recursive Format, Recursive Full Packed [Frigo, Leiserson, Prokop, Ramachandran 99, Ahmed, Pingali 00, Elmroth, Gustavson, Jonsson, Kagstrom 04].

Upper bounds – Supporting Data Structures • Rules of thumb: • Don’t mix column/row major with blocked:decide which one fits your algorithm. • Converting between the two once in your algorithmis OK for dense instances.E.g., changing the DS at the beginning of your algorithms.

The Naïve Algorithms i j j i i k j j k Naïve left looking and right looking Bandwidth: (n3) Latency: (n3/M) Assuming column-major DS.

Rectangular Recursive [Toledo 97] n A11 A12 L11 A12 L11 0 n/2 A22 A21 L21 A22 L21 L22 n/2 A31 A32 L31 A32 L31 L32 m n/2 [A22;A23] := [A22;A23] - [L21;L31]L21T Recursive on left rectangular Recursive on right rectangular;L12 :=0 Base: ColumnL := A/(A11)1/2

Rectangular Recursive [Toledo 97] n A11 A12 L11 A12 L11 0 n/2 A22 A21 L21 A22 L21 L22 n/2 A31 A32 L31 A32 L31 L32 m n/2 Toledo’s Algorithm: guarantees stability for LU. BW(m,n) = 2m if n = 1, otherwise BW(m,n/2) + BWmm(m-n/2,n/2,n/2) + BW(m-n/2,n/2) Bandwidth: O(n3/M1/2+n2 log n) Optimal bandwidth (for almost all M)

Rectangular Recursive [Toledo 97] n A11 A12 L11 A12 L11 0 n/2 A22 A21 L21 A22 L21 L22 n/2 A31 A32 L31 A32 L31 L32 m n/2 Is it Latency optimal? If we use a column major DS, then L = (n3/M), so M1/2 awayfrom optimum (n3/M3/2) This is due to the matrix multiply. Open problem: Can we prove a latency lower bound for all matrix multiply classic algorithm using column major DS? Note: we don’t need such a lower bound here, as the matrix-multiply algorithm is known.

Rectangular Recursive [Toledo 97] n A11 A12 L11 A12 L11 0 n/2 A22 A21 L21 A22 L21 L22 n/2 A31 A32 L31 A32 L31 L32 m n/2 Is it Latency optimal? If we use a contiguous blocks DS, then L = (n2). This is due to the forward columns updates Optimum is (n3/M3/2) n3/M3/2 ≤ n2 for n ≤ M2/3. Outside this range the algorithm is not latency optimal.

Square Recursive algorithm [Ahmed, Pingali00], First analyzed in [Ballard, Demmel, Holtz, S. 09] n A11 A12 L11 0 L11 0 L11 0 n/2 A22 A22 A22 A21 A21 L21 L21 A22 n/2 A22 = A22 – L21L21’ Recursive on A22 A12=0 Recursive on A11 L21 = RTRSM(A21,L11’) RTRSM: Recursive Triangular Solver

Square Recursive algorithm [Ahmed, Pingali00], First analyzed in [Ballard, Demmel, Holtz, S. 09] n A11 A12 L11 0 L11 0 L11 0 n/2 A22 A22 A22 A21 A21 L21 L21 A22 n/2 A22 = A22 – L21L21’ Recursive on A22 A12=0 Recursive on A11 L21 = RTRSM(A21,L11’) BW = 3n2 if 3n2 ≤ M, otherwise 2BW(n/2) + O(n3/M1/2 + n2)

Square Recursive algorithm [Ahmed, Pingali00], First analyzed in [Ballard, Demmel, Holtz, S. 09] n A11 A12 L11 0 L11 0 L11 0 n/2 A22 A22 A22 A21 A21 L21 L21 A22 n/2 • Uses • Recursive matrix multiplication • [Frigo, Leiserson, Prokop, Ramachandran 99] • and recursive RTRSM • Bandwidth: O(n3/M1/2) Latency: O(n3/M3/2) Assuming blocks recursive DS. • Optimal bandwidth • Optimal latency • Cache oblivious • Not stable for LU

Parallel Algorithm Pr 1 2 3 1 2 3 4 5 6 4 5 6 7 8 9 7 8 9 Pc 1 2 3 1 2 3 4 5 6 4 5 6 7 8 9 7 8 9 Find Cholesky locally on a diagonal block.Broadcast downwards. Parallel triangular solve.Broadcast to the right. 3. Parallel symmetric rank-b update

Parallel Algorithm Pr 1 2 3 1 2 3 4 5 6 4 5 6 7 8 9 7 8 9 Pc 1 2 3 1 2 3 4 5 6 4 5 6 7 8 9 7 8 9 Optimal (up to a log P factor)only for 2D.New 2.5D algorithm extends to any size M. Low order term, as b ≤ n/P1/2

Summary of algorithms for dense Cholesky decomposition Tight lower bound for Cholesky decomposition. Sequential: Bandwidth: (n3/M1/2) Latency: (n3/M3/2) Parallel: Bandwidth and latency lower bound tight up to (log P) factor.

Sparse Algorithms: Cholesky Decomposition Based on [L. Grigori P.Y David, J. Demmel, S. Peyronnet, ‘00]

Lower bounds for sparse direct methods • Geometric embedding for sparse 3-nested-loops-like algorithms may be loose: • BW =  (#flops / M1/2 ), • L = Ω(#flops / M3/2) • It may be smaller than the NNZ (number of non-zeros) in the input/output. • For example computing the product of two (block) diagonal matrices may involve no communication in parallel

Lower bounds for sparse matrix multiplication • Compute C = AB, with A, B, C of size n  n, having dnnonzeros uniformly distributed, #flops is at most d2n • Lower bound on communication in sequential is • BW = (dn), L =  (dn/M) • Two types of algorithms - depending on how their communication pattern is designed. • - Take into account the nonzero structure of the matrices • Example - Buluc, Gilbert • BW =L = Ω(d2n) • - Oblivious to the nonzero structure of the matrices • Example - sparse Cannon - Buluc, Gilbert for parallel case • BW = Ω(dn/ p1/2), #messages = Ω(p1/2) • L. GrigoriP.Y David, J. Demmel, S. Peyronnet

Lower bounds for sparse Cholesky factorization • Consider A of size ksks results from a finite difference operator on a regular grid of dimension s2 with ks nodes. • Its Cholesky L factor contains a dense lower triangular matrix of size ks-1ks-1. • BW = Ω(W1/2) • L = Ω(W3/2) • where W = k 3(s-1)/2 for a sequential algorithm • W = k 3(s-1)/(2p) for a parallel algorithm • This result applies more generally to matrix A whose graph G =(V,E) has the following property for some s: • every set of vertices W ⊂ V with n/3 ≤ |W| ≤ 2n/3 is adjacent to at least s vertices in V − W • the Cholesky factor of A contains a dense s x s submatrix

PSPASES with an optimal layout could attains the lower bound in parallel for 2D/3D regular grids: • Uses nested dissection to reorder the matrix • Distributes the matrix using the subtree-to-subcube algorithm • The factorization of every dense multifrontal matrix is performed using an optimal parallel dense Choleskyfactorization. • Sequential multifrontal algorithm attains the lower bound • The factorization of every dense multifrontal matrix is performed using an optimal dense Cholesky factorization Optimal algorithms for sparse Cholesky factorization

Open Problems / Work in Progress (1/2) • Rank Revealing CAQR (RRQR) • AP = QR, P a perm. chosen to put “most independent” columns first • Does LAPACK geqp3 move O(n3) words, or fewer? • Better: Tournament Pivoting (TP) on blocks of columns • CALU with more reliable pivoting (Gu, Grigori, Khabou et al) • Replace tournament pivoting of panel (TSLU) with Rank-Revealing QR of Panel • LU with Complete Pivoting (GECP) • A = PrLUPcT , Prand Pc are perms. putting “most independent” rows & columns first • Idea: TP on column blocks to chose columns, then TSLU on columns to choose rows • LU of A=AT, with symmetric pivoting • A = PLLTPT, P perm., L lower triangular • Bunch-Kaufman, Rook pivoting: O(n3) words moved? • CA approach: Choose block with one step of CA-GECP, permute rows & columns to top left, pick symmetric subset with local Bunch-Kaufman • A = PLBLTPT, with symmetric pivoting (Toledo et al) • L triangular, P perm., B banded • B tridiagonal: due to Aasen • CA approach: B block tridiagonal, stability still in question

Jim Demmel and Oded Schwartz