Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley

Minimizing Communication in Numerical Linear Algebrawww.cs.berkeley.edu/~demmelCase Study: Matrix Multiply Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Why Matrix Multiplication? • An important kernel in many problems • Appears in many linear algebra algorithms • Bottleneck for dense linear algebra • Closely related to other algorithms, e.g., transitive closure on a graph using Floyd-Warshall • Optimization ideas can be used in other problems • The best case for optimization payoffs • The most-studied algorithm in high performance computing

Matrix-multiply, optimized several ways Speed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops

Note on Matrix Storage • A matrix is a 2-D array of elements, but memory addresses are “1-D” • Conventions for matrix layout • by column, or “column major” (Fortran default); A(i,j) at A+i+j*n • by row, or “row major” (C default) A(i,j) at A+i*n+j • recursive (later) • Column major (for now) Column major matrix in memory Row major Column major 0 5 10 15 0 1 2 3 1 6 11 16 4 5 6 7 2 7 12 17 8 9 10 11 3 8 13 18 12 13 14 15 4 9 14 19 16 17 18 19 Blue row of matrix is stored in red cachelines cachelines Figure source: Larry Carter, UCSD

Computational Intensity: Key to algorithm efficiency Machine Balance: Key to machine efficiency Using a Simple(r) Model of Memory to Optimize • Assume just 2 levels in the hierarchy, fast and slow • All data initially in slow memory • m = number of memory elements (words) moved between fast and slow memory • tm = time per slow memory operation • f = number of arithmetic operations • tf = time per arithmetic operation << tm • q = f / m average number of flops per slow memory access • Minimum possible time = f* tf when all data in fast memory • Actual time • f * tf + m * tm = f * tf * (1 + tm/tf * 1/q) • Larger q means time closer to minimum f * tf • q  tm/tfneeded to get at least half of peak speed

Warm up: Matrix-vector multiplication {implements y = y + A*x} for i = 1:n for j = 1:n y(i) = y(i) + A(i,j)*x(j) A(i,:) + = * y(i) y(i) x(:)

Warm up: Matrix-vector multiplication {read x(1:n) into fast memory} {read y(1:n) into fast memory} for i = 1:n {read row i of A into fast memory} for j = 1:n y(i) = y(i) + A(i,j)*x(j) {write y(1:n) back to slow memory} • m = number of slow memory refs = 3n + n2 • f = number of arithmetic operations = 2n2 • q = f / m2 • Matrix-vector multiplication limited by slow memory speed

Modeling Matrix-Vector Multiplication • Compute time for nxn = 1000x1000 matrix • Time • f * tf + m * tm = f * tf * (1 + tm/tf * 1/q) • = 2*n2 * tf * (1 + tm/tf * 1/2) • For tf and tm, using data from R. Vuduc’s PhD (pp 351-3) • http://bebop.cs.berkeley.edu/pubs/vuduc2003-dissertation.pdf • For tm use minimum-memory-latency / words-per-cache-line machine balance (q must be at least this for ½ peak speed)

Simplifying Assumptions • What simplifying assumptions did we make in this analysis? • Ignored parallelism in processor between memory and arithmetic within the processor • Sometimes drop arithmetic term in this type of analysis • Assumed fast memory was large enough to hold three vectors • Reasonable if we are talking about any level of cache • Not if we are talking about registers (~32 words) • Assumed the cost of a fast memory access is 0 • Reasonable if we are talking about registers • Not necessarily if we are talking about cache (1-2 cycles for L1) • Memory latency is constant • Could simplify even further by ignoring memory operations in X and Y vectors • Mflop rate/element = 2 / (2* tf + tm)

Validating the Model • How well does the model predict actual performance? • Actual DGEMV: Most highly optimized code for the platform • Model sufficient to compare across machines • But under-predicting on most recent ones due to latency estimate

Naïve Matrix Multiply {implements C = C + A*B} for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) Algorithm has 2*n3 = O(n3) Flops and operates on 3*n2 words of memory q potentially as large as 2*n3 / 3*n2 = O(n) A(i,:) C(i,j) C(i,j) B(:,j) = + *

Naïve Matrix Multiply {implements C = C + A*B} for i = 1 to n {read row i of A into fast memory} for j = 1 to n {read C(i,j) into fast memory} {read column j of B into fast memory} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory} A(i,:) C(i,j) C(i,j) B(:,j) = + *

Naïve Matrix Multiply Number of slow memory references on unblocked matrix multiply m = n3 to read each column of B n times + n2 to read each row of A once + 2n2 to read and write each element of C once = n3 + 3n2 So q = f / m = 2n3 / (n3 + 3n2) 2 for large n, no improvement over matrix-vector multiply Inner two loops are just matrix-vector multiply, of row i of A times B Similar for any other order of 3 loops A(i,:) C(i,j) C(i,j) B(:,j) = + *

Matrix-multiply, optimized several ways Speed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops

Naïve Matrix Multiply on RS/6000 12000 would take 1095 years T = N4.7 Size 2000 took 5 days O(N3) performance would have constant cycles/flop Performance looks like O(N4.7) Slide source: Larry Carter, UCSD

Naïve Matrix Multiply on RS/6000 Page miss every iteration TLB miss every iteration Cache miss every 16 iterations Page miss every 512 iterations Slide source: Larry Carter, UCSD

Blocked (Tiled) Matrix Multiply Consider A,B,C to be N-by-N matrices of b-by-b subblocks where b=n / N is called the block size for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory} A(i,k) C(i,j) C(i,j) = + * B(k,j)

Blocked (Tiled) Matrix Multiply Recall: m is amount memory traffic between slow and fast memory matrix has nxn elements, and NxN blocks each of size bxb f is number of floating point operations, 2n3 for this problem q = f / m is our measure of algorithm efficiency in the memory system So: m = N*n2 read each block of B N3 times (N3 * b2 = N3 * (n/N)2 = N*n2) + N*n2 read each block of A N3 times + 2n2 read and write each block of C once = (2N + 2) * n2 So computational intensity q = f / m = 2n3 / ((2N + 2) * n2)  n / N = b for large n So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiply (q=2)

Analyzing Machine Speed Limits The blocked algorithm has computational intensity q  b • The larger the block size, the more efficient our algorithm will be • Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large • Assume your fast memory has size Mfast 3b2 Mfast, so q  b  (Mfast/3)1/2 • To build a machine to run matrix multiply at 1/2 peak arithmetic speed of the machine, we need a fast memory of size • Mfast 3b2 3q2 = 3(tm/tf)2 • This size is reasonable for L1 cache, but not for register sets • Note: analysis assumes it is possible to schedule the instructions perfectly

Limits to Optimizing Matrix Multiply • The blocked algorithm changes the order in which values are accumulated into each C[i,j] by applying commutativity and associativity • Get slightly different answers from naïve code, because of roundoff - OK • The previous analysis showed that the blocked algorithm has computational intensity: q  b  (Mfast/3)1/2 • There is a lower bound result that says we cannot do any better than this (using only associativity) • Theorem (Hong & Kung, 1981): Any reorganization of this algorithm (that uses only associativity) is limited to q = O( (Mfast)1/2 ) • Does not apply to algorithms like Strassen

Lower bound for all “direct” linear algebra • Let M = “fast” memory size • #words_moved = (#flops / M1/2) • #messages_sent = (#flops / M3/2) • Holds for sequential algorithms for • Matmul, BLAS, LU, QR, eig, SVD, … • Some whole programs (sequences of these operations, no matter how they are interleaved, eg computing Ak) • Dense and sparse matrices (where #flops << n3 ) • Some graph-theoretic algorithms (eg Floyd-Warshall) • Proof in Lecture 3

What if there are more than 2 levels of memory? • Recall goal is to minimize communication between all levels • The tiled algorithm requires finding a good block size • Machine dependent • Need to “block” b x b matrix multiply in inner most loop • 1 level of memory  3 nested loops (naïve algorithm) • 2 levels of memory  6 nested loops • 3 levels of memory  9 nested loops … • Cache Oblivious Algorithms offer an alternative • Treat nxn matrix multiply as a set of smaller problems • Eventually, these will fit in cache • Will minimize # words moved between every level of memory hierarchy (between L1 and L2 cache, L2 and L3, L3 and main memory etc.) – at least asymptotically

Recursive Matrix Multiplication (RMM) (1/2) A11 A12 A21 A22 B11 B12 B21 B22 C11 C12 C21 C22 A11·B11 +A12·B21 A11·B12 +A12·B22 A21·B11 +A22·B21 A21·B12 +A22·B22 func C = RMM (A, B, n) if n = 1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return • For simplicity: square matrices with n = 2m • C = = A · B = · · = • True when each Aij etc 1x1 or n/2 x n/2

Recursive Matrix Multiplication (2/2) func C = RMM (A, B, n) if n=1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return A(n) = # arithmetic operations in RMM( . , . , n) = 8 · A(n/2) + 4(n/2)2 if n > 1, else 1 = 2n3 … same operations as usual, in different order M(n) = # words moved between fast, slow memory by RMM( . , . , n) = 8 · M(n/2) + 4(n/2)2 if 3n2 > Mfast , else 3n2 = O( n3 / (Mfast )1/2 +n2 ) … same as blocked matmul

Recursion: Cache Oblivious Algorithms • Recursion for general A (mxn) * B (nxp) • Case1: m>= max{n,p}: split A horizontally: • Case 2 : n>= max{m,p}: split A vertically and B horizontally • Case 3: p>= max{m,n}: split B vertically • Attains lower bound in O() sense 1 2 Case 1 Case 2 Case 3

Experience with Cache-Oblivious Algorithms • In practice, need to cut off recursion well before 1x1 blocks • Call “Micro-kernel” for small blocks, eg 16 x 16 • Implementing a high-performance Cache-Oblivious code is not easy • Using fully recursive approach with highly optimized recursive micro-kernel, Pingali et al report that they never got more than 2/3 of peak. • Issues with Cache Oblivious (recursive) approach • Recursive Micro-Kernels yield less performance than iterative ones using same scheduling techniques • Pre-fetching is needed to compete with best code: not well-understood in the context of Cache Oblivous codes Unpublished work, presented at LACSI 2006

Minimizing latency requires new data structures • To minimize latency, need to load/store whole rectangular subblock of matrix with one “message” • Incompatible with conventional columnwise (rowwise) storage • Ex: Rows (columns) not in contiguous memory locations • Blocked storage: store as matrix of bxb blocks, each block stored contiguously • Ok for one level of memory hierarchy, what if more? • Recursive blocked storage: store each block using subblocks • Also known as “space filling curves”, “Morton ordering”

Comparing matrix-vector to matrix-matrix mult Data source: Jack Dongarra Matrix-matrix mult ≡ DGEMM, matrix-vector mult ≡ DGEMV

How hard is hand-tuning matmul, anyway? • Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09 • Students given “blocked” code to start with • Still hard to get close to vendor tuned performance (ACML) • For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/

How hard is hand-tuning matmul, anyway?

What part of the Matmul Search Space Looks Like Number of columns in register block Finding needle in haystack! Number of rows in register block A 2-D slice of a 3-D register-tile search space. The dark blue region was pruned. (Platform: Sun Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0 compiler)

Automatic Performance Tuning • Goal: Let machine do hard work of writing fast code • What do tuning of Matmul, dense BLAS, FFTs, signal processing, have in common? • Can do the tuning off-line: once per architecture, algorithm • Can take as much time as necessary (hours, a week…) • At run-time, algorithm choice may depend only on few parameters (matrix dimensions, size of FFT, etc.) • Examples: PHiPAC, ATLAS, FFTW, Spiral • Can’t always do tuning off-line • Algorithm and implementation may strongly depend on data only known at run-time • Ex: Sparse matrix nonzero pattern determines both best data structure and implementation of Sparse-matrix-vector-multiplication (SpMV) • Part of search for best algorithm just be done (very quickly!) at run-time • Example: OSKI

Autotuning Matmul with ATLAS (n = 500) • ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor. • ATLAS written by C. Whaley, inspired by PHiPAC, by Asanovic, Bilmes, Chin, D. Source: Jack Dongarra

Parallel matrix-matrix multiplication • Consider distributed memory machines • Each processor has its own private memory • Communication by sending messages over a network • Examples: MPI, UPC, Titanium • First question: how is matrix initially distributed across different processors?

Different Parallel Data Layouts for Matrices (not all!) 1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout 4) Row versions of the previous layouts b 3) 1D Column Block Cyclic Layout Generalizes others 6) 2D Row and Column Block Cyclic Layout 5) 2D Row and Column Blocked Layout

Parallel Matrix-Vector Product • Compute y = y + A*x, where A is a dense matrix • Layout: • 1D row blocked • A(i) refers to the n by n/p block row that processor i owns • x(i) and y(i) similarly refer to segments of x,y owned by i • Algorithm: • Foreach processor i • Broadcast x(i) • Compute y(i) = A(i)*x • Algorithm uses the formula y(i) = y(i) + A(i)*x = y(i) + Sj A(i,j)*x(j) P0 P1 P2 P3 x P0 P1 P2 P3 A(0) y A(1) A(2) A(3)

Matrix-Vector Product y = y + A*x • A column layout of the matrix eliminates the broadcast of x • But adds a reduction to update the destination y • A 2D blocked layout uses a broadcast and reduction, both on a subset of processors • sqrt(p) for square processor grid P0 P1 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15

Parallel Matrix Multiply • Computing C=C+A*B • Using basic algorithm: 2*n3 Flops • Depends on: • Data layout • Topology of machine • Scheduling communication • Use of simple performance model for algorithm design • Message Time = “latency” + #words * time-per-word = a + #words *b • Efficiency: • serial time / (p * parallel time) • perfect (linear) speedup  efficiency = 1

p0 p1 p2 p3 p4 p5 p6 p7 Matrix Multiply with 1D Column Layout • Assume matrices are n x n and n is divisible by p • A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i)) • B(i,j) is the n/p by n/p sublock of B(i) • in rows j*n/p through (j+1)*n/p • Algorithm uses the formula C(i) = C(i) + A*B(i) = C(i) + Sj A(j)*B(j,i) May be a reasonable assumption for analysis, not for code

Matrix Multiply: 1D Layout on Bus or Ring • Algorithm uses the formula C(i) = C(i) + A*B(i) = C(i) + Sj A(j)*B(j,i) • First consider a bus-connected machine without broadcast: only one pair of processors can communicate at a time (ethernet) • Second consider a machine with processors on a ring: all processors may communicate with nearest neighbors simultaneously

MatMul: 1D layout on Bus without Broadcast Naïve algorithm: C(myproc) = C(myproc) + A(myproc)*B(myproc,myproc) for i = 0 to p-1 for j = 0 to p-1 except i if (myproc == i) send A(i) to processor j if (myproc == j) receive A(i) from processor i C(myproc) = C(myproc) + A(i)*B(i,myproc) barrier Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: a + b*n2 /p

Naïve MatMul (continued) Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication: a + b*n2 /p … approximately Only 1 pair of processors (i and j) are active on any iteration, and of those, only i is doing computation => the algorithm is almost entirely serial Running time: = (p*(p-1) + 1)*computation + p*(p-1)*communication  2*n3 + p2*a + p*n2*b This is worse than the serial time and grows with p. When might you still want to do this?

Matmul for 1D layout on a Processor Ring • Pairs of adjacent processors can communicate simultaneously Copy A(myproc) into Tmp C(myproc) = C(myproc) + Tmp*B(myproc , myproc) for j = 1 to p-1 Send Tmp to processor myproc+1 mod p Receive Tmp from processor myproc-1 mod p C(myproc) = C(myproc) + Tmp*B( myproc-j mod p , myproc) • Need to be careful about talking to neighboring processors • May want double buffering in practice for overlap • Ignoring deadlock details in code • Time of inner loop = 2*(a + b*n2/p) + 2*n*(n/p)2

Matmul for 1D layout on a Processor Ring • Time of inner loop = 2*(a + b*n2/p) + 2*n*(n/p)2 • Total Time = 2*n* (n/p)2 + (p-1) * Time of inner loop •  2*n3/p + 2*p*a + 2*b*n2 • (Nearly) Optimal for 1D layout on Ring or Bus, even with Broadcast: • Perfect speedup for arithmetic • A(myproc) must move to each other processor, costs at least (p-1)*cost of sending n*(n/p) words • Parallel Efficiency = 2*n3 / (p * Total Time) = 1/(1 + a * p2/(2*n3) + b * p/(2*n) ) = 1/ (1 + O(p/n)) • Grows to 1 as n/p increases (or a and b shrink)

MatMul with 2D Layout • Consider processors in 2D grid (physical or logical) • Processors communicate with 4 nearest neighbors • Assume p processors form square s x s grid, s = p1/2 p(0,0) p(0,1) p(0,2) p(0,0) p(0,1) p(0,2) p(0,0) p(0,1) p(0,2) = * p(1,0) p(1,1) p(1,2) p(1,0) p(1,1) p(1,2) p(1,0) p(1,1) p(1,2) p(2,0) p(2,1) p(2,2) p(2,0) p(2,1) p(2,2) p(2,0) p(2,1) p(2,2)

Cannon’s Algorithm … C(i,j) = C(i,j) + S A(i,k)*B(k,j) … assume s = sqrt(p) is an integer forall i=0 to s-1 … “skew” A left-circular-shift row i of A by i … so that A(i,j) overwritten by A(i,(j+i)mod s) forall i=0 to s-1 … “skew” B up-circular-shift column i of B by i … so that B(i,j) overwritten by B((i+j)mod s), j) for k=0 to s-1 … sequential forall i=0 to s-1 and j=0 to s-1 … all processors in parallel C(i,j) = C(i,j) + A(i,j)*B(i,j) left-circular-shift each row of A by 1 up-circular-shift each column of B by 1 k

Cannon’s Matrix Multiplication C(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2)

Initial Step to Skew Matrices in Cannon • Initial blocked input • After skewing before initial block multiplies A(0,0) A(0,1) A(0,2) B(0,0) B(0,1) B(0,2) A(1,0) A(1,1) A(1,2) B(1,0) B(1,1) B(1,2) A(2,0) A(2,1) A(2,2) B(2,0) B(2,1) B(2,2) A(0,0) A(0,1) A(0,2) B(0,0) B(1,1) B(2,2) A(1,1) A(1,2) A(1,0) B(1,0) B(2,1) B(0,2) A(2,2) A(2,0) A(2,1) B(2,0) B(0,1) B(1,2)

A(0,0) A(0,1) A(0,2) B(0,0) B(1,1) B(2,2) A(1,1) A(1,2) A(1,0) B(1,0) B(2,1) B(0,2) A(2,2) A(2,0) A(2,1) B(2,0) B(0,1) B(1,2) Skewing Steps in CannonAll blocks of A must multiply all like-colored blocks of B • First step • Second • Third A(0,2) A(0,0) A(0,1) B(1,0) B(2,1) B(0,2) A(1,0) A(1,1) B(2,0) B(0,1) B(1,2) A(1,2) A(2,0) A(2,2) B(0,0) B(1,1) B(2,2) A(2,1) A(0,0) B(2,0) B(0,1) B(1,2) A(0,2) A(0,1) A(1,0) B(0,0) B(1,1) B(2,2) A(1,1) A(1,2) A(2,0) A(2,1) A(2,2) B(1,0) B(2,1) B(0,2)

Cost of Cannon’s Algorithm forall i=0 to s-1 … recall s = sqrt(p) left-circular-shift row i of A by i … cost ≤ s*(a + b*n2/p) forall i=0 to s-1 up-circular-shift column i of B by i … cost ≤ s*(a + b*n2/p) for k=0 to s-1 forall i=0 to s-1 and j=0 to s-1 C(i,j) = C(i,j) + A(i,j)*B(i,j) … cost = 2*(n/s)3 = 2*n3/p3/2 left-circular-shift each row of A by 1 … cost = a + b*n2/p up-circular-shift each column of B by 1 … cost = a + b*n2/p • Total Time = 2*n3/p + 4*s* + 4**n2/s • Parallel Efficiency = 2*n3 / (p * Total Time) • = 1/( 1 + a * 2*(s/n)3 + b * 2*(s/n) ) • = 1/(1 + O(sqrt(p)/n)) • Grows to 1 as n/s = n/sqrt(p) = sqrt(data per processor) grows • Better than 1D layout, which had Efficiency = 1/(1 + O(p/n))

Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley