EE 193: Parallel Computing

EE 193: Parallel Computing Fall 2017 Tufts University Instructor: Joel Grodstein joel.grodstein@tufts.edu Lecture 7: Matrix multiply

Goals • Primary goals: • Understand block-matrix algorithms

Matrix multiplication • Matrix multiplication is the poster child for parallel computation. Some uses: • Computer graphics (e.g., linear algebra for transforming points between coordinate spaces) • Anything that uses linear algebra • Neural nets (perhaps the biggest use of GPGPUs) • And that's good – because matrix multiplication is easy, right? • It's easy to get it correct. • But not so easy to make it fast. EE 193 Joel Grodstein

Matrix times vector • Let's start simple: a (4x4 matrix) * (1x4 vector) Write the eqns. Write the code. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 x0 x1 x2 x3 y0 y1 y2 y3 * = If N were big, what do you think about the temporal and spatial locality of this code? • for r=0…N-1 • for k=0…N-1 • y[r] += A[r,k]*x[k]; EE 193 Joel Grodstein

Matrix times vector • Temporal: • each element of A is used only once . • each element of x is used N times . • this is only useful if x fits into cache (hopefully easy) • Spatial: • The inner loop accesses memory in row-major order  • for r=0…N-1 • for k=0…N-1 • y[r] += A[r,k]*x[k]; A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 x0 x1 x2 x3 y0 y0 y2 y3 * = EE 193 Joel Grodstein

Data for HSX and Pascal • Look at Haswell server compute and memory EE 193 Joel Grodstein

Xeon compute-to-memory ratio • 125G operand/sec • Max DRAM bandwidth = =25G floats/sec • Note that 125 > 25. So • If we are reading our data from main memory (not from caches)… • then we cannot keep all FPUs in all cores busy • Note that 125/25 ≈ 5 • If every piece of data gets used once (from memory) and four more times (from cache)… • then we can keep all FPUs in all cores busy • Now let’s look at a consequence of this EE 193 Joel Grodstein

Compute/memory ratio • for i=0…N-1 • for k=0…N-1 • y[i] += A[i,k]*y[k]; • How many computations are there? • How many floats are read/written? • The compute/memory ratio is • We want every piece of data to be bereused 4 times – but it’s only used once, period • So a Xeon cannot get data in from memory fast enough to keep the FPUs busy . • Even perfect caches cannot fix this! A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 x0 x1 x2 x3 y0 y0 y2 y3 * = N2 FMACs. N2 +2N Roughly 1 EE 193 Joel Grodstein

What does it all mean? • Big-data problems do not fit in your registers. • They're unlikely to fit in L1. • They may fit in L2… or L3… or main memory… • If they fit into L1, all of this is irrelevant • since the BW and latency from L1 to RF is so good. • If they fit into main memory… • A chain is no stronger than its weak link. • Cache can buffer, reuse, even out flow, etc. • If the compute/memory ratio is <5, then HSX cannot use all of its computes • But if the average DRAM bandwidth is the weak link, there is no way to fix that. EE 193 Joel Grodstein

A slightly different problem • for i=0…N-1 • for k=0…N-1 • y[i] += A[i,k]*x[k]; • Related problem: what if A is fixed, and we want to stream in lots of x vectors. • Consider the work for one vector x producing one vector y, and assuming A is already in cache. • How many computations are there? • How many floats are read/written? • The compute/memory ratio is • So HSX can deal with this for N≥10. • Note that computer graphics only typically has N=4. But the graphics pipeline has more than just one matrix multiply. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 x0 x1 x2 x3 y0 y0 y2 y3 * = N2FMACs. 2N N/2 EE 193 Joel Grodstein

More on matrix*vector • for i=0…N-1 • for k=0…N-1 • y[i] += A[i,k]*x[k]; • Keep looking at the same problem: A is fixed, and we want to stream in lots of x vectors. • In this example, do we have any need for caches? • We need someplace to store the A matrix. • Our code may not wind up accessing the various x vectors exactly in address-sequential order; a cache can store the ones we skip until we use them. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 x0 x1 x2 x3 y0 y0 y2 y3 * = EE 193 Joel Grodstein

Matrix times matrix • Let's move on to a bigger problem: matrix*matrix C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 * = • for r=0…N-1 • for c=0…N-1 • for k=0…N-1 • C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein

Matrix times matrix • Let's move on to a bigger problem: matrix*matrix A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 * = How many computations are there? How many floats are read/written? The compute/memory ratio is N3 FMACs 3N2 N/3. So, in principle, this should be easy! • for r=0…N-1 • for c=0…N-1 • for k=0…N-1 • C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein

Matrix access patterns • Access pattern for A? • row major, repeat each line N times • row will mostly be accessed from a small, fast cache • Access pattern for B? • column major, repeat entire matrix N times • accessing from main memory, and slowly • Access pattern for C? • row major, repeat each element N times • extremely fast access A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 * = Can we fix this? • for r=0…N-1 • for c=0…N-1 • for k=0…N-1 • C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein

Our trick: swap the 2nd and 3rd loops. • Any change in the end-result computation? • We still execute the final line with the same r,c,k triplets; just in a different order. And adding the same numbers in a different order doesn't change the result. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for r=0…N-1 • for c=0…N-1 • for k=0…N-1 • C[r,c] += A[r,k]*B[k,c]; * = EE 193 Joel Grodstein

In-class exercise • Draw matrix multiple for 2x2 matrices on the board • Divide the class into two groups • one group works on the original loop ordering, and the other on the swapped ordering • Each group manually runs the code, and notes which A*B products got summed into which locations in C. • At the end, compare to see if they both added the same product pairs EE 193 Joel Grodstein

Matrix access patterns • Access pattern for A? • row major, repeat each element N times • extremely fast access • Access pattern for B? • row major, repeat entire matrix N times • but if we cannot fit B into the L3, we must still keep reloading it from memory  • Access pattern for C? • row major, repeat each row N times • row will mostly be accessed from a small, fast cache A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 * = Can we fix this? • for r=0…N-1 • for k=0…N-1 • for c=0…N-1 • C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein

A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 * = r k • for r=0…N-1 • for k=0…N-1 • for c=0…N-1 • C[r,c] += A[r,k]*B[k,c]; 0 0 0 1 0 2 0 3 EE 193 Joel Grodstein

Let's go parallel. Assign each thread to just a few r values. We want to assign each thread to one core, and have each thread only access as much memory as can fit in L2. Does it work? • Each thread must now access only its own rows of A and C • Each thread still accesses every element of B; and once per value of r. • Conclusion: each thread must be able to hold all of B in its cache. Other than that, it parallelizes quite well. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for r=0…N-1 • for k=0…N-1 • for c=0…N-1 • C[r,c] += A[r,k]*B[k,c]; * = EE 193 Joel Grodstein

Block-matrix algorithms • What happens if B doesn't fit into our relevant cache? Is this a real concern? • YES. Matrices can be big…, and we want to be able to do linear algebra on small CPUs (e.g., for Folding @home)… and GPUs have small caches. • Let's try to change our algorithm to work even on small caches and big matrices. EE 193 Joel Grodstein

Baby step • Consider the two programs: • for a=0..1 • for b=0..3 • i=4*a+b; • do something with i; • for i=0..7 • do something with i; • Do they get the same results? • Unless the "do something" is very tricky, yes they do. EE 193 Joel Grodstein

Interpretation: matrix blocks A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 • Break the 4x4 matrix into 4 2x2 blocks • Block 0,0 • Block 0,1 • Block 1,0 • Block 1,1 EE 193 Joel Grodstein

Do we believe that this has the same result as before? • Yes. But it is getting a bit confusing. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for rb=0…Nb-1 • for ri=0..Ni-1 • for kb=0..Nb-1 • for ki=0..Ni-1 • for cb=0..Nb-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; Pick the row * = Pick k Pick the column EE 193 Joel Grodstein

Now swap some loops A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 for rb=0…Nb-1 * = for ri=0…Ni-1 for kb=0…Nb-1 for ki=0…Ni-1 for cb=0…Nb-1 • for ci=0…Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; EE 193 Joel Grodstein

What? Doesn't it only take two numbers to specify a block??? A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; * = Pick the block Pick the offset within a block EE 193 Joel Grodstein

Yes, it only take two numbers to specify a block. But A, B and C each use a different index pair. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; * = rbcbkb Pick the three blocks 0 0 0 0 0 1 EE 193 Joel Grodstein

Yes, it only take two numbers to specify a block. But A, B and C each use a different index pair. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; * = rbcbkb Pick the block 0 0 1 0 1 1 EE 193 Joel Grodstein

Yes, it only take two numbers to specify a block. But A, B and C each use a different index pair. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; * = rbcbkb Pick the block 1 0 1 0 1 1 EE 193 Joel Grodstein

Why is this useful? • We only work on three blocks at a time • We can pick the block size so that three blocks will fit into cache • Why isn’t this is free lunch? • Instead of traversing entire rows of the matrix, we only traverse rows of the block → take less advantage of spatial locality. • The cache turns over after each block triplet → our compute/mem ratio is really rather than N/3. Basically, we're not getting as much temporal reuse. A00A01A02A03 A10A11A12A13 A20A21A22A23 A30A31A32A33 B00B01B02B03 B10B11B12B13 B20B21B22B23 B30B31B32B33 C00C01C02C03 C10C11C12C13 C20C21C22C23 C30C31C32C33 • for each triplet of three blocks specified by rb, cb, kb • for each element combination ri, ci, kiwithin the blocks • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; * = EE 193 Joel Grodstein

Look at blocks another way • Multiply two 16x16 matrices. = * EE 193 Joel Grodstein

Look at blocks another way • Row 4, column 0 = * EE 193 Joel Grodstein

Look at blocks another way • Block 1,0 = * EE 193 Joel Grodstein

Look at blocks another way • Interpretation: block[1,0] = row[1] ∙ column[0] • The output block depends only on the input blocks. • We reused the same rows/columns over & over. = * EE 193 Joel Grodstein

Block-matrix and threads • We have a block-matrix algorithm • We even understand it  • It should make good use of caches, and thus be quite fast. • Next problems: • ordering the inner and outer loops • multithreading • Simplifying assumption: • Pick the block size so that it fits into the L2 (or lower) cache • Use the L3 (i.e., ring cache) to get “some” reuse of blocks without going to main memory EE 193 Joel Grodstein

for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; Pick the offset within a block • Does it matter what order we put the three inner loops? After all, blocks fit into the L2. • Probably. It would be nice to get a lot of hits from the L1, to make sure our cores don’t wait for memory EE 193 Joel Grodstein

for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; Pick the block • Does it matter what order we put the three outer loops? After all, there’s enough computation that the initial block-loading time doesn’t matter. • Perhaps. But that depends on a lot of factors, and it’s easy enough to pick a good loop ordering. That will let us reload blocks from the L3 rather than from main memory. • What order to use? • rb/kb/cblooks good, for the same reasons as before. EE 193 Joel Grodstein

for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; Pick the block Pick the offset within a block • How many threads should we use? • Start with one thread per block triplet (i.e., one thread per rb/kb/cb triplet). We can change that if we want to. This goes in a thread EE 193 Joel Grodstein

for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; Pick the block thread (thread_func, rb, kb, cb); Pick the offset within a block • Start with one thread per block triplet (i.e., one thread per rb/kb/cb triplet). • How many threads are there? • If we have 1Kx1K matrices, and 64x64 blocks, then how many threads are there? • 1K/64 = 16. So Nb=16, and there are 163=4K threads • What do you think about the code above? • Way too many threads to launch all at once! This goes in a thread Nb3 EE 193 Joel Grodstein

for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • thread (thread_func, rb, kb, cb); Pick the block • Another issue: do we need a critical section to avoid multiple threads writing the same data values? • Yes. Every thread with the same rb and cb all try to write the same locations of the output matrix. • How many threads are fighting each other? • Nb (i.e., all the different values of kb) • If we use a critical section to protect the “C[r,c] += A[r,k]*B[k,c]”, how will that affect performance? • Basically, no two threads can run at the same time any more  • If we make each thread keep its own private work area and then merge them, how will that work? • Each thread will need to keep an entire block as its own work space. That’s potentially a lot of space. And the merging is (#threads)*Nb2 flops, which is non-trivial EE 193 Joel Grodstein

for rb=0…Nb-1 • for kb=0..Nb-1 • for cb=0..Nb-1 • for ri=0..Ni-1 • for ki=0..Ni-1 • for ci=0..Ni-1 • r = rb*Ni+ri; k = kb*Ni+ki; c = cb*Ni+ci; • C[r,c] += A[r,k]*B[k,c]; Pick the block • do { • launch 4 threads; • wait for them to finish; • } until you’ve gone through every one of the Nb3 • triplets (in (Nb3)/4 groups) Pick the offset within a block • Any ideas? • What if we only launch a few at a time? With 4 cores, maybe launch 4 at a time? • Does it fix the problem of having too many threads at once? • Yes • Do we still need a critical section? • Probably not, since we won’t have two threads with the same kb running at the same time. This goes in a thread EE 193 Joel Grodstein

Results on Norbert • for kb=0…Nb-1 • for rb=0..Nb-1 • for cb=0..Nb-1 • Spawn a thread to compute C[r,c] += A[r,k]*B[k,c]; • (for every ri,ci,ki in the three blocks) • if (we've spawned as many threads as cores) • wait for them all to finish; Launching a thread takes .01ms, so .3s total • Very slow. Why? • Another problem, too. • 1Kx1K matrices, 64x64 blocks → Nb=32. • How many threads? • How many groups of 32 threads (there are 32 cores)? • Different threads go at different rates. We have 1024 iterations of "wait for the slowest thread to finish." Hugely inefficient. 32x32x32=32K 1024 EE 193 Joel Grodstein

What to do? • for kb=0…Nb-1 • for rb=0..Nb-1 • for cb=0..Nb-1 • Spawn a thread to compute C[r,c] += A[r,k]*B[k,c]; • (for every r,c,k in the three blocks) • if (we've spawned as many threads as cores) • wait for them all to finish; • What are your thoughts on how to fix this? • We launched too many threads • We kept having the fastest thread wait for the slowest • You get to take your best shot in HW #3. • But you can brainstorm together here first EE 193 Joel Grodstein

Brainstorming • Brainstorm ideas and write some metacode on the board • Questions to ask: • you don’t want to launch a million threads. Roughly how many threads do you want to launch? • Can you divide up the work between threads so that you don’t need a critical section? • You must initialize the output matrix to 0 before threads start summing partial products into it. Can you distribute this work among threads also? EE 193 Joel Grodstein

SMT • Onto the next problem  • We've assumed 1 thread/core until now. Any problem with that? • SMT is often a good thing. It allows one thread to compute while the other thread is stalled anyway. • If we write the 1st thread really well, perhaps it will never stall anyway? • Let's try to use hyperthreading. • Can we spawn 2 threads/core instead of 1? EE 193 Joel Grodstein

C++ threads can really suck • Let's say we divvy up the work on a block triplet among two threads in the same core • But wait – we have a big problem • C++ threads does not let you assign threads to cores. It lets the O/S do its best  • There’s no guarantee our two threads, using the same block triplet, will be assigned to the same core. • In fact, even if we only have 1 thread/core, a stupid enough O/S might put all the threads on 1 core. EE 193 Joel Grodstein

EE 193: Parallel Computing

EE 193: Parallel Computing

Presentation Transcript

Parallel Computing

Parallel Computing

Parallel Computing Explained Parallel Computing Overview

EE 193/Comp 150 Computing with biological parts

Parallel Computing

Parallel Computing

Parallel computing

Parallel Computing

Parallel Computing

Parallel Computing

Parallel Computing

Parallel Computing

Parallel Computing Overview

Parallel Computing

Parallel Computing

Parallel computing

Parallel Computing Overview

Parallel Computing Research

Multiprocessors - Parallel Computing

Parallel Computing