CSE 160 - Lecture 11

CSE 160 - Lecture 11 Computation/Communication Analysis - Matrix Multiply

Granularity • Machine granularity has been defined as MFLOPS/MB/sec = FLOP/Byte • This tries to indicate the balance between computation and communication. • For parallel computation it is important to understand how much computation could be accomplished while sending/receiving a message

Message Startup Latency • Granularity as defined only tells part of the story • If it told the whole story than message startup latency would not be important. • Message startup latency - the time it takes to start sending a message of any length • This latency is approximated by measuring the latency of a zero-byte message • There are other measures that are important, too.

Back of the envelope calculations • Suppose we have a 733MhZ Pentium, Myrinet (100 MB/sec) and zero-length message latency of 10 microseconds • Granularity is 733/100 = 7.3 • 7 Flops can be computed for every byte of data sent • Double precision float is 8 bytes • (8*7) = 56 Flops for every DP float sent on the network (hmmm) • in 10 microseconds can accomplish 7333 Flops • Every float takes ~ .08 microseconds to transmit • 100MB/sec = 100 bytes/microsecond • 125 floats transmitted/startup latency

First Interpretation • For a 50:50 balance (Comp/Comm) • Compute 7333 Flops, Transmit 1 float • If done serially (compute -> message -> compute -> …) • Throughput of CPU cut in 1/2 • Only computing 1/2 of the time, messaging the other half • For a 90:10 (comp/comm) • Compute (9*7333  66000) Flops/ transmit 1 float • If done serially, 90% time computing, 10% messaging

One more calculation • 733 MHz PIII, 100 Mbit ethernet, 100 microsecond latency • granularity is 733/10 = 73.3 • in one latency period can do 73333 Flops • 90:10 requires (9 * 73333) = 666000 Flops • If latency isn’t the contraint, but transmission time is, then we have to balance the computation time with the communication time

Matrix - Matrix Multiply • Given two square matrices (N x N) A, B, want to multiply them together • The total number of FLOPs is O(2N3) • There are numerous ways to efficiently parallelize matrix multiply, we’ll pick a simple method and analyze is communication and computation costs

Matrix Multiply - Review4x4 example A11 A12 A13 A14 B11 B12 B13 B14 C11 C12 C13 C14 A21 A22 A23 A24 B21 B22 B23 B24 C21 C22 C23 C24 = * A31 A32 A33 A34 B31 B32 B33 B34 C31 C32 C33 C34 A41 A42 A34 A44 B41 B42 B34 B44 C41 C42 C34 C44 In general, entry Akm = dot product of kth row of B with the mth column of C A32 = B31 C12 B32 C22 B33 C32 B34 C42 = B31 C12+ B32 C22+ B33 C32+ B34 C42

How much computation for NxN? • Akm =  Bkj Cjm j=1,2,…,N • Count multiplies – N • Count adds – N • So every element of A takes 2N Flops • There are N*N elements in the results • Total is (flops/element)*(#elements) • (2N)*(N*N) = O(2N3)

The matrix elements can be matrices! • Matrix elements can themselves be matrices. • E.g. B31 C12 would itself be a matrix multiply • We can think about matrix blocks of size qXq. • Total computation is (#blocks)*2q3 • Let’s formulate two parallel algorithms for MM

First question to ask • What can be computed independently? • Does the computation of Akm depend at all on the computation of Apq? • Another way of asking, does the order in which we compute the elements of A matter? • Not for MM • In general, calculations will depend on previous results, This might limit the amount of parallelism

A11 A12 A13 A14 A21 A22 A23 A24 A31 A32 A33 A34 A41 A42 A34 A44 Simple Parallelism Divide the computation of A into blocks. Assign a block to different processors (16 in this case) Computation of each block of A can be computed in parallel In theory, should be 16X faster than on one processor Computation of A mapped onto a 4x4 Processor grid

Next Questions to Ask • How will the three matrices be stored in a parallel program. Two choices • Every node gets a complete copy of all the matrices (could run out of memory) • Distribute the storage to different computers so that each “node” holds only some part of each matrix • Can compute on much larger matrices!

Every Processor gets copy of the Matrices • This is a good paradigm for a shared-memory machine? • Every processor can share their copy of the matrix • For distributed memory machines, need p*p more total memory on a pXp processor grid. • May not be practical for extremely large matrices – one 1000x1000 DP matrix is 8MB. • If we ignore cost of initial distribution of matrices across multiple memories, then parallel MM multiply runs p*p times faster than on a single CPU.

Case Two • Matrices A, B, and C are distributed so that each processor has only some part of the matrix. • We call this a parallel data decomposition • Examples include Block, Cyclic and Strip • Why would you do this? • If you only need to multiply two matrices, then today’s machines can locally store fairly large matrices, but • Codes often contain 10’s to 100’s of similar sized matrices. • Local memory starts to become a relatively scarce commodity

Let’s Pick A Row Decomposition • Basic idea - • Assume P processors and N = q*P • Assign q rows of the matrices to each processor Proc 0 A11 A12 A13 A14 B11 B12 B13 B14 C11 C12 C13 C14 A21 A22 A23 A24 B21 B22 B23 B24 C21 C22 C23 C24 Proc 1 = * A31 A32 A33 A34 B31 B32 B33 B34 C31 C32 C33 C34 Proc 2 A41 A42 A34 A44 B41 B42 B34 B44 C41 C42 C34 C44 Proc 3 Akm is qXq

What do we need to compute a block • Consider a block Akm • Need the kth row of B, and the mth column of C • Where are these located • All of the kth row of B is on the same processor as that holds the kth row of A • Why? Our chosen decomposition of data • The mth column is distributed among all processors • Why? Our chosen decomposition.

Let’s assume some memory contraints • Each process has just enough memory to hold 3 qxN matrices • enough to get all data in place to easily compute the a row/column dot product • Can do this on each processor one qXq block at a time. • How much computation is needed to compute a single qXq entry in a row • 2q3FLOPS • How much data needs to be transferred? • 3*q2 – need qXq blocks of data from 3 neighbors (column of C)

Basic Algorithm - SPMD Assume my processor id = z (z = 1, …, p) P is the number of processors Blocks are qXq For (j = 1; j < p; j++) { for (k = 1; k <= p , k  z; k++) send Czj to processor k for (k = 1; k <= p , k  z; k++) receive Ckj from processor k compute Azj locally barrier(); }

Let’s compute some work • For each iteration, each process does the following • Computes 2q3 flops • Send/receives 3q2 floats • We’ve added the transmission of data to the basic algorithm. • For what size q does the time required for data transmission balance the time for computation?

Some Calculations • Using just a granularity measure how do we choose a minimum q (for 50:50) • On Myrinet, need to perform 56 flops for every float transferred • Need to perform 2q3 flops/iteration. Assume a flop costs 1 unit. • Each float transferred “costs” 56 units • So when is • Flop cost >= transfer cost • 2q3 >= 3*56q2? Q = 80 • For a size 80 matrix, 733Mflop this takes 1.4ms

What does 50:50 really mean • On two processors, it takes the same amount of time as one • On four, with scaling, goes twice as fast • 50% efficiency

Final Thoughts • This decomposition of data is not close to optimal • We could get more parallelism by running on a pXp processor grid and having each processor do just one multiply • MM is highly parallelizable and better algorithms get good efficiencies even on workstations.

CSE 160 - Lecture 11

CSE 160 - Lecture 11

Presentation Transcript

CSE 524: Lecture 11

CSE 160 – Lecture 2

CSE 544: Lecture 11 Theory

CSE 8A Lecture 11

CSE 341 Lecture 11 a

CSE 160 - Lecture 15

CSE 143 Lecture 11

CSE 160 – Lecture 16

CSE 403 Lecture 11

CSE 160 – Lecture 10

CS 160: Lecture 11

CSE 160 – Lecture 9

CS 160: Lecture 11

CSE 390 “Lecture 11”

CSE 160 – Lecture 2

CSE 303 Lecture 11

CSE 160 – Lecture 16

CS 160: Lecture 11