Introduction to Parallel Computing

Introduction to Parallel Computing

Multiprocessor Architectures • Message-Passing Architectures • Separate address space for each processor. • Processors communicate via message passing. • Shared-Memory Architectures • Single address space shared by all processors. • Processors communicate by memory read/write. • SMP or NUMA. • Cache coherence is important issue. • Lots of middle ground and hybrids. • No clear consensus on terminology.

. . . memory memory memory cache cache cache . . . processor processor processor interconnection network Message-Passing Architecture

Shared-Memory Architecture . . . processor 1 processor 2 processor N cache cache cache interconnection network . . . memory 1 memory 2 memory M

Shared-Memory Architecture:SMP and NUMA • SMP = Symmetric Multiprocessor • All memory is equally close to all processors. • Typical interconnection network is a shared bus. • Easier to program, but doesn’t scale to many processors. • NUMA = Non-Uniform Memory Access • Each memory is closer to some processors than others. • a.k.a. “Distributed Shared Memory”. • Typically interconnection is grid or hypercube. • Harder to program, but scales to more processors.

Shared-Memory Architecture:Cache Coherence • Effective caching reduces memory contention. • Processors must see single consistent memory. • Many different consistency models. • Weak consistency is sufficient. • Snoopy cache coherence for bus-based SMPs. • Distributed directories for NUMA. • Many implementation issues: multiple-levels, I-D separation, cache line size, update policy, etc. etc. • Usually don’t need to know all the details.

Example: Quad-Processor Pentium Pro • SMP, bus interconnection. • 4 x 200 MHz Intel Pentium Pro processors. • 8 + 8 Kb L1 cache per processor. • 512 Kb L2 cache per processor. • Snoopy cache coherence. • Compaq, HP, IBM, NetPower. • Windows NT, Solaris, Linux, etc.

Diplopodus Beowulf-based cluster of Linux/Intel workstations 100 Mbit switch Node config • 2 x 500 MHz Pentium III • 512 Mb RAM • 12-16 Gb disk 24 PCs

The first program • Purpose: Illustrate notation • Given • Length of vectors M • Data xm, ym, m=0,1,…,M-1 of real numbers, and two real scalars  and . • Compute • z = x + y, i.e., z[m] = x[m] + y[m] for m=0,1,…,M-1.

Program Vector_Sum_1 declare m: integer; x,y,z: array[0,1,…,M-1] of real; initially <; m: 0  m < M :: x[m]=xm, y[m]=ym> assign <|| m: 0  m < M :: z[m] = x[m] + y[m]> end Here || is a concurrent operator. It means that is two operations O1 and O2 are separated by ||, i.e. O1 || O2, then the two operations can be performed concurrently independently of each other. In addition, <|| m:0  m<M:: Om> is short for O0||O1||…||OM-1 meaning that all the M operations can be done concurrently.

Sequential assignment initially a=1, b=2 assign a:=b; b:=a results in a=b=2. Concurrent assignment initially a=1, b=2 assign a:=b || b:=a results in a=2, b=1.

A model of a parallel computer • P processors (nodes); p=0,1,…P-1. • All processors are identical. • All processors compute sequentially. • All nodes can communicate with any other node. • The communication is handled by mechanisms for sending and receiving data at each processor.

Data distribution • Suppose distribution of vector x with M elements x0,…xM-1 over a collection of P identical computers. • On each computer define index set Jp = {0,1,…Ip-1}, where Ip is the number of indices stored at processor p. • Assume I0+I1+…+IP-1 = M, x=(x0,…xI0-1,…,…,…,…,…,xM-1) stored proc 0 stored proc P-1

A proper data distribution defines a one-to-one mapping  from a global index m to a local index i on processor p. • For a global index m, (m) gives a unique index i on a unique processor p. • Similarly, an index i on processor p is uniquely mapped to a unique global index m= -1(p,i). • Globally: x = x0,…xM-1 • Locally: x0,…xI0-1, x0,…xI1-1,…, x0,…xIP-1-1 proc 0 proc 1 proc P-1

Purpose: • derive a multicomputer version of Vector_Sum_1 • Given • Length of vectors M. • Data xm, ym, m=0,1,…,M-1 of real numbers, and two real scalars  and . • Number of processors P. • Set of indices Jp={0,1,…,Ip-1} where the number of entries Ip on the p-th processor is given. • A one-to-one mapping between global and local indices. • Compute z= x + y, i,.e, z[m]= x[m] + y[m] for m=0,1,…,M-1.

O,…,P-1 || p Program Vector_Sum_2 declare i: integer; x,y,z: array[Jp] of real; initially <; i: iJp :: x[i]=x-1(p,i), y[i]=y -1(p,i)> assign <|| i: iJp :: z[i] = x[i] + y[i]> end Notice that we have one program for each processor - all programs being identical. In each program, the identifier p is known. Also the mapping is assumed to be known. The result is stored in a distributed vector z.

Performance analysis Let P be the number of processors, and let T = T(P) denote the execution time for a program on this multicomputer. Performance analysis is the study of the properties of T(P). In order to analyze concurrent algorithms, we have to assume certain properties of the computer. In fact these assumptions are rather strict and thus leave out a lot of existing computers. On the other hand; without these assumptions the analysis tend to be extremely complicated.

Observation • Let T(1) be the fastest possible scalar computation, then T(P)  T(1)/P. This relation states a bound for how fast a computation can be done on a parallel computer compared with a scalar computer.

Definitions • Speed-up: The speed-up of a P-node computation with execution time T(P) is given by S(P) = T(1)/T(P). • Efficiency: The efficiency of a P-node computation with speed-up S(P) is given by (P) = S(P)/P.

Discussion • Suppose we are in an optimal situation, i.e., we have T(P) = T(1)/P. Then the speed-up is given by S(P) = T(1)/T(P) = P, and the efficiency is (P) = S(P)/P = 1.

More generally we have T(P)  T(1)/P, which implies that S(P) = T(1)/T(P)  P, and (P) = S(P)/P  1. In practical computations we are pleased if we are close to the optimal results. A speed-up close to P and to an efficiency close to 1 is very good. Practical details often result in weaker performance than expected from the analysis.

Efficiency modelling • Goal: estimate how fast a certain algorithm can run on a multicomputer. The models depend on the following parameters: • A = Arithmetic time; the time of one single arithmetic operation. Integer ops ignored, all nodes assumed equal. • C(L) = Message exchange time; the time it takes to send a message of length L (in proper units) from one processor to another. We assume that this time is equal for any pair of processors. • L = Latency; the start-up time for a communication - or the time it takes to send a message of length zero. • 1/ = Bandwidth; the maximum rate of messages (in proper units) that can be exchanged.

Efficiency modelling In our efficiency models, we will assume that there is a linear relation between the message exchange time and the length of the message: C(L) = L + L.

Analysis of Vector_Sum_2 O,…,P-1 || p Program Vector_Sum_2 declare i: integer; x,y,z: array[Jp] of real; initially <; i: iJp :: x[i]=x-1(p,i), y[i]=y -1(p,i)> assign <|| i: iJp :: z[i] = x[i] + y[i]> end Recall that Jp={0,1,…,Ip-1}, and define I = maxp Ip. Then a model of the execution time is given by T(P) = 3 maxp Ip A= 3I A. Notice that there are three arithmetic operations for each entry of the array.

Load balancing • Obviously, we would like to balance the load of the processors. Basically, we would like to have each of them perform approximately the same number of operations. (Recall that we assume all processors of same capacity). • In the notation used in the present vector operation, we have load balance if I is as small as possible. • In the case that M (the number of array entries) is a multiple of P (the number of processors), we have load balance if I = M/P, meaning that there are equally many vector entries on each processor.

Speed-up For this problem the speed-up is S(P) = T(1)/T(P) = 3MA / 3IA = M/I. If the problem is load balanced, we have I = M/P and thus S(P) = P which is optimal. Notice that we are typically interested in very large values of M, say M=106-109. The number of processors P are usually below 1000.

The communication cost In the above example, no communication at all was necessary. In the next example, one real number must be communicated. This changes the analysis a bit!

The communication cost • Purpose: • derive a multicomputer program for computation of an inner product. • Given • Length of vectors M. • Data xm, ym, m=0,1,…,M-1 of real numbers. • Number of processors P. • Set of indices Jp={0,1,…,Ip-1} where the number of entries Ip on the p-th processor is given. • A one-to-one mapping between global and local indices. • Compute = (x,y), i.e., = x[0] y[0] + x[1] y[1] + … + x[M-1] y[M-1] .

Program Inner_Product O,…,P-1 || p Program Inner_Product declare i: integer; w: array[0,1,…,P-1] of real; x,y: array[Jp] of real; initially <; i: iJp :: x[i]=x-1(p,i), y[i]=y -1(p,i)> assign w[p] = < +i : iJp :: x[i] y[i]>; send w[p] to all <; q: 0  q < P and q  p:: receive w[q] from q >;  = < +q: 0  q < P:: w[q] >; end

Performance modelling of Inner_Product • Recall Jp = {0,1,…,Ip-1} and I = maxp Ip. • A model of the execution time for Inner_Product is given by T(P) = (2I-1) A + (P-1) C(1) + (P-1) A Here the first term arises from the sum of x[i]y[i] over local i values (Ip multiplications and Ip-1 additions). The second term arises from the cost of sending one real number from one processors to all others. The third term arises from the computation of the inner product based on the values on each processor.

Simplifications Assume I = M/P, i.e., a load balanced problem. Assume (as always) P  M, and C(1) = A (for practical computers  is quite large, 50-1000). We then have T(P)  2IA + PC(1), or T(P)  (2M/P+ P)A.

Example I • Choosing M=105 and  = 50, we get T(P) = (2* 105/P + 50P)

Example II • Choosing M=107 and  = 50, we get T(P) = (2* 107/P + 50P)

Speed-up For this problem, the speed-up is S(P) = T(1)/T(P)  [(2M+ ) A ] / [(2M/P + P ) A ] = P [1+/(2M)] / [1+ P2/(2M)]. Optimal speed-up characterized by S(p)  P, we must require P2/(2M)  1 in order for this to be the case.

Introduction to Parallel Computing