Lecture # 3: A Primer on How to Design Parallel Algorithms

Lecture # 3: A Primer on How to Design Parallel Algorithms ShantanuDutt University of Illinois at Chicago

Parallel Algorithm Design—A Process/Data View • Break up the computation in either a data parallel or functional parallel way depending on the problem at hand • Data parallel breakup: • Easier to generalize than functional breakup, and more structured • Input data parallel (e.g., the max and general associative computation, finite-element computations—the heat equation problem) or • Output data parallel (e.g., matrix multiplication) • Each data partition defines a computational partition or process that works on that partition • 1st attempt: breakup somehow, determine how to communicate to solve the problem • 2nd attempt: breakup in a way that the reqd. communication complexity (topology indep.) is min. • 3rd attempt: breakup so that commun. complexity on a particular topology is minimized I/P data: Linear or unstructured (e.g., associative functions): Easy to partition into P (N/P)-size pieces P 2 1 I/P data: Dimensional or Structured (e.g., finite-elt. grid): Can be partitioned by same dimensionality or in fewer dimensions. Choice based on resulting communication pattern and the target topology Inter-process communication Finite elt. grid: (a) 2D (full dimensional) partition (b) 1D (reduced dimensional) partition

An example parallel algorithm for a finite element computation • Simple, structured and sparse communication needed. • Example: Heat Equation - • The initial temperature is zero on the boundaries and high in the middle • The boundary temperature is held at zero. • The calculation of an element is dependent upon its neighbor elements data1 data2 …... data N Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

find out if I am MASTER or WORKER • if I am MASTER • initialize array • send each WORKER starting info and subarray • do until all WORKERS converge • gather from all WORKERS convergence data • broadcast to all WORKERS convergence signal • end do • receive results from each WORKER • else if I am WORKER • receive from MASTER starting info and subarray • do until solution converged { • update time • send (non-blocking?) neighbors my border info • receive (non-blocking?) neighbors border info • update interior of my portion of solution array (see comput. given in the serial code) • wait for non-block. commun. (if any) to complete • update border of my portion of solution array • determine if my solution has converged • if so {send MASTER convergence signal • recv. from MASTER convergence signal} • end do } • send MASTER results • endif • Serial Code - • do y=2, N-1 • do x=2, M-1 • u2(x,y)=u1(x,y)+cx*[u1(x+1,y) + u1(x-1,y)] + cy*[u1(x,y+1)} + u1(x,y-1)] /* cx, cy are const.‏ • enddo • enddo • u1 = u2; Master (can be one of the workers) Workers Problem Grid Code from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur

Parallel Algorithm Design—A Process/Data View: Dimensional Partitioning Inter-process communication Finite elt. grid: • A full dimensional partition will have the smallest circumference (amount of data that needs to be communicated to neighboring processes) to area (total data elements per process = N/P) ratio, and thus has the least topology-independent communication cost • A 2D partition (Fig. a) of a 2D grid of nxn points, gives us [n/sqrt(P)]x [n/sqrt(P)] partitions, where n=sqrt(N). Thus total data that needs to be communicated (per iteration of a finite element computation) is 2* [n/sqrt(P)] (for corner partitions) to 4* [n/sqrt(P)] (for interior partitions). • A 1D partition (Fig. b) of a 2D grid of nxn points, gives us n*(n/P) partition, requiring a data amount of n to 2n to be communicated which is about sqrt(P) times more data than for a 2D partition. • However, if one considers the topology of the target parallel system, a 1D partition is suitable for both a linear and 2D topology, while a 2D partition is suitable for a 2D topology and expensive in terms of average message hops (and thus also in terms of more contention/conflict) in a linear topology (a) 2D (full dimensional) partition (b) 1D (reduced dimensional) partition

An example parallel algorithm for a finite element computation: Analysis • Analysis: • Let L be the max dist. of a grid point from the heat source. The # of iterations for convergence will be a function f(L) of L. • Assume linear partition of i/p data and near-neighbor communication only (possible for most topologies for linear data partitioning). • Per iteration computation time per process = Q(N/P). Comm. time per iter. per process = Q(sqrt(N)). • All computations in an iteration occur in parallel (there is no data dependency among the computations in each process for the same iteration). Commun. also occurs completely in parallel in an iteration. Also, near-neighbor  no contention. So parallel time per iteration is Q(max(N/P, sqrt(N)). • Number of parallel iterations? • Parallel Time Tp(P) = • Seq. time Tseq = Tp(1) = • Speedup Sp(P) = Efficiency Ep(P) = data1 data2 …... data N Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur

Parallel Algorithm Design—A Process/Data View: Mapping to a Multiprocessor Processes mapped to data (& thus comput.) Finite elt. grid: 5 1 6 2 7 3 8 4 1 2 3 4 5 6 7 8 • Assumptions: • Total grid points: N in a sqrt(N)xsqrt(N) arrangement • Commun. time = ts + k*th + m*tw, where ts is the startup time at source and target procs, k = # of hops, th the set-up time per switch, m = msg. size in “flits” (# of bits that can be transmitted in parallel), tw is propagation delay per flit. ts does not affect congestion, so for congestion/delay purposes we approx. commun. time here by k*th + m*tw • Analysis: 1D data partition & linear topology: All commun. is near neighborr and hence parallel per iteration and w/o contention: Parallel commun. time per iteration in a finite-element computation = th+ sqrt(N)*tw ~ Q(sqrt(N)) for large N (a) 2D (full dimensional) partition (b) 1D (reduced dimensional) partition 4-hop messages 1 2 3 4 5 6 7 8 (c) Communication patterns when mapping a 2D data partition (orange) and a 1D data partition (blue) to a linear array

Parallel Algorithm Design—A Process/Data View: Mapping to a Multiprocessor Processes mapped to data (& thus comput.) Finite elt. grid: 5 1 6 2 7 3 8 4 1 2 3 4 5 6 7 8 • Analysis: 2D data partition & linear topology: • Near neighbor commun. part is parallel and w/o contentions. So parallel commun. time per iteration of FE computation= th + (sqrt(N)/2)*tw = Q(sqrt(N)/2), for large N. • 2nd dim. communication is for a distance of 4 hops (generally about sqrt(P)-hops): • Msg. from 4  8 will take time 4*th + (sqrt(N)/4)*tw~ Q(sqrt(N)/4) for large N. Assuming cut-through/wormhole routing in which an entire path is first reserved and then the data passes through via multiple interconnects as if it is a single interconnect, msgs from 3  7, 2  6, 1  5 are blocked until 4  8 communication is completed. After this is finished, say, 3  7 comm. occurs, while the others are blocked. • So commun. is sequentialized for 4-hop messages along an intersecting route. So commun. along dim. 2 of the data is not fully parallel (they are started in parallel but get sequentialized due to contention) and takes a total parallel time of 4(4*th + (sqrt(N)/4)*tw) ~ Q(sqrt(N) for large N, similar in order notation to 1D partiton, but will actually take more time in terms of the thterms (16*th vs. th). • Total parallel commun. time is thus Q(1.5 sqrt(N)), worse than for linear partitioning (a) 2D (full dimensional) partition (b) 1D (reduced dimensional) partition 4-hop messages 1 2 3 4 5 6 7 8 (c) Communication patterns when mapping a 2D data partition (orange) and a 1D data partition (blue) to a linear array

Parallel Algorithm Design—A Process/Data View: Mapping to a Multiprocessor Processes mapped to data (& thus comput.) Finite elt. grid: 1 5 2 6 7 3 4 8 1 2 3 4 5 6 7 8 • Analysis: 1D data partition & 2D topology: All commun. is near neighborr and hence parallel per iteration and w/o contention: Parallel commun. time per iteration in a finite-element computation = th+ sqrt(N)*tw ~ Q(sqrt(N)) for large N • Analysis: 2D data partition & 2D topology: All commun. is near neighborr and hence w/o contention. However, each processor has two communication steps that are partly sequentialized: • Worst-case: Parallel commun. time per iteration in a FE computation = th + (sqrt(N)/2)*tw + th + (sqrt(N)/4)*tw ~ Q(0.75*sqrt(N)) for large N • Best (close to Average) case: The bulk of both commun. which is data transfer takes place in parallel but the startup time (ts) is sequentialized. So parallel commun. time per iteration = max(th + (sqrt(N)/2)*tw, th+ (sqrt(N)/4)*tw) ~ Q(0.5*sqrt(N)) for large N (a) 2D (full dimensional) partition (b) 1D (reduced dimensional) partition 1 1 2 2 3 3 4 4 8 5 6 7 7 6 5 8 (c) Communication patterns when mapping a 2D data partition (orange) and a 1D data partition (blue), & corresponding processes to a 2D array/mesh. The latter is equivalent to embedding a linear array onto a 2D array to minimize average dilation (the factor by which the length of a near-neighbor commun. or link increases in the host/target topology.

Parallel Algorithm Design—A Process/Data View: O/P Data Partitioning Each process computes a block of data of C (one or more Cij’s) • Output data partitioning: E.g., Matrix multiplication • Issues of input data distribution among the processes: • If to compute elt. Cij, row Ai and col Bj are available at the process then no communication needed to move input data to the right processes • If each process computing a block of C has similar blocks of A, B data, then data needs to be moved around the processes so that the one computing Cij has Ai and Bj available. This will require partial all-to-all broadcasts; see below. = X C B A • Each process broadcasts/multicasts the block of A data it has to processes in its “row” and similarly the block of B data to processes in its column. • After the above stage, every process computing each elt. Cij, will have row Ai and col. Bj • The time complexity of the above phase will depend on the interconn. topology of the multiprocessor B A

Parallel Algorithm Design—Strategy Summary • Data partitioning: defines processes; based partly on amount & pattern of communication needed and target topology • Determine communication pattern required • Granularity of breakup: N/P for i/p data parallel or M/P for o/p data parallel can either be decided based on P or based on overhead (i.e., P to be determined) to minimize either parallel time Tp or efficiency Ep or work efficiency WEp (see below) • Map processes to processors in target topology to: • reduce average message hops (try to maximize near-neighbor communication) • maximize parallelization of message passing across processes • minimize conflicts among message paths (conflict depends partly on type of routing “flow control”—store-and-forward or wormhole/cut-through) • Overall Goals: • mimimize parallel time Tp (or speedup Sp) • maximimize efficiency Ep = Sp/P • maximize work efficiency WEp = Sp/WSp, where WSp is total work scale-up (correlated to total power consumed) across the P processors compared to the total sequential work Ws in the computation = Wp/Ws, where Wp is the total work done across P peocessors

MIMD/SPMD Coding Style Loop Do some computation; send data receive data Do some more computation; etc. End Loop

Lecture # 3: A Primer on How to Design Parallel Algorithms

Lecture # 3: A Primer on How to Design Parallel Algorithms

Presentation Transcript

LECTURE

Lecture 25 Lecture 26

Lecture

Lecture

Lecture VIII Lecture IX

Lecture

Lecture 10 Lecture 10 Lecture 11 Lecture 11 Lecture 11 Lecture 11

Lecture S1: Sample Lecture

Lecture