Creating Coarse-grained Parallelism for Loop Nests

Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Last time… • Single loop methods • Privatization • Loop distribution • Alignment • Loop Fusion

This time… • Perfect Loop Nests • Loop Interchange • Loop Selection • Loop Reversal • Loop Skewing • Profitibility-Based Methods

This time… • Imperfectly Nested Loops • Multilevel Loop Fusion • Parallel Code Generation • Packaging Parallelism • Strip Mining • Pipeline Parallelism • Guided Self Scheduling

Loop Interchange A(I+1, J) = A(I, J) + B(I, J) ENDDO ENDDO DO I = 1, N PARALLEL DO J = 1 , M DO I = 1 , N A(I+1, J) = A(I, J) + B(I, J) ENDDO END PARALLEL DO DO J = 1, M D = ( < , = ) • Vectorization: OK • Parallelization: Problematic • Vectorization: Bad • Parallelization: Good

Loop Interchange (Cont.) Best we can do DO I = 1, N PARALLEL DO J = 1, M A(I+1, J+1) = A(I, J) + B(I, J) END PARALLEL DO ENDDO DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I, J) + B(I, J) ENDDO ENDDO D = ( < , < ) • Loop Interchange doesn’t work, as both loops carry dependence!! When can a loop be moved to the outermost position in the nest, and be guaranteed to be parallel?

Loop Interchange (Cont.) • Theorem: In a perfect nest of loops, a particular loop can be parallelized at the outermost level if and only if the column of the direction matrix for that nest contains only ‘=‘ entries. • Proof. If. A column with only “=“ entries represents a loop that can be interchanged, and carries no dependence. • Only If. There is a non “=“ entry in that column: • If it is “>” – Can’t interchange loops (dependence will be reversed) • If it is “<“ – Can interchange, but can’t shake the dependece (Will not allow parallelization anyway...)

Loop Interchange (Cont.) • Working with direction matrix • 1. Move loops with all “=“ entries into outermost position and parallelize it. Remove the column from the matrix • 2. Move loops with most “<“ entries into next outermost position and sequentialize it, eliminate the column and any rows representing carried dependences • 3. Repeat step 1

< = = = = < < < < Loop Interchange (Cont.) • Example: DO I = 1, N DO J = 1, M DO K = 1, L A(I+1, J,K) = A(I, J,K) + X1 B(I, J,K+1) = B(I, J,K) + X2 C(I+1, J+1,K+1) = C(I, J,K) + X3 ENDDO ENDDO ENDO DO I = 1, N PARALLEL DO J = 1, M DO K = 1, L A(I+1, J,K) = A(I, J,K) + X1 B(I, J,K+1) = B(I, J,K) + X2 C(I+1, J+1,K+1) = C(I, J,K) + X3 ENDDO END PARALLEL DO ENDO

< < = = < = < = < = = < = < = = = = < = = = = < < < = = < = < = < = = < = < = = = = < = = = = < Loop Selection – Optimal? • Is the approach of selecting the loop with the most ‘ < ‘ directions optimal? • Will result in NO parallelization for this matrix • While other selections may allow parallelization Is it possible to derive a selection heuristic that provides optimal code?

< < = = < = < = < = = < = < = = = = < = = = = < Loop Selection • The problem of loop selection is NP-complete • Loop selection is best done by a heuristic! • Favor the selection of loops that must be sequentialized before parallelism can be uncovered.

= < = < = < = < < < = > Heuristic Loop Selection (Cont.) • Example of principals involvedin heuristic loop selection DO I = 2, N DO J = 2, M DO K = 2, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1) ENDDO ENDDO ENDDO DO J = 2, M DO I = 2, N PARALLEL DO K = 2, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1) END PARALLEL DO ENDDO ENDDO The J-loop must be sequentialized because of the first dependence The I-loop must be sequentialized because of the fourth dependence

= < > < = > = < < < = < Loop Reversal • Using loop reversal to create coarse-grained parallelism. Consider: DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDO ENDDO DO K = L, 1, -1 DO I = 2, N+1 DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDO ENDDO DO K = L, 1, -1 PARALLEL DO I = 2, N+1 PARALLEL DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) END PARALLEL DO END PARALLEL DO ENDDO DO I = 2, N+1 DO J = 2, M+1 DO K = L, 1, -1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDO ENDDO

0 1 0 1 0 0 0 0 1 0 0 0 = < = < = = = = < = = = = < < < = < = = < = = = Loop Skewing Skewed using k = K + I + J yield: DO I = 2, N+1 DO J = 2, M+1 DO k = I+J+1, I+J+L A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J) B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J) ENDDO ENDDO ENDDO DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K) B(I, J, K+1) = B(I, J, K) + A(I, J, K) ENDDO ENDDO ENDDO DO k = 5, N+M+1 PARALLEL DO I = MAX(2, k-M-L-1), MIN(N+1, k-L-2) PARALLEL DO J = MAX(2, k-I-L), MIN(M+1, k-I-1) A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J) B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J) END PARALLEL DO END PARALLEL DO ENDDO

Loop Skewing - Main Benefits • Eliminate “>” signs in the matrix • Transforms skewed loops in such a way, that after outward interchange, it will carry all dependences formerly carried by the loop with respect to which it is skewed

Loop Skewing - Drawback • The resulting parallelism is usually unbalanced. (The resulting loop executes a variable amount of iterations each time). • As we shall see – It’s not really a problem for asynchronous parallelism (unlike vectorization).

Loop Skewing (Cont.) • Updated strategy • Parallelize outermost loop if possible • Sequentializes at most one outer loop to find parallelism in the next loop • If 1 and 2 fail, try skewing • If 3 fails, sequentialize the loop that can be moved to the outermost position and cover the most other loops

In Practice – • Sometimes we get much worse execution times, than we would have gotten parallelizing less\different loops.

Profitability-Based Methods • Static performance estimation function • No need to be accurate, just good at selecting the better of two alternatives • Key considerations • Cost of memory references • Sufficiency of granularity

Profitability-Based Methods (Cont.) • Impractical to choose from all arrangements • Consider only subset of the possible code arrangements, based on properties of the cost function • In our case: consider only the inner-most loop

Profitability-Based Methods (Cont.) A possible cost evaluation heuristics: • Subdivide all the references in the loop body into reference groups • Two references are in the same group if: • There is a loop independent dependence between them. • There is a constant-distance loop carried dependence between them.

Profitability-Based Methods (Cont.) A possible cost evaluation heuristics: • Determine whether subsequent accesses to the same reference are • Loop invariant • Cost = 1 • Unit stride • Cost = number of iterations / cache line size • Non-unit stride • Cost = number of iterations

Profitability-Based Methods (Cont.) A possible cost evaluation heuristics: • Compute loop cost:

Profitability-Based Methods: Example DO I = 1, N DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDO ENDDO

Profitability-Based Methods: Example DO I = 1, N DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDO ENDDO Inner-most loop C A B COST K 1 N N/L N3(1+1/L)+N2 Worst J N 1 N 2N3+N2 I N/L N/L 1 2N3/L+N2 Best

Profitability-Based Methods: Example • Reorder loop from innermost to outermost by increasing loop cost: I,K,J DO J = 1, N DO K = 1, N DO I = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDO ENDDO Can’t always have desired loop order (as some permutations are illegal) - Try to find the possible permutation closest to the desired one.

Profitability-Based Methods (Cont.) • Goal: Given a desired loop order and a direction matrix for a loop nest - find the legal permutation closest to the desired one. • Method: Until there are no more loops: Choose from all the loops that can be interchanged to the outermost position, the one that is outermost in the desired permutation. Drop that loop. It can be shown that if a legal permutation with the desired innermost loop in the innermost position exists – this algorithm will find such a permutation.

Profitability-Based Methods (Cont.) DO J = 1, N DO K = 1, N DO I = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDO ENDDO • For performance reasons – the compiler may mark the inner loop as “not meant for parallelization” (sequential performance utilizes locality in memory accesses).

Multilevel Loop Fusion • Commonly used for imperfect loop nests • Used after maximal loop distribution

Multilevel Loop Fusion PARALLEL DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C ENDDO END PARALLEL DO PARALLEL DO J = 1, M DO I = 1, N B(I+1, J) = B(I, J) + D ENDDO END PARALLEL DO DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C ENDDO ENDDO DO I = 1, N DO J = 1, M B(I+1, J) = B(I, J) + D ENDDO ENDDO DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C B(I+1, J) = B(I, J) + D ENDDO ENDDO After distribution each nest is better with a different outer loop – Can’t fuse!

i,j A j i B C j D Multilevel Loop Fusion (Cont.) DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X ENDDO ENDDO DO I = 1, N DO J = 1, M B(I+1, J) = A(I, J) + B(I,J) ENDDO ENDDO DO I = 1, N DO J = 1, M C(I, J+1) = A(I, J) + C(I,J) ENDDO ENDDO DO I = 1, N DO J = 1, M D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO ENDDO DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X ENDDO ENDDO DO I = 1, N DO J = 1, M B(I+1, J) = A(I, J) + B(I,J) ENDDO ENDDO DO I = 1, N DO J = 1, M C(I, J+1) = A(I, J) + C(I,J) ENDDO ENDDO DO I = 1, N DO J = 1, M D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO ENDDO DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X B(I+1, J) = A(I, J) + B(I,J) C(I, J+1) = A(I, J) + B(I,J) D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO ENDDO Which loop should be fused into the A loop?

2 barriers j AB j D i C Multilevel Loop Fusion (Cont.) Fusing A loop with B loop PARALLEL DO J = 1, M DO I = 1, N A(I, J) = A(I, J) + X B(I+1, J) = A(I, J) + B(I,J) ENDDO ENDDO PARALLEL DO I = 1, N DO J = 1, M C(I, J+1) = A(I, J) + C(I,J) ENDDO ENDDO PARALLEL DO J = 1, M DO I = 1, N D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO ENDDO

1 barrier i i AC AC j D j j B BD Multilevel Loop Fusion (Cont.) Fusing A loop with C loop PARALLEL DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X C(I, J+1) = A(I, J) + C(I,J) ENDDO ENDDO PARALLEL DO J = 1, M DO I = 1, N B(I+1, J) = A(I, J) + B(I,J) ENDDO ENDDO PARALLEL DO J = 1, M DO I = 1, N D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO ENDDO PARALLEL DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X C(I, J+1) = A(I, J) + C(I,J) ENDDO ENDDO PARALLEL DO J = 1, M DO I = 1, N B(I+1, J) = A(I, J) + B(I,J) D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO ENDDO Now we can also fuse B-D

i,j A j i B C j D A barrier is inevitable!! Multilevel Loop Fusion (Cont.) • Decision making needs look-ahead • Strategy: Fuse with the loop that cannot be fused with one of its successors Rationale: If it can’t be fused with its successors – a barrier will be formed anyway.

Parallel Code Generation Code generation scheme: Parallelize(l,D) • Try methods for perfect nests (loop interchange, loop skewing, loop reversal), and stop if parallelism is found. • If nest can be distributed: distribute, run recursively on the distributed nests, and merge. • Else sequentialize outer loop, eliminate the dependences it carries, and try recursively on each of the loops nested in it.

Parallel Code Generation procedure Parallelize(l, Dl); ParallelizeNest(l, success); //(try methods for perfect nests..) if ¬success then begin if l can be distributed then begin distribute l into loop nests l1, l2, …, ln; for i:=1 to n do begin Parallelize(li, Di); end Merge({l1, l2, …, ln}); end

Parallel Code Generation (Cont.) else begin // if l cannot be distributed then for each outer loop l0 nested in l do begin let D0 be the set of dependences between statements in l0 less dependences carried by l; Parallelize(l0,D0); end let S - the set of outer loops and statements loops left in l; If ||S||>1 then Merge(S); end end end Parallelize

I loop can be parallelized Both loops can be parallelized Parallel Code Generation (Cont.) DO J = 1, M DO I = 1, N A(I+1, J+1) = A(I+1, J) + C ENDDO END DO DO J = 1, M DO I = 1, N X(I, J) = A(I, J) + C ENDDO END DO PARALLEL DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I+1, J) + C ENDDO END PARALLEL DO DO J = 1, M DO I = 1, N X(I, J) = A(I, J) + C ENDDO END DO PARALLEL DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I+1, J) + C ENDDO END PARALLEL DO PARALLEL DO J = 1, M DO I = 1, N !Left sequential for memory hierarchy X(I, J) = A(I, J) + C ENDDO END PARALLEL DO Type: (I-loop, parallel) Try distribution… DO J = 1, M DO I = 1, N A(I+1, J+1) = A(I+1, J) + C X(I, J) = A(I, J) + C ENDDO ENDDO Different types – can’t fuse Now fusing… Type: (J-loop, parallel) Both loops carry dependence – loop interchange will not find sufficient parallelism.

A B C D A C D Parallel Code Generation (Cont.) J loop, parallel PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X ENDDO ENDPARALLEL DO PARALLEL DO J = 1, M DO I = 1, N B(I+1, J) = A(I, J) + B(I,J) ENDDO END PARALLEL DO PARALLEL DO I = 1, N DO J = 1, M C(I, J+1) = A(I, J) + C(I,J) ENDDO END PARALLEL DO PARALLEL DO J = 1, M DO J = 1, N D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO END PARLLEL DO PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X ENDDO DO I = 1, N B(I+1, J) = A(I, J) + B(I,J) ENDDO END PARALLEL DO PARALLEL DO I = 1, N DO J = 1, M C(I, J+1) = A(I, J) + C(I,J) ENDDO END PARALLEL DO PARALLEL DO J = 1, M DO J = 1, N D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO END PARLLEL DO PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X B(I+1, J) = A(I, J) + B(I,J) ENDDO END PARALLEL DO PARALLEL DO I = 1, N DO J = 1, M C(I, J+1) = A(I, J) + C(I,J) ENDDO END PARALLEL DO PARALLEL DO J = 1, M DO J = 1, N D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO END PARLLEL DO DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X B(I+1, J) = A(I, J) + B(I,J) C(I, J+1) = A(I, J) + C(I,J) D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO ENDDO J loop, parallel I loop, parallel J loop, parallel

Erlebacher PARALLEL DO J = 1, JMAXDO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) DO K = 2, N-1PARALLEL DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) PARALLEL DO J = 1, JMAXDDO I = 1, IMAXD TOT(I, J) = 0.0 PARALLEL DO J = 1, JMAXDDO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) DO K = 2, N-1PARALLEL DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) DO J = 1, JMAXDO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) DO K = 2, N-1DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) DO J = 1, JMAXDDO I = 1, IMAXD TOT(I, J) = 0.0 DO J = 1, JMAXDDO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) DO K = 2, N-1DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

L1 L2 L3 L4 L5 Erlebacher PARALLEL DO J= 1, MAXD L1 : DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) L2: DO K = 2, N – 1 DO I = 1, IMAXD F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) L3: DO I = 1, IMAXD TOT(I, J) = 0.0 L4: DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) L5: DO K = 2, N-1 DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) END PARALLEL DO

Erlebacher PARALLEL DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) TOT(I, J) = 0.0 TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) ENDDO DO K = 2, N-1 DO I = 1, IMAXD F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) ENDDO ENDDO END PARALLEL DO

Packaging of Parallelism • Trade off between parallelism and granularity of synchronization. • Larger granularity work-units means synchronization needs to be done less frequently, but at a cost of less parallelism, and poorer load balance.

Strip Mining • Converts available parallelism into a form more suitable for the hardware DO I = 1, N A(I) = A(I) + B(I) ENDDO • Interruptions may be disastrous k = CEIL (N / P) PARALLEL DO I = 1, N, k DO i = I, MIN(I + k-1, N) A(i) = A(i) + B(i) ENDDO END PARALLEL DO The value of P is unknown until runtime, so strip mining is often handled by special hardware (Convex C2 and C3)

Strip Mining (Cont.) • What if the execution time varies among iteraions? PARALLEL DO I = 1, N DO J = 2, I A(J, I) = A(J-1, I) * 2.0 ENDDO END PARALLEL DO Solution: smaller unit size to allow more balanced distribution

Pipeline Parallelism • Fortran command DOACROSS – pipelines parallel loop iterations with cross-iteration synchronization. • Useful where parallelization is not available • High synchronization costs DOACROSS I = 2, N S1: A(I) = B(I) + C(I) POST(EV(I)) IF (I>2) WAIT (EV(I-1)) S2: C(I) = A(I-1) + A(I) ENDDO

Scheduling Parallel Work Load balance Little Sychro.

Scheduling Parallel Work • Parallel execution is slower than serial execution if • Bakery-counter scheduling • Moderate synchronization overhead N- number of iterations B- time of one iteration p- number of processors σ0- constant overhead per processor

Guided Self-Scheduling • Incorporates some level of static scheduling to guide dynamic self-scheduling • Schedules groups of iterations • Going from large to small chunks of work • Iterations dispensed at time t follows:

Guided Self-Scheduling (Cont.) • GSS: (20 iteration, 4 processors) • Not completely balanced • Required synchronization: 9 In bakery counter: 20

Creating Coarse-grained Parallelism for Loop Nests

Creating Coarse-grained Parallelism for Loop Nests

Presentation Transcript

Enhancing Fine-Grained Parallelism

Enhancing Fine-Grained Parallelism

Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

A Coarse-grained Model for the Formation of Caveolae

Parametric Tiling of Affine Loop Nests *

Coarse-grained Word Sense Disambiguation

Compiling for Coarse-Grained Adaptable Architectures

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Coarse-Grained Transactions

Coarse-Grained Transactions

Fine-grained and Coarse-grained Word Sense Disambiguation

Atomistic vs. Coarse Grained Simulations

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

Coarse-Grained Coherence

Coarse-Grained Theory of Surface Nanostructure Formation

parXXL : A Fine Grained Development Environment on Coarse Grained Architectures

Commutativity and Coarse-Grained Transactions

Instruction Level Parallelism: Loop Level Parallelism

Enhancing Fine-Grained Parallelism

Coarse Grained Interoperability scenarios