Enhancing Fine-Grained Parallelism

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures

Fine-Grained Parallelism Techniques to enhance fine-grained parallelism: • Loop Interchange • Scalar Expansion • Scalar Renaming • Array Renaming

We fail here Prelude: A Long Time Ago... procedurecodegen(R, k, D); // R is the region for which we must generate code. // k is the minimum nesting level of possible parallel loops. // D is the dependence graph among statements in R.. find the set {S1, S2, ... , Sm} of maximal strongly-connected regions in the dependence graph D restricted to R construct Rp from R by reducing each Si to a single node and compute Dp, the dependence graph naturally induced on Rp by D let {p1, p2, ... , pm} be the m nodes of Rp numbered in an order consistent with Dp (use topological sort to do the numbering); for i = 1 to m do begin if piis cyclic then begin generate a level-k DO statement; let Di be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to pi; codegen (pi, k+1, Di); generate the level-k ENDDO statement; end else generate a vector statement for pi in r(pi)-k+1 dimensions, where r (pi) is the number of loops containing pi; end end

Prelude: A Long Time Ago... • Codegen: tries to find parallelism using transformations of loop distribution and statement reordering • If we deal with loops containing cyclic dependences early on in the loop nest, we can potentially vectorize more loops • Goal in Chapter 5: To explore other transformations to exploit parallelism

Motivational Example DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO ENDDO codegen will not uncover any vector operations. However, by scalar expansion, we can get: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO

Motivational Example DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO

Motivational Example II • Loop Distribution gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO ENDDO

Motivational Example III Finally, interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO • A couple of new transformations used: • Loop interchange • Scalar Expansion

Loop Interchange DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B •DV: (=, <) ENDDO ENDDO • Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B •DV: (<, =) ENDDO ENDDO • leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + B ENDDO

Loop Interchange • Loop interchange is a reordering transformation • Why? • Think of statements being parameterized with the corresponding iteration vector • Loop interchange merely changes the execution order of these statements. • It does not create new instances, or delete existing instances DO J = 1, M DO I = 1, N S <some statement> ENDDO ENDDO • If interchanged, S(2, 1) will execute before S(1, 2)

Loop Interchange: Safety • Safety: not all loop interchanges are safe DO J = 1, M DO I = 1, N A(I,J+1) = A(I+1,J) + B ENDDO ENDDO • Direction vector (<, >) • If we interchange loops, we violate the dependence

Loop Interchange: Safety • A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.

Loop Interchange: Safety • A dependence is interchange-sensitiveif it is carried by the same loop after interchange. That is, an interchange-sensitive dependence moves with its original carrier loop to the new level.

Loop Interchange: Safety • Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops in the nest is determined by applying the same permutation to the elements of D(i,j). • The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row.

Loop Interchange: Safety DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO • The direction matrix for the loop nest is: < < = < = > • Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non-"=" direction in any row. • Follows from Theorem 5.1 and Theorem 2.3

Loop Interchange: Profitability • Profitability depends on architecture DO I = 1, N DO J = 1, M DO K = 1, L S A(I+1,J+1,K) = A(I,J,K) + B ENDDO ENDDO ENDDO • For SIMD machines with large number of FU’s: DO I = 1, N S A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B ENDDO • Not suitable for vector register machines

Loop Interchange: Profitability • For Vector machines, we want to vectorize loops with stride-one memory access • Since Fortran stores in column-major order: • useful to vectorize the I-loop • Thus, transform to: DO J = 1, M DO K = 1, L S A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO ENDDO

Loop Interchange: Profitability • MIMD machines with vector execution units: want to cut down synchronization costs • Hence, shift K-loop to outermost level: PARALLEL DO K = 1, L DO J = 1, M A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO END PARALLEL DO

Loop Shifting • Motivation: Identify loops which can be moved and move them to “optimal” nesting levels • Theorem 5.3 In a perfect loop nest, if loops at level i, i+1,...,i+n carry no dependence, it is always legal to shift these loops inside of loop i+n+1. Furthermore, these loops will not carry any dependences in their new position. • Proof:

Loop Shifting DO I = 1, N DO J = 1, N DO K = 1, N S A(I,J) = A(I,J) + B(I,K)*C(K,J) ENDDO ENDDO ENDDO • S has true, anti and output dependences on itself, hence codegen will fail as recurrence exists at innermost level • Use loop shifting to move K-loop to the outermost level: DO K= 1, N DO I = 1, N DO J = 1, N S A(I,J) = A(I,J) + B(I,K)*C(K,J) ENDDO ENDDO ENDDO

Loop Shifting DO K= 1, N DO I = 1, N DO J = 1, N S A(I,J) = A(I,J) + B(I,K)*C(K,J) ENDDO ENDDO ENDDO codegen vectorizes to: DO K = 1, N FORALL J=1,N A(1:N,J) = A(1:N,J) + B(1:N,K)*C(K,J) END FORALL ENDDO

Loop Shifting • Change body of codegen: ifpi is cyclic then if k is the deepest loop in pi thentry_recurrence_breaking(pi, D, k) else begin select_loop_and_interchange(pi, D, k); generate a level-k DO statement; let Di be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to pi; codegen (pi, k+1, Di); generate the level-k ENDDO statement end end

Loop Shifting procedure select_loop_and_interchange(i, D, k) if the outermost carried dependence in i is at level p>k then shift loops at level k,k+1,...,p-1 inside the level-p loop, making it into the level-k loop; return; end select_loop_and_interchange

Loop Selection • Consider: DO I = 1, N DO J = 1, M S A(I+1,J+1) = A(I,J) + A(I+1,J) ENDDO ENDDO • Direction matrix: < < = < • Loop shifting algorithm will fail to uncover vector loops; however, interchanging the loops can lead to: DO J = 1, M A(2:N+1,J+1) = A(1:N,J) + A(2:N+1,J) ENDDO • Need a more general algorithm

Loop Selection • Loop selection: • Select a loop at nesting level p  k that can be safely moved outward to level k and shift the loops at level k, k+1, …, p-1 inside it

Loop p Direction vector Loop Selection • Heuristics for selecting loop level • If the level-k loop carries no dependence, then let p be the smallest integer such that the level-p loop carries a dependence. (loop-shifting heuristic.) • If the level-k loop carries a dependence, let p be the outermost loop that can be safely shifted outward to position k and that carries a dependence d whose direction vector contains an "=" in every position but the pth. If no such loop exists, let p = k. = = < > = . . . = = = < < . . . = = < = = . . .

Scalar Expansion DO I = 1, N S1 T = A(I) S2 A(I) = B(I) S3 B(I) = T ENDDO • Scalar Expansion: DO I = 1, N S1 T$(I) = A(I) S2 A(I) = B(I) S3 B(I) = T$(I) ENDDO T = T$(N) • leads to: S1 T$(1:N) = A(1:N) S2 A(1:N) = B(1:N) S3 B(1:N) = T$(1:N) T = T$(N)

Scalar Expansion • However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I+1) A(I) = T ENDDO • Scalar expansion gives us: T$(0) = T DO I = 1, N S1 T$(I) = T$(I-1) + A(I) + A(I+1) S2 A(I) = T$(I) ENDDO T = T$(N)

Scalar Expansion: Safety • Scalar expansion is always safe • When is it profitable? • Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions. • However, we want to predict when expansion is profitable • Dependences due to reuse of memory location vs. reuse of values • Dependences due to reuse of values must be preserved • Dependences due to reuse of memory location can be deleted by expansion

Scalar Expansion: Covering Definitions • A definition X of a scalar S is a covering definition for loop L if a definition of S placed at the beginning of L reaches no uses of S that occur past X. DO I = 1, 100 S1 T = X(I) S2 Y(I) = T ENDDO DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) S2 Y(I) = T ENDIF ENDDO covering covering

Scalar Expansion: Covering Definitions • A covering definition does not always exist: DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ENDIF S2 Y(I) = T ENDDO • In SSA terms: There does not exist a covering definition for a variable T if the edge out of the first assignment to T goes to a -function later in the loop which merges its values with those for another control flow path through the loop

Scalar Expansion: Covering Definitions • We will consider a collection of covering definitions • There is a collection C of covering definitions for T in a loop if either: • There exists no -function at the beginning of the loop that merges versions of T from outside the loop with versions defined in the loop, or, • The -function within the loop has no SSA edge to any -function including itself

Scalar Expansion: Covering Definitions • Remember the loop which had no covering definition: DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ENDIF S2 Y(I) = T ENDDO • To form a collection of covering definitions, we can insert dummy assignments: DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ELSE S2 T = T ENDIF S3 Y(I) = T ENDDO

Scalar Expansion: Covering Definitions • Algorithm to insert dummy assignments and compute the collection, C, of covering definitions: • Central idea: Look for parallel paths to a -function following the first assignment, until no more exist

Scalar Expansion: Covering Definitions Detailed algorithm: • Let S0 be the -function for T at the beginning of the loop, if there is one, and null otherwise. Make C empty and initialize an empty stack. • Let S1 be the first definition of T in the loop. Add S1 to C. • If the SSA successor of S1 is a -function S2 that is not equal to S0, then push S2 onto the stack and mark it; • While the stack is non-empty, • pop the -function S from the stack; • add all SSA predecessors that are not -functions to C; • if there is an SSA edge from S0 into S, then insert the assignment T = T as the last statement along that edge and add it to C; • for each unmarked -function S3 (other than S0) that is an SSA predecessor of S, mark S3 and push it onto the stack; • for each unmarked -function S4 that can be reached from S by a single SSA edge and which is not predominated by S in the control flow graph mark S4 and push it onto the stack.

Scalar Expansion: Covering Definitions Given the collection of covering definitions, we can carry out scalar expansion for a normalized loop: • Create an array T$ of appropriate length • For each S in the covering definition collection C, replace the T on the left-hand side by T$(I). • For every other definition of T and every use of T in the loop body reachable by SSA edges that do not pass through S0, the -function at the beginning of the loop, replace T by T$(I). • For every use prior to a covering definition (direct successors of S0 in the SSA graph), replace T by T$(I-1). • If S0 is not null, then insert T$(0) = T before the loop. • If there is an SSA edge from any definition in the loop to a use outside the loop, insert T = T$(U) after the loop, were U is the loop upper bound.

Scalar Expansion: Covering Definitions After inserting covering definitions: DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ENDIF S2 Y(I) = T ENDDO DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ELSE S2 T = T ENDIF S3 Y(I) = T ENDDO After scalar expansion: T$(0) = T DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T$(I) = X(I) ELSE T$(I) = T$(I-1) ENDIF S2 Y(I) = T$(I) ENDDO

Deletable Dependences • Uses of T before covering definitions are expanded as T$(I - 1) • All other uses are expanded as T$(I) then the deletable dependences are: • Backward carried antidependences • Backward carried output dependences • Forward carried output dependences • Loop-independent antidependences into the covering definition • Loop-carried true dependences from a covering definition

Scalar Expansion procedure try_recurrence_breaking(i, D, k) if k is the deepest loop in ithen begin remove deletable edges in i; find the set {SC1, SC2, ..., SCn} of maximal strongly-connected regions in D restricted to i; if there are vector statements among SCithen expand scalars indicated by deletable edges; codegen(i, k, D restricted to i); end try_recurrence_breaking

Scalar Expansion: Drawbacks • Expansion increases memory requirements • Solutions: • Expand in a single loop • Strip mine loop before expansion • Forward substitution: DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO DO I = 1, N A(I) = A(I) + A(I+1) + B(I) ENDDO

Scalar Renaming DO I = 1, 100 S1 T = A(I) + B(I) S2 C(I) = T + T S3 T = D(I) - B(I) S4 A(I+1) = T * T ENDDO • Renaming scalar T: DO I = 1, 100 S1T1 = A(I) + B(I) S2 C(I) = T1 + T1 S3T2 = D(I) - B(I) S4 A(I+1) = T2 * T2 ENDDO

Scalar Renaming • will lead to: S3 T2$(1:100) = D(1:100) - B(1:100) S4 A(2:101) = T2$(1:100) * T2$(1:100) S1 T1$(1:100) = A(1:100) + B(1:100) S2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100)

Scalar Renaming • Renaming algorithm partitions all definitions and uses into equivalent classes, each of which can occupy different memory locations: • Use the definition-use graph to: • Pick definition • Add all uses that the definition reaches to the equivalence class • Add all definitions that reach any of the uses… • ..until fixed point is reached

Scalar Renaming: Profitability • Scalar renaming will break recurrences in which a loop-independent output dependence or antidependence is a critical element of a cycle • Relatively cheap to use scalar renaming • Usually done by compilers when calculating live ranges for register allocation

Array Renaming DO I = 1, N S1 A(I) = A(I-1) + X S2 Y(I) = A(I) + Z S3 A(I) = B(I) + C ENDDO • S1  S2 S2 -1 S3 S3 1 S1 S1 0 S3 • Rename A(I) to A$(I): DO I = 1, N S1 A$(I) = A(I-1) + X S2 Y(I) = A$(I) + Z S3 A(I) = B(I) + C ENDDO • Dependences remaining: S1  S2 andS3 1 S1

Array Renaming: Profitability • Examining dependence graph and determining minimum set of critical edges to break a recurrence is NP-complete! • Solution: determine edges that are removed by array renaming and analyze effects on dependence graph • procedure array_partition: • Assumes no control flow in loop body • identifies collections of references to arrays which refer to the same value • identifies deletable output dependences and antidependences • Use this procedure to generate code • Minimize amount of copying back to the “original” array at the beginning and the end

Enhancing Fine-Grained Parallelism