- By
**corby** - Follow User

- 202 Views
- Uploaded on

Download Presentation
## Enhancing Fine-Grained Parallelism

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Enhancing Fine-Grained Parallelism

Chapter 5 of Allen and Kennedy

Optimizing Compilers for Modern Architectures

Fine-Grained Parallelism

Techniques to enhance fine-grained parallelism:

- Loop Interchange
- Scalar Expansion
- Scalar Renaming
- Array Renaming

Prelude: A Long Time Ago...

procedurecodegen(R, k, D);

// R is the region for which we must generate code.

// k is the minimum nesting level of possible parallel loops.

// D is the dependence graph among statements in R..

find the set {S1, S2, ... , Sm} of maximal strongly-connected regions in the dependence graph D restricted to R

construct Rp from R by reducing each Si to a single node and compute Dp, the dependence graph naturally induced on Rp by D

let {p1, p2, ... , pm} be the m nodes of Rp numbered in an order consistent with Dp (use topological sort to do the numbering);

for i = 1 to m do begin

if piis cyclic then begin

generate a level-k DO statement;

let Di be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to pi;

codegen (pi, k+1, Di);

generate the level-k ENDDO statement;

end

else

generate a vector statement for pi in r(pi)-k+1 dimensions, where r (pi) is the number of loops containing pi;

end

end

Prelude: A Long Time Ago...

- Codegen: tries to find parallelism using transformations of loop distribution and statement reordering
- If we deal with loops containing cyclic dependences early on in the loop nest, we can potentially vectorize more loops
- Goal in Chapter 5: To explore other transformations to exploit parallelism

Motivational Example

DO J = 1, M

DO I = 1, N

T = 0.0

DO K = 1,L

T = T + A(I,K) * B(K,J)

ENDDO

C(I,J) = T

ENDDO

ENDDO

codegen will not uncover any vector operations. However, by scalar expansion, we can get:

DO J = 1, M

DO I = 1, N

T$(I) = 0.0

DO K = 1,L

T$(I) = T$(I) + A(I,K) * B(K,J)

ENDDO

C(I,J) = T$(I)

ENDDO

ENDDO

Motivational Example

DO J = 1, M

DO I = 1, N

T$(I) = 0.0

DO K = 1,L

T$(I) = T$(I) + A(I,K) * B(K,J)

ENDDO

C(I,J) = T$(I)

ENDDO

ENDDO

Motivational Example II

- Loop Distribution gives us:

DO J = 1, M

DO I = 1, N

T$(I) = 0.0

ENDDO

DO I = 1, N

DO K = 1,L

T$(I) = T$(I) + A(I,K) * B(K,J)

ENDDO

ENDDO

DO I = 1, N

C(I,J) = T$(I)

ENDDO

ENDDO

Motivational Example III

Finally, interchanging I and K loops, we get:

DO J = 1, M

T$(1:N) = 0.0

DO K = 1,L

T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J)

ENDDO

C(1:N,J) = T$(1:N)

ENDDO

- A couple of new transformations used:
- Loop interchange
- Scalar Expansion

Loop Interchange

DO I = 1, N

DO J = 1, M

S A(I,J+1) = A(I,J) + B •DV: (=, <)

ENDDO

ENDDO

- Applying loop interchange:

DO J = 1, M

DO I = 1, N

S A(I,J+1) = A(I,J) + B •DV: (<, =)

ENDDO

ENDDO

- leads to:

DO J = 1, M

S A(1:N,J+1) = A(1:N,J) + B

ENDDO

Loop Interchange

- Loop interchange is a reordering transformation
- Why?
- Think of statements being parameterized with the corresponding iteration vector
- Loop interchange merely changes the execution order of these statements.
- It does not create new instances, or delete existing instances

DO J = 1, M

DO I = 1, N

S <some statement>

ENDDO

ENDDO

- If interchanged, S(2, 1) will execute before S(1, 2)

Loop Interchange: Safety

- Safety: not all loop interchanges are safe

DO J = 1, M

DO I = 1, N

A(I,J+1) = A(I+1,J) + B

ENDDO

ENDDO

- Direction vector (<, >)
- If we interchange loops, we violate the dependence

Loop Interchange: Safety

- A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.

Loop Interchange: Safety

- A dependence is interchange-sensitiveif it is carried by the same loop after interchange. That is, an interchange-sensitive dependence moves with its original carrier loop to the new level.

Loop Interchange: Safety

- Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops in the nest is determined by applying the same permutation to the elements of D(i,j).
- The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row.

Loop Interchange: Safety

DO I = 1, N

DO J = 1, M

DO K = 1, L

A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1)

ENDDO

ENDDO

ENDDO

- The direction matrix for the loop nest is:

< < =

< = >

- Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non-"=" direction in any row.
- Follows from Theorem 5.1 and Theorem 2.3

Loop Interchange: Profitability

- Profitability depends on architecture

DO I = 1, N

DO J = 1, M

DO K = 1, L

S A(I+1,J+1,K) = A(I,J,K) + B

ENDDO

ENDDO

ENDDO

- For SIMD machines with large number of FU’s:

DO I = 1, N

S A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B

ENDDO

- Not suitable for vector register machines

Loop Interchange: Profitability

- For Vector machines, we want to vectorize loops with stride-one memory access
- Since Fortran stores in column-major order:
- useful to vectorize the I-loop
- Thus, transform to:

DO J = 1, M

DO K = 1, L

S A(2:N+1,J+1,K) = A(1:N,J,K) + B

ENDDO

ENDDO

Loop Interchange: Profitability

- MIMD machines with vector execution units: want to cut down synchronization costs
- Hence, shift K-loop to outermost level:

PARALLEL DO K = 1, L

DO J = 1, M

A(2:N+1,J+1,K) = A(1:N,J,K) + B

ENDDO

END PARALLEL DO

Loop Shifting

- Motivation: Identify loops which can be moved and move them to “optimal” nesting levels
- Theorem 5.3 In a perfect loop nest, if loops at level i, i+1,...,i+n carry no dependence, it is always legal to shift these loops inside of loop i+n+1. Furthermore, these loops will not carry any dependences in their new position.
- Proof:

Loop Shifting

DO I = 1, N

DO J = 1, N

DO K = 1, N

S A(I,J) = A(I,J) + B(I,K)*C(K,J)

ENDDO

ENDDO

ENDDO

- S has true, anti and output dependences on itself, hence codegen will fail as recurrence exists at innermost level
- Use loop shifting to move K-loop to the outermost level:

DO K= 1, N

DO I = 1, N

DO J = 1, N

S A(I,J) = A(I,J) + B(I,K)*C(K,J)

ENDDO

ENDDO

ENDDO

Loop Shifting

DO K= 1, N

DO I = 1, N

DO J = 1, N

S A(I,J) = A(I,J) + B(I,K)*C(K,J)

ENDDO

ENDDO

ENDDO

codegen vectorizes to:

DO K = 1, N

FORALL J=1,N

A(1:N,J) = A(1:N,J) + B(1:N,K)*C(K,J)

END FORALL

ENDDO

Loop Shifting

- Change body of codegen:

ifpi is cyclic then

if k is the deepest loop in pi

thentry_recurrence_breaking(pi, D, k)

else begin

select_loop_and_interchange(pi, D, k);

generate a level-k DO statement;

let Di be the dependence graph consisting of

all dependence edges in D that are at level

k+1 or greater and are internal to pi;

codegen (pi, k+1, Di);

generate the level-k ENDDO statement

end

end

Loop Shifting

procedure select_loop_and_interchange(i, D, k)

if the outermost carried dependence in i is at

level p>k then

shift loops at level k,k+1,...,p-1 inside

the level-p loop, making it into the level-k

loop;

return;

end select_loop_and_interchange

Loop Selection

- Consider:

DO I = 1, N

DO J = 1, M

S A(I+1,J+1) = A(I,J) + A(I+1,J)

ENDDO

ENDDO

- Direction matrix:

< <

= <

- Loop shifting algorithm will fail to uncover vector loops; however, interchanging the loops can lead to:

DO J = 1, M

A(2:N+1,J+1) = A(1:N,J) + A(2:N+1,J)

ENDDO

- Need a more general algorithm

Loop Selection

- Loop selection:
- Select a loop at nesting level p k that can be safely moved outward to level k and shift the loops at level k, k+1, …, p-1 inside it

Direction vector

Loop Selection- Heuristics for selecting loop level
- If the level-k loop carries no dependence, then let p be the smallest integer such that the level-p loop carries a dependence. (loop-shifting heuristic.)
- If the level-k loop carries a dependence, let p be the outermost loop that can be safely shifted outward to position k and that carries a dependence d whose direction vector contains an "=" in every position but the pth. If no such loop exists, let p = k.

= = < > = . . .

= = = < < . . .

= = < = = . . .

Scalar Expansion

DO I = 1, N

S1 T = A(I)

S2 A(I) = B(I)

S3 B(I) = T

ENDDO

- Scalar Expansion:

DO I = 1, N

S1 T$(I) = A(I)

S2 A(I) = B(I)

S3 B(I) = T$(I)

ENDDO

T = T$(N)

- leads to:

S1 T$(1:N) = A(1:N)

S2 A(1:N) = B(1:N)

S3 B(1:N) = T$(1:N)

T = T$(N)

Scalar Expansion

- However, not always profitable. Consider:

DO I = 1, N

T = T + A(I) + A(I+1)

A(I) = T

ENDDO

- Scalar expansion gives us:

T$(0) = T

DO I = 1, N

S1 T$(I) = T$(I-1) + A(I) + A(I+1)

S2 A(I) = T$(I)

ENDDO

T = T$(N)

Scalar Expansion: Safety

- Scalar expansion is always safe
- When is it profitable?
- Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions.
- However, we want to predict when expansion is profitable
- Dependences due to reuse of memory location vs. reuse of values
- Dependences due to reuse of values must be preserved
- Dependences due to reuse of memory location can be deleted by expansion

Scalar Expansion: Covering Definitions

- A definition X of a scalar S is a covering definition for loop L if a definition of S placed at the beginning of L reaches no uses of S that occur past X.

DO I = 1, 100

S1 T = X(I)

S2 Y(I) = T

ENDDO

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T = X(I)

S2 Y(I) = T

ENDIF

ENDDO

covering

covering

Scalar Expansion: Covering Definitions

- A covering definition does not always exist:

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T = X(I)

ENDIF

S2 Y(I) = T

ENDDO

- In SSA terms: There does not exist a covering definition for a variable T if the edge out of the first assignment to T goes to a -function later in the loop which merges its values with those for another control flow path through the loop

Scalar Expansion: Covering Definitions

- We will consider a collection of covering definitions
- There is a collection C of covering definitions for T in a loop if either:
- There exists no -function at the beginning of the loop that merges versions of T from outside the loop with versions defined in the loop, or,
- The -function within the loop has no SSA edge to any -function including itself

Scalar Expansion: Covering Definitions

- Remember the loop which had no covering definition:

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T = X(I)

ENDIF

S2 Y(I) = T

ENDDO

- To form a collection of covering definitions, we can insert dummy assignments:

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T = X(I)

ELSE

S2 T = T

ENDIF

S3 Y(I) = T

ENDDO

Scalar Expansion: Covering Definitions

- Algorithm to insert dummy assignments and compute the collection, C, of covering definitions:
- Central idea: Look for parallel paths to a -function following the first assignment, until no more exist

Scalar Expansion: Covering Definitions

Detailed algorithm:

- Let S0 be the -function for T at the beginning of the loop, if there is one, and null otherwise. Make C empty and initialize an empty stack.
- Let S1 be the first definition of T in the loop. Add S1 to C.
- If the SSA successor of S1 is a -function S2 that is not equal to S0, then push S2 onto the stack and mark it;
- While the stack is non-empty,
- pop the -function S from the stack;
- add all SSA predecessors that are not -functions to C;
- if there is an SSA edge from S0 into S, then insert the assignment T = T as the last statement along that edge and add it to C;
- for each unmarked -function S3 (other than S0) that is an SSA predecessor of S, mark S3 and push it onto the stack;
- for each unmarked -function S4 that can be reached from S by a single SSA edge and which is not predominated by S in the control flow graph mark S4 and push it onto the stack.

Scalar Expansion: Covering Definitions

Given the collection of covering definitions, we can carry out scalar expansion for a normalized loop:

- Create an array T$ of appropriate length
- For each S in the covering definition collection C, replace the T on the left-hand side by T$(I).
- For every other definition of T and every use of T in the loop body reachable by SSA edges that do not pass through S0, the -function at the beginning of the loop, replace T by T$(I).
- For every use prior to a covering definition (direct successors of S0 in the SSA graph), replace T by T$(I-1).
- If S0 is not null, then insert T$(0) = T before the loop.
- If there is an SSA edge from any definition in the loop to a use outside the loop, insert T = T$(U) after the loop, were U is the loop upper bound.

Scalar Expansion: Covering Definitions

After inserting covering definitions:

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T = X(I)

ENDIF

S2 Y(I) = T

ENDDO

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T = X(I)

ELSE

S2 T = T

ENDIF

S3 Y(I) = T

ENDDO

After scalar expansion:

T$(0) = T

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T$(I) = X(I)

ELSE

T$(I) = T$(I-1)

ENDIF

S2 Y(I) = T$(I)

ENDDO

Deletable Dependences

- Uses of T before covering definitions are expanded as T$(I - 1)
- All other uses are expanded as T$(I) then the deletable dependences are:
- Backward carried antidependences
- Backward carried output dependences
- Forward carried output dependences
- Loop-independent antidependences into the covering definition
- Loop-carried true dependences from a covering definition

Scalar Expansion

procedure try_recurrence_breaking(i, D, k)

if k is the deepest loop in ithen begin

remove deletable edges in i;

find the set {SC1, SC2, ..., SCn} of maximal strongly-connected regions in D restricted to i;

if there are vector statements among SCithen

expand scalars indicated by deletable edges;

codegen(i, k, D restricted to i);

end try_recurrence_breaking

Scalar Expansion: Drawbacks

- Expansion increases memory requirements
- Solutions:
- Expand in a single loop
- Strip mine loop before expansion
- Forward substitution:

DO I = 1, N

T = A(I) + A(I+1)

A(I) = T + B(I)

ENDDO

DO I = 1, N

A(I) = A(I) + A(I+1) + B(I)

ENDDO

Scalar Renaming

DO I = 1, 100

S1 T = A(I) + B(I)

S2 C(I) = T + T

S3 T = D(I) - B(I)

S4 A(I+1) = T * T

ENDDO

- Renaming scalar T:

DO I = 1, 100

S1T1 = A(I) + B(I)

S2 C(I) = T1 + T1

S3T2 = D(I) - B(I)

S4 A(I+1) = T2 * T2

ENDDO

Scalar Renaming

- will lead to:

S3 T2$(1:100) = D(1:100) - B(1:100)

S4 A(2:101) = T2$(1:100) * T2$(1:100)

S1 T1$(1:100) = A(1:100) + B(1:100)

S2 C(1:100) = T1$(1:100) + T1$(1:100)

T = T2$(100)

Scalar Renaming

- Renaming algorithm partitions all definitions and uses into equivalent classes, each of which can occupy different memory locations:
- Use the definition-use graph to:
- Pick definition
- Add all uses that the definition reaches to the equivalence class
- Add all definitions that reach any of the uses…
- ..until fixed point is reached

Scalar Renaming: Profitability

- Scalar renaming will break recurrences in which a loop-independent output dependence or antidependence is a critical element of a cycle
- Relatively cheap to use scalar renaming
- Usually done by compilers when calculating live ranges for register allocation

Array Renaming

DO I = 1, N

S1 A(I) = A(I-1) + X

S2 Y(I) = A(I) + Z

S3 A(I) = B(I) + C

ENDDO

- S1 S2 S2 -1 S3 S3 1 S1 S1 0 S3
- Rename A(I) to A$(I):

DO I = 1, N

S1 A$(I) = A(I-1) + X

S2 Y(I) = A$(I) + Z

S3 A(I) = B(I) + C

ENDDO

- Dependences remaining: S1 S2 andS3 1 S1

Array Renaming: Profitability

- Examining dependence graph and determining minimum set of critical edges to break a recurrence is NP-complete!
- Solution: determine edges that are removed by array renaming and analyze effects on dependence graph
- procedure array_partition:
- Assumes no control flow in loop body
- identifies collections of references to arrays which refer to the same value
- identifies deletable output dependences and antidependences
- Use this procedure to generate code
- Minimize amount of copying back to the “original” array at the beginning and the end

Download Presentation

Connecting to Server..