enhancing fine grained parallelism n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Enhancing Fine-Grained Parallelism PowerPoint Presentation
Download Presentation
Enhancing Fine-Grained Parallelism

Loading in 2 Seconds...

play fullscreen
1 / 46

Enhancing Fine-Grained Parallelism - PowerPoint PPT Presentation


  • 192 Views
  • Uploaded on

Enhancing Fine-Grained Parallelism. Chapter 5 of Allen and Kennedy. Optimizing Compilers for Modern Architectures. Fine-Grained Parallelism. Techniques to enhance fine-grained parallelism: Loop Interchange Scalar Expansion Scalar Renaming Array Renaming. We fail here.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Enhancing Fine-Grained Parallelism' - corby


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
enhancing fine grained parallelism

Enhancing Fine-Grained Parallelism

Chapter 5 of Allen and Kennedy

Optimizing Compilers for Modern Architectures

fine grained parallelism
Fine-Grained Parallelism

Techniques to enhance fine-grained parallelism:

  • Loop Interchange
  • Scalar Expansion
  • Scalar Renaming
  • Array Renaming
prelude a long time ago

We fail here

Prelude: A Long Time Ago...

procedurecodegen(R, k, D);

// R is the region for which we must generate code.

// k is the minimum nesting level of possible parallel loops.

// D is the dependence graph among statements in R..

find the set {S1, S2, ... , Sm} of maximal strongly-connected regions in the dependence graph D restricted to R

construct Rp from R by reducing each Si to a single node and compute Dp, the dependence graph naturally induced on Rp by D

let {p1, p2, ... , pm} be the m nodes of Rp numbered in an order consistent with Dp (use topological sort to do the numbering);

for i = 1 to m do begin

if piis cyclic then begin

generate a level-k DO statement;

let Di be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to pi;

codegen (pi, k+1, Di);

generate the level-k ENDDO statement;

end

else

generate a vector statement for pi in r(pi)-k+1 dimensions, where r (pi) is the number of loops containing pi;

end

end

prelude a long time ago1
Prelude: A Long Time Ago...
  • Codegen: tries to find parallelism using transformations of loop distribution and statement reordering
  • If we deal with loops containing cyclic dependences early on in the loop nest, we can potentially vectorize more loops
  • Goal in Chapter 5: To explore other transformations to exploit parallelism
motivational example
Motivational Example

DO J = 1, M

DO I = 1, N

T = 0.0

DO K = 1,L

T = T + A(I,K) * B(K,J)

ENDDO

C(I,J) = T

ENDDO

ENDDO

codegen will not uncover any vector operations. However, by scalar expansion, we can get:

DO J = 1, M

DO I = 1, N

T$(I) = 0.0

DO K = 1,L

T$(I) = T$(I) + A(I,K) * B(K,J)

ENDDO

C(I,J) = T$(I)

ENDDO

ENDDO

motivational example1
Motivational Example

DO J = 1, M

DO I = 1, N

T$(I) = 0.0

DO K = 1,L

T$(I) = T$(I) + A(I,K) * B(K,J)

ENDDO

C(I,J) = T$(I)

ENDDO

ENDDO

motivational example ii
Motivational Example II
  • Loop Distribution gives us:

DO J = 1, M

DO I = 1, N

T$(I) = 0.0

ENDDO

DO I = 1, N

DO K = 1,L

T$(I) = T$(I) + A(I,K) * B(K,J)

ENDDO

ENDDO

DO I = 1, N

C(I,J) = T$(I)

ENDDO

ENDDO

motivational example iii
Motivational Example III

Finally, interchanging I and K loops, we get:

DO J = 1, M

T$(1:N) = 0.0

DO K = 1,L

T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J)

ENDDO

C(1:N,J) = T$(1:N)

ENDDO

  • A couple of new transformations used:
    • Loop interchange
    • Scalar Expansion
loop interchange
Loop Interchange

DO I = 1, N

DO J = 1, M

S A(I,J+1) = A(I,J) + B •DV: (=, <)

ENDDO

ENDDO

  • Applying loop interchange:

DO J = 1, M

DO I = 1, N

S A(I,J+1) = A(I,J) + B •DV: (<, =)

ENDDO

ENDDO

  • leads to:

DO J = 1, M

S A(1:N,J+1) = A(1:N,J) + B

ENDDO

loop interchange1
Loop Interchange
  • Loop interchange is a reordering transformation
  • Why?
    • Think of statements being parameterized with the corresponding iteration vector
    • Loop interchange merely changes the execution order of these statements.
    • It does not create new instances, or delete existing instances

DO J = 1, M

DO I = 1, N

S <some statement>

ENDDO

ENDDO

  • If interchanged, S(2, 1) will execute before S(1, 2)
loop interchange safety
Loop Interchange: Safety
  • Safety: not all loop interchanges are safe

DO J = 1, M

DO I = 1, N

A(I,J+1) = A(I+1,J) + B

ENDDO

ENDDO

  • Direction vector (<, >)
  • If we interchange loops, we violate the dependence
loop interchange safety1
Loop Interchange: Safety
  • A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.
loop interchange safety2
Loop Interchange: Safety
  • A dependence is interchange-sensitiveif it is carried by the same loop after interchange. That is, an interchange-sensitive dependence moves with its original carrier loop to the new level.
loop interchange safety3
Loop Interchange: Safety
  • Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops in the nest is determined by applying the same permutation to the elements of D(i,j).
  • The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row.
loop interchange safety4
Loop Interchange: Safety

DO I = 1, N

DO J = 1, M

DO K = 1, L

A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1)

ENDDO

ENDDO

ENDDO

  • The direction matrix for the loop nest is:

< < =

< = >

  • Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non-"=" direction in any row.
  • Follows from Theorem 5.1 and Theorem 2.3
loop interchange profitability
Loop Interchange: Profitability
  • Profitability depends on architecture

DO I = 1, N

DO J = 1, M

DO K = 1, L

S A(I+1,J+1,K) = A(I,J,K) + B

ENDDO

ENDDO

ENDDO

  • For SIMD machines with large number of FU’s:

DO I = 1, N

S A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B

ENDDO

  • Not suitable for vector register machines
loop interchange profitability1
Loop Interchange: Profitability
  • For Vector machines, we want to vectorize loops with stride-one memory access
  • Since Fortran stores in column-major order:
    • useful to vectorize the I-loop
  • Thus, transform to:

DO J = 1, M

DO K = 1, L

S A(2:N+1,J+1,K) = A(1:N,J,K) + B

ENDDO

ENDDO

loop interchange profitability2
Loop Interchange: Profitability
  • MIMD machines with vector execution units: want to cut down synchronization costs
  • Hence, shift K-loop to outermost level:

PARALLEL DO K = 1, L

DO J = 1, M

A(2:N+1,J+1,K) = A(1:N,J,K) + B

ENDDO

END PARALLEL DO

loop shifting
Loop Shifting
  • Motivation: Identify loops which can be moved and move them to “optimal” nesting levels
  • Theorem 5.3 In a perfect loop nest, if loops at level i, i+1,...,i+n carry no dependence, it is always legal to shift these loops inside of loop i+n+1. Furthermore, these loops will not carry any dependences in their new position.
  • Proof:
loop shifting1
Loop Shifting

DO I = 1, N

DO J = 1, N

DO K = 1, N

S A(I,J) = A(I,J) + B(I,K)*C(K,J)

ENDDO

ENDDO

ENDDO

  • S has true, anti and output dependences on itself, hence codegen will fail as recurrence exists at innermost level
  • Use loop shifting to move K-loop to the outermost level:

DO K= 1, N

DO I = 1, N

DO J = 1, N

S A(I,J) = A(I,J) + B(I,K)*C(K,J)

ENDDO

ENDDO

ENDDO

loop shifting2
Loop Shifting

DO K= 1, N

DO I = 1, N

DO J = 1, N

S A(I,J) = A(I,J) + B(I,K)*C(K,J)

ENDDO

ENDDO

ENDDO

codegen vectorizes to:

DO K = 1, N

FORALL J=1,N

A(1:N,J) = A(1:N,J) + B(1:N,K)*C(K,J)

END FORALL

ENDDO

loop shifting3
Loop Shifting
  • Change body of codegen:

ifpi is cyclic then

if k is the deepest loop in pi

thentry_recurrence_breaking(pi, D, k)

else begin

select_loop_and_interchange(pi, D, k);

generate a level-k DO statement;

let Di be the dependence graph consisting of

all dependence edges in D that are at level

k+1 or greater and are internal to pi;

codegen (pi, k+1, Di);

generate the level-k ENDDO statement

end

end

loop shifting4
Loop Shifting

procedure select_loop_and_interchange(i, D, k)

if the outermost carried dependence in i is at

level p>k then

shift loops at level k,k+1,...,p-1 inside

the level-p loop, making it into the level-k

loop;

return;

end select_loop_and_interchange

loop selection
Loop Selection
  • Consider:

DO I = 1, N

DO J = 1, M

S A(I+1,J+1) = A(I,J) + A(I+1,J)

ENDDO

ENDDO

  • Direction matrix:

< <

= <

  • Loop shifting algorithm will fail to uncover vector loops; however, interchanging the loops can lead to:

DO J = 1, M

A(2:N+1,J+1) = A(1:N,J) + A(2:N+1,J)

ENDDO

  • Need a more general algorithm
loop selection1
Loop Selection
  • Loop selection:
    • Select a loop at nesting level p  k that can be safely moved outward to level k and shift the loops at level k, k+1, …, p-1 inside it
loop selection2

Loop p

Direction vector

Loop Selection
  • Heuristics for selecting loop level
    • If the level-k loop carries no dependence, then let p be the smallest integer such that the level-p loop carries a dependence. (loop-shifting heuristic.)
    • If the level-k loop carries a dependence, let p be the outermost loop that can be safely shifted outward to position k and that carries a dependence d whose direction vector contains an "=" in every position but the pth. If no such loop exists, let p = k.

= = < > = . . .

= = = < < . . .

= = < = = . . .

scalar expansion
Scalar Expansion

DO I = 1, N

S1 T = A(I)

S2 A(I) = B(I)

S3 B(I) = T

ENDDO

  • Scalar Expansion:

DO I = 1, N

S1 T$(I) = A(I)

S2 A(I) = B(I)

S3 B(I) = T$(I)

ENDDO

T = T$(N)

  • leads to:

S1 T$(1:N) = A(1:N)

S2 A(1:N) = B(1:N)

S3 B(1:N) = T$(1:N)

T = T$(N)

scalar expansion1
Scalar Expansion
  • However, not always profitable. Consider:

DO I = 1, N

T = T + A(I) + A(I+1)

A(I) = T

ENDDO

  • Scalar expansion gives us:

T$(0) = T

DO I = 1, N

S1 T$(I) = T$(I-1) + A(I) + A(I+1)

S2 A(I) = T$(I)

ENDDO

T = T$(N)

scalar expansion safety
Scalar Expansion: Safety
  • Scalar expansion is always safe
  • When is it profitable?
    • Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions.
    • However, we want to predict when expansion is profitable
  • Dependences due to reuse of memory location vs. reuse of values
    • Dependences due to reuse of values must be preserved
    • Dependences due to reuse of memory location can be deleted by expansion
scalar expansion covering definitions
Scalar Expansion: Covering Definitions
  • A definition X of a scalar S is a covering definition for loop L if a definition of S placed at the beginning of L reaches no uses of S that occur past X.

DO I = 1, 100

S1 T = X(I)

S2 Y(I) = T

ENDDO

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T = X(I)

S2 Y(I) = T

ENDIF

ENDDO

covering

covering

scalar expansion covering definitions1
Scalar Expansion: Covering Definitions
  • A covering definition does not always exist:

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T = X(I)

ENDIF

S2 Y(I) = T

ENDDO

  • In SSA terms: There does not exist a covering definition for a variable T if the edge out of the first assignment to T goes to a -function later in the loop which merges its values with those for another control flow path through the loop
scalar expansion covering definitions2
Scalar Expansion: Covering Definitions
  • We will consider a collection of covering definitions
  • There is a collection C of covering definitions for T in a loop if either:
    • There exists no -function at the beginning of the loop that merges versions of T from outside the loop with versions defined in the loop, or,
    • The -function within the loop has no SSA edge to any -function including itself
scalar expansion covering definitions3
Scalar Expansion: Covering Definitions
  • Remember the loop which had no covering definition:

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T = X(I)

ENDIF

S2 Y(I) = T

ENDDO

  • To form a collection of covering definitions, we can insert dummy assignments:

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T = X(I)

ELSE

S2 T = T

ENDIF

S3 Y(I) = T

ENDDO

scalar expansion covering definitions4
Scalar Expansion: Covering Definitions
  • Algorithm to insert dummy assignments and compute the collection, C, of covering definitions:
    • Central idea: Look for parallel paths to a -function following the first assignment, until no more exist
scalar expansion covering definitions5
Scalar Expansion: Covering Definitions

Detailed algorithm:

  • Let S0 be the -function for T at the beginning of the loop, if there is one, and null otherwise. Make C empty and initialize an empty stack.
  • Let S1 be the first definition of T in the loop. Add S1 to C.
  • If the SSA successor of S1 is a -function S2 that is not equal to S0, then push S2 onto the stack and mark it;
  • While the stack is non-empty,
    • pop the -function S from the stack;
    • add all SSA predecessors that are not -functions to C;
    • if there is an SSA edge from S0 into S, then insert the assignment T = T as the last statement along that edge and add it to C;
    • for each unmarked -function S3 (other than S0) that is an SSA predecessor of S, mark S3 and push it onto the stack;
    • for each unmarked -function S4 that can be reached from S by a single SSA edge and which is not predominated by S in the control flow graph mark S4 and push it onto the stack.
scalar expansion covering definitions6
Scalar Expansion: Covering Definitions

Given the collection of covering definitions, we can carry out scalar expansion for a normalized loop:

  • Create an array T$ of appropriate length
  • For each S in the covering definition collection C, replace the T on the left-hand side by T$(I).
  • For every other definition of T and every use of T in the loop body reachable by SSA edges that do not pass through S0, the -function at the beginning of the loop, replace T by T$(I).
  • For every use prior to a covering definition (direct successors of S0 in the SSA graph), replace T by T$(I-1).
  • If S0 is not null, then insert T$(0) = T before the loop.
  • If there is an SSA edge from any definition in the loop to a use outside the loop, insert T = T$(U) after the loop, were U is the loop upper bound.
scalar expansion covering definitions7
Scalar Expansion: Covering Definitions

After inserting covering definitions:

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T = X(I)

ENDIF

S2 Y(I) = T

ENDDO

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T = X(I)

ELSE

S2 T = T

ENDIF

S3 Y(I) = T

ENDDO

After scalar expansion:

T$(0) = T

DO I = 1, 100

IF (A(I) .GT. 0) THEN

S1 T$(I) = X(I)

ELSE

T$(I) = T$(I-1)

ENDIF

S2 Y(I) = T$(I)

ENDDO

deletable dependences
Deletable Dependences
  • Uses of T before covering definitions are expanded as T$(I - 1)
  • All other uses are expanded as T$(I) then the deletable dependences are:
    • Backward carried antidependences
    • Backward carried output dependences
    • Forward carried output dependences
    • Loop-independent antidependences into the covering definition
    • Loop-carried true dependences from a covering definition
scalar expansion2
Scalar Expansion

procedure try_recurrence_breaking(i, D, k)

if k is the deepest loop in ithen begin

remove deletable edges in i;

find the set {SC1, SC2, ..., SCn} of maximal strongly-connected regions in D restricted to i;

if there are vector statements among SCithen

expand scalars indicated by deletable edges;

codegen(i, k, D restricted to i);

end try_recurrence_breaking

scalar expansion drawbacks
Scalar Expansion: Drawbacks
  • Expansion increases memory requirements
  • Solutions:
    • Expand in a single loop
    • Strip mine loop before expansion
    • Forward substitution:

DO I = 1, N

T = A(I) + A(I+1)

A(I) = T + B(I)

ENDDO

DO I = 1, N

A(I) = A(I) + A(I+1) + B(I)

ENDDO

scalar renaming
Scalar Renaming

DO I = 1, 100

S1 T = A(I) + B(I)

S2 C(I) = T + T

S3 T = D(I) - B(I)

S4 A(I+1) = T * T

ENDDO

  • Renaming scalar T:

DO I = 1, 100

S1T1 = A(I) + B(I)

S2 C(I) = T1 + T1

S3T2 = D(I) - B(I)

S4 A(I+1) = T2 * T2

ENDDO

scalar renaming1
Scalar Renaming
  • will lead to:

S3 T2$(1:100) = D(1:100) - B(1:100)

S4 A(2:101) = T2$(1:100) * T2$(1:100)

S1 T1$(1:100) = A(1:100) + B(1:100)

S2 C(1:100) = T1$(1:100) + T1$(1:100)

T = T2$(100)

scalar renaming2
Scalar Renaming
  • Renaming algorithm partitions all definitions and uses into equivalent classes, each of which can occupy different memory locations:
    • Use the definition-use graph to:
    • Pick definition
    • Add all uses that the definition reaches to the equivalence class
    • Add all definitions that reach any of the uses…
    • ..until fixed point is reached
scalar renaming profitability
Scalar Renaming: Profitability
  • Scalar renaming will break recurrences in which a loop-independent output dependence or antidependence is a critical element of a cycle
  • Relatively cheap to use scalar renaming
  • Usually done by compilers when calculating live ranges for register allocation
array renaming
Array Renaming

DO I = 1, N

S1 A(I) = A(I-1) + X

S2 Y(I) = A(I) + Z

S3 A(I) = B(I) + C

ENDDO

  • S1  S2 S2 -1 S3 S3 1 S1 S1 0 S3
  • Rename A(I) to A$(I):

DO I = 1, N

S1 A$(I) = A(I-1) + X

S2 Y(I) = A$(I) + Z

S3 A(I) = B(I) + C

ENDDO

  • Dependences remaining: S1  S2 andS3 1 S1
array renaming profitability
Array Renaming: Profitability
  • Examining dependence graph and determining minimum set of critical edges to break a recurrence is NP-complete!
  • Solution: determine edges that are removed by array renaming and analyze effects on dependence graph
  • procedure array_partition:
    • Assumes no control flow in loop body
    • identifies collections of references to arrays which refer to the same value
    • identifies deletable output dependences and antidependences
  • Use this procedure to generate code
    • Minimize amount of copying back to the “original” array at the beginning and the end