compiling array assignments
Download
Skip this Video
Download Presentation
Compiling Array Assignments

Loading in 2 Seconds...

play fullscreen
1 / 40

Compiling Array Assignments - PowerPoint PPT Presentation


  • 182 Views
  • Uploaded on

Compiling Array Assignments. Allen and Kennedy, Chapter 13. Fortran 90. Fortran 90: successor to Fortran 77 Slow to gain acceptance: Need better/smarter compiler techniques to achieve same level of performance as Fortran 77 compilers

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Compiling Array Assignments' - baakir


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
compiling array assignments

Compiling Array Assignments

Allen and Kennedy, Chapter 13

fortran 90
Fortran 90
  • Fortran 90: successor to Fortran 77
  • Slow to gain acceptance:
    • Need better/smarter compiler techniques to achieve same level of performance as Fortran 77 compilers
  • This chapter focuses on a single new feature - the array assignment statement: A[1:100] = 2.0
    • Intended to provide direct mechanism to specify parallel/vector execution
  • This statement must be implemented for the specific available hardware. In an uniprocessor, the statement must be converted to a scalar loop: Scalarization
fortran 903
Fortran 90
  • Range of a vector operation in Fortran 90 denoted by a triplet: <lower bound: upper bound: increment>

A[1:100:2] = B[2:51] + 3.0

  • Semantics of Fortran 90 require that for vector statements, all inputs to the statement are fetched before any results are stored
outline
Outline
  • Simple scalarization
  • Safe scalarization
  • Techniques to improve on safe scalarization
    • Loop reversal
    • Input prefetching
    • Loop splitting
  • Multidimensional scalarization
  • A framework for analyzing multidimensional scalarization
scalarization
Scalarization
  • Replace each array assignment by a corresponding DO loop
  • Is it really that easy?
  • Two key issues:
    • Wish to avoid generating large array temporaries
    • Wish to optimize loops to exhibit good memory hierarchy
    • performance
simple scalarization
Simple Scalarization
  • Consider the vector statement:

A(1:200) = 2.0 * A(1:200)

  • A scalar implementation:

S1 DO I = 1, 200

S2 A(I) = 2.0 * A(I)

ENDDO

  • However, some statements cause problems:

A(2:201) = 2.0 * A(1:200)

  • If we naively scalarize:

DO i = 1, 200

A(i+1) = 2.0 * A(i)

ENDDO

simple scalarization7
Simple Scalarization
  • Meaning of statements changed by the scalarization process: scalarization faults
  • Naive algorithm which ignores such scalarization faults:

procedure SimpleScalarize(S)

let V0 = (L0: U0: D0 ) be the vector iteration specifier on left side of S;

// Generate the scalarizing loop

let I be a new loop index variable;

generate the statement DO I = L0 ,U0 ,D ;

for each vector specifier V = (L: U: D) in S do

replace V with (I+L-L0 );

generate an ENDDO statement;

end SimpleScalarize

scalarization faults
Scalarization Faults
  • Why do scalarization faults occur?
  • Vector operation semantics: All values from the RHS of the assignment should be fetched before storing into the result
  • If a scalar operation stores into a location fetched by a later operation, we get a scalarization fault
  • Principle 13.1: A vector assignment generates a scalarization fault if and only if the scalarized loop carries a true dependence.
  • These dependences are known as scalarization dependences
  • To preserve correctness, compiler should never produce a scalarization dependence
safe scalarization
Safe Scalarization
  • Naive algorithm for safe scalarization: Use temporary storage to make sure scalarization dependences are not created
  • Consider:

A(2:201) = 2.0 * A(1:200)

  • can be split up into:

T(1:200) = 2.0 * A(1:200)

A(2:201) = T(1:200)

  • Then scalarize using SimpleScalarize

DO I = 1, 200

T(I) = 2.0 * A(I)

ENDDO

DO I = 2, 201

A(I) = T(I-1)

ENDDO

safe scalarization10
Safe Scalarization
  • Procedure SafeScalarize implements this method of scalarization
  • Good news:
    • Scalarization always possible by using temporaries
  • Bad News:
    • Substantial increase in memory use due to temporaries
    • More memory operations per array element
  • We shall look at a number of techniques to reduce the effects of these disadvantages
loop reversal
Loop Reversal

A(2:256) = A(1:255) + 1.0

  • SimpleScalarize will produce a scalarization fault
  • Solution: Loop reversal

DO I = 256, 2, -1

A(I) = A(I+1) + 1.0

ENDDO

loop reversal12
Loop Reversal
  • When can we use loop reversal?
    • Loop reversal maps dependences into antidependences
    • But also maps antidependences into dependences

A(2:257) = ( A(1:256) + A(3:258) ) / 2.0

  • After scalarization:

DO I = 2, 257

A(I) = ( A(I-1) + A(I+1) ) / 2.0

ENDDO

  • Loop Reversal gets us:

DO I = 257, 2

A(I) = ( A(I-1) + A(I+1) ) / 2.0

ENDDO

  • Thus, cannot use loop reversal in presence of antidependences
input prefetching
Input Prefetching

A(2:257) = ( A(1:256) + A(3:258) ) / 2.0

  • Causes a scalarization fault when naively scalarized to:

DO I = 2, 257

A(I) = ( A(I-1) + A(I+1) ) / 2.0

ENDDO

  • Problem: Stores into first element of the LHS in the previous iteration
  • Input prefetching: Use scalar temporaries to store elements of input and output arrays
input prefetching14
Input Prefetching
  • A first-cut at using temporaries:

DO I = 2, 257

T1 = A(I-1)

T2 = ( T1 + A(I+1) ) / 2.0

A(I) = T2

ENDDO

  • T1holds element of input array, T2 holds element of output array
  • But this faces the same problem. Can correct by moving assignment to T1 into previous iteration...
input prefetching15
Input Prefetching

T1 = A(1)

DO I = 2, 256

T2 = ( T1 + A(I+1) ) / 2.0

T1 = A(I)

A(I) = T2

ENDDO

T2 = ( T1 + A(257) ) / 2.0

A(I) = T2

  • Note: We are using scalar replacement, but the motivation for doing so is different than in Chapter 8
input prefetching16
Already seen in Chapter 8, we need as many temporaries as the dependence threshold + 1.

Example:

DO I = 2, 257

A(I+2) = A(I) + 1.0

ENDDO

Can be changed to:

T1 = A(1)

T2 = A(2)

DO I = 2, 255

T3 = T1 + 1.0

T1 = T2

T2 = A(I+2)

A(I+2) = T3

ENDDO

T3 = T1 + 1.0

T1 = T2

A(258) = T3

T3 = T1 + 1.0

A(259) = T3

Input Prefetching
input prefetching17
Input Prefetching
  • Can also unroll the loop and eliminate register to register copies
  • Principle 13.2: Any scalarization dependence with a threshold known at compile time can be corrected by input prefetching.
input prefetching18
Input Prefetching
  • Sometimes, even when a scalarization dependence does not have a constant threshold, input prefetching can be used effectively

A(1:N) = A(1:N) / A(1)

  • which can be naively scalarized as:

DO i = 1, N

A(i) = A(i) / A(1)

ENDDO

  • true dependence from first iteration to every other iteration
  • antidependence from first iteration to itself
  • Via input prefetching, we get:

tA1 = A(1)

DO i = 1, N

A(i) = A(i) / tA1

ENDDO

loop splitting
Loop Splitting
  • Problem with using input prefetching with thresholds > 1: Temporaries must be saved for each iteration of the loop up to the threshold
  • Can potentially use loop splitting to solve this problem

A(3:6) = ( A(1:4) + A(5:8) ) / 2.0

  • Scalarization loop:

DO I = 3, 6

A(I) = (A(I-2) + A(I+2) ) / 2.0

ENDDO

  • True dependence and antidependence with threshold of 2
  • Split up into two independent loops which do not interact with each other...
loop splitting20
Loop Splitting

DO I = 3, 6

A(I) = (A(I-2) + A(I+2) ) / 2.0

ENDDO

  • Both true and anti dependence has threshold of 2
  • With loop splitting becomes:

DO I = 3, 5, 2

A(I) = (A(I-2) + A(I+2) ) / 2.0

ENDDO

DO I = 4, 6, 2

A(I) = (A(I-2) + A(I+2) ) / 2.0

ENDDO

  • Note that the threshold becomes 1 for each loop. Could have produced incorrect results if threshold of antidependence was not divisible by 2
loop splitting21
Loop Splitting
  • Can write the splitting as a nested pair of loops:

DO i1 = 3, 4

DO i2 = i1, 6, 2

A(i2) = (A(i2-2) + A(i2+2) ) / 2.0

ENDDO

ENDDO

  • The inner loop carries a scalarization dependence with threshold 1 and the outer loop carries no dependence. Can apply input prefetching:

DO i1 = 3, 4

T1 = A(i1-2)

DO i2 = i1, 6, 2

T2 = (T1 + A(i2 + 2)) / 2.0

T1 = A(i2)

A(i2) = T2

ENDDO

ENDDO

loop splitting22
Loop Splitting
  • Principle 13.3: Any scalarization loop in which all true dependences have the same constant threshold T and all antidependences have a threshold that is divisible by T can be transformed, using input prefetching and loop splitting, so that all scalarization dependences are eliminated.
scalarization algorithm
Scalarization Algorithm
  • Revised scalarization Algorithm:

procedureFullScalarize

for each vector statement Sdo begin

compute the dependences of S on itself as though S had been scalarized;

ifS has no scalarization dependences upon itself thenSimpleScalarize(S);

elseifS has scalarization dependences, but no self antidependences

then begin

SimpleScalarize(S);

reverse the scalarization loop;

end

scalarization algorithm24
Scalarization Algorithm

else if all scalarization dependences have a threshold of 1 then begin

SimpleScalarize(S);

InputPrefetch(S);

end

else if all scalarization dependences for S have the same constant threshold T

and all antidependences have thresholds that are divisible by T

thenSplitLoop(S);

else if all antidependences for S have the same constant threshold T

and all true dependences have thresholds that are divisible by Tthen begin

reverse the loop;

SplitLoop(S);

end

elseSafeScalarize(S, SL);

end

endFullScalarize

multidimensional scalarization
Multidimensional Scalarization
  • Vector statements in Fortran 90 in more than 1 dimension:

A(1:100, 1:100) = B(1:100, 1, 1:100)

  • corresponds to:

DO J = 1, 100

A(1:100, J) = B(1:100, 1, J)

ENDDO

  • Scalarization in multiple dimensions:

A(1:100, 1:100) = 2.0 * A(1:100, 1:100)

  • Obvious Strategy: convert each vector iterator into a loop:

DO J = 1, 100, 1

DO I = 1, 100

A(I,J) = 2.0 * A(I,J)

ENDDO

ENDDO

multidimensional scalarization26
Multidimensional Scalarization
  • What should the order of the loops be after scalarization?
    • Familiar question: We dealt with this issue in Loop Selection/Interchange in Chapter 5
  • Profitability of a particular configuration depends on target architecture
    • For simplicity, we shall assume shorter strides through memory are better
    • Thus, optimal choice for innermost loop is the leftmost vector iterator
multidimensional scalarization27
Multidimensional Scalarization
  • Extending previous results to multiple dimensions:
    • Each vector iterator is scalarized separately, starting from the leftmost vector iterator in the innermost loop and the rest of the iterators from left to right
  • Once the ordering is available:

1. Test to see if the loop carries a scalarization dependence. If not, then proceed to the next loop.

    • If the scalarization loop carries only true dependences, reverse the loop and proceed to the next loop.
    • Apply input prefetching, with loop splitting where appropriate, to eliminate dependences to which it applies. Observe, however, that in outer loops, prefetching is done for a single submatrix (the remaining dimensions).

4. Otherwise, the loop carries a scalarization fault that requires temporary storage. Generate a scalarization that utilizes temporary storage and terminate the scalarization test for this loop, since temporary storage will eliminate all scalarization faults.

outer loop prefetching
Outer Loop Prefetching

A(1:N, 1:N) =

(A(0:N-1, 2:N+1) + A(2:N+1, 0:N-1)) / 2.0

  • If we try to scalarize this (keeping the column iterator in the innermost loop) we get a true scalarization dependence (<, >) involving the second input and an antidependence (>, <) involving the first input
  • Cannot use loop reversal...
outer loop prefetching29
Outer Loop Prefetching

A(1:N, 1:N) =

(A(0:N-1, 2:N+1) + A(2:N+1, 0:N-1)) / 2.0

  • We can use input prefetching on the outer loop. The temporaries will be arrays:

T0(1:N) = A(2:N+1, 0)

DO j = 1, N-1

T1(1:N)=( A(0:N-1, j+1) + T0(1:N) ) / 2.0

T0(1:N) = A(2:N+1, j)

A(1:N, j) = T1(1:N)

ENDDO

T1(1:N) = ( A(0:N-1, N) + T0(1:N) ) / 2.0

A(1:N, N) = T1(1:N)

  • Total temporary space required = 2 rows of original matrix
  • Better than storage required for copy of the result matrix
loop interchange
Loop Interchange
  • Sometimes, there is a tradeoff between scalarization and optimal memory hierarchy usage

A(2:100, 3:101) = A(3:101, 1:201:2)

  • If we scalarize this using the prescribed order:

DO I = 3, 101

DO 100 J = 2, 100

A(J,I) = A(J+1,2*I-5)

ENDDO

ENDDO

  • Dependences (<, >) (I = 3, 4) and (>, >) (I = 6, 7)
  • Cannot use loop reversal, input prefetching
  • Can use temporaries
loop interchange31
Loop Interchange
  • However, we can use loop interchange to get:

DO J = 2, 100

DO I = 3, 101

A(J,I) = A(J+1,2*I-5)

ENDDO

ENDDO

  • Not optimal memory hierarchy usage, but reduction of temporary storage
  • Loop interchange is useful to reduce size of temporaries
  • It can also eliminate scalarization dependences
general multidimensional scalarization
General Multidimensional Scalarization
  • Goal: To vectorize a single statement which has m vector dimensions
    • Given an ideal order of scalarization (l1, l2, ..., lm)
    • (d1, d2, ..., dn) be direction vectors for all true and antidependences of the statement upon itself
    • The scalarization matrix is a n  m matrix of these direction vectors
  • For instance:

A(1:N, 1:N, 1:N) = A(0:N-1, 1:N, 2:N+1) + A(1:N, 2:N+1, 0:N-1)

> = <

< > =

general multidimensional scalarization33
General Multidimensional Scalarization
  • If we examine any column of the direction matrix, we can immediately see if the corresponding loop can be safely scalarized as the outermost loop of the nest:
    • If all entries of the column are = or >, it can be safely scalarized as the outermost loop without loop reversal.
    • If all entries are = or <, it can be safely scalarized with loop reversal.
    • If it contains a mixture of < and >, it cannot be scalarized by simple means.
general multidimensional scalarization34
General Multidimensional Scalarization
  • Once a loop has been selected for scalarization, the dependences carried by that loop, any dependence whose direction vector does not contain a = in the position corresponding to the selected loop may be eliminated from further consideration.
  • In our example, if we move the second column to the outside, we get:

> = < = > <

< > = > < =

  • Scalarization in this way will reduce the matrix to:

< >

a complete scalarization algorithm
A Complete Scalarization Algorithm

procedureCompleteScalarize(S, loop_list)

let M be the scalarization direction matrix resulting from scalarization S to loop_list;

while there are more loops to be scalarized do begin

let l be the first loop in loop list that can be simply scalarized with or without loop reversal (determine by examining the columns of M from left to right);

if there is no such l then begin

let l be the first loop on loop_list;

section l by input prefetching

if the previous step fails then

section S using the naive temporary method and exit;

end

a complete scalarization algorithm36
A Complete Scalarization Algorithm

else // make l the outermost loop

section l directly or with loop reversal, depending on the entries in the column of M corresponding to l (if l is the last loop, use hardware section length);

remove l from loop_list;

let M\' be M with the column corresponding to l and the rows corresponding to non “=“ entries in that column eliminated;

M := M\';

end //while

endCompleteScalarize

  • Time Complexity = O(m2 n) where
    • m = number of loops
    • n = number of dependences
a complete scalarization algorithm37
A Complete Scalarization Algorithm
  • Correctness follows from the definition of the scalarization matrix and the method to remove dependences from the matrix
  • For a given statement S and loop list, Scalarize produces a correct scalarization with the following properties:
    • input prefetching is applied to the innermost loop possible
    • the order of scalarization loops is the closest possible to the order specified on input among scalarizations with property (1).
scalarization example
Scalarization Example

DO J = 2, N-1

A(2:N-1,J) = A(1:N-2,J) + A(3:N,J) + &

A(2:N-1,J-1) + A(2:N-1,J+1)/4.

ENDDO

  • Loop carried true dependence, antidependence
  • Naive compiler could generate:

DO J = 2, N-1

DO i = 2, N-1

T(i-1) = (A(i-1,J) + A(i+1,J) + &

A(i,J-1) + A(i,J+1) )/4

ENDDO

DO i = 2, N-1

A(i,J) = T(i-1)

ENDDO

ENDDO

  • 2  (N-2)2accesses to memory due to array T
scalarization example39
Scalarization Example
  • However, can use input prefetching to get:

DO J = 2, N-1

tA0 = A(1, J)

DO i = 2, N-2

tA1 = (tA0+A(i+1,J)+A(i,J-1)+A(i,J+1))/4

tA0 = A(i-1, J)

A(i,J) = tA1

ENDDO

tA1 = (tA0+A(N,J)+A(N-1,J-1)+A(N-1,J+1))/4

A(N-1,J) = tA1

ENDDO

  • If temporaries are allocated to registers, no more memory accesses than original Fortran 90 program
post scalarization issues
Post Scalarization Issues
  • Issues due to scalarization:
    • Generates many individual loops
    • These loops carry no dependences. So reuse of quantities in registers is not common
  • Solution: Use loop interchange, loop fusion, unroll-and-jam, and scalar replacement
ad