Compiling array assignments l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 40

Compiling Array Assignments PowerPoint PPT Presentation


  • 134 Views
  • Uploaded on
  • Presentation posted in: General

Compiling Array Assignments. Allen and Kennedy, Chapter 13. Fortran 90. Fortran 90: successor to Fortran 77 Slow to gain acceptance: Need better/smarter compiler techniques to achieve same level of performance as Fortran 77 compilers

Download Presentation

Compiling Array Assignments

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Compiling array assignments l.jpg

Compiling Array Assignments

Allen and Kennedy, Chapter 13


Fortran 90 l.jpg

Fortran 90

  • Fortran 90: successor to Fortran 77

  • Slow to gain acceptance:

    • Need better/smarter compiler techniques to achieve same level of performance as Fortran 77 compilers

  • This chapter focuses on a single new feature - the array assignment statement: A[1:100] = 2.0

    • Intended to provide direct mechanism to specify parallel/vector execution

  • This statement must be implemented for the specific available hardware. In an uniprocessor, the statement must be converted to a scalar loop: Scalarization


Fortran 903 l.jpg

Fortran 90

  • Range of a vector operation in Fortran 90 denoted by a triplet: <lower bound: upper bound: increment>

    A[1:100:2] = B[2:51] + 3.0

  • Semantics of Fortran 90 require that for vector statements, all inputs to the statement are fetched before any results are stored


Outline l.jpg

Outline

  • Simple scalarization

  • Safe scalarization

  • Techniques to improve on safe scalarization

    • Loop reversal

    • Input prefetching

    • Loop splitting

  • Multidimensional scalarization

  • A framework for analyzing multidimensional scalarization


Scalarization l.jpg

Scalarization

  • Replace each array assignment by a corresponding DO loop

  • Is it really that easy?

  • Two key issues:

    • Wish to avoid generating large array temporaries

    • Wish to optimize loops to exhibit good memory hierarchy

    • performance


Simple scalarization l.jpg

Simple Scalarization

  • Consider the vector statement:

    A(1:200) = 2.0 * A(1:200)

  • A scalar implementation:

    S1 DO I = 1, 200

    S2 A(I) = 2.0 * A(I)

    ENDDO

  • However, some statements cause problems:

    A(2:201) = 2.0 * A(1:200)

  • If we naively scalarize:

    DO i = 1, 200

    A(i+1) = 2.0 * A(i)

    ENDDO


Simple scalarization7 l.jpg

Simple Scalarization

  • Meaning of statements changed by the scalarization process: scalarization faults

  • Naive algorithm which ignores such scalarization faults:

    procedure SimpleScalarize(S)

    let V0 = (L0: U0: D0 ) be the vector iteration specifier on left side of S;

    // Generate the scalarizing loop

    let I be a new loop index variable;

    generate the statement DO I = L0 ,U0 ,D ;

    for each vector specifier V = (L: U: D) in S do

    replace V with (I+L-L0 );

    generate an ENDDO statement;

    end SimpleScalarize


Scalarization faults l.jpg

Scalarization Faults

  • Why do scalarization faults occur?

  • Vector operation semantics: All values from the RHS of the assignment should be fetched before storing into the result

  • If a scalar operation stores into a location fetched by a later operation, we get a scalarization fault

  • Principle 13.1: A vector assignment generates a scalarization fault if and only if the scalarized loop carries a true dependence.

  • These dependences are known as scalarization dependences

  • To preserve correctness, compiler should never produce a scalarization dependence


Safe scalarization l.jpg

Safe Scalarization

  • Naive algorithm for safe scalarization: Use temporary storage to make sure scalarization dependences are not created

  • Consider:

    A(2:201) = 2.0 * A(1:200)

  • can be split up into:

    T(1:200) = 2.0 * A(1:200)

    A(2:201) = T(1:200)

  • Then scalarize using SimpleScalarize

    DO I = 1, 200

    T(I) = 2.0 * A(I)

    ENDDO

    DO I = 2, 201

    A(I) = T(I-1)

    ENDDO


Safe scalarization10 l.jpg

Safe Scalarization

  • Procedure SafeScalarize implements this method of scalarization

  • Good news:

    • Scalarization always possible by using temporaries

  • Bad News:

    • Substantial increase in memory use due to temporaries

    • More memory operations per array element

  • We shall look at a number of techniques to reduce the effects of these disadvantages


Loop reversal l.jpg

Loop Reversal

A(2:256) = A(1:255) + 1.0

  • SimpleScalarize will produce a scalarization fault

  • Solution: Loop reversal

    DO I = 256, 2, -1

    A(I) = A(I+1) + 1.0

    ENDDO


Loop reversal12 l.jpg

Loop Reversal

  • When can we use loop reversal?

    • Loop reversal maps dependences into antidependences

    • But also maps antidependences into dependences

      A(2:257) = ( A(1:256) + A(3:258) ) / 2.0

  • After scalarization:

    DO I = 2, 257

    A(I) = ( A(I-1) + A(I+1) ) / 2.0

    ENDDO

  • Loop Reversal gets us:

    DO I = 257, 2

    A(I) = ( A(I-1) + A(I+1) ) / 2.0

    ENDDO

  • Thus, cannot use loop reversal in presence of antidependences


Input prefetching l.jpg

Input Prefetching

A(2:257) = ( A(1:256) + A(3:258) ) / 2.0

  • Causes a scalarization fault when naively scalarized to:

    DO I = 2, 257

    A(I) = ( A(I-1) + A(I+1) ) / 2.0

    ENDDO

  • Problem: Stores into first element of the LHS in the previous iteration

  • Input prefetching: Use scalar temporaries to store elements of input and output arrays


Input prefetching14 l.jpg

Input Prefetching

  • A first-cut at using temporaries:

    DO I = 2, 257

    T1 = A(I-1)

    T2 = ( T1 + A(I+1) ) / 2.0

    A(I) = T2

    ENDDO

  • T1holds element of input array, T2 holds element of output array

  • But this faces the same problem. Can correct by moving assignment to T1 into previous iteration...


Input prefetching15 l.jpg

Input Prefetching

T1 = A(1)

DO I = 2, 256

T2 = ( T1 + A(I+1) ) / 2.0

T1 = A(I)

A(I) = T2

ENDDO

T2 = ( T1 + A(257) ) / 2.0

A(I) = T2

  • Note: We are using scalar replacement, but the motivation for doing so is different than in Chapter 8


Input prefetching16 l.jpg

Already seen in Chapter 8, we need as many temporaries as the dependence threshold + 1.

Example:

DO I = 2, 257

A(I+2) = A(I) + 1.0

ENDDO

Can be changed to:

T1 = A(1)

T2 = A(2)

DO I = 2, 255

T3 = T1 + 1.0

T1 = T2

T2 = A(I+2)

A(I+2) = T3

ENDDO

T3 = T1 + 1.0

T1 = T2

A(258) = T3

T3 = T1 + 1.0

A(259) = T3

Input Prefetching


Input prefetching17 l.jpg

Input Prefetching

  • Can also unroll the loop and eliminate register to register copies

  • Principle 13.2: Any scalarization dependence with a threshold known at compile time can be corrected by input prefetching.


Input prefetching18 l.jpg

Input Prefetching

  • Sometimes, even when a scalarization dependence does not have a constant threshold, input prefetching can be used effectively

    A(1:N) = A(1:N) / A(1)

  • which can be naively scalarized as:

    DO i = 1, N

    A(i) = A(i) / A(1)

    ENDDO

  • true dependence from first iteration to every other iteration

  • antidependence from first iteration to itself

  • Via input prefetching, we get:

    tA1 = A(1)

    DO i = 1, N

    A(i) = A(i) / tA1

    ENDDO


Loop splitting l.jpg

Loop Splitting

  • Problem with using input prefetching with thresholds > 1: Temporaries must be saved for each iteration of the loop up to the threshold

  • Can potentially use loop splitting to solve this problem

    A(3:6) = ( A(1:4) + A(5:8) ) / 2.0

  • Scalarization loop:

    DO I = 3, 6

    A(I) = (A(I-2) + A(I+2) ) / 2.0

    ENDDO

  • True dependence and antidependence with threshold of 2

  • Split up into two independent loops which do not interact with each other...


Loop splitting20 l.jpg

Loop Splitting

DO I = 3, 6

A(I) = (A(I-2) + A(I+2) ) / 2.0

ENDDO

  • Both true and anti dependence has threshold of 2

  • With loop splitting becomes:

    DO I = 3, 5, 2

    A(I) = (A(I-2) + A(I+2) ) / 2.0

    ENDDO

    DO I = 4, 6, 2

    A(I) = (A(I-2) + A(I+2) ) / 2.0

    ENDDO

  • Note that the threshold becomes 1 for each loop. Could have produced incorrect results if threshold of antidependence was not divisible by 2


Loop splitting21 l.jpg

Loop Splitting

  • Can write the splitting as a nested pair of loops:

    DO i1 = 3, 4

    DO i2 = i1, 6, 2

    A(i2) = (A(i2-2) + A(i2+2) ) / 2.0

    ENDDO

    ENDDO

  • The inner loop carries a scalarization dependence with threshold 1 and the outer loop carries no dependence. Can apply input prefetching:

    DO i1 = 3, 4

    T1 = A(i1-2)

    DO i2 = i1, 6, 2

    T2 = (T1 + A(i2 + 2)) / 2.0

    T1 = A(i2)

    A(i2) = T2

    ENDDO

    ENDDO


Loop splitting22 l.jpg

Loop Splitting

  • Principle 13.3: Any scalarization loop in which all true dependences have the same constant threshold T and all antidependences have a threshold that is divisible by T can be transformed, using input prefetching and loop splitting, so that all scalarization dependences are eliminated.


Scalarization algorithm l.jpg

Scalarization Algorithm

  • Revised scalarization Algorithm:

    procedureFullScalarize

    for each vector statement Sdo begin

    compute the dependences of S on itself as though S had been scalarized;

    ifS has no scalarization dependences upon itself thenSimpleScalarize(S);

    elseifS has scalarization dependences, but no self antidependences

    then begin

    SimpleScalarize(S);

    reverse the scalarization loop;

    end


Scalarization algorithm24 l.jpg

Scalarization Algorithm

else if all scalarization dependences have a threshold of 1 then begin

SimpleScalarize(S);

InputPrefetch(S);

end

else if all scalarization dependences for S have the same constant threshold T

and all antidependences have thresholds that are divisible by T

thenSplitLoop(S);

else if all antidependences for S have the same constant threshold T

and all true dependences have thresholds that are divisible by Tthen begin

reverse the loop;

SplitLoop(S);

end

elseSafeScalarize(S, SL);

end

endFullScalarize


Multidimensional scalarization l.jpg

Multidimensional Scalarization

  • Vector statements in Fortran 90 in more than 1 dimension:

    A(1:100, 1:100) = B(1:100, 1, 1:100)

  • corresponds to:

    DO J = 1, 100

    A(1:100, J) = B(1:100, 1, J)

    ENDDO

  • Scalarization in multiple dimensions:

    A(1:100, 1:100) = 2.0 * A(1:100, 1:100)

  • Obvious Strategy: convert each vector iterator into a loop:

    DO J = 1, 100, 1

    DO I = 1, 100

    A(I,J) = 2.0 * A(I,J)

    ENDDO

    ENDDO


Multidimensional scalarization26 l.jpg

Multidimensional Scalarization

  • What should the order of the loops be after scalarization?

    • Familiar question: We dealt with this issue in Loop Selection/Interchange in Chapter 5

  • Profitability of a particular configuration depends on target architecture

    • For simplicity, we shall assume shorter strides through memory are better

    • Thus, optimal choice for innermost loop is the leftmost vector iterator


Multidimensional scalarization27 l.jpg

Multidimensional Scalarization

  • Extending previous results to multiple dimensions:

    • Each vector iterator is scalarized separately, starting from the leftmost vector iterator in the innermost loop and the rest of the iterators from left to right

  • Once the ordering is available:

    1. Test to see if the loop carries a scalarization dependence. If not, then proceed to the next loop.

    • If the scalarization loop carries only true dependences, reverse the loop and proceed to the next loop.

    • Apply input prefetching, with loop splitting where appropriate, to eliminate dependences to which it applies. Observe, however, that in outer loops, prefetching is done for a single submatrix (the remaining dimensions).

      4. Otherwise, the loop carries a scalarization fault that requires temporary storage. Generate a scalarization that utilizes temporary storage and terminate the scalarization test for this loop, since temporary storage will eliminate all scalarization faults.


Outer loop prefetching l.jpg

Outer Loop Prefetching

A(1:N, 1:N) =

(A(0:N-1, 2:N+1) + A(2:N+1, 0:N-1)) / 2.0

  • If we try to scalarize this (keeping the column iterator in the innermost loop) we get a true scalarization dependence (<, >) involving the second input and an antidependence (>, <) involving the first input

  • Cannot use loop reversal...


Outer loop prefetching29 l.jpg

Outer Loop Prefetching

A(1:N, 1:N) =

(A(0:N-1, 2:N+1) + A(2:N+1, 0:N-1)) / 2.0

  • We can use input prefetching on the outer loop. The temporaries will be arrays:

    T0(1:N) = A(2:N+1, 0)

    DO j = 1, N-1

    T1(1:N)=( A(0:N-1, j+1) + T0(1:N) ) / 2.0

    T0(1:N) = A(2:N+1, j)

    A(1:N, j) = T1(1:N)

    ENDDO

    T1(1:N) = ( A(0:N-1, N) + T0(1:N) ) / 2.0

    A(1:N, N) = T1(1:N)

  • Total temporary space required = 2 rows of original matrix

  • Better than storage required for copy of the result matrix


Loop interchange l.jpg

Loop Interchange

  • Sometimes, there is a tradeoff between scalarization and optimal memory hierarchy usage

    A(2:100, 3:101) = A(3:101, 1:201:2)

  • If we scalarize this using the prescribed order:

    DO I = 3, 101

    DO 100 J = 2, 100

    A(J,I) = A(J+1,2*I-5)

    ENDDO

    ENDDO

  • Dependences (<, >) (I = 3, 4) and (>, >) (I = 6, 7)

  • Cannot use loop reversal, input prefetching

  • Can use temporaries


Loop interchange31 l.jpg

Loop Interchange

  • However, we can use loop interchange to get:

    DO J = 2, 100

    DO I = 3, 101

    A(J,I) = A(J+1,2*I-5)

    ENDDO

    ENDDO

  • Not optimal memory hierarchy usage, but reduction of temporary storage

  • Loop interchange is useful to reduce size of temporaries

  • It can also eliminate scalarization dependences


General multidimensional scalarization l.jpg

General Multidimensional Scalarization

  • Goal: To vectorize a single statement which has m vector dimensions

    • Given an ideal order of scalarization (l1, l2, ..., lm)

    • (d1, d2, ..., dn) be direction vectors for all true and antidependences of the statement upon itself

    • The scalarization matrix is a n  m matrix of these direction vectors

  • For instance:

    A(1:N, 1:N, 1:N) = A(0:N-1, 1:N, 2:N+1) + A(1:N, 2:N+1, 0:N-1)

    > = <

    < >=


General multidimensional scalarization33 l.jpg

General Multidimensional Scalarization

  • If we examine any column of the direction matrix, we can immediately see if the corresponding loop can be safely scalarized as the outermost loop of the nest:

    • If all entries of the column are = or >, it can be safely scalarized as the outermost loop without loop reversal.

    • If all entries are = or <, it can be safely scalarized with loop reversal.

    • If it contains a mixture of < and >, it cannot be scalarized by simple means.


General multidimensional scalarization34 l.jpg

General Multidimensional Scalarization

  • Once a loop has been selected for scalarization, the dependences carried by that loop, any dependence whose direction vector does not contain a = in the position corresponding to the selected loop may be eliminated from further consideration.

  • In our example, if we move the second column to the outside, we get:

    > = < = > <

    < > = > < =

  • Scalarization in this way will reduce the matrix to:

    < >


A complete scalarization algorithm l.jpg

A Complete Scalarization Algorithm

procedureCompleteScalarize(S, loop_list)

let M be the scalarization direction matrix resulting from scalarization S to loop_list;

while there are more loops to be scalarized do begin

let l be the first loop in loop list that can be simply scalarized with or without loop reversal (determine by examining the columns of M from left to right);

if there is no such l then begin

let l be the first loop on loop_list;

section l by input prefetching

if the previous step fails then

section S using the naive temporary method and exit;

end


A complete scalarization algorithm36 l.jpg

A Complete Scalarization Algorithm

else // make l the outermost loop

section l directly or with loop reversal, depending on the entries in the column of M corresponding to l (if l is the last loop, use hardware section length);

remove l from loop_list;

let M' be M with the column corresponding to l and the rows corresponding to non “=“ entries in that column eliminated;

M := M';

end //while

endCompleteScalarize

  • Time Complexity = O(m2 n) where

    • m = number of loops

    • n = number of dependences


A complete scalarization algorithm37 l.jpg

A Complete Scalarization Algorithm

  • Correctness follows from the definition of the scalarization matrix and the method to remove dependences from the matrix

  • For a given statement S and loop list, Scalarize produces a correct scalarization with the following properties:

    • input prefetching is applied to the innermost loop possible

    • the order of scalarization loops is the closest possible to the order specified on input among scalarizations with property (1).


Scalarization example l.jpg

Scalarization Example

DO J = 2, N-1

A(2:N-1,J) = A(1:N-2,J) + A(3:N,J) + &

A(2:N-1,J-1) + A(2:N-1,J+1)/4.

ENDDO

  • Loop carried true dependence, antidependence

  • Naive compiler could generate:

    DO J = 2, N-1

    DO i = 2, N-1

    T(i-1) = (A(i-1,J) + A(i+1,J) + &

    A(i,J-1) + A(i,J+1) )/4

    ENDDO

    DO i = 2, N-1

    A(i,J) = T(i-1)

    ENDDO

    ENDDO

  • 2  (N-2)2accesses to memory due to array T


Scalarization example39 l.jpg

Scalarization Example

  • However, can use input prefetching to get:

    DO J = 2, N-1

    tA0 = A(1, J)

    DO i = 2, N-2

    tA1 = (tA0+A(i+1,J)+A(i,J-1)+A(i,J+1))/4

    tA0 = A(i-1, J)

    A(i,J) = tA1

    ENDDO

    tA1 = (tA0+A(N,J)+A(N-1,J-1)+A(N-1,J+1))/4

    A(N-1,J) = tA1

    ENDDO

  • If temporaries are allocated to registers, no more memory accesses than original Fortran 90 program


Post scalarization issues l.jpg

Post Scalarization Issues

  • Issues due to scalarization:

    • Generates many individual loops

    • These loops carry no dependences. So reuse of quantities in registers is not common

  • Solution: Use loop interchange, loop fusion, unroll-and-jam, and scalar replacement


  • Login