1 / 27

Loop Tiling for Iterative Stencil Computations

Loop Tiling for Iterative Stencil Computations. Marta Jiménez. What is an Iterative Stencil Computation?. Matrix A. DO K = 1, NITER /* time-step loop */ do J = ... do I = ... {A(I,J), A(I+1,J),…} enddo enddo { wrapped-around computations } ENDDO.

Download Presentation

Loop Tiling for Iterative Stencil Computations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Loop Tiling for Iterative Stencil Computations Marta Jiménez

  2. What is an Iterative Stencil Computation? Matrix A DO K = 1, NITER /* time-step loop */ do J = ... do I = ... {A(I,J), A(I+1,J),…} enddo enddo {wrapped-around computations} ENDDO • ISC often performed for PDE, GM, IP • swim, tomcatv, mgrid (from SPEC95 benchmark) • Jacobi

  3. Loop Tiling • Loop Tiling • divides IS into regular tiles to make the working set fit in the memory level being exploited • can be applied hierarchically (Multilevel Tiling) • Current algorithms for Loop Tiling are limited to loops that: • are “perfectly” nested • are fully permutable • define a rectangular IS • However, in iterative stencil computations, loops are: • NOT perfectly nested • NOT fully permutable

  4. Today’s talk • Show how Loop Tiling can be applied to iterative stencil computations • based on Song & Li’s paper [PLDI99] • define a Program Model • 1 Level of 1D-Tiling (cache) • program example: SWIM • 2 levels of Tiling • 2D-Tiling at the cache level • 1D-Tiling at the register level (based on Jiménez et al. [ICS98][HPCA98]) • Performance Results • Loop Tiling on EV5 & EV6

  5. Steps 1- Apply a set of transformations to the original program to achieve the desired program model defined by Song & Li 2- Perform 2D-Tiling for the Cache Level 3- Perform 1D-Tiling for the Register Level

  6. 1st Step: achieve desired program model • Program Model: DO K = 1, NITER /* time-step loop */ do J1 = LJ1, UJ1 do I1 = LI1, UI1 {A(I,J), A(I+1,J),…} enddo enddo . . . do Jm = LJm, UJm do Im = LIm, UIm {A(I,J), A(I+1,J),…} enddo enddo ENDDO • Usually, programs are NOT directly written in this form • We must apply a set of transformations to achieve this program model

  7. SWIM original code SUBROUTINE CALCX do J = 1,N do I = 1,M ... enddo enddo c wrapped-around computations do J = 1, N ... enddo do I = 1, M ... enddo ... initializations 90 NCYCLE = NCYCLE +1 CALL CALC1 CALL CALC2 IF (NCYCLE >= ITMAX) STOP IF (NCYCLE <= 1) THEN CALL CALC3Z ELSE CALL CALC3 ENDIF GO TO 90 • Transformations • Inline subroutines • Convert GO TO into DO-loop • Peel iterations of the time-step loop to eliminate IF-statements guarded by NCYCLE

  8. Wrapped-around Computations J J DO K = 2, ITMAX-1 do J = 1,N do I = 1,M ... enddo enddo wrapped-around comp do J = 1, N ... enddo do I = 1, M ... enddo ... do J = 1,N do I = 1,M ... enddo enddo ... ... ENDDO I I CALC1 CALC2 CALC3

  9. Wrapped-around Computations • Projection along directionI DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around comp do J = 1, N ... enddo do J = 1,N ... enddo wrapped-around comp do J = 1, N ... enddo ... ENDDO J c c • Another way of dealing with the wrapped-around computations is performing code sinking

  10. 1st Step: achieved program model • Flow dependencies & iterations space for SWIM (Projection along directionI ) J 1 N DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around ENDDO CALC1 K=2 CALC2 K-loop (time) K=3 CALC3

  11. Steps 1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li 2- Perform 2D-Tiling for the Cache Level 3- Perform 1D-Tiling for the Register Level

  12. 1D-Tiling J 1 N J 1 N 1 N K=2 OFFSET-i SLOPE K=3 K=4 • Dependencies are violated • Tiling parameters: SLOPE, OFFSETS-i

  13. 2D-Tiling J 1 N 1 N 1 N I 1 N 1 N 1 N 1 1 M M 1 1 M M 1 1 M M K (time-step loop) • Tiling parameters: SLOPE, OFFSETS-i for each tiled dimension (JandI) • Computed using theJI-loop distance subgraph

  14. output dependencies JI-loop Distance Subgraph [1,-1,-1] [0,0,0] [1,0,0] JI1-loop JI2-loop JI3-loop [1,-1,0] [1,0,-1] [1, 0, 0] [1,-1,0] [1,0,-1] [1, 0, 0] [1, 0, 0] [1,0,-1] [1,-1,0] [0,0,0] flow dependencies anti-dependencies • Each node represents a JI-loop nest • Each edge represents a dependence (distance vector)

  15. Wrapped-around Computations • SWIM: Projection along direction I J 1 N DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around ENDDO K=2 K-loop (time) K=3 • Backward dependencies with large distances make Tiling not profitable • apply Circular Loop Skewing to shorten backward dependencies

  16. Circular Loop Skewing • Shorts backward dependencies by changing the iteration order J J 1 2 N 1 2 N 1 2 3 4 K=2 BETA-i DELTA K=3 • CLS parameters: BETA-i, DELTA (computed using theJI-loop distance subgraph)

  17. J 1 N 1 2 3 4 K=2 BETA-i DELTA K=3 Circular Loop Skewing DO K = 2, ITMAX-1 do JX = 1+BETA1+DELTA(K-2), N+BETA1+DELTA(K-2) J = MOD(JX-1, N) + 1 ... enddo wrapped-around do JX = 1+BETA2+DELTA(K-2), N+BETA2+DELTA(K-2) J = MOD(JX-1, N) + 1 ... enddo wrapped-around do JX = 1+BETA3+DELTA(K-2), N+BETA3+DELTA(K-2) J = MOD(JX-1, N) + 1 ... enddo wrapped-around ENDDO

  18. 2nd Step: 2D-Tiling for cache level • SWIM: projection along directionI • CLS parameters: DELTA=2, BETA1=0, BETA2=1, BETA3=2 • Tiling parameters: SLOPE=2, OFFSET1=1, OFFSET2=OFFSET3=0 J 1 2 3 N 1 2 3 DO JJ = ... DO II = ... DO K = ... if (first tile) then do JX = ... offsets iter. enddo endif do JX = ... Iter. inside tile enddo do JX = ... Iter. inside tile enddo do JX = ... Iter. inside tile enddo ENDDO 2 3 1 2 3 N 1 K=2 K=3 K=4

  19. Steps 1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li 2- Perform 2D-Tiling for the Cache Level 3- Perform 1D-Tiling for the Register Level

  20. 3rd Step: 1D-Tiling for register level DO JJ = ... DO II = ... DO K = ... ... do JX = LJ, UJ J = MOD (JX-1, N)+1 do IX = LI, UI I = MOD (IX-1, M)+1 [loop body: {I,J}] enddo enddo ... ENDDO J N-2 N-1 N 1 2 I M-2 M-1 M 1 2 unrolled • The MOD operation introduced by CLS prevents us to fully unroll the loop • Apply first Index Set Splitting to loop J

  21. Index Set Splitting • ISS splits a loop into two new loops that iterate over non-intersecting portions of the iteration space DO JJ = ... DO II = ... DO K = ... ... do JX = LJ, min(N,UJ) J = JX do IX = ... enddo enddo do JX = max(N+1,LJ), UJ J = JX-N do IX = ... enddo enddo ... ENDDO J N-2 N-1 N 1 2 I M-2 M-1 M 1 2 ISS

  22. DO JJ = ... DO II = ... DO K = ... ... do JX = LJ, min(N,UJ)-3+1,3 J = JX do IX = ... [loop body: {J}] [loop body: {J+1}] [loop body: {J+2}] enddo enddo do JX = JX, min(N,UJ) J = JX do IX = ... [loop body: {J}] enddo enddo ... ENDDO J N-2 N-1 N 1 2 I M-2 M-1 M 1 2 ISS 3rd Step: 1D-Tiling for register level

  23. Code Transformations Summary 1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li • Inline subroutines • Convert GOTO into DO-loop • Peel iterations of the time-step loop to eliminate IF-statements 2- Perform 2D-Tiling for the Cache Level • Construct JI-loop distance subgraph • Compute DELTA and BETAs and apply CLS to shorten backwards dep. • Update JI-loop distance subgraph • Compute OFSSETs and SLOPE and tile the IS 3- Perform 1D-Tiling for the Register Level • Index Set Splitting • Tiling in a straightforward manner

  24. Performance Results (SWIM) • Architecture: EV56 (500Mhz, L1:8KB, L2:96KB), EV6(500MHz, L1:64KB, L2:4MB) • Compiler Invocation: • f77 -O5 -arch ev56 (EV5) • kf77 -O5 -arch ev6 -notransform_loop -unroll 1 (EV6) • Programs: • 1D-Tiling for the Cache Level: loop J, TS = 4 (EV5), TS=8 (EV6) • 2D -Tiling for the Cache Level: TSIxJ = 32x16 (EV5), TSIxJ=40x12(EV6) • 1D-Tiling for the register level: loop J, TS=4 (EV5 & EV6) EV5 1519s 1533s 1023s 999s 1009s 677s (execution time) EV6 439s 658s 294s 371s 578s 296s Speedup ORI ORI + RT 1D 1D + RT 2D 2D + RT

  25. Performance Results EV5 (SWIM) • Architecture: EV56 (500Mhz, L1:8KB, L2:96KB) • Compiler invocations: • base: kf77 -O5 -arch ev56 • no_prefetch: kf77 -O5 -arch ev56 -switch nolu_prefetch_fetch ….. Speedup over ORI (base) Speedup ORI ORI + RT 1D 1D + RT 2D 2D + RT

  26. Performance Results EV6 (SWIM) • Architecture: EV6(500MHz, L1:64KB, L2:4MB) • Compiler invocations: • base: f77 -O5 -arch ev6 • no_prefetch: f77 -O5 -arch ev6 -switch nolu_prefetch_fetch ….. Speedup over ORI (base) Speedup ORI ORI + RT 1D 1D + RT 2D 2D + RT

  27. Code for Result Verification DO K = 2, ITMAX-1 ... do J = 1,N ... enddo result verification IF (MOD(K,MPRINT).eq.0) THEN do I = do J = UCHECK = UCHECK + {UNEW(I,J)} enddo UNEW (I,I) = . . . enddo PRINTS ENDIF do J = 1,N ... enddo ENDDO J c NEW in SPEC2000!! • Apply strip-mining to loop K (only useful if MPRINT is large)

More Related