Parametric Tiling of Affine Loop Nests *

Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan 1 1 Ohio State University2 Louisiana State University *Supported by US NSF

j j i i Loop Tiling • A key loop transformation for: • Efficient coarse-grained parallel execution • Data locality optimization for (i=1; i<=7; i++) for (j=1; j<=6; j++) S(i,j); for (it=1; it<=7; it+=Ti) for (jt=1; jt<=6; jt+=Tj) for (i=it; i<min(7,it+Ti-1); i++) for (j=jt; j<min(6,jt+Tj-1); j++) S(i,j); Inter-tile loops Intra-tile loops

Rectangular Tileability Legality of rectangular tiling: Atomic execution of each tile No cyclic dependence between tiles Data dependence lexicographically positive in all space dimensions Unimodular transformations (e.g., skewing) used as a pre-processing step to make rectangular tiling valid j i j 1 . = j’ 1 0 0 1 1 0 i i‘ j’ i’ Skewing

Parametric Tiling for (i=1; i<=N; i++) for (j=1; j<=N; j++) S(i,j); Tile loop i with tile size Ti Tile loop j with tile size Tj Performance of tiled code can vary greatly with choice of tile sizes → Model-driven and/or empirical search for best tile sizes Parametric tile sizes Not fixed at compile time Runtime parameters Valuable for: Auto-tuning systems Generalized “ATLAS” for (it=1; it<=N; it+=Ti) for (jt=1; jt<=N; jt+=Tj) for (i=it; i<min(N,it+Ti-1); i++) for (j=jt; j<min(N,jt+Tj-1); j++) S(i,j);

Approaches to Loop Tiling TLOG and HiTLOG Handles only perfectly nested loops Tile sizes can be runtime parameters Does not address parallelism Pluto Handles imperfectly nested loops Tile sizes must be fixed at compile time Addresses parallelism PrimeTile Handles imperfectly nested loops Tile sizes can be runtime parameters Does not address parallelism DynTile and PTile (this work): Systems with all positive features of existing tiling tools: Handle imperfectly nested loops Tile sizes can be runtime parameters Address parallelism Support multilevel tiling

Tiled Code Generation with Polyhedral Model i ≤ N j i ≥ 1 Assume: Rectangular tiling is valid. Tile sizes = 32 x 32 Tiled loop: for (it=0; it<=floord(N,32); it++) for (jt=0; jt<=floord(N,32); jt++) for (i=max(1,32*it); i<=min(N,32*it+31); i++) for (j=max(1,32*jt); j<=min(N,32*jt+31); j++) S(i,j); Original loop: for (i=1; i<=N; i++) for (j=1; j<=N; j++) S(i,j); j ≤ N N … 2 Statement domain: -32 0 1 0 0 0 32 0 -1 0 0 31 0 -32 0 1 0 0 0 32 0 -1 0 31 0 0 0 0 0 ≤ i-32∙it i-32∙it ≤ 31 0 ≤ j-32∙jt j-32∙jt ≤ 31 j ≥ 1 1 it jt it’ = it jt’ = jt it’ jt’ 1 0 0 0 0 0 0 1 0 0 0 0 it jt i … N 1 2 Affine schedule: 0 0 0 0 0 0 0 0 0 0 0 0 . 1 0 0 -1 -1 0 1 0 0 1 0 -1 0 -1 1 0 i j N i j 1 0 0 0 0 1 ≤ i i ≤ N 1 ≤ j j ≤ N . i j N1 >= i’ j’ = 1 0 0 0 0 1 0 0 i’ = i j’ = j Constraint of polyhedral model: Inequalities of the loop bounds must be linear in terms of loop iterators and problem sizes

PrimeTile: Approach to Sequential Parametric Tiling Recursive level-by-level generation of tiling loops by non-polyhedral AST processing No full tiles Full tiles j Full tiles (loop i) Partial tile (loop i) i for (i=lbi; i<=ubi; i++) for (j=lbj(i); j<=ubj(i); j++) S(i,j); Output pseudocode: for it { } [epilog i] [compute lbv] [compute ubv] if (lbv<ubv) { } else { [untiled j] } [prolog j] [full tiles j] [epilog j]

PrimeTile: Multi-Level Tiling Essential for: Exploiting data locality in deep multi-level memory hierarchies Approach: Boundary tiles can be recursively tiled using smaller tile sizes j i 1 levels of tiling 2 levels of tiling 3 levels of tiling

DynTile: Parametric Tiling (Multi Statement Domains) One-trip loop for i for j1=l2-1,l2-1 S1(i,j1) for j2=l2,u2 S2(i,j2) for j3 S3(i,j3) for i S1(i) for j2=l2,u2 S2(i,j2) for j3 S3(i,j3) Pre-processing to embed in common space

DynTile: Parametric Tiling (Multiple Statement Domains) Convex Hull S3 /* Inter-tile loops */ for it { for jt { } } /* Intra-tile loops*/ for i for j1 S1(i,j1) for j2 S2(i,j2) for j3 S3(i,j3) S2 S1 j i

DynTile: Wave-front Parallelism After sequential tiling: If no loop carried dependences exist, then each tiling loop is directly parallelizable If none of the tiling loops is parallel, then wave-front parallelization is always possible (all points in the same wavefront are independent of each other) i j 1 . = 1 0 0 1 1 0 i‘ j’ j’ j wavefrontk+4 wavefrontk+3 wavefrontk+2 wavefrontk+1 i wavefrontk i’

DynTile: Inspector Code for Dynamic Scheduled Parallel Execution /** Inter-tile loops */ for it { for jt { } } Step 1: Count #wavefronts and #tiles in each wavefront Step 2: Allocate bins to store wavefronts /** * Intra-tile loops * (treated as * a black box) */ Step 3: Fill the bins with its corresponding tiles Step 4: Execute in parallel all tiles in each bin w1 has 2 tiles w2 has 3 tiles w3 has 4 tiles w4 has 3 tiles w5 has 4 tiles #wavefronts = 5 jt Tile iteration space it w1 w2 w3 w4 w5 for each bin w { #pragmaomp parallel for for each tile in w { } } w1 w2 w3 w4 w5 /** Intra-tile loops (treated as a black box) */

DynTile: Implementation Pre-process Sequence of loop nests Statement polyhedra + Affine transforms (for rectangular tileability) Modified CLooG Pluto Tileable loop code (with preserved embedding information) DynTile Loop ASTs Parser + AST Generator Clan Statement polyhedra Convex Hull Generator (using ISL) Tiling Transformer Inspector Code Specifier Convex-hull loop AST Tiled loop ASTs Parallel tiled loop ASTs Code Generator Parallel tiled loop code

PTile: Loop Generation • Representation of Statement Domains • Set of affine inequalities • S: • v1 , v2 , …, vn are loop variables (v1 outermost and vn innermost) • p1 , p2 , …, pk are program parameters • Bounds of vi ,r ≤ i ≤ n, r ≥ 1 • max(f1(v1 , v2 , …, vr-1 , p1 , p2 , …, pk , c), … , ft(v1 , v2 , …, vr-1 , p1 , p2 , …, pk , c) ) ≤ vi ≤ min(g1(v1 , v2 , …, vr-1 , p1 , p2 , …, pk , c), … , gs(v1 , v2 , …, vr-1 , p1 , p2 , …, pk , c) ) • Bounds are dependent on outer loop variables and parameters (row echelon form)

Loop Generation (cont.) B P C v1 . vn p1 . pk 1 B11 0 0 … 0 B21 B22 0 … 0 . . . Bn1 Bn2 … Bnn P11 P12 … P1k P21 P22 … P2k . . . Pn1 Pn2 … Pnk c1 c2 . . . cn . ≥ 0 row echelon form – suitable for generating loop code to scan iteration points represented by the system v p 1 B | P | C . ≥ 0

Parametric Sequential Tiling • Tiling transformation • Express each variable vj in terms of inter-tile (tile) co-ordinates tj, intra-tile co-ordinates uj and tile sizes sj • vj = sj .tj + uj and 0 ≤ uj ≤ sj -1 • S’ : • S’ is equivalent to S t u p s 1 B.s | B | P | 0 | C . ≥ 0 0 | I | 0 | 0 | 0 0 | -I | 0 | I | -1 Not in Row echelon form for t But in Row echelon form for u I : Identity matrix

Parametric Sequential Tiling (cont.) • To derive a system in row echelon form for all variables • Create a system ST with only tile variables, program parameters and tile sizes (also parameters) • Relaxed projection to eliminate intra-tile variables uj • In ST , Bij.uj = • All solutions to S’ also satisfy ST • ST: • B.s has same nonzero structure as B => Row echelon form for t, where s is a diagonal matrix of parametric tile sizes 0 if Bij ≤ 0 Bij . (sj -1) if Bij > 0 t p s 1 . B.s | P | B+ | C’ ≥ 0

Parametric Sequential Tiling (cont.) t u p s 1 B.s | B | P | 0 | C S’: In row echelon form for u - To generate intra-tile loops . 0 | I | 0 | 0 | 0 ≥ 0 0 | -I | 0 | I | -1 In row echelon form for t - To generate tile loops ST: t u p s 1 B.s | 0 | P | B+ | C’ In row echelon form for t and u - To generate tile loops and intra-tile loops B.s | B | P | 0 | C ST|S’ : . ≥ 0 t p s 1 0 | I | 0 | 0 | 0 . B.s | P | B+ | C’ ≥ 0 0 | -I | 0 | I | -1

Parallel Non-parameterized Tiling /* Original loops */ for (i=1; i<=N; i++) for (j=1; j<=N; j++) for (k=i; k<=N; k++) S(i,j,k); /* Sequential tiled loops */ for (it=⌈-6/8⌉; it<=⌊N/8⌋; it++) for (jt=⌈-6/8⌉; jt<=⌊N/8⌋; jt++) for (kt=⌈(it*8-7)/8⌉; kt<=⌊N/8⌋; kt++) // intra-tile loops i,j,k Tiling (8x8x8 tile sizes) Original loop constraints Introduce new wavefront constraints (for loop kt) w = it+jt+kt Use Fourier Motzkin Elimination to derive new wavefront constraints (for loops w,it,jt )

Parallel Non-parameterized Tiling (cont.) This works when tile sizes are fixed When tile sizes are parametric, Fourier Motzkin Elimination becomes problematic Sign of the coefficient in the combined inequalities can be indeterminate impossible to determine whether the new inequality is a lower-bound or upper-bound inequality /* Parallel tiled loops */ for (w=⌈-7/8⌉; w<=⌊3*N/8⌋; w++) /* sequential */ for (it=max(⌈-6/8⌉, ⌈(4*w-N)/4⌉); it<=min(⌊N/8⌋, ⌊(8*w+7)/16⌋); it++) /* parallel */ for (jt=max(⌈-6/8⌉, ⌈(8*w-8*it-N)/8⌉); jt<=min(⌊N/8⌋, ⌊(8*w-16*it+7)/8⌋); jt++) /* parallel */ for (kt=max(⌈(it*8-7)/8⌉, w-it-jt); kt<=min(⌊N/8⌋, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k

Parallel Parametric Tiling Introduce an outermost wavefront loop Optimize the innermost iterator using wavefront inequalities w-t1-…-tn-1 ≤ tn≤ w-t1-…-tn-1 /* Parallel tiled loops */ for (w=wmin; w<=wmax; w++) /* sequential */ for (it=lbit; it<=ubit; it++) /* parallel */ for (jt=lbjt; jt<=ubjt; jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k

Static Determination of Lowest and Highest Wavefront Numbers The outermost tiling loop enumerates the wavefront numbers from lowest (wmin) to highest (wmax) The values of wmin and wmax can be determined at compile time using ILP solvers such as PIP/PipLib Similarly, parametric bound values of each tiling loop variable (tjmin and tjmax for 1 ≤ j ≤ n) can also be computed using ILP solver. Lexicographic minimum point in each loop level, e.g., 1, 1 Lowest wavefront number e.g., wmin=⌊1/Ti⌋+⌊1/Tj⌋ Global parameter values (affine inequalities) ILP Solver Lexicographic maximal point in each loop level, e.g., 200,2*N Original point loops (affine inequalities) Highest wavefront number e.g., wmax=⌊200/Ti⌋+⌊(2*N)/Tj⌋

Parallel Parametric Tiling Introduce an outermost wavefront loop Utilize ILP solver to derive wmin and wmax Optimize the innermost iterator using wavefront inequalities w-t1-…-tn-1 ≤ tn≤ w-t1-…-tn-1 /* Parallel tiled loops */ for (w=wmin; w<=wmax; w++) /* sequential */ for (it=lbit; it<=ubit; it++) /* parallel */ for (jt=lbjt; jt<=ubjt; jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k Correct code, but may visit many empty tiles

Parallel Parametric Tiling (cont.) Optimize using bounded wavefront inequalities Utilize ILP solver to derive parametric bound values tjmin, tjmax for 1 ≤ j ≤ n /* Parallel tiled loops */ for (w=wmin; w<=wmax; w++) /* sequential */ for (it=max(lbit, w-jtmax-ktmax); it<=min(ubit, w-jtmin-ktmin); it++) /* parallel */ for (jt=max(lbjt, w-it-ktmax); jt<=min(ubjt, w-it-ktmin); jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k Tighter loop bounds, but may still visit empty tiles

Parallel Parametric Tiling (cont.) Optimize using Relaxed Symbolic Fourier Motzkin Elimination (RSFME) /* Original loops */ for (i=1; i<=N; i++) for (j=1; j<=N; j++) for (k=i; k<=N; k++) S(i,j,k); /* Sequential tiled loops */ for (it=⌈(1-Ti+1)/Ti⌉; it<=⌊N/Ti⌋; it++) for (jt=⌈(1-Tj+1)/Tj⌉; jt<=⌊N/Tj⌋; jt++) for (kt=⌈(it*Ti-Tk+1)/Tk⌉; kt<=⌊N/Tk⌋; kt++) // intra-tile loops i,j,k Very tight loop bounds, with negligible overhead of scanning empty tiles No ambiguous signs encountered

Ambiguous Sign Resolution Resolving ambiguous sign in RSFME  Relaxation step Replace the tile loop variables with their parametric bounded values (tjmin and tjmax) /* Original loops */ for (i=1; i<=N; i++) for (j=1; j<=N-i; j++) for (k=i; k<=N; k++) S(i,j,k); /* Sequential tiled loops */ for (it=⌈(1-Ti+1)/Ti⌉; it<=⌊N/Ti⌋; it++) for (jt=⌈(1-Tj+1)/Tj⌉; jt<=⌊(N-it*Ti)/Tj⌋; jt++) for (kt=⌈(it*Ti-Tk+1)/Tk⌉; kt<=⌊N/Tk⌋; kt++) // intra-tile loops i,j,k Ambiguous sign encountered Use itmin and itmax to resolve sign ambiguity: (7a.1) w-N/Tj-N/Tk+ itmin *Ti/Tj<=it (w*Tj*Tk-N*Tj-N*Tk+ itmin *Ti*Tk)/(Tj*Tk)<=it w-N/Tj-N/Tk<=it-it*Ti/Tj (7a.2) it*Ti/Tj<= itmax -w+N/Tj+N/Tk it<=( itmax *Tj*Tk-w*Tj*Tk+N*Tj+N*Tk)/(Ti*Tk)

PTile: Prototype Implementation Pre-process Sequence of loop nests Statement polyhedra + Affine transforms (for rectangular tileability) Modified CLooG Pluto Tileable loop code (with preserved embedding information) PTile Loop ASTs Parser + AST Generator Clan Statement polyhedra Convex Hull Generator (using ISL) WavefrontParallelizer + RSFME Tiling Transformer Convex-hull loop AST Sequential tiled loop ASTs Parallel tiled loop ASTs Code Generator Parallel tiled loop code

PTile, DynTile, PrimeTile: Experiments Main comparison: PTile, DynTile and PrimeTile AMD Opteron 2380: Dual-socket quad-coreAMD Opteron 2380 processors running at 2.6 GHz with 256+256KB L1 cache, 2 MB of L2 cache Compilers: GCC version 4.4.0 ICC version 11.0 Experiments: With and without vectorization For parallel runs, used OpenMP Benchmarks: 2-D FDTD, Cholesky, DTRMM, LU

Results - 1 PTile: The RSFME relaxation step was never needed in these and other benchmarks that we have tested with Control overhead: PrimeTile has simple loop bounds but larger code size PTile and DynTile generate more complex loop bounds For 2D-FDTD, there is a 20% to 40% difference in execution time due to control overhead; for the other benchmarks, no significant difference

Results - 2 Sequential Parallel Bench Compiler PrimeTile DynTile PTile DynTile PTile 2d-fdtd gcc-novec 43.84s 49.19s 56.78s 9.32s 10.98s 2d-fdtd gcc-vec 43.82s 49.22s 56.85s 9.37s 10.98s 2d-fdtd icc-novec 40.27s 48.12s 54.29s 13.30s 12.96s 2d-fdtd icc-vec 40.52s 49.61s 54.63s 13.03s 13.18s cholesky gcc-novec 6.13s 10.50s 13.43s 1.91s 2.81s cholesky gcc-vec 6.08s 10.46s 13.45s 1.89s 2.82s cholesky icc-novec 5.63s 5.86s 8.19s 1.21s 2.40s cholesky icc-vec 5.36s 5.74s 8.22s 1.27s 2.61s dtrmm gcc-novec 9.29s 14.34s 18.99s 2.55s 4.50s dtrmm gcc-vec 9.25s 14.57s 18.99s 2.54s 3.69s dtrmm icc-novec 9.84s 9.19s 13.27s 2.17s 3.22s dtrmm icc-vec 9.91s 9.12s 13.44s 2.33s 3.27s lu gcc-novec 8.30s 9.15s 10.98s 2.56s 2.94s lu gcc-vec 8.29s 9.15s 10.98s 2.98s 2.43s lu icc-novec 6.30s 5.63s 7.49s 6.18s 1.60s lu icc-vec 6.36s 5.58s 6.52s 6.36s 1.62s

Results - 3 Sequential: PrimeTile performs best; DynTile is close gcc has more trouble optimizing code from DynTile than code from PrimeTile (difference between icc and gcc) PTile is slower because the order of execution of tiles impacts locality Parallel: DynTile performs better than PTile (except for LU – we need to understand this better) All tiles in a waveftont are executed in parallel with DynTile, where as the OpenMP parallel pragma works only with the outermost tiled parallel loop in PTile Vectorization: complexity of loop bounds in generated code appear to make it difficult for the compiler to vectorize

Summary Developed DynTile and PTile, two parametric tiling systems with the following features Handle imperfectly nested loops Allow tile sizes to be run time parameters Address parallelism Support multi-level tiling Ongoing: Much more extensive set of experiments to understand and improve the efficiency of the approaches for generation of parallel parametrically tiled code

Parametric Tiling of Affine Loop Nests *