Jeyarajan Thiyagalingam Olav Beckmann and Paul H.J. Kelly

Improving the Performance of Morton Layout by Array Alignment andLoop UnrollingReducing the Price of Naivety Jeyarajan Thiyagalingam Olav Beckmann and Paul H.J. Kelly Software Performance Optimisation Group, Imperial College, London

Motivation • Consider two code variants of a matrix multiply IJK Variant for( i=0; i<N; i++ ) for( j=0; j<N; j++ ) for( k=0; k<N; k++ ) C[i,j] += A[i,k] * B[k,j] IKJ Variant for( i=0; i<N; i++ ) for( k=0; k<N; k++ ) for( j=0; j<N; j++ ) C[i,j] += A[i,k] * B[k,j] • Both code variants are valid, apparently same complexity.

The price of naivety • Depending on problem size and architecture, the IKJvariant can be up to 10 times faster than IJK.

Performance Programming Model • Naively-written code can suffer a factor 10 performance hit • Sometimes the compiler can help; none of the compilers we used interchanged these loops. • A robust performance programming model would have to account for the capabilities of the compiler • Offering a clear Performance Programming Model should be part of Compiler Research.

Compromise – blocked layout 0 1 8 9 12 13 4 4 5 2 3 10 11 14 15 4 6 7 16 17 20 21 24 25 28 29 4 18 19 22 23 26 27 30 31 4 32 33 36 37 40 41 44 45 34 35 38 39 42 43 46 47 48 4 49 4 4 52 53 4 56 57 60 61 50 51 54 55 58 59 62 63 0 1 2 3 4 5 6 7 2 • Reason for differences in performance: • Row-major traversal uses 4 words per block • But column-major traversal uses only 1 word per block • Bandwidth wasted with CM 2 8 9 10 11 12 13 14 15 2 16 17 18 19 20 21 22 23 2 24 25 26 27 28 29 30 31 . 32 33 34 35 36 37 38 39 . 40 41 42 43 44 45 46 47 . 48 49 50 51 52 53 54 55 . 56 57 58 59 60 61 62 63 . . . . 8 8 8 8 • Blocked: 4-word cache block contains 2x2 subarray: • Row-major traversal uses 2 words per block • Column-major traversal uses 2 words per block

Recursively-blocked layout 0 0 1 1 8 8 9 9 12 12 13 13 4 4 5 5 2 2 3 3 10 10 11 11 14 14 15 15 6 6 7 7 8 8 9 9 12 12 13 13 24 24 25 25 28 28 29 29 10 10 11 11 14 14 15 15 26 26 27 27 30 30 31 31 36 36 37 37 40 40 41 41 44 44 32 32 45 45 33 33 38 38 39 39 42 42 43 43 46 46 34 34 47 47 35 35 48 48 49 49 52 52 53 53 56 56 57 57 60 60 61 61 50 50 51 51 54 54 55 55 58 58 59 59 62 62 63 63 • Real machines have deep memory hierarchies • Therefore, need to apply blocking recursively • Layout of the blocks: Z-Morton (one of a number of space-filling curves)

Morton Layout – A Compromise • Morton storage layout is unbiased towards either row- or column-major traversal.

So have we solved the problem? • Unfortunately, the basic Morton Scheme often performs disappointingly. • At least Morton does not seem to suffer from pathological drops in performance.

Statement that Morton is unbiased turns out to be based on assumption that a cache line maps to start of Morton block. Alignment

It turns out that Morton layout is only unbiased for even power-of-two cache line sizes The same problems happen when mis-aligning the base address Alignment

Alignment • We calculated miss-rates systematically for all levels of memory hierarchy • In each case, we calculated the miss-rates for all possible alignments of the base address. • The difference in miss-rates between best and worst alignment of the base address of Morton arrays can be up to a factor of 1.5 for even power-of-two cache lines, a factor of 2 for odd power-of-two cache lines.

Alignment • The overall miss-rates drop exponentially with block size, but access times are generally assumed to increase geometrically with block size.

Alignment • With canonical layouts, it is often necessary to pad the row or column length in order to avoid pathological behaviour. • Finding the right amount of padding is not trivial. • Theoretically, one should align the base address of Morton arrays to the largest significant block size in the memory hierarchy – i.e. page size. • Aliasing in the memory hierarchy can spoil the theory. • For example, on Pentium 4, the following aliasing patterns cause problems • 2K – map to same L1 cache line • 16K – aliases in store-forwarding logic • 32K – map to the same L2 cache line • 64K – indistinguishable in L1 cache

Address calculation • With lexicographic (aka canonical) layout, it’s easy to calculate the offset S of A[i,j] in a NM array A: • Srm(i,j) = Ni + j Scm(i,j) = i + Mj (if N and M are powers of two, this is bit-wise concatenation of i and j) • In loops, the multiplication is replaced by an increment • When unrolling loops, the address calculation can be strength-reduced. • How can we calculate the Morton offset?

Address calculation • Morton indices can be calculated by using the bit-concatenation idea of RM/CM for power-of-two arrays recursively: • For a 2x2 array, if iand j are the indices, then the location is (i << 1) | j. • Let D0(i) = in0 … i10i00 • Let D1(i) = 0in … 0i10i0 • Then Smz(i,j) = D0(i) | D1(j) • Dilation is rather expensive for inner loop • Strength reduction (Wise et al) • D0(i+1) = ((D0(i) | Ones0) + 1) & Ones1 • D1(i+1) = ((D1(i) | Ones1) + 1) & Ones0

Address calculation • Idea: use lookup tables for D0(i) and D1(j) A[MortonTabEven[i] + MortonTabOdd[j]] • When can we do strength reduction? • In general Smz(i,j+1) could be anywhere • D0(i + 1) = ??? • D0(i + k) = D0(i) + D0(k) if i’s and k’s bits do not overlap. • We can do strength reduction D0(i + k) = D0(i) + D0(k) as long as i = 2n and k < 2n • With this, we can do loop unrolling

Unrolled Code with Stength-Reduction double mmijk_unrolled(unsigned sz,FLOATTYPE *A,FLOATTYPE *B,FLOATTYPE *C) unsigned i,j,k; for (i=0;i<sz;i++){ unsigned int t1i=MortonTabOdd[i]; for (j=0;j<sz;j++){ unsigned int t0j=MortonTabEven[j]; for (k=0;k<sz;k+=4){ unsigned int t0k=MortonTabEven[k]; unsigned int t1k=MortonTabOdd[k]; C[t1i+t0j] += A[t1i+t0k] *B[t1k+t0j]; C[t1i+t0j] += A[t1i+t0k + 2] *B[t1k+t0j + 1]; C[t1i+t0j] += A[t1i+t0k + 8] *B[t1k+t0j + 4]; C[t1i+t0j] += A[t1i+t0k +10] *B[t1k+t0j + 5]; } } }

So have we solved the problem? • Unrolling significantly reduces the overhead of the Basic Morton Scheme. • IKJ is still faster than IJK – might be due to having two table lookups in the inner loop.

Benchmarks • Suite of simple numerical kernels operating on 2D arrays of doubles • Used the compilers and flags which the vendors used for their SPEC-CFP2000 results

Experimental Setup • We used identical clusters of (student) lab machines during off-peak periods • Extensive scripting to automate data collection • Dixon Test to remove outliers from the measurements • Use median instead of mean. • Overall more than 26M measurements

Architectures • AMD, Thunderbird 1.8GHz, 512MB DDR-RAM • 64KB, 2-way, 64Byte block L1 cache, 256KB, 8-way 64B block L2. • Intel C Compiler v7.1 for Linux. Flags: “-xK -ipo –static +FDO” • Pentium III, Coppermine 450MHz, 256MB SDRAM • 16KB, 4-way, 32Byte block L1 cache, 256KB, 8-way 32B block L2. • Intel C Compiler v7.1 for Linux: Flags: “-xK –ipo –O3 –static +FDO” • Sun, SunFire 6800, UltraSparc III 750MHz • 64KB, 4-way, 32Byte block L1 cache, 8MB Direct Mapped L2 Cache • Sun Workshop Compiler V6, Flags: “-fast –xcrossfile –xalias_level=std +FDO” • Alpha Compaq AlphaServer ES40, 21264 (EV6) 500MHz • 64KB, 2-way, 64Byte block L1 cache, 4MB Direct Mapped L2 Cache • Compaq C Compiler V6 , Flags: “–arch ev6 -fast –O4” • Pentium 4, 2.0GHz, 512MB DDR-RAM • 8KB, 8-way,64Byte block L1 cache, 256KB, 8-way 64B block L2. • Intel C Compiler v7 for Linux: Flags: “-xW –ipo –O3 –static +FDO”

Alpha(L1:64KB/2-w/64Byte, L2:4MB/DM)

Athlon(L1:64KB/2-w/64Byte, L2:256KB/8-w/64B)

Pentium III(L1:16KB/4-w/32Byte, L2:256KB/8-w/32B)

Pentium 4(L1:8KB/8-w/64Byte, L2:256KB/8-w/64B)

Sparc (L1:64KB/4-w/32Byte, L2:8MB Direct-mapped/64Byte)

Summary • The Basic Morton Scheme often performs disappointingly. • Page-aligning the base address theoretically maximises spatial locality. • Unrolling is facilitated by carefully aligning the start iteration of unrolled loops to power-of-two indices into the array. • With base-address alignment and unrolling for strength-reduction of index calculation, Morton layout is beginning to actually work.

Future Work • Larger Factors of Unrolling • Until now only factor 4 hand-unrolled • We have used code generation to unroll by larger factors, and it seems that there are more improvements to be had. • Prefetching • It’s likely that hardware prefetching will fetch the wrong things • Turn off hardware prefetching, use the right, compiler-directed prefetching instead. • Tiling • Storage layout transformations and iteration space transformations are complimentary • But we should do both.

Jeyarajan Thiyagalingam Olav Beckmann and Paul H.J. Kelly

Jeyarajan Thiyagalingam Olav Beckmann and Paul H.J. Kelly

Presentation Transcript

Max Beckmann:

H.J. Heinz

H.J. Heinz Company

Max Beckmann

Mobile App Install Paul Kelly- McQuaid

Beckmann, B; Pfeufer , A; Kääb , S

Paul T.M. Loto’aniu 1,2 and H.J. Singer 1 NOAA Space Environment Center 1

Paul Kelly

H.J. Heinz

H.J. Muller

H.J. Heinz Distribution Center

Johan Beckmann, University of Pretoria ( johan.beckmann@up.ac.za ) and

Kelly

Day and Dream,Drawings (1946) BY Max Beckmann