Cache Optimizations & the Loop Nest Optimizer

Cache Optimizations &the Loop Nest Optimizer

Improvement Opportunities • Program runs slow because not all resources are used: • processor: • not using opportunities to go superscalar (ILP) • scheduling of instructions is not optimal (too many wait states) • memory access: • not all data in cache line is used (spatial locality) • data in the cache in not reused (temporal locality) • Performance analysis is used to diagnose the problem. • Compiler will attempt to optimize the program for the given Architecture: • data structure can inhibit compiler optimizations • algorithm presentations can inhibit compiler optimizations • Often it is necessary to rewrite critical part of code (loops) in the program so that compiler can do better performance optimization. • Understand compiler optimizations techniques

Compiler Optimization Techniques Loop nests, implies usage of multi-dimensionalarrays enabled at -O3 or with LNO:opt=[1|0] • The following optimizations are built into the compiler: • general • procedure inlining • data and array padding • loop based: • Loop interchange • outer and inner loop unrolling • cache blocking • loop fusion (merge) and fission (split) • Code generation: • software pipelining • instruction reordering • Algorithm presentation in the program such that compiler can apply the optimization techniques - • - leads to optimal program performance on the machine.

Scalar Architecture: Cache System 1 Cache subsystem memory disk 0.1 ~2-3 cy ~10 cy 64reg Speed of Access 1/clock 32KB (L1) 0.01 8MB (L2) ~100 cy ~4000 cy ~1 - 100s GB Device Capacity (size) • The hierarchy of memory devices: • The goal of Memory Hierarchy: • access speed ~ fastest memory • effective capacity ~ size of largest memory • -> Programs should follow the principle of locality: (Use items in the cache) • Spatial locality of reference (use all words in cache line) • Temporal locality of reference (use same cache line)

Scalar Architecture: Cache Organization • The goal of scalar optimization: • Spatial locality of reference (use all words in cache line) • Temporal locality of reference (use same cache line) Example Cache L2 on O2K (e.g. 8 MB or 2097152 words) Words in Memory Load instruction (ld) for 1 word cache line transfer cache lines in memory (32 words) • Cache hit will load word from cache • Cache miss will load cache line from memory

Problems of Scalar Optimization k i i = X j j k cache lines • each C(I,j) value is accumulated in the register for A(I,k)*B(k,j) • B is traversed in sequence of cache lines (spatial locality) • A is accessing only 1 word from each cache line (no locality) • for A and B no reuse of cache lines (if n is large) • This is a problem only if A,B,C do not fit into the cache DO i=1,n DO j=1,n DO k=1,n C(i,j)=C(i,j) + A(i,k)*B(k,j) ENDDO ENDDO ENDDO

Loop Nest Optimizer • LNO performs loop restructuring to optimize data access: • loop interchange • loop unrolling • loop blocking for cache • loop fusion • loop fission • pre-fetching • LNO is controlled with compiler options and/or compiler directives or pragmas; same options for both • LNO is the default at -O3, but can be turned on/off individually by -LNO:opt=[1|0] • directives/pragma syntax: • Fortran: C*$* keyword [=value(s)] • C/C++ : #pragma keyword [=value(s)] • directives/pragmas can be disabled with the compiler switch -LNO:ignore_pragmas

Array Indexing Loop carried Addressing - DO j=1,M DO i=1,N k = k + 1 … A(k) … ENDDO ENDDO Indirect Addressing -- DO j=1,M DO i=1,N … A(index(i,j)) … ENDDO ENDDO • There are several ways to index arrays: • The addressing scheme will have impact on the performance • Arrays should be accessed in most natural direct way for compiler to apply loop optimization techniques Direct Addressing ++ DO j=1,M DO i=1,n … A(i,j) …. ENDDO ENDDO Explicit Addressing + DO j=1,M DO i=1,N … A(i+(j-1)*N) … ENDDO ENDDO

Data Storage in Memory J I j i In memory In memory i j i j j i j i A(I,J) i+1 j+1 j+2 i+2 j i a[i][j] • Data storage order is language dependent: • Fortran stores multi-dimensional arrays “column-wise” • C stores multi-dimensional arrays “row-wise” • Accessing array elements in storage order greatly improves performance: • for arrays that do not fit in the cache(s) left most index changes fastest... right most index changes fastest...

Loop Interchange: FORTRAN Original loop: c*$* no interchange DO I=1,N DO J=1,M C(I,J)=A(I,J)+B(I,J) ENDDO ENDDO Interchanged loops: c*$* interchange(J,I) DO J=1,M DO I=1,N C(I,J)=A(I,J)+B(I,J) ENDDO ENDDO J I J I M A(I,J) B(I,J) C(I,J) N Storage order Access order • The distribution of data in memory is not changed. Only the access pattern is changed • Compiler can do this optimization automatically -LNO:interchange=[on|off](default on) M A(I,J) B(I,J) C(I,J) N

Index Reversal • Index reversal on B: i.e. B(I,J) replaced by B(J,I) must be done everywhere in the program • This has to be done manually, there is no compiler optimization that does index reversal. Original loop: DO I=1,N DO J=1,M C(I,J)=A(I,J)+B(J,I) ENDDO ENDDO The access is poor for A and C, while it is optimal for B Interchanged loops + Index reversal: DO J=1,M DO I=1,N C(I,J)=A(I,J)+B(I,J) ENDDO ENDDO interchange will be good for A and C, it will be bad for B

The Significance of Loop Interchange • Run time in seconds obtained on an Origin 3000: • loop order R12K@400MHz • (8 MB cache) • i,j,k 535.0 • j,i,k 32.0 • k,j,i 11.0 DO I=1,700 DO J=1,700 DO K=1,700 A(I,J,K)=A(I,J,K)+B(I,J,K)*C(I,J,K) ENDDO ENDDO ENDDO

Loop Interchange in C Interchanged loop: #pragma interchange(i,j) for(i=0; i<n; i++) for(j=0; j<m; j++) c[i][j]=a[i][j]+b[j][i]; Index Reversal loop: for(j=0; j<m; j++) for(i=0; i<n; i++) c[j][i]=a[j][i]+b[j][i]; • In C, the situation is exactly the opposite to Fortran: • The performance benefits in C are the same as in Fortran • In most practical situations, loop interchange (supported by the compiler) is much easier to achieve than index reversal. Original loop: #pragma no interchange for(j=0; j<m; j++) for(i=0; i<n; i++) c[i][j]=a[i][j]+b[j][i]; Addressing of b[j][i] is optimal Addressing of c[i][j] and a[i][j] are poor

Array Placement Effects • “Poor” data placement in memory can lead to the effect of • cache thrashing. • There are 2 techniques built into the compiler to avoid the cache thrashing: • array padding • leading dimension extension • NOTE: leading dimension of arrays should be an odd number, if the multi-dimensional array has small extensions (e.g. a(64,64,64,..)) several leading dimensions should be odd numbers.

Direct-Mapped Caches: Thrashing (Virtual) memory A(1) A(2) A(8191) 32 KB A(8192) B(1) A(8187) A(8185) A(8186) A(8189) A(8192) A(8191) A(8190) A(8188) A(4) A(2) A(3) A(5) A(8) A(7) A(1) A(6) B(8191) B(8192) COMMON //A(8192), B(8192) DO I=1,N PROD = PROD + A(I)*B(I) ENDDO Registers in the CPU Direct mapped cache (32 KB) Cache line: 4 words 1 2 2047 2048 Thrashing: every memory reference results in a cache miss Location in the cache: (memory-address) mod (cache-size) in this case loc(A(1)) mod 32KB = loc(B(1)) mod 32KB [because B(1) = A(1) + 8192; 8192*4B mod 32KB = 0]

Set-Associative Caches (Virtual) memory A(1) A(2) A(8191) 32 KB A(8192) B(1) A(4095) A(4089) A(4090) A(4091) A(4092) A(4093) A(4094) B(4093) A(4096) B(4094) B(4089) B(4096) B(4091) B(4095) B(4092) B(4090) A(5) A(7) A(8) A(4) A(3) A(2) A(1) A(6) B(3) B(8) B(7) B(1) B(2) B(5) B(4) B(6) B(8191) B(8192) COMMON //A(8192), B(8192) DO I=1,N PROD = PROD + A(I)*B(I) ENDDO Registers in the CPU 2 way set associative cache (32 KB) Cache line: 4 words 1 2 Set select (1bit) (LRU) 1023 1024 No Thrashing: conflicting cache lines are stored into a different set Location in the cache: (memory-address) mod (cache-size) in this case loc(A(1)) mod 16KB = loc(B(1)) mod 16KB BUT A DIFFERENT SET!

Array Padding: Example Assume 32 KB cache COMMON // A(1024,1024), B(1024,1024), C(1024,1024) DO J=1,1024 DO I=1,1024 A(I,J) = A(I,J)+B(I,J)*C(I,J) ENDDO ENDDO Addr[C(1,1)] = Addr[B(1,1)] + 1024*1024*4 position in the cache: C(1,1) = B(1,1) since (1024*1024*4) mod 32KB = 0 COMMON // A(1024,1024),pad1(129) B(1024,1024),pad2(129) C(1024,1024) DO J=1,1024 DO I=1,1024 A(I,J) = A(I,J)+B(I,J)*C(I,J) ENDDO ENDDO • Padding will cause cache lines • to be placed in different • cache locations • Compiler will try to do padding • automatically Addr[C(1,1)] = Addr[B(1,1)] + 1024*1024*4+129*4 position in the cache: C(1,1) = B(129,1) mod 32KB

Maxwell Code Example • Compiling with: • -mips4 -O3 -LNO:opt=0 -OPT:reorg_common=off • (to show the effect of compiler not performing the necessary optimizations) • gives performance on this code of 4.6 Mflop/s REAL EX(NX,NY,NZ),EY(NX,NY,NZ),EZ(NX,NY,NZ) !Electric field REAL HX(NX,NY,NZ),HY(NX,NY,NZ),HZ(NX,NY,NZ) !Magnetic field … DO K=2,NZ-1 DO J=2,NY-1 DO I=2,NX-1 HX(I,J,K)=HX(I,J,K)-(EZ(I,J,K)-EZ(I,J-1,K))*CHDY +(EY(I,J,K)-EY(I,J,K-1))*CHDZ HY(I,J,K)=HY(I,J,K)-(EX(I,J,K)-EX(I,J,K-1))*CHDZ +(EZ(I,J,K)-EZ(I-1,J,K))*CHDX HZ(I,J,K)=HZ(I,J,K)-(EY(I,J,K)-EY(I-1,J,K))*CHDX +(EX(I,J,K)-EX(I,J-1,K))*CHDY ENDDO ENDDO ENDDO here NX=NY=NZ = 32, 64, 128, 256 (i.e. with real*4 elements: 0.8MB, 6.3MB, 50MB, 403MB) Reusing load from previous iteration (I-1) gives in total: 13 memory operations (6H+7E) -> minimum 13 cycles/iteration 18 floating point operations in this code 18/(13*2)=69% peak, i.e. 800Mflop/s on the R10000@400MHz processor

Maxwell Example - continued • Problem: • array dimensions are small even numbers, power of 2 and map to the same location in both 1st level and the 2nd level caches • for the Maxwell example the print shows with NX=NY=NZ=64: • Compiler is able to pad the arrays automatically. Compiling with the default optimizations: -mips4 -O3 gives for the performance 162 Mflop/s In general: primary cache 32 KB = 2(way-set-ass) * 4(size-real) * 4096 secondary cache 8 MB = 2(way-set-ass) * 4(size-real) * 1048576 C print position of arrays in memory with the code: Integer*8 aEX aEX = %LOC(EX(1,1,1)) print *,’Addr EX=‘,mod(aEX,4096), mod(aEX,1048576),’words’ Addr EX= 3720 470664 Addr EY= 3720 470664 …….. etc. Addr HZ= 3720 470664 All arrays map to the same locations in both caches

Dangers of Array Padding • Compiler will automatically pad local data • -O3 optimization will automatically pad common blocks • Padding of common blocks is safe as long as the Fortran standard is not violated: • Fix violation or do not to use this optimization either by compiling with lower optimization or using explicit compiler flag: • -OPT:reorg_common=off SUBROUTINE SUB COMMON // A(512,512), B(512,512) DO I=1, 2*512*512 A(I) = 0.0 END

Variable Length Arrays (VLA) • SGI compiler supports Variable Length Arrays in C and Fortran • It is standard in F90 and an SGI extension in F77: • In C it is an SGI extension: • VLAs are very handy as scratch arrays, since they are created each time execution enters the subroutine and they are destroyed at exit • Unlike the static arrays, VLAs allow for proper aliasing and alignment considerations by the compiler SUBROUTINE NAME1(N,M) DIMENTION R(N,M) ……… etc. … END These arrays are created on the stack, as opposed to a location in a static area void name1(int m, int n){ double r[m][n][n+m]; …… etc. ….. }

Loop Unrolling • Loop unrolling: perform multiple loop iterations at the same time • Advantages of loop unrolling: • more opportunities for super-scalar code • more data re-use & pseudo-prefetch • exploit presence of cache lines • reduction in loop overhead (minor) • NOTE: Inner loops should “never” be unrolled by hand: • compiler will typically unroll the inner loop the necessary amount for SWP DO I=1,N,UNROLL …(I)… …(I+1)… …(I+2)… …(I+UNROLL-1)… ENDDO DO I=1,N,1 …(I)… ENDDO C*$* unroll(p) P = 0 default unrolling p = 1 no unrolling p = UNROLL - that factor & cleanup DO I=N-mod(N,unroll)+1,N …(I)… ENDDO

Prefetch Data from Memory for(i=0; i<n; i+=4){ a += b[i+0]; a += b[i+1]; a += b[i+2]; a += b[i+3]; } • Reordering instructions in unrolled loop leads to effective (pseudo-) prefetch of the data • no instruction overhead; compiler does this optimization automatically. • Explicit (manual) prefetch for memory: • prefetch to 1st level cache should be done in form of pseudo-prefetch • compiler will insert prefetch to 2nd level cache automatically (LNO) • manual prefetch to 2nd level cache can be done with compiler directive: • same in C with the corresponding #pragma directive for(i=0; i<n; i+=4){ t = b[i+3]; a += b[i+0]; a += b[i+1]; a += b[i+2]; a += t; } C*$* prefetch_ref=a(1) c*$* prefetch_ref=a(1+16) do I=1,n c*$* prefetch_ref=a(I+32),stride=16,kind=rd sum = sum + a(I) enddo

Outer Loop Unrolling DO I=1,N,4 ! Unrolling by 4 DO J=1,N A(I+0)=A(I+0)+B(I+0,J)*C(J) A(I+1)=A(I+1)+B(I+1,J)*C(J) A(I+2)=A(I+2)+B(I+2,J)*C(J) A(I+3)=A(I+3)+B(I+3,J)*C(J) ENDDO ENDDO • the unroll factor should match the cache line size • mostly 1st level cache optimization • if the data fits into the 2nd level cache, this is good optimization to use DO I=1,N DO J=1,N A(I)=A(I)+B(I,J)*C(J) ENDDO ENDDO Problem: A(I) is constant for the inner loop J C(J) is traversed each I iteration B(I,J) is traversed poorly Unrolling the outer loop will load the complete cache line of B in to the registers -> data re-use one 1st level cache line -LNO:outer_unroll=n

Blocking for Cache (tiling) DO I=1,N …. (I) …. ENDDO DO i1=1,N,nb DO I=i1,min(i1+nb-1,N) …. (I) …. ENDDO ENDDO The inner loop is traversed only in the range of nb at a time • Blocking for cache: • An optimization that applies to data sets that do not fit into the (2nd level) data cache • A way to increase spatial locality of reference (i.e. exploit full cache lines) • A way to increase temporal locality of reference (i.e. to improve data re-use) • It is beneficial mostly with multi-dimensional arrays -LNO:blocking=[on|off] (default on) -LNO:blocking_size=n1,n2 (for L1 and L2) By default L1=32KB and L2=1MB use -LNO:cs2=8M to specify the 8MB L2 cache

Blocking: Example • The following loop nest: • z[j] is reused for each i iteration • For large n the array z will not be reused from the cache • Blocking the loops for cache: • nb elements of z array will be brought in to the cache and reused nb times before moving on to the next tile x[i][j] is traversed in order y[I] is loop invariant z[j] is traversed sequentially changing loop order is not beneficial in this case for(i=0; i<n; i++) for(j=0; j<m; j++) x[i][j] = y[i] + z[j] For(it=0; it<n; it += nb) for(jt=0; jt<m; jt += nb for(i=it; i<min(jt+nb,n); i++) for(j=jt; j<min(jt+nb,m); j++) x[i][j] = y[i] + z[j]

Loop Fusion Fusing more loops with loop peeling: a[0] = b[0] + 1 c[0] = a[0]/2 for(i=1; i<n; i++){ a[i] = b[i] + 1 c[i] = a[i]/2 d[I-1] = 1/c[i] } d[n] = 1/c[n+1] Original loops: for(i=0; i<n; i++) a[i] = b[i] + 1 for(i=0; i<n; i++) c[i] = a[i]/2 for(i=0; i<n; i++) d[i] = 1/c[i+1] Fused loops: for(i=0; i<n; i++){ a[i] = b[i] + 1 c[i] = a[i]/2 } for(i=0; i<n; i++) d[i] = 1/c[i+1] • Loop fusion (merging two or more loops together): • fusing loops that refer to the same data enhances temporal locality • larger loop body allow more effective scalar optimizations • Example: • loop peeling can be used to break data dependencies when fusing loops • sometimes temporary arrays can be replaced by scalars (this optimization has to be done manually) • Compiler will attempt fuse loops if they are adjacent, i.e. no code between the loops to be fused -LNO:fusion=[0,1,2] (default 1)

Loop Fusion in Array Assignments • Loop Fusion is instrumental in generating good F90 code • compiler can optimize the loop sequence by fusion • for that, all assignments (loops) should be adjacent • preserving data dependencies, this can fused: • for this optimization to work automatically, no code should be placed between the array assignments, such that the assignments are adjacent F90 code sequence: A(I:N) = B(I:N)+1 C(I:N) = A(1:N)/2 D(1:N) = 1/C(2:N+1) Allocate T(1:N) DO I=1,N T(I)=B(I)+1 ENDDO DO I=1,N A(I) = T(I) ENDDO DO I=1,N T(I)= A(I)/2 ENDDO DO I=1,N C(I) = T(I) ENDDO DO I=1,N T(I)=1/C(I+1) ENDDO DO I=1,N D(I) = T(I) ENDDO Compiler will typically generate the following instruction sequence Fused loops: DO I=1,N A(I) = B(I)+1 C(I) = A(I)/2 ENDDO DO I=1,N D(I) = 1/C(I+1) ENDDO Further peeling to break data dependencies will merge the two remaining loops

Loop Fission • Loop Fission (splitting) or loop distribution: • improve memory locality by splitting out loops that refer to different independent arrays for(i=0; i<n-1; i++){ b[i+1] = c[i]*x + y; c[i+1] = 1/b[i+1]; } for(i=0; i<n-1; i++) a[i+1] = a[i+1] + b[i]; for(i=0; i<n-1; i++) d[i+1] = sqrt(c[i+1]); i=n+1 for(i=1; i<n; i++){ a[i] = a[i] + b[i-1]; b[i] = c[i-1]*x + y; c[i] = 1/b[i]; d[i] = sqrt(c[i]); } -LNO:fission=[0,1,2] (default 1) 0 no fission 1 normal fission 3 fission tried before fussion attempts to distribute inner loops

LNO: Gather-Scatter • Special form of loop fission: • If the loop to be optimized contains conditional execution, it is often faster to evaluate all the conditions first. • The computationally intensive loop runs only over the indices for which the condition was true and can be better optimized (SWP) • LNO will not evaluate the nested IF conditions, unless -LNO:gather_scatter=2 is used do I=1,n deref_gs(inc_0+1) = I if(c(I) .gt. 0) then inc_0 = inc_0 + 1 endif enddo do ind_0=0,inc_0-1 I=deref_gs(ind_0+1) a(I) = c(I)/b(I) c(I) = c(I)*b(I) b(I) = 2*b(I) enddo end Subroutine fred(a,b,c,n) real*8 a(n), b(n), c(n) do I=1,n if(c(I) .gt. 0) then a(I) = c(I)/b(I) c(I) = c(I)*b(I) b(I) = 2*b(I) endif enddo end Conditional execution removed

LNO: Vector Intrinsics SUBROUTINE VFRED(A,N) REAL*8 A(N) DO I=1,N A(I) = A(I) + COS(A(I)) ENDDO END CALL VCOS$(A(1),DEREF_SE1_F8(1), %VAL(N-1),%VAL(1), %VAL(1)) DO I=1,N A(I) = A(I) + DEREF_SE1_F8(I) ENDDO • Most intrinsics have their “vector” equivalents. The compiler will automatically substitute vector intrinsics where legal, when the functions are invoked in a loop: • Vector intrinsics are faster if N>10 for most intrinsics • Vector intrinsics have different precision rules (1 or 2 ulp less) • illegal arguments cannot be trapped with the vector intrinsics • -LNO:vintr=off to disable the generation of the vector intrinsics

Vector Intrinsics: Performance

Data Dependence in Loops S2 (1) S3 • In loops, each statement can be executed many times. • loop carried data dependence • dependence between statements in different iterations • loop independent data dependence • dependence between statements in the same iteration • lexically forward dependence: • source precedes the target lexically • lexically backward dependence: • opposite from above • right-hand side of an assignment precede the left-hand side • example: • unroll to analyze: • loop carried, lexically forward dependence (1) for( i=2; i<9; i++){ (2) x[i] = y[i] + z[i]; (3) a[i] = x[i-1] + 1; (4) }

Specifying the Dependency Rules • In the following example: • if K>N no dependency; if K<N there is a dependency. • The value of K is unknown to the compiler , thus the • compiler will assume dependencies. • The ivdep directive can be used to • communicate to the compiler the • data dependency rules. • IVDEP = Ignore Vector DEPendency SUBROUTINE DAXPYI(N,X,K,A) INTEGER N,K REAL*8 X(N),A DO I=1,N X(K+I) = X(K+I) + A*X(I) ENDDO END Compiler schedules: K<N (dependence) 14% peak K>N (no dependence) 33% peak SUBROUTINE DAXPYI(N,X,K,A) INTEGER N,K REAL*8 X(N),A cdir$ ivdep DO I=1,N X(K+I) = X(K+I) + A*X(I) ENDDO END

The IVDEP Directive • With indexed addressing IVDEP is the only way to specify no data dependencies to the compiler: • here ivdep means that the integer values stored in indx array are all different, I.e. indx is a permutation array • assuming no data dependencies will produce faster processor code, because compiler has less constraints on ordering the load-store instructions • The IVDEP directive to the compiler is not part of the language and its interpretation is not standardized. void update(int n, float *a, float *b, int *indx, float s) { int i; #pragma ivdep for(i=0; i<n; i++) a[indx[i]] += s*b[i]; }

Three Types of IVDEP Directive • The IVDEP directive to the compiler is not part of any language and its interpretation is not standardized. • Default interpretation: • A and B and C are independent, that breaks both, lexically forward (i+k) and backward (i-k) dependencies. • index(1,i) != index(1,j) • index(2,i) != index(2,j) • But for some I: index(1,*) == index(2,*) • The default interpretation can be changed with the -OPT: compiler option. Possible other interpretations: • break only lexically backward dependencies (Cray IVDEP), I.e. assume only index(*,i)!=index(*,i-k) (cray_ivdep=on) • there are no dependencies what so ever (Liberal IVDEP, enable with -OPT:liberal_ivdep=on) CDIR$ IVDEP DO I=1,N A(INDEX(1,I)) = B(I) A(INDEX(2,I)) = C(I) ENDDO SGI default behaviour: A and B and C are independent, i.e. index(*,i) != index(*,j)

The Argument Alias Problem • In Fortran, it is a mistake to invoke copy with overlapping arguments. The compiler will perform optimizations assuming A and B are not aliases over the computational range. • In C, argument aliases are allowed. Therefore optimizations (SWP) changing the original order of loads and stores are not possible. There are several ways to remove this restriction: • the ivdep pragma • the compiler optimization flag: -OPT:alias=memory-access-model • the restrict keyword SUBROUTINE COPY(A,B,N) REAL*8 A(N),B(N) DO I=1,N B(I) = A(I) ENDDO END In Fortran, compiler assumes A and B do not overlap In C, compiler assumes pointers a and b can point to the same address void copy(double *a, double *b, int n) { int i; for(i=1; i<n; i++) b[i] = a[i]; }

Aliases: the Optimizer Options • These options work over all of the compilation unit. • -OPT:alias=[any,typed,unnamed,restrict,disjoint] • any is the default. Any pair of memory references may be aliased. • From the other memory access models, the most important are: • restrict • assume that any pair of memory references that are named differently do not point to the same regions in memory • disjoint • assume same restrictions as “restrict”, in addition any pointer de-referencing will point to an overlapping region in memory float *p, *q *p does not alias with *q, q, p or any global variable float *p, *q *p does not alias with *q, q, p or any global variable *p does not alias with **q, **p, ***q, etc.

The restrict Keyword • The Numerical C Extensions Group X3J11.1 proposed (1993) a restrict keyword as the way to specify pointer access models. • The restrict semantics: • assume de-referencing the qualified pointer is the only way the program can access the memory pointed to by that pointer • loads and stores through such a pointer do not alias with any other load and stores, except these with the same pointer • in this example, it is sufficient to indicate restrict b, since it is necessary to qualify only the pointers being stored through • to enable the restrict keyword it is necessary to use the compiler flag (7.2 and 7.3 compilers): -LANG:restrict void copy(double * restrict a, double * restrict b, int n) { int i; for(i=1; i<n; i++) b[i] = a[i]; }

Alias in Storage Allocation • Program data can be stored in memory in 2 ways: • Storage in global area • memory pages are allocated statically, i.e. all data is put at a fixed (virtual) address at load time • loading such data takes often 2 instructions, since the load immediate instruction in MIPS is limited by 64 KB offset: ldadr R1,addr #load base pointer • ldw R2,R1+offset #load base+offset • COMMON block data, global data, SAVE data, malloc, mmap • compilation with -static: all variables are allocated in global area • Storage on the stack • memory pages are allocated dynamically during program exec • each subroutine gets new stack area for local data • loading data from the stack requires single instruction ldw R2,TOS+offset #load TopOfStack+offset • local (automatic) variables, temporary storage, alloca data • Routines called from a parallel region : • Allocate private stack area • Variables allocated on private stack are private. • Variables in global area are shared (aliases).

Procedure Inlining • Inlining: replace a function call by that function source code • Advantages: • increase opportunities for processor optimizations • more opportunities for Loop Nest optimizations • Candidates for inlining are modules that: • “small” i.e. not much source code • are called very often (typically in a loop) • do not take much time per call • Inhibition to inlining: • mismatched in the subroutine arguments (type or shape) • no inlining across languages (e.g. Fortran calls C subroutine) • no static (SAVE) local variables • not varargs routines, no recursive routines • no functions with alternate entry points • no nested subroutines (like in F90) DO I=1,N call DO_WORK(A(I),C(I)) ENDDO -INLINE:list=[on|off] (default off) -INLINE:must=sub1:never=sub2 -IPA:inline=[on|off] (default on) Subroutine DO_WORK(X,Y) Y=1+X*(1+x*0.5) END

Software Pipelining (SWP) • The software pipelining is the way to mix iterations in a loop such that all processor execution slots are filled: • SWP is performed by the Code Generator (CG), that also unrolls inner loop to achieve the best SWP schedule (-O3 opt level). This can be computationally intensive. • Vector loops well-suited for SWP; short loops may run slower with SWP • Inhibitors to SWP: • loops with subroutine (or intrinsic) calls cannot be SWP-ed • loops with complicated conditionals or branching • loops that are too long cannot be software pipelined because compiler runs out of available registers (loop fission) • data dependence between iterations are harder to SWP

Summary • Scalar optimization: • improving ILP by code transformation and grouping independent instructions • improving memory access by restructuring loop nests to take better advantage of memory hierarchy • compilers are good at instruction level optimizations and loop transformations. It depends on the language, however: • F77 is the easiest for compiler to work with • C is more difficult • F90/C++ are most complex for compiler optimizations • the user is responsible to present the code in a way that allows for compiler optimizations: • don’t violate the language standard • write clean and clear code • consider the data structures for (false) sharing and alignment • consider the data structures for data dependencies • most natural presentation of algorithms using multi-dimensional arrays

Case Study:Vector Update Scalar Optimization Techniques

Vector Update Code ll=0 do jj=1,nj do ii=1,ni ll=ll+1 res=0 do n=1,nib na=ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub nb=n+(jj-1)*nib ndb1=nmb1/2 naa1=nma1+na nbb1=ndb1+nb res=res+p(naa1)*dp(nbb1) end do nde1=nme1/2 lle1=nde1+ll dp(lle1)=dp(lle1)+res end do end do Profiling tells us that we spend most time in this part Thist is the net result of all the computations L1 Cache L2 Cache TLB Execution (sec) (sec) (sec) (sec) 50 37 215 286

Cache Optimizations & the Loop Nest Optimizer