修改程序代码以利用编译器实现优化

修改程序代码以利用编译器实现优化 www.intel.com/software/products Intel Confidential IA64_Tools_Overview2.ppt

Data Issues Responsible Pointer Usage • Compiler alias analysis limits optimizations • Developer knows App – tell compiler! • Avoid pointing to same memory address with 2 different pointers • Use array notation when possible • Avoid pointer arithmetic if possible

Data Issues Pointer Disambiguation • -Oa file.c (Windows) -fno-alias file.c (Linux) • All pointers in file.c are assumed not to alias • -Ow file.c (Windows) Not (yet) on Linux • Assume no aliasing within functions (ie, pointer arguments are unique) • -Qrestrict file.c (Windows) -restrict (Linux) • Restrict Qualifier: Enables pointer disambiguation • -Za file.c (Windows) -ansi (Linux) • Enforce strict ANSI compilance (requires that pointers to different data types are not aliased)

Prefetch Loop interchange Unrolling Cache blocking Unroll-and-jam Scalar replacement Redundant zero-trip elimination Data dependence analysis Reuse analysis Loop recovery Canonical expressions Loop fusion Loop distribution Loop reversal Loop skewing Loop peeling Scalar expansion Register blocking HLO High Level Optimizations Available at O3

HLO Data Prefetching • Adding prefetching instructions using selective prefetching. • Works for array , pointers , C structure , C/C++ parameters for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] if (mod(j,8) == 0) lfetch.nta(A[j+d, i]) if (i == 1) lfetch.nt1(B[0, j+d]) end_for end_for for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] end_for end_for • Goal: to issue one prefetch instruction per cache line • Itanium cache lines are L1 : 32B, L2 : 64B, L3 : 64B • Itanium 2 cache lines are L1 : 64B, L2 : 128B, L3 : 128B • -O3 does this for you • “Let the Compiler do the work!”

HLO Demo for(i=0;i<NUM;i++) { for(j=0;j<NUM;j++) { for(k=0;k<NUM;k++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } } } Consecutive memory index Fast Inner loop index Loop Interchange • Note: c[i][j] term is constant in inner loop • Interchange to allow unit stride memory access Lab : Matrix with Loop Interchange, -O2

HLO Unit Stride memory accessC/C++ Example – Fortran opposite Non-unit strided data access j incrementing K gets non consecutive memory elements b b00 b01 b02 b03 b0j b0N-1 b10 b11 b12 b13 b1j b1N-1 k bN-10 bN-1j bN-1N-1 Unit strided data access k a a00 a01 a02 a03 a0N-1 incrementing K gets consecutive memory elements a10 a11 a12 a13 a1N-1 i ai0 ai1 ai2 ai3 aiN-1 aN-10 aN-1N-1

HLO Demo for(i=0;i<NUM;i++) { for(k=0;k<NUM;k++) { for(j=0;j<NUM;j++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } } } Loop after interchange • Note: a[i][k] term is constant in inner loop • Two loads, one Store, one FMA: F/M = .33, Unit stride Lab : Matrix with Loop Interchange, -O3

HLO Unit Stride memory access (C/C++) All Unit strided data access j b b00 b01 b02 b03 b0N-1 b10 b11 b12 b13 b1N-1 k bk0 bk1 bk2 bk3 bkN-1 j bN-10 bN-1N-1 Fastest incremented index Consecutive memory access k a a00 a01 a02 a03 a0N-1 a10 a11 a12 a13 a1N-1 k i Next fastest loop indexConsecutive memory index ai0 ai1 ai2 ai3 aiN-1 aN-10 aN-1N-1

HLO Demo Loop Unrolling Preconditioning loop II = IMOD (N,4) DO I = 1, II DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDO ENDDO DO I = II,N,4 DO J=1,M A(J,I) = B(J,I) + C(J,I) * D A(J,I+1) = B(J,I+1) + C(J,I+1) * D A(J,I+2) = B(J,I+2) + C(J,I+2) * D A(J,I+3) = B(J,I+3) + C(J,I+3) * D ENDDO ENDDO Unroll Outer loop by 4 N=1025 M=5 DO I=1,N DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDO ENDDO • Unroll largest loops • If loop size known can eliminate preconditioning loop by choosing number of times to unroll Lab : Matrix with Loop Unrolling by 2

HLO Loop Unrolling - Candidates • If trip count is low and known at compile time it may make sense to Fully unroll • Poor Candidates: (similar issues for SWP or vectorizer) • Low trip count loops – for (j=0; j < N; j++) : N=4 at runtime • Fat loops – loop body already has lots of computation taking place • Loops containing procedure calls • Loops with branches

HLO Loop Unrolling - Benefits • Benefits • perform more computations per loop iteration • Reduces the effect of loop overhead • Can increase Floating point to memory access ratio (F/M) • Costs • Register pressure • Code bloat

HLO Lab for(i=0;i<NUM;i=i+2) { for(k=0;k<NUM;k=k+2){ for(j=0;j<NUM;j++){ c[i][j]= c[i][j]+ a[i][k]* b[k][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k]* b[k][j]; c[i][j]= c[i][j]+ a[i][k+1]* b[k+1][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k+1]* b[k+1][j]; } } } Loop invariant Loop Unrolling - Example • All loops unrolled by 4 results in (per iteration) 32 Loads, 16 stores, 64 FMA: F/M = 1.33 Demo Lab : Matrix with Loop Unrolling by 4

HLO Cache Blocking • When all arrays in loop do not fit in cache • Effective for huge out-of-core memory applications • Effective for large out-of-cache applications • Work on “neighborhoods” of data and keep these neighborhoods in cache • Helps reduce TLB & Cache misses for v = 1, 1000, 20 for u = 1, 1000, 20 for k = v, v+19 for j = u, u+19 for i = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for end_for end_for end_for end_for for i = 1, 1000 for j = 1, 1000 for k = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for end_for end_for

修改程序代码以 利用编译器实现优化