1 / 14

Computer Aided Hand Tuning

Computer Aided Hand Tuning. Antoine Monsifrot François Bodin CAPS Team. June 2001. Why CBR driven code tuning? Approach System overview Tuning cases Examples Conclusion. Overview. Execution speed depends on the code structure on the processor architecture

salene
Download Presentation

Computer Aided Hand Tuning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Aided Hand Tuning Antoine Monsifrot François Bodin CAPS Team June 2001

  2. Why CBR driven code tuning? Approach System overview Tuning cases Examples Conclusion Overview

  3. Execution speed depends on the code structure on the processor architecture Compiler optimizations frequently fail unable to analyze the programs (aliasing, ...) must preserve program semantics few application or target architecture knowledge ignore most of the existing libraries Introduction

  4. Case-based reasoning no knowledge formalization needed 4 main operations: identification, retrieval, reuse, retention Defining a Tuning case abstracting loop performance properties User interaction CBR Driven Code Tuning?

  5. System Overview

  6. A goal and a target machine A program transformation A set of indices data about the code that indicates the optimisation opportunity abstraction of code properties High probability of recognising a code structure we know how to optimise compilers need to be conservative A Tuning Case

  7. Based on execution time code properties data locality parallelism floating point operations libraries Abstractions data accesses data dependencies arithmetic expressions code patterns Abstract performance indices

  8. Loop nest structure depth, gotos, function call Array accesses access strides Expression patterns div/div, power, sparse accesses, ... Loop patterns Blas, LU, Jacobi, SOR Parallelism Data dependencies Execution time and frequency etime, tcov Static Indices do k = 1,npts do j = 2,npts a(j,k) = a(j-1,k) + a(j,k)**2 if (a(j,k) .eq. 0) then goto 4 endif a(j,k) = a(j,k) + 1 4 a(j,k) = a(j-1,k) / a(j,k) enddo Dynamic Indices

  9. For each loops all cases are checked Computing Cases char *ComputeCase1(Indices[]){ …}

  10. Tiling for TLB Cases Example Indices } • no perfect loop nest • large body } distribution + distribution + tiling } • affine loop • line array accesses • column array accesses } tiling Skewing + tiling + } • no negative component in • dependence vectors • uniform dependencies skewing

  11. Loop Benchmark 64 loop nests 3.3Mflop 54.1Mflop DO 3200 I = 1,NSIZE2 DO 3170 J = 1,NSIZE1 IF (B2(J,I) .EQ. 0.0) GO TO 3130 A2(J,I) = C2(J,I)*B2(J,I) GO TO 3170 3130 CONTINUE B2(J,I) = C2(J,I)*A2(J,I) 3170 CONTINUE 3200 CONTINUE • 44 are compiler friendly • 40 are improved by KAP • 13 do not exhibit a case • 12 exhibit a case • 5 parallel loops not parallelized by KAP • 1 sorted else if • 1 condition on loop index • 3 loop nests with loops to merge • 2 matrix multiply http://www.netlib.org/benchmark/parallel

  12. A real application Gaussian Density Functional Program 75863 lines of Fortran code (comment included) Two main routines: gridwork : 47,5% 1015 lines x_annihilate : 29,7% 269 lines An Application Example: DeFT http://www.ccl.net/cca/software/SOURCES/FORTRAN/DeFT/index.shtml

  13. Examples of cases found: Parallel loop DeFT Examples do 1012 i=1,ihits ii=iwkvec(i) …... do 1012 j=1,ihits jj=iwkvec(j) …... do 1015 k=1,npts 1015 wf(k,ii)=wf(k,ii)+factor*fv(k,jj) if((nfunctional.gt.0).and.(ipart.eq.0)) then do 1016 k=1,npts wfx(k,ii)=wfx(k,ii)+factor*fvx(k,jj) wfy(k,ii)=wfy(k,ii)+factor*fvy(k,jj) 1016 wfz(k,ii)=wfz(k,ii)+factor*fvz(k,jj) endif 1012 continue Matrix Multiplication (Blas) do 1029 k = 1,n ... do 1029 j = istart(myid+1),iend(myid+1) do 1029 i = 1,n 1029 overlap(i,j) = overlap(i,j) + coeff(i,k)*coeff(j,k) Fusion do 1011 k=istart(myid+1),iend(myid+1) 1011 veci(k)=coeff(k,i) do 1012 k=istart(myid+1),iend(myid+1) 1012 vecj(k)=coeff(k,j) do 1013 k=istart(myid+1),iend(myid+1) 1013 coeff(k,i)=coeff(k,i)+s(i)*(vecj(k)-tau(i)*veci(k)) do 1014 k=istart(myid+1),iend(myid+1) 1014 coeff(k,j)=coeff(k,j)-s(i)*(veci(k)+tau(i)*vecj(k)) do 1015 k=istart(myid+1),iend(myid+1) 1015 veci(k)=smat(k,i) do 1016 k=istart(myid+1),iend(myid+1) 1016 vecj(k)=smat(k,j) 4-processor SGI Onyx Sequential : 121s KAP : 140s CAHT : 85s

  14. Case based reasoning provides a promising framework for code tuning Tuning the cases may be difficult take into account the compiler (f.i. unrolling) integration of dynamic data and assembly code properties learning techniques for case tuning Conclusion

More Related