1 / 67

Code Tuning and Optimization

Code Tuning and Optimization. Kadin Tseng Boston University Scientific Computing and Visualization. Introduction Timing Example Code Profiling Cache Tuning Parallel Performance. Outline. Timing Where is most time being used? Tuning How to speed it up Often as much art as science

xenon
Download Presentation

Code Tuning and Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Code Tuning and Optimization Kadin Tseng Boston University Scientific Computing and Visualization

  2. Introduction Timing Example Code Profiling Cache Tuning Parallel Performance Code Tuning and Optimization Outline

  3. Timing • Where is most time being used? • Tuning • How to speed it up • Often as much art as science • Parallel Performance • How to assess how well parallelization is working Code Tuning and Optimization Introduction

  4. Code Tuning and Optimization Timing

  5. When tuning/parallelizing a code, need to assess effectiveness of your efforts • Can time whole code and/or specific sections • Some types of timers • unix time command • function/subroutine calls • profiler Code Tuning and Optimization Timing

  6. CPU time • How much time the CPU is actually crunching away • User CPU time • Time spent executing your source code • System CPU time • Time spent in system calls such as i/o • Wall-clock time • What you would measure with a stopwatch Code Tuning and Optimization CPU Time or Wall-Clock Time?

  7. Both are useful • For serial runs without interaction from keyboard, CPU and wall-clock times are usually close • If you prompt for keyboard input, wall-clock time will accumulate if you get a cup of coffee, but CPU time will not Code Tuning and Optimization CPU Time or Wall-Clock Time? (cont’d)

  8. Parallel runs • Want wall-clock time, since CPU time will be about the same or even increase as number of procs. is increased • Wall-clock time may not be accurate if sharing processors • Wall-clock timings should always be performed in batch mode Code Tuning and Optimization CPU Time or Wall-Clock Time? (3)

  9. easiest way to time code simply type time before your run command output differs between c-type shells (cshell, tcshell) and Bourne-type shells (bsh, bash, ksh) Code Tuning and Optimization Unix Time Command

  10. katana:~ % time mycode 1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w Code Tuning and Optimization Unix Time Command (cont’d) input + output operations wall-clock time (s) user CPU time (s) avg. shared + unshared text space system CPU time (s) page faults + no. times proc. was swapped (u+s)/wc

  11. Bourne shell results Code Tuning and Optimization Unix Time Command (3) • $ time mycode • real 0m1.62s • user 0m1.57s • sys 0m0.03s wall-clock time user CPU time system CPU time

  12. Code Tuning and Optimization Example Code

  13. Simulation of response of eye to stimuli (CNS Dept.) • Based on Grossberg & Todorovicpaper • Contains 6 levels of response • Our code only contains levels 1 through 5 • Level 6 takes a long time to compute, and would skew our timings! Code Tuning and Optimization Example Code

  14. All calculations done on a square array Array size and other constants are defined in gt.h (C) or in the “mods” module at the top of the code (Fortran) Code Tuning and Optimization Example Code (cont’d)

  15. Computational domain is a square • Defines square array Iover domain (initial condition) Code Tuning and Optimization Level 1 Equations bright dark

  16. Code Tuning and Optimization Level 2 Equations Ipq=initial condition

  17. Code Tuning and Optimization Level 3 Equations

  18. Code Tuning and Optimization Level 4 Equations

  19. Code Tuning and Optimization Level 5 Equation

  20. Copy files from /scratch disc Katana% cp /scratch/kadin/tuning/* . • Choose C (gt.c and gt.h) or Fortran (gt.f90) • Compile with no optimization: pgcc –O0 –o gt gt.cc pgf90 –O0 –o gt gt.f90 • Submit rungt script to batch queue katana% qsub -b y rungt Code Tuning and Optimization Exercise 1 small oh zero capital oh

  21. Check status qstat–u username • After run has completed a file will appear named rungt.o??????, where ?????? represents the process number • File contains result of time command • Write down wall-clock time • Re-compile using –O3 • Re-run and check time Code Tuning and Optimization Exercise 1 (cont’d)

  22. often need to time part of code timers can be inserted in source code language-dependent Code Tuning and Optimization Function/Subroutine Calls

  23. intrinsic subroutine in Fortran • returnsuserCPU time(in seconds) • no system time is included Code Tuning and Optimization cpu_time real :: t1, t2 call cpu_time(t1) ! Start timer ... perform computation here... call cpu_time(t2) ! Stop timer print*, 'CPU time = ', t2-t1, ' sec.'

  24. intrinsic subroutine in Fortran good for measuring wall-clocktime Code Tuning and Optimization system_clock

  25. t1 and t2 are tic counts count_rate is optional argument containing tics/sec. Code Tuning and Optimization system_clock (cont’d) integer :: t1, t2, count_rate call system_clock(t1, count_rate) ! Start clock ... perform computation here... call system_clock(t2) ! Stop clock print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’

  26. can be called from C to obtain CPUtime Code Tuning and Optimization times #include <sys/times.h> #include <unistd.h> void main(){ inttics_per_sec; float tic1, tic2; structtmstimedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); // start clock tic1 = timedat.tms_utime; … perform computation here… times(&timedat); // stop clock tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); } • can also get system time with tms_stime

  27. can be called from C to obtain wall-clock time Code Tuning and Optimization gettimeofday #include <sys/time.h> void main(){ structtimeval t; double t1, t2; gettimeofday(&t, NULL); // start clock t1 = t.tv_sec + 1.0e-6*t.tv_usec; … perform computation here… gettimeofday(&t, NULL); // stop clock t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }

  28. convenient wall-clock timer for MPIcodes Code Tuning and Optimization MPI_Wtime

  29. Fortran C Code Tuning and Optimization MPI_Wtime (cont’d) double precision t1, t2 t1 = mpi_wtime() ! Start clock ... perform computation here... t2 = mpi_wtime() ! Stop clock print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = MPI_Wtime(); // start clock ... perform computation here… t2 = MPI_Wtime(); // stop clock printf(“wall-clock time = %5.3f\n”,t2-t1);

  30. convenientwall-clocktimer for OpenMPcodes resolution available by calling omp_get_wtick() Code Tuning and Optimization omp_get_time

  31. Fortran C Code Tuning and Optimization omp_get_wtime (cont’d) double precision t1, t2, omp_get_wtime t1 = omp_get_wtime() ! Start clock ... perform computation here... t2 = omp_get_wtime() ! Stop clock print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = omp_get_wtime(); // start clock ... perform computation here... t2 = omp_get_wtime(); // stop clock printf(“wall-clock time = %5.3f\n”,t2-t1);

  32. Code Tuning and Optimization Timer Summary

  33. Put wall-clock timer around each “level” in the example code Print time for each level Compile and run Code Tuning and Optimization Exercise 2

  34. Code Tuning and Optimization Profiling

  35. profile tells you how much time is spent in each routine • gives a level of granularity not available with previous timers • e.g., function may be called from many places • various profilers available, e.g. • gprof (GNU) -- function level profiling • pgprof (Portland Group) -- function and line level profiling Code Tuning and Optimization Profilers

  36. compile with -pg • when you run executable, filegmon.outwill be created • gprofexecutable> myprof • this processes gmon.out into myprof • for multiple processes (MPI), copy or link gmon.out.nto gmon.out, then run gprof Code Tuning and Optimization gprof

  37. Code Tuning and Optimization gprof (cont’d) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds % cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]

  38. Code Tuning and Optimization gprof (3) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds called/total parents index %time self descendentscalled+self name index called/total children 0.00 340.50 1/1 .__start [2] [1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]

  39. compile with Portland Group compiler • pgf90 (pgf95, etc.) • pgcc • –Mprof=func • similar to –pg • run code • pgprof –exe executable • pops up window with flat profile Code Tuning and Optimization pgprof

  40. Code Tuning and Optimization pgprof (cont’d)

  41. To save profile data to a file: • re-run pgprof using –textflag • at command prompt type p > filename • filename is the name you want to give the profile file • type quit to get out of profiler • Close pgprof as soon as you’re through • Leaving window open ties up a license (only a few available) Code Tuning and Optimization pgprof (3)

  42. Times individual lines • For pgprof, compile with the flag –Mprof=line • Optimizer will re-order lines • profiler will lump lines in some loops or other constructs • may want to compile without optimization, may not • In flat profile, double-click on function to get line-level data Code Tuning and Optimization Line-Level Profiling

  43. Code Tuning and Optimization Line-Level Profiling (cont’d)

  44. Code Tuning and Optimization Cache

  45. Cache is a small chunk of fast memory between the main memory and the registers Code Tuning and Optimization Cache registers primary cache secondary cache main memory

  46. If variables are used repeatedly, code will run faster since cache memory is much faster than main memory • Variables are moved from main memory to cache in lines • L1 cache line sizes on our machines • Opteron (katana cluster) 64 bytes • Xeon (katana cluster) 64 bytes • Power4 (p-series) 128 bytes • PPC440 (Blue Gene) 32 bytes • Pentium III (linux cluster) 32 bytes Code Tuning and Optimization Cache (cont’d)

  47. Why not just make the main memory out of the same stuff as cache? • Expensive • Runs hot • This was actually done in Cray computers • Liquid cooling system Code Tuning and Optimization Cache (3)

  48. Cache hit • Required variable is in cache • Cache miss • Required variable not in cache • If cache is full, something else must be thrown out (sent back to main memory) to make room • Want to minimize number of cache misses Code Tuning and Optimization Cache (4)

  49. Code Tuning and Optimization Cache (5) “mini” cache holds 2 lines, 4 words each for(i=0; i<10; i++) x[i] = i; x[8] x[0] x[9] x[1] x[2] a b x[3] Main memory x[4] … x[5] … x[6] x[7]

  50. Code Tuning and Optimization Cache (6) x[0] • will ignore i for simplicity • need x[0], not in cache cache miss • load line from memory into cache • next 3 loop indices result in cache hits x[1] x[2] x[3] for(i=0; i<10; i++) x[i] = i; x[8] x[0] x[9] x[1] x[2] a b x[3] x[4] … x[5] … x[6] x[7]

More Related