1 / 86

Code Tuning and Parallelization on Boston University’s Scientific Computing Facility

Code Tuning and Parallelization on Boston University’s Scientific Computing Facility. Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization. Outline. Introduction Timing Profiling Cache Tuning Timing/profiling exercise Parallelization. Introduction.

rasia
Download Presentation

Code Tuning and Parallelization on Boston University’s Scientific Computing Facility

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Code Tuning and Parallelization on Boston University’s Scientific Computing Facility Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

  2. Outline • Introduction • Timing • Profiling • Cache • Tuning • Timing/profiling exercise • Parallelization

  3. Introduction • Tuning • Where is most time being used? • How to speed it up • Often as much art as science • Parallelization • After serial tuning, try parallel processing • MPI • OpenMP

  4. Timing

  5. Timing • When tuning/parallelizing a code, need to assess effectiveness of your efforts • Can time whole code and/or specific sections • Some types of timers • unix time command • function/subroutine calls • profiler

  6. CPU or Wall-Clock Time? • both are useful • for parallel runs, really want wall-clock time, since CPU time will be about the same or even increase as number of procs. is increased • CPU time doesn’t account for wait time • wall-clock time may not be accurate if sharing processors • wall-clock timings should always be performed in batch mode

  7. Unix Time Command • easiest way to time code • simply type time before your run command • output differs between c-type shells (cshell, tcshell) and Bourne-type shells (bsh, bash, ksh)

  8. Unix time Command (cont’d) • tcsh results • twister:~ % time mycode • 1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w input + output operations user CPU time (s) wall-clock time (s) avg. shared + unshared text space system CPU time (s) page faults + no. times proc. was swapped (u+s)/wc

  9. Unix Time Command (3) • bsh results • $ time mycode • Real 1.62 • User 1.57 • System 0.03 wall-clock time (s) user CPU time (s) system CPU time (s)

  10. Function/Subroutine Calls • often need to time part of code • timers can be inserted in source code • language-dependent

  11. cpu_time • intrinsic subroutine in Fortran • returns userCPU time (in seconds) • no system time is included • 0.01 sec. resolution on p-series real :: t1, t2 call cpu_time(t1) ... do stuff to be timed ... call cpu_time(t2) print*, 'CPU time = ', t2-t1, ' sec.'

  12. system_clock • intrinsic subroutine in Fortran • good for measuring wall-clock time • on p-series: • resolution is 0.01 sec. • max. time is 24 hr.

  13. system_clock (cont’d) • t1andt2are tic counts • count_rateis optional argument containing tics/sec. integer :: t1, t2, count_rate call system_clock(t1, count_rate) ... do stuff to be timed... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’

  14. times • can be called from C to obtain CPU time • 0.01 sec. resolution on p-series • can also get system time with tms_stime #include <sys/times.h> #include <unistd.h> void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed… times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); }

  15. gettimeofday • can be called from C to obtain wall-clock time • msec resolution on p-series #include <sys/time.h> void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }

  16. MPI_Wtime • convenient wall-clock timer for MPI codes • msec resolution on p-series

  17. MPI_Wtime (cont’d) double precision t1, t2 t1 = mpi_wtime() ... do stuff to be timed ... t2 = mpi_wtime() print*,'wall-clock time = ', t2-t1 • Fortran • C double t1, t2; t1 = MPI_Wtime(); ... do stuff to be timed ... t2 = MPI_Wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1);

  18. omp_get_wtime • convenient wall-clock timer for OpenMP codes • resolution available by calling omp_get_wtick() • 0.01 sec. resolution on p-series

  19. omp_get_wtime (cont’d) double precision t1, t2, omp_get_wtime t1 = omp_get_wtime() ... do stuff to be timed ... t2 = omp_get_wtime() print*,'wall-clock time = ', t2-t1 • Fortran • C double t1, t2; t1 = omp_get_wtime(); ... do stuff to be timed ... t2 = omp_get_wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1);

  20. Timer Summary

  21. Profiling

  22. Profilers • profile tells you how much time is spent in each routine • various profilers available, e.g. • gprof (GNU) • pgprof (Portland Group) • Xprofiler (AIX)

  23. gprof • compile with-pg • filegmon.out will be created when you run • gprof executable > myprof • for multiple procs. (MPI), copy or link gmon.out.n to gmon.out, then run gprof

  24. gprof (cont’d) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds called/total parents index %time self descendents called+self name index called/total children 0.00 340.50 1/1 .__start [2] [1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]

  25. gprof (3) ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds % cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]

  26. pgprof • compile with Portland Group compiler • pgf95 (pgf90, etc.) • pgcc • –Mprof=func • similar to –pg • run code • pgprof –exe executable • pops up window with flat profile

  27. pgprof (cont’d)

  28. pgprof (3) • line-level profiling • –Mprof=line • optimizer will re-order lines • profiler will lump lines in some loops or other constructs • may want to compile without optimization, may not • in flat profile, double-click on function

  29. pgprof (4)

  30. xprofiler • AIX (twister) has a graphical interface to gprof • compile with-g -pg -Ox • Ox representswhatever level of optimization you’re using (e.g., O5) • run code • producesgmon.outfile • type xprofiler mycode • mycode is your code run comamnd

  31. xprofiler (cont’d)

  32. xprofiler (3) • filled boxes represent functions or subroutines • “fences” representlibraries • left-click a box to get function name and timing information • right-click on box to get source code or other information

  33. xprofiler (4) • can also get same profiles as from gprof by using menus • report flat profile • report call graph profile

  34. Cache

  35. Cache • Cache is a small chunk of fast memory between the main memory and the registers registers primary cache secondary cache main memory

  36. Cache (cont’d) • Variables are moved from main memory to cache in lines • L1 cache line sizes on our machines • Opteron (katana cluster) 64 bytes • Power4 (p-series) 128 bytes • PPC440 (Blue Gene) 32 bytes • Pentium III (linux cluster) 32 bytes • If variables are used repeatedly, code will run faster since cache memory is much faster than main memory

  37. Cache (cont’d) • Why not just make the main memory out of the same stuff as cache? • Expensive • Runs hot • This was actually done in Cray computers • Liquid cooling system

  38. Cache (cont’d) • Cache hit • Required variable is in cache • Cache miss • Required variable not in cache • If cache is full, something else must be thrown out (sent back to main memory) to make room • Want to minimize number of cache misses

  39. Cache example “mini” cache holds 2 lines, 4 words each for(i=0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] Main memory x[4] … … x[5] x[6] x[7]

  40. Cache example (cont’d) • We will ignore i for simplicity • need x[0], not in cache cache miss • load line from memory into cache • next 3 loop indices result in cache hits x[0] x[1] x[2] x[3] for(i=0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] x[4] … … x[5] x[6] x[7]

  41. Cache example (cont’d) • need x[4], not in cache cache miss • load line from memory into cache • next 3 loop indices result in cache hits x[0] x[4] x[5] x[1] x[6] x[2] x[3] x[7] for(i=0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] x[4] … … x[5] x[6] x[7]

  42. Cache example (cont’d) • need x[8], not in cache cache miss • load line from memory into cache • no room in cache! • replace old line x[8] x[4] x[5] x[9] x[6] a b x[7] for(i==0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] x[4] … … x[5] x[6] x[7]

  43. Cache (cont’d) • Contiguous access is important • In C, multidimensional array is stored in memory as a[0][0] a[0][1] a[0][2] …

  44. Cache (cont’d) • In Fortran and Matlab, multidimensional array is stored the opposite way: a(1,1) a(2,1) a(3,1) …

  45. Cache (cont’d) • Rule: Always order your loops appropriately • will usually be taken care of by optimizer • suggestion: don’t rely on optimizer! for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; } } do j = 1, n do i = 1, n a(i,j) = 1.0 enddo enddo C Fortran

  46. Tuning Tips

  47. Tuning Tips • Some of these tips will be taken care of by compiler optimization • It’s best to do them yourself, since compilers vary

  48. Tuning Tips (cont’d) • Access arrays in contiguous order • For multi-dimensional arrays, rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab Bad Good • for(i=0; i<N; i++){ • for(j=0; j<N; j++{ • a[i][j] = 1.0; • } • } • for(j=0; j<N; j++){ • for(i=0; i<N; i++{ • a[i][j] = 1.0; • } • }

  49. Tuning Tips (3) • Eliminate redundant operations in loops • Bad: • Good: • for(i=0; i<N; i++){ • x = 10; • } … • x = 10; • for(i=0; i<N; i++){ • } …

  50. Tuning Tips (4) for(i=0; i<N; i++){ if(i==0) perform i=0 calculations else perform i>0 calculations } • Eliminate if statements within loops • They may inhibit pipelining

More Related