1 / 57

Enlightening Your Code: Optimization

Enlightening Your Code: Optimization. By: Kyle Rollin Special Thanks to Ross Walker. Outline of Topics. Compiler Optimization Mathematical Considerations Looping and Structure Managing Memory Utilizing efficient I/O Vectorization. Benefits of Optimization. Reduced Runtime (Wallclock)

penha
Download Presentation

Enlightening Your Code: Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enlightening Your Code: Optimization By: Kyle Rollin Special Thanks to Ross Walker

  2. Outline of Topics • Compiler Optimization • Mathematical Considerations • Looping and Structure • Managing Memory • Utilizing efficient I/O • Vectorization

  3. Benefits of Optimization • Reduced Runtime (Wallclock) • Less Resource Use • More Efficient Use of Specific Hardware Architecture • Better chance of successful MRAC/LRAC proposals

  4. REMEMBERWALLCLOCK TIME IS EVERYTHING! • The only metric that ultimately matters is Wallclock time to solution. • Wallclock is how long it takes to run your simulation. • Wallclock is how much you get charged for. • Wallclock is how long your code is blocking other users from using the machine. • Lies, Damn Lies and Statistics... • TFlops/PFlops numbers are ‘worse’ than statistics. • You can easily write a code that gets very high Flops numbers but has a longer time to solution.

  5. Why optimize your code for single cpu performance? Disadvantages • Time consuming – Do not waste weeks optimizing a one off code that will run for 1 hour. • Can make the code harder to read and debug. • Can adversely affect parallel scaling. • Different architectures can respond in different ways.

  6. When to optimize? • Code optimization is an iterative process requiring time, energy and thought. • Performance tuning is recommended for: • production codes that are widely distributed and often used in the research community 2) projects with limited allocation (to maximize available compute hours).

  7. Optimization Strategy • Aim: To try to minimize the amount of work the compiler is required to do. DO NOT rely on the compiler to optimize bad code. Use a two pronged approach: 1) Write easily optimizable code to begin with. 2) Once the code is debugged and working try more aggressive optimizations.

  8. What compilers will not do. • They will not reorder large memory arrays. • They will not factor/simplify large equations. • They will not replace inefficient logic. • They will not vectorize large complex loops. • They will not fix inefficient memory access. • They will not detect and remove excess work. • They will not automatically invert divisions. • ...

  9. How can we help the compiler? • We should aim to help the compiler as much as possible by designing our code to match the machine architecture. • Try to always move linearly in memory (Cache hits). • Factor / Simplify equations as much as possible. • Think carefully about how we do things. Is there a more efficient approach? • Remove excess work. • Avoid branching inside loops. I will take you through a series of simple examples that you can then use to try and optimize the example code.

  10. The Power 4 Architecture • Remember: Clock speed is NOT everything.

  11. Why is the Power 4 chip faster for a lower clock speed? • The speed comes from the inherent parallelism designed into the chip. • 2 Floating point units per core. • 2 Fixed point units per core. Each FP unit can perform 1 fused multiply add (FMA) per clock tick. Thus the two floating point units can perform a theoretical maximum of 4 floating point operations per clock cycle. • You should design your code with this in mind.

  12. The Power 4 Architecture(Datastar)

  13. Compiler Optimization

  14. Compiler Optimization • Choosing the right compiler is important xlc_r xlC_r xlf_r xlf90_r xlf95_r icc ifort gcc g++ g77

  15. Compiler Optimization • Compilers with parallel capabilities mpcc_r mpCC_r mpxlf_r mpxlf90_r mpxlf95_r mpicc mpif77 mpif90

  16. Compiler Optimization IBM Compiler Flags (Fortran and C) • -O3; -O2; -O1 • Optimization Levels 3,2, and 1 • -qstrict • Used with -O3 to ensure compiler optimization does not alter program semantics. Use only when necessary as it may reduce optimization. • -q64 • Compiles in 64 bit mode. This flag should be used only with thread-safe compilers (xlf_r, xlc_r, etc). • -qarch=pwr4; -qtune=pwr4 • Produces an object that contains instructions that run on the POWER4 hardware platforms.

  17. Compiler Optimization Intel Compiler Flags (Fortran and C) • -O3; -O2; -O1 • Optimization Levels 3,2, and 1 • -tpp2 • Optimize for Intel Itanium 2 processors. (default)

  18. Compiler Optimization

  19. Compiler Optimization • Example IBM compilation commands: • Example Intel compilation commands:

  20. The Test Codes • Located in: • /dsgpfs/projects/workshop/optimization/examples • Create a directory on /gpfs • cd /gpfs • mkdir <your_username> • Copy the examples to your gpfs directory • cd /gpfs/<your_username> • cp -r /dsgpfs/projects/workshop/optimization/examples/ * . Available Text Editors: vi, emacs, pico

  21. The Test Codes • /gpfs/projects/workshop/optimization/contest/ is the location of an example piece of code that we will attempt to optimize. • This code solves the following ‘fictional’ equation:

  22. The Test Codes The main loop of the code is as follows: total_e = 0.0d0 cut_count = 0 do i = 1, natom do j = 1, natom if ( j < i ) then !Avoid double counting. vec2 = (coords(1,i)-coords(1,j))**2 + (coords(2,i)-coords(2,j))**2 + (coords(3,i)-coords(3,j))**2 !X^2 + Y^2 + Z^2 rij = sqrt(vec2) !Check if this is below the cut off if ( rij <= cut ) then cut_count = cut_count+1 !Increment the counter of pairs below cutoff current_e = (exp(rij*q(i))*exp(rij*q(j)))/rij total_e = total_e + current_e + 1.0d0/a end if end if end do end do

  23. Mathematical Considerations

  24. Mathematical Considerations • Different mathematical operations have varying efficiencies: • Assignment, Addition, Multiplication: Fast • Division, Exponentials: Slow • Square Roots: Very Slow

  25. Mathematical Considerations • The compiler will not simplify or factorize for you, even if doing so is more efficient. • Which of these is a more optimized representation?

  26. Mathematical Considerations • Whenever possible, invert divisions. Examine the following poorly written snippet: do i = 1, n temp = temp + (i / x) end do • Instead, divide once and then multiply in the loop: inv_x = 1.0d0 / x do i = 1, n temp = temp + (i * inv_x) end do

  27. Managing Memory • Swapping • Swapping is used to replace segments of data in main memory. • In the case that more memory is needed than is available in the main memory, segments of memory are swapped to and from the disk. • Swapping is generally inefficient.

  28. Looping and Structure

  29. Looping and Structure • It is important to make sure the compiler recognizes constant indices while a loop is executing. Consider: do i = 1, n do j = 1, n+1 print *, “Hello World” end do end do • In this case some compilers evaluate n+1 each time the inner loop iterates, even though the index is constant.

  30. Looping and Structure • A quick fix for this problem is to store the index within a variable: do i = 1, n n_plus_one = n + 1 do j = 1, n_plus_one print *, “Hello World” end do end do

  31. Looping and Structure • When possible, avoid excessive array lookups : • In the RHS example, the x(i) has been factored out of the inner loop and the xi is treated as a scalar. do i = 1, N do j = 1,N ... sum = sum+x(j)*x(i) end do end do do i = 1, N xi = x(i) do j = 1,N ... sum = sum+x(j)*xi end do end do

  32. Looping and Structure • Always try to move linearly through memory. • This maximizes the amount of cache hits. • In FORTRAN this means that the first index should always be looped upon in the inner most loop. • Here is an example of how NOT to do it: do i=1, Ni do j=1,Nj do k=1,Nk v=x(i,j,k) ... end do end doend do

  33. Looping and Structure • This is how it should be done: • Notice that by reversing the order, we can now move linearly through the x array. • Note: In C, memory is laid out in a reverse fashion do k=1, Nk do j=1,Nj do i=1,Niv=x(i,j,k) ... end do end doend do

  34. Looping and Structure • Sometimes it is more efficient to “inline” a function instead of calling it, especially if a function is small: • “Inlining” a function that is too large will result in an overly large executable, so some judgment is required v = 0do k=1, 50 v = square(v) ...end do v = 0do k=1, 50 v = v**2 ...end do

  35. Looping and Structure • If you have multiple nested loops that loop entirely through an array it can often be more beneficial to replace a multi-dimensional array with a 1-D array and do the arithmetic yourself. E.g. • This avoids unnecessary pointer lookups and can make it easier to vectorize the code Nitr = Nk*Nj*Nido k=1, Nitr v=x(k) ...end do

  36. Looping and Structure • Sometimes (you need to test), especially on the Power 4 architecture it can help to split arrays of the form (1:x,N) where x is a small (ca. 3) constant into a set of x 1-D arrays. E.g. do i = 1, N ... do j = 1, N vec2=(xi-crd(1,j))**2+(yi-crd(2,j))**2+(zi-crd(3,j))**2 ... end do end do BECOMES do i = 1, N ... do j = 1, N vec2=(xi-x(j))**2+(yi-y(j))**2+(zi-z(j))**2 ... end do end do The reasoning behind this is that the power4 chip has multiple cache lines and so by splitting the array into 3 seperate arrays the compiler can utilize 3 pipelines simultaneously. (You need to test if this works for your code!)

  37. Effective I/O

  38. Effective I/O • Oftentimes, it’s not *how* your program runs, but *where* your program runs that affects I/O performance. • /GPFS is a partition optimized for file I/O by specific disk technologies. • It is recommended for use with most jobs, especially where large amounts of data are produced or read, where I/O performance is important, or where parallel I/O is performed.

  39. Effective I/O • Whenever possible, read your data from a binary file instead of an ASCII file. • Binary File: fast • ASCII File: slow

  40. Vectorization

  41. Vectorization • Vectorization is a technique by which small, uniform loops can be optimized using a specific vector math library. The particular library depends on the compiler: • xlc/xlf: MASS (Mathematical Acceleration SubSystem) • ifort/icc: MKL (Math Kernal Library) • Often you can set a flag and let the compiler try to vectorize loops for you automatically.

  42. Vectorization • Vectorization is beneficial on a scalar machine because: • Machine is actually super scalar... • Has a pipeline for doing computation. • Has multiple floating point units. • Vectorization makes it easy for the compiler to use multiple floating point units as we have a large number of operations that are all independent of each other. • Vectorization makes it easy to fill the pipeline.

  43. Vectorization • On Power 4 a single FMA has to go through 6 stages in the pipeline. • After 1 clock cycle the first FMA operation has completed stage 1 and moves on to stage 2. Processor can now start processing stage 1 of the second FMA operation in parallel with stage 2 of first operation etc etc...

  44. Vectorization • After 6 operations pipelining of subsequent FMA operations gives one result every clock cycle. • Pipeline latency is thus hidden beyond 6 operations. • Power 4 chip has two floating point units so needs a minimum of 12 independent FMA operations to be fully pipelined. • Thus if we can split our calculation up into a series of long ‘independent’ vector calculations we can maximize the floating point performance of the chip.

  45. Vectorization do i = 2, Ndo j = 1, i-1 vec2 = ... onerij = 1.0d0/sqrt(vec2) rij = vec2*onerij exp1 = exp(rij) esum = esum + exp1*onerijend do end do Scalar Code

  46. Vectorization Vector Code do i = 2, Ndo j = 1, i-1 vec2 = ... onerij = 1.0d0/sqrt(vec2) rij = vec2*onerij exp1 = exp(rij) esum = esum + exp1*onerijend do end do loopcount = 0 do i = 2, Ndo j = 1, i-1 loopcount = loopcount+1 vec2 = ...vectmp1(loopcount) = vec2end do end do !Begin vector operations call vrsqrt(vectmp2,vectmp1,loopcount) !vectmp2 now contains onerij vectmp1(1:loopcount) = vectmp1(1:loopcount) * vectmp2(1:loopcount) !vectmp1 now contains rij call vexp(vectmp1,vectmp1,loopcount) !vectmp 1 now contains exp(rij) do i = 1, loopcount esum = esum + vectmp1(i)*vectmp2(i) end do

  47. IBM’s vector math library • IBM has a special tuned vector math library for use on Datastar. This library is called libmassvp4 and can be compiled into your code with: • xlf90 -o foo.x -O3 -L/usr/lib -lmassvp4 foo.f • Support is provided for a number of vector functions including: vrec (vectored inverse), vexp (vectored exp) vsqrt (vectored sqrt), vrsqrt (vectored invsqrt) vlog (vectored ln), vcos (vectored cosine), vtanh (vectored tanh)

  48. The Competition

  49. The Competition • As mentioned previously an example piece of code is available in /gpfs/projects/workshop/optimization/contest/ on datastar.(COPY THIS TO YOUR OWN DIRECTORY ON GPFS BEFORE EDITING) • Provided in this directory is the unoptimized code (original.f), a script to compile it (compile.x) and a loadleveller script to submit it on 1 cpu (llsubmit submit_orig.ll) • There are also a set of optimized codes that I have created (both scalar and vector). These are read protected at present and will be made readable to all at the end of today.

  50. The Competition • Your mission (should you choose to accept it) is to optimize this example code so that it gives the same answer (to 12 s.f.) on a single P655+ (1.5GHz) Datastar node in the shortest time possible. • There is also a parallel version of the original and optimized code which I have run on 16 cpus (2 * 1.5GHz node) of Datastar.

More Related