Enlightening Your Code: Optimization

Enlightening Your Code: Optimization By: Kyle Rollin Special Thanks to Ross Walker

Outline of Topics • Compiler Optimization • Mathematical Considerations • Looping and Structure • Managing Memory • Utilizing efficient I/O • Vectorization

Benefits of Optimization • Reduced Runtime (Wallclock) • Less Resource Use • More Efficient Use of Specific Hardware Architecture • Better chance of successful MRAC/LRAC proposals

REMEMBERWALLCLOCK TIME IS EVERYTHING! • The only metric that ultimately matters is Wallclock time to solution. • Wallclock is how long it takes to run your simulation. • Wallclock is how much you get charged for. • Wallclock is how long your code is blocking other users from using the machine. • Lies, Damn Lies and Statistics... • TFlops/PFlops numbers are ‘worse’ than statistics. • You can easily write a code that gets very high Flops numbers but has a longer time to solution.

Why optimize your code for single cpu performance? Disadvantages • Time consuming – Do not waste weeks optimizing a one off code that will run for 1 hour. • Can make the code harder to read and debug. • Can adversely affect parallel scaling. • Different architectures can respond in different ways.

When to optimize? • Code optimization is an iterative process requiring time, energy and thought. • Performance tuning is recommended for: • production codes that are widely distributed and often used in the research community 2) projects with limited allocation (to maximize available compute hours).

Optimization Strategy • Aim: To try to minimize the amount of work the compiler is required to do. DO NOT rely on the compiler to optimize bad code. Use a two pronged approach: 1) Write easily optimizable code to begin with. 2) Once the code is debugged and working try more aggressive optimizations.

What compilers will not do. • They will not reorder large memory arrays. • They will not factor/simplify large equations. • They will not replace inefficient logic. • They will not vectorize large complex loops. • They will not fix inefficient memory access. • They will not detect and remove excess work. • They will not automatically invert divisions. • ...

How can we help the compiler? • We should aim to help the compiler as much as possible by designing our code to match the machine architecture. • Try to always move linearly in memory (Cache hits). • Factor / Simplify equations as much as possible. • Think carefully about how we do things. Is there a more efficient approach? • Remove excess work. • Avoid branching inside loops. I will take you through a series of simple examples that you can then use to try and optimize the example code.

The Power 4 Architecture • Remember: Clock speed is NOT everything.

Why is the Power 4 chip faster for a lower clock speed? • The speed comes from the inherent parallelism designed into the chip. • 2 Floating point units per core. • 2 Fixed point units per core. Each FP unit can perform 1 fused multiply add (FMA) per clock tick. Thus the two floating point units can perform a theoretical maximum of 4 floating point operations per clock cycle. • You should design your code with this in mind.

The Power 4 Architecture(Datastar)

Compiler Optimization

Compiler Optimization • Choosing the right compiler is important xlc_r xlC_r xlf_r xlf90_r xlf95_r icc ifort gcc g++ g77

Compiler Optimization • Compilers with parallel capabilities mpcc_r mpCC_r mpxlf_r mpxlf90_r mpxlf95_r mpicc mpif77 mpif90

Compiler Optimization IBM Compiler Flags (Fortran and C) • -O3; -O2; -O1 • Optimization Levels 3,2, and 1 • -qstrict • Used with -O3 to ensure compiler optimization does not alter program semantics. Use only when necessary as it may reduce optimization. • -q64 • Compiles in 64 bit mode. This flag should be used only with thread-safe compilers (xlf_r, xlc_r, etc). • -qarch=pwr4; -qtune=pwr4 • Produces an object that contains instructions that run on the POWER4 hardware platforms.

Compiler Optimization Intel Compiler Flags (Fortran and C) • -O3; -O2; -O1 • Optimization Levels 3,2, and 1 • -tpp2 • Optimize for Intel Itanium 2 processors. (default)

Compiler Optimization

Compiler Optimization • Example IBM compilation commands: • Example Intel compilation commands:

The Test Codes • Located in: • /dsgpfs/projects/workshop/optimization/examples • Create a directory on /gpfs • cd /gpfs • mkdir <your_username> • Copy the examples to your gpfs directory • cd /gpfs/<your_username> • cp -r /dsgpfs/projects/workshop/optimization/examples/ * . Available Text Editors: vi, emacs, pico

The Test Codes • /gpfs/projects/workshop/optimization/contest/ is the location of an example piece of code that we will attempt to optimize. • This code solves the following ‘fictional’ equation:

The Test Codes The main loop of the code is as follows: total_e = 0.0d0 cut_count = 0 do i = 1, natom do j = 1, natom if ( j < i ) then !Avoid double counting. vec2 = (coords(1,i)-coords(1,j))**2 + (coords(2,i)-coords(2,j))**2 + (coords(3,i)-coords(3,j))**2 !X^2 + Y^2 + Z^2 rij = sqrt(vec2) !Check if this is below the cut off if ( rij <= cut ) then cut_count = cut_count+1 !Increment the counter of pairs below cutoff current_e = (exp(rij*q(i))*exp(rij*q(j)))/rij total_e = total_e + current_e + 1.0d0/a end if end if end do end do

Mathematical Considerations

Mathematical Considerations • Different mathematical operations have varying efficiencies: • Assignment, Addition, Multiplication: Fast • Division, Exponentials: Slow • Square Roots: Very Slow

Mathematical Considerations • The compiler will not simplify or factorize for you, even if doing so is more efficient. • Which of these is a more optimized representation?

Mathematical Considerations • Whenever possible, invert divisions. Examine the following poorly written snippet: do i = 1, n temp = temp + (i / x) end do • Instead, divide once and then multiply in the loop: inv_x = 1.0d0 / x do i = 1, n temp = temp + (i * inv_x) end do

Managing Memory • Swapping • Swapping is used to replace segments of data in main memory. • In the case that more memory is needed than is available in the main memory, segments of memory are swapped to and from the disk. • Swapping is generally inefficient.

Looping and Structure

Looping and Structure • It is important to make sure the compiler recognizes constant indices while a loop is executing. Consider: do i = 1, n do j = 1, n+1 print *, “Hello World” end do end do • In this case some compilers evaluate n+1 each time the inner loop iterates, even though the index is constant.

Looping and Structure • A quick fix for this problem is to store the index within a variable: do i = 1, n n_plus_one = n + 1 do j = 1, n_plus_one print *, “Hello World” end do end do

Looping and Structure • When possible, avoid excessive array lookups : • In the RHS example, the x(i) has been factored out of the inner loop and the xi is treated as a scalar. do i = 1, N do j = 1,N ... sum = sum+x(j)*x(i) end do end do do i = 1, N xi = x(i) do j = 1,N ... sum = sum+x(j)*xi end do end do

Looping and Structure • Always try to move linearly through memory. • This maximizes the amount of cache hits. • In FORTRAN this means that the first index should always be looped upon in the inner most loop. • Here is an example of how NOT to do it: do i=1, Ni do j=1,Nj do k=1,Nk v=x(i,j,k) ... end do end doend do

Looping and Structure • This is how it should be done: • Notice that by reversing the order, we can now move linearly through the x array. • Note: In C, memory is laid out in a reverse fashion do k=1, Nk do j=1,Nj do i=1,Niv=x(i,j,k) ... end do end doend do

Looping and Structure • Sometimes it is more efficient to “inline” a function instead of calling it, especially if a function is small: • “Inlining” a function that is too large will result in an overly large executable, so some judgment is required v = 0do k=1, 50 v = square(v) ...end do v = 0do k=1, 50 v = v**2 ...end do

Looping and Structure • If you have multiple nested loops that loop entirely through an array it can often be more beneficial to replace a multi-dimensional array with a 1-D array and do the arithmetic yourself. E.g. • This avoids unnecessary pointer lookups and can make it easier to vectorize the code Nitr = Nk*Nj*Nido k=1, Nitr v=x(k) ...end do

Looping and Structure • Sometimes (you need to test), especially on the Power 4 architecture it can help to split arrays of the form (1:x,N) where x is a small (ca. 3) constant into a set of x 1-D arrays. E.g. do i = 1, N ... do j = 1, N vec2=(xi-crd(1,j))**2+(yi-crd(2,j))**2+(zi-crd(3,j))**2 ... end do end do BECOMES do i = 1, N ... do j = 1, N vec2=(xi-x(j))**2+(yi-y(j))**2+(zi-z(j))**2 ... end do end do The reasoning behind this is that the power4 chip has multiple cache lines and so by splitting the array into 3 seperate arrays the compiler can utilize 3 pipelines simultaneously. (You need to test if this works for your code!)

Effective I/O

Effective I/O • Oftentimes, it’s not *how* your program runs, but *where* your program runs that affects I/O performance. • /GPFS is a partition optimized for file I/O by specific disk technologies. • It is recommended for use with most jobs, especially where large amounts of data are produced or read, where I/O performance is important, or where parallel I/O is performed.

Effective I/O • Whenever possible, read your data from a binary file instead of an ASCII file. • Binary File: fast • ASCII File: slow

Vectorization

Vectorization • Vectorization is a technique by which small, uniform loops can be optimized using a specific vector math library. The particular library depends on the compiler: • xlc/xlf: MASS (Mathematical Acceleration SubSystem) • ifort/icc: MKL (Math Kernal Library) • Often you can set a flag and let the compiler try to vectorize loops for you automatically.

Vectorization • Vectorization is beneficial on a scalar machine because: • Machine is actually super scalar... • Has a pipeline for doing computation. • Has multiple floating point units. • Vectorization makes it easy for the compiler to use multiple floating point units as we have a large number of operations that are all independent of each other. • Vectorization makes it easy to fill the pipeline.

Vectorization • On Power 4 a single FMA has to go through 6 stages in the pipeline. • After 1 clock cycle the first FMA operation has completed stage 1 and moves on to stage 2. Processor can now start processing stage 1 of the second FMA operation in parallel with stage 2 of first operation etc etc...

Vectorization • After 6 operations pipelining of subsequent FMA operations gives one result every clock cycle. • Pipeline latency is thus hidden beyond 6 operations. • Power 4 chip has two floating point units so needs a minimum of 12 independent FMA operations to be fully pipelined. • Thus if we can split our calculation up into a series of long ‘independent’ vector calculations we can maximize the floating point performance of the chip.

Vectorization do i = 2, Ndo j = 1, i-1 vec2 = ... onerij = 1.0d0/sqrt(vec2) rij = vec2*onerij exp1 = exp(rij) esum = esum + exp1*onerijend do end do Scalar Code

Vectorization Vector Code do i = 2, Ndo j = 1, i-1 vec2 = ... onerij = 1.0d0/sqrt(vec2) rij = vec2*onerij exp1 = exp(rij) esum = esum + exp1*onerijend do end do loopcount = 0 do i = 2, Ndo j = 1, i-1 loopcount = loopcount+1 vec2 = ...vectmp1(loopcount) = vec2end do end do !Begin vector operations call vrsqrt(vectmp2,vectmp1,loopcount) !vectmp2 now contains onerij vectmp1(1:loopcount) = vectmp1(1:loopcount) * vectmp2(1:loopcount) !vectmp1 now contains rij call vexp(vectmp1,vectmp1,loopcount) !vectmp 1 now contains exp(rij) do i = 1, loopcount esum = esum + vectmp1(i)*vectmp2(i) end do

IBM’s vector math library • IBM has a special tuned vector math library for use on Datastar. This library is called libmassvp4 and can be compiled into your code with: • xlf90 -o foo.x -O3 -L/usr/lib -lmassvp4 foo.f • Support is provided for a number of vector functions including: vrec (vectored inverse), vexp (vectored exp) vsqrt (vectored sqrt), vrsqrt (vectored invsqrt) vlog (vectored ln), vcos (vectored cosine), vtanh (vectored tanh)

The Competition

The Competition • As mentioned previously an example piece of code is available in /gpfs/projects/workshop/optimization/contest/ on datastar.(COPY THIS TO YOUR OWN DIRECTORY ON GPFS BEFORE EDITING) • Provided in this directory is the unoptimized code (original.f), a script to compile it (compile.x) and a loadleveller script to submit it on 1 cpu (llsubmit submit_orig.ll) • There are also a set of optimized codes that I have created (both scalar and vector). These are read protected at present and will be made readable to all at the end of today.

The Competition • Your mission (should you choose to accept it) is to optimize this example code so that it gives the same answer (to 12 s.f.) on a single P655+ (1.5GHz) Datastar node in the shortest time possible. • There is also a parallel version of the original and optimized code which I have run on 16 cpus (2 * 1.5GHz node) of Datastar.

Enlightening Your Code: Optimization