Download Presentation
## Libraries and Their Performance

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Libraries and Their Performance**Frank V. Hale Thomas M. DeBoni NERSC User Services Group**Part I: Single Node Performance Measurement**• Use of hpmcount for measurement of total code performance • Use of HPM Toolkit for measurement of code section performance • Vector operations generally give better performance than scalar (indexed) operations • Shared-memory, SMP parallelism can be very effective and easy to use**Demonstration Problem**• Compute p using random points in unit square (ratio of points in unit circle to points in unit square) • Use input file with sequence of 134,217,728 uniformly distributed random numbers in range 0-1; unformatted, 8-byte floating point numbers (1 gigabyte of data)**A first Fortran code**% cat estpi1.f implicit none integer i,points,circle real*8 x,y read(*,*)points open(10,file="runiform1.dat",status="old",form="unformatted") circle = 0 c repeat for each (x,y) data point: read and compute do i=1,points read(10)x read(10)y if (sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5) circle = circle + 1 enddo write(*,*)"Estimated pi using ",points," points as ", . ((4.*circle)/points) end**Compile and Run with hpmcount**% cat jobestpi1 #@ class = debug #@ shell = /usr/bin/csh #@ wall_clock_limit = 00:29:00 #@ notification = always #@ job_type = serial #@ output = jobestpi1.out #@ error = jobestpi1.out #@ environment = COPY_ALL #@ queue setenv FC "xlf_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 " $FC -o estpi1 estpi1.f echo "10000" > estpi1.dat hpmcount ./estpi1 <estpi1.dat exit**Some Observations**• Performance is not very good at all, less than 1 Mflip/s (peak is 1,500 Mflip/s per processor) • Scalar approach to computation • Scalar I/O mixed with scalar computation Suggestions: • Separate I/O from computation • Use vector operations on dynamically allocated vector data structures**A second code, Fortran 90**% cat estpi2.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x,y c dynamically allocated vector data structures read(*,*)points allocate (x(points)) allocate (y(points)) allocate (ones(points)) ones = 1 open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) end**Observations on Second Code**• Operations on whole vectors should be faster, but • No real improvement in performance of total code was observed. • Suspect that most time is being spent on I/O. • I/O is now separate from computation, so the code is easy to instrument in sections**Instrument code sections with HPM Toolkit**Four sections to be separately measured: • Data structure initialization • Read data • Estimate p • Write output Calls to f_hpmstart and f_hpmstop around each section.**Instrumented Code (1 of 2)**%cat estpi3.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x,y #include "f_hpm.h" call f_hpminit(0,"Instrumented code") call f_hpmstart(1,"Initialize data structures") read(*,*)points allocate (x(points)) allocate (y(points)) allocate (ones(points)) ones = 1 call f_hpmstop(1)**Instrumented Code (2 of 2)**call f_hpmstart(2,"Read data") open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo call f_hpmstop(2) call f_hpmstart(3,"Estimate pi") circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) call f_hpmstop(3) call f_hpmstart(4,"Write output") write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) call f_hpmstop(4) call f_hpmterminate(0) end**Notes on Instrumented Code**• Entire executable code enclosed between hpm_init and hpm_terminate • Code sections enclosed between hpm_start and hpm_stop • Descriptive text labels appear in output file(s)**Compile and Run with HPM Toolkit**% cat jobestpi3 #@ class = debug #@ shell = /usr/bin/csh #@ wall_clock_limit = 00:29:00 #@ notification = always #@ job_type = serial #@ output = jobestpi3.out #@ error = jobestpi3.out #@ environment = COPY_ALL #@ queue module load hpmtoolkit setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f" $FC -o estpi3 estpi3.f echo "10000000" > estpi3.dat ./estpi3 <estpi3.dat exit**Notes on Use of HPM Toolkit**• Must load module hpmtoolkit • Need to include header file f_hpm.h in Fortran code, and give preprocessor directions to compiler with -qsuffix • Performance output in a file named like perfhpmNNNN.MMMMM where NNNN is the task id and MMMMM is the process id • Message from sample executable: libHPM output in perfhpm0000.21410**Comparison of Code Sections**10,000,000 points**Observations on Sections**• Optimization of the estimation of p has little effect because • The code spends 99% of the time reading the data • Can the I/O be optimized?**Reworking the I/O**• Whole arrary I/O versus scalar I/O • Scalar I/O (one number per record) file is twice as big (8 bytes for number, 8 bytes for end of record) • Whole array I/O file has only one end of record marker • Only one call for Fortran read routine for whole array I/O read(10)xy • Need to use some fancy array footwork to sort out x(1), y(1), x(2), y(2), … x(n), y(n) from xy array. x = xy(1::2) y = xy(2::2)**Revised Data Structures and I/O**% cat estpi4.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x, y, xy #include "f_hpm.h" call f_hpminit(0,"Instrumented code") call f_hpmstart(1,"Initialize data structures") read(*,*)points allocate (x(points)) allocate (y(points)) allocate (xy(2*points)) allocate (ones(points)) ones = 1 call f_hpmstop(1) call f_hpmstart(2,"Read data") open(10,file="runiform.dat",status="old",form="unformatted") read(10)xy x = xy(1::2) y = xy(2::2) call f_hpmstop(2)**Vector I/O Code Sections**10,000,000 points**Observations on New Sections**• The time spent reading the data as a vector rather than a scalar was reduced from 89.9 to 3.16 seconds, a reduction of 96% of the I/O time. • There was no performance penalty for the additional data structure complexity. • I/O design can have very significant performance impacts! • Total code performance with hpmcount is now 15.4 Mflip/s, 20 times improved from the 0.801 Mflip/s of the scalar I/O code.**Automatic Shared-Memory (SMP) Parallelization**• IBM Fortran provides a–qsmpoption for automatic, shared-memory parallelization, allowing multithreaded computation within a node. • Default number of threads is 16; the number of threads is controlled by OMP_NUM_THREADS environment variable • Allows use of the SMP version of the ESSL library, -lesslsmp**Compiler Options**• The source code is the same as the previous, vector operation example, estpi4.f • Compiler options –qsmp and –lesslsmp enable automatic shared-memory parallelism (SMP) • Compiler command line: xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f -qsmp –lesslsmp -o estpi5 estpi4.f**SMP Code Sections**10,000,000 points**Observations on SMP Code**• Computational section is now showing 1,100 Mflip/sec, or 4.6% of theoretical peak of 24,000 Mflip/sec on 16 processor node. • Computational section is now 12 times faster, with no changes to source code • Recommendation: always use thread-safe compilers (with _r suffix) and –qsmp unless there is a good reason to do otherwise. • There are no explicit parallelism directives in the source code; all threading is within the library.**Too Many Threads Can Spoil Performance**• Each node has 16 processors, and usually having more threads than processors will not improve performance**Sidebar: Cost of Misaligned Common Block**• User code with Fortran77 style common blocks may receive an innocuous warning: 1514-008 (W) Variable … is misaligned. This may affect the efficiency of the code. • How much can this affect the efficiency of the code? • Test: put arrays x and y in misaligned common, with a 1-byte character in front of them**Potential Cost of Misaligned Common Blocks**• 10,000,000 points used for computing Pi; • Properly aligned, dynamically allocated x and y used 0.064 seconds at 1,100 Mflip/s • Misaligned, statically allocated x and y in common block used 0.834 seconds at 88.4 Mflip/s • Common block alignment slowed computation by a factor of 12**Part I Conclusion**• hpmcount can be used to measure the performance of the total code • HPM Toolkit can be used to measure the performance of discrete code sections • Optimization effort must be focused effectively • Fortran90 vector operations are generally faster than Fortran77 scalar operations • Use of automatic SMP parallelization may provide an easy performance boost • I/O may be the largest factor in “whole code” performance • Misaligned common blocks can be very expensive**Part II: Comparing Libraries**• In the rich user environment on seaborg, there are many alternative ways to do the same computation • The HPM Toolkit provides the tools to compare alternative approaches to the same computation**Dot Product Functions**• User coded scalar computation • User coded vector computaiton • Single processor ESSL ddot • Multi-threaded SMP ESSL ddot • Single processor IMSL ddot • Single processor NAG f06eaf • Multi-threaded SMP NAGf06eaf**Sample Problem**• Test Cauchy-Schwartz inequality for N vectors of length N (X•Y)2 <= (X•X)(Y•Y) • Generate 2N random numbers (array x2) • Use 1st N for X; (X•X) computed once • Vary vector Y for i=1,n y = 2.0*x2(i:n+(i-1)) First Y is 2X, second Y is 2(X2(2:N+1)), etc. • Compute (2*N)+1 dot products of length N**Instrumented Code Section for Dot Products**call f_hpmstart(1,"Dot products") xx = ddot(n,x,1,x,1) do i=1,n y = 2.0*x2(i:n+(i-1)) yy = ddot(n,y,1,y,1) xy = ddot(n,x,1,y,1) diffs(i) = (xx*yy)-(xy*xy) enddo call f_hpmstop(1)**Two User Coded Functions**real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n),dp dp = 0. do i=1,n dp = dp + x(i)*y(i) ! User scalar loop enddo myddot = dp return end real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n) myddot = sum(x*y) ! User vector computation return end**Compile and Run User Functions**module load hpmtoolkit echo "100000" > libs.dat setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT-qsuffix=cpp=f" $FC -o libs0 libs0.f ./libs0 <libs.dat $FC -o libs0a libs0a.f ./libs0a <libs.dat**Compile and Run ESSL Versions**setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f -lessl" $FC -o libs1 libs1.f ./libs1 <libs.dat setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f -qsmp -lesslsmp" $FC -o libs1smp libs1.f ./libs1smp <libs.dat**Compile and Run IMSL Version**module load imsl setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $IMSL" $FC -o libs1imsl libs1.f ./libs1imsl <libs.dat module unload imsl**Compile and Run NAG Versions**module load nag_64 setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG" $FC -o libs1nag libsnag.f ./libs1nag <libs.dat module unload nag module load nag_smp64 setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG_SMP6 -qsmp=omp -qnosave " $FC -o libs1nagsmp libsnag.f ./libs1nagsmp <libs.dat module unload nag_smp64**First Comparison of Dot Product(N=100,000)**Version Wall Clock (sec)Mflip/sScaled Time (1=Fastest) User Scalar 246 203 1.72 User Vector 249 201 1.74 ESSL 145 346 1.01 ESSL-SMP 408 123 2.85 Slowest IMSL 143 351 1.00 Fastest NAG 250 200 1.75 NAG-SMP 180 278 1.26**Comments on First Comparisons**• The best results, by just a little, were obtained using the IMSL library, with ESSL a close second • Third best was the NAG-SMP routine, with benefits from multi-threaded computation • The user coded routines and NAG were about 75% slower than the ESSL and IMSL routines. In general, library routines are highly optimized and better than user coded routines. • The ESSL-SMP library did very poorly on this computation; this unexpected result may be due to data structures in the library, or perhaps the number of threads (default is 16).**ESSL-SMP Performance vs. Number of Threads**• All for N=100,000 • Number of threads controlled by environment variable OMP_NUM_THREADS**Revised First Comparison of Dot Product(N=100,000)**Version Wall Clock (sec)Mflip/sScaled Time (1=Fastest) User Scalar 246 203 4.9 User Vector 249 201 5.0 ESSL 145 346 2.9 ESSL-SMP 50 1000 1.0 Fastest 4 threads IMSL 143 351 2.9 NAG 250 200 5.0 Slowest NAG-SMP 180 278 3.6 Tuning for Number of Threads is Very, Very Important for SMP codes !**Scaling up the Problem**• The first comparisons were for N=100,000 computing 200,001 dot products of vectors of length 100,000 • Second comparison for N=200,000 computes 400,001 dot products of vectors of length 200,000 • Increase computational complexity by a factor of 4.**Second Comparison of Dot Product(N=200,000)**Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest) User Scalar 1090 183 2.17 User Vector 1180 169 2.35 Slowest ESSL 739 271 1.47 ESSL-SMP 503 398 1.00 Fastest IMSL 725 276 1.44 NAG 1120 179 2.23 NAG-SMP 864 231 1.72**Comments on Second Comparisons (N=200,000)**• Now the best results are from the ESSL-SMP library, with the default 16 threads • The next best group is ESSL, IMSL and NAG-SMP, taking 50-75% longer than the ESSL-SMP routine. • The worst results were seen from NAG (single thread) and the user code routines. • What is the impact of the number of threads on the ESSL-SMP library performance? It is already the best.**ESSL-SMP Performance vs. Number of Threads**• All for N=200,000 • Number of threads controlled by environment variable OMP_NUM_THREADS**Revised Second Comparison of Dot Product(N=200,000)**Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest) User Scalar 1090 183 7.5 User Vector 1180 169 8.1 Slowest ESSL 739 271 5.1 ESSL-SMP 146 1370 1.0 Fastest (6 threads) IMSL 725 276 5.0 NAG 1120 179 7.7 NAG-SMP 864 231 5.9