Libraries and Their Performance

Libraries and Their Performance Frank V. Hale Thomas M. DeBoni NERSC User Services Group

Part I: Single Node Performance Measurement • Use of hpmcount for measurement of total code performance • Use of HPM Toolkit for measurement of code section performance • Vector operations generally give better performance than scalar (indexed) operations • Shared-memory, SMP parallelism can be very effective and easy to use

Demonstration Problem • Compute p using random points in unit square (ratio of points in unit circle to points in unit square) • Use input file with sequence of 134,217,728 uniformly distributed random numbers in range 0-1; unformatted, 8-byte floating point numbers (1 gigabyte of data)

A first Fortran code % cat estpi1.f implicit none integer i,points,circle real*8 x,y read(*,*)points open(10,file="runiform1.dat",status="old",form="unformatted") circle = 0 c repeat for each (x,y) data point: read and compute do i=1,points read(10)x read(10)y if (sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5) circle = circle + 1 enddo write(*,*)"Estimated pi using ",points," points as ", . ((4.*circle)/points) end

Compile and Run with hpmcount % cat jobestpi1 #@ class = debug #@ shell = /usr/bin/csh #@ wall_clock_limit = 00:29:00 #@ notification = always #@ job_type = serial #@ output = jobestpi1.out #@ error = jobestpi1.out #@ environment = COPY_ALL #@ queue setenv FC "xlf_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 " $FC -o estpi1 estpi1.f echo "10000" > estpi1.dat hpmcount ./estpi1 <estpi1.dat exit

Performance of first code

Some Observations • Performance is not very good at all, less than 1 Mflip/s (peak is 1,500 Mflip/s per processor) • Scalar approach to computation • Scalar I/O mixed with scalar computation Suggestions: • Separate I/O from computation • Use vector operations on dynamically allocated vector data structures

A second code, Fortran 90 % cat estpi2.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x,y c dynamically allocated vector data structures read(*,*)points allocate (x(points)) allocate (y(points)) allocate (ones(points)) ones = 1 open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) end

Performance of second code

Observations on Second Code • Operations on whole vectors should be faster, but • No real improvement in performance of total code was observed. • Suspect that most time is being spent on I/O. • I/O is now separate from computation, so the code is easy to instrument in sections

Instrument code sections with HPM Toolkit Four sections to be separately measured: • Data structure initialization • Read data • Estimate p • Write output Calls to f_hpmstart and f_hpmstop around each section.

Instrumented Code (1 of 2) %cat estpi3.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x,y #include "f_hpm.h" call f_hpminit(0,"Instrumented code") call f_hpmstart(1,"Initialize data structures") read(*,*)points allocate (x(points)) allocate (y(points)) allocate (ones(points)) ones = 1 call f_hpmstop(1)

Instrumented Code (2 of 2) call f_hpmstart(2,"Read data") open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo call f_hpmstop(2) call f_hpmstart(3,"Estimate pi") circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) call f_hpmstop(3) call f_hpmstart(4,"Write output") write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) call f_hpmstop(4) call f_hpmterminate(0) end

Notes on Instrumented Code • Entire executable code enclosed between hpm_init and hpm_terminate • Code sections enclosed between hpm_start and hpm_stop • Descriptive text labels appear in output file(s)

Compile and Run with HPM Toolkit % cat jobestpi3 #@ class = debug #@ shell = /usr/bin/csh #@ wall_clock_limit = 00:29:00 #@ notification = always #@ job_type = serial #@ output = jobestpi3.out #@ error = jobestpi3.out #@ environment = COPY_ALL #@ queue module load hpmtoolkit setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f" $FC -o estpi3 estpi3.f echo "10000000" > estpi3.dat ./estpi3 <estpi3.dat exit

Notes on Use of HPM Toolkit • Must load module hpmtoolkit • Need to include header file f_hpm.h in Fortran code, and give preprocessor directions to compiler with -qsuffix • Performance output in a file named like perfhpmNNNN.MMMMM where NNNN is the task id and MMMMM is the process id • Message from sample executable: libHPM output in perfhpm0000.21410

Comparison of Code Sections 10,000,000 points

Observations on Sections • Optimization of the estimation of p has little effect because • The code spends 99% of the time reading the data • Can the I/O be optimized?

Reworking the I/O • Whole arrary I/O versus scalar I/O • Scalar I/O (one number per record) file is twice as big (8 bytes for number, 8 bytes for end of record) • Whole array I/O file has only one end of record marker • Only one call for Fortran read routine for whole array I/O read(10)xy • Need to use some fancy array footwork to sort out x(1), y(1), x(2), y(2), … x(n), y(n) from xy array. x = xy(1::2) y = xy(2::2)

Revised Data Structures and I/O % cat estpi4.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x, y, xy #include "f_hpm.h" call f_hpminit(0,"Instrumented code") call f_hpmstart(1,"Initialize data structures") read(*,*)points allocate (x(points)) allocate (y(points)) allocate (xy(2*points)) allocate (ones(points)) ones = 1 call f_hpmstop(1) call f_hpmstart(2,"Read data") open(10,file="runiform.dat",status="old",form="unformatted") read(10)xy x = xy(1::2) y = xy(2::2) call f_hpmstop(2)

Vector I/O Code Sections 10,000,000 points

Observations on New Sections • The time spent reading the data as a vector rather than a scalar was reduced from 89.9 to 3.16 seconds, a reduction of 96% of the I/O time. • There was no performance penalty for the additional data structure complexity. • I/O design can have very significant performance impacts! • Total code performance with hpmcount is now 15.4 Mflip/s, 20 times improved from the 0.801 Mflip/s of the scalar I/O code.

Automatic Shared-Memory (SMP) Parallelization • IBM Fortran provides a–qsmpoption for automatic, shared-memory parallelization, allowing multithreaded computation within a node. • Default number of threads is 16; the number of threads is controlled by OMP_NUM_THREADS environment variable • Allows use of the SMP version of the ESSL library, -lesslsmp

Compiler Options • The source code is the same as the previous, vector operation example, estpi4.f • Compiler options –qsmp and –lesslsmp enable automatic shared-memory parallelism (SMP) • Compiler command line: xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f -qsmp –lesslsmp -o estpi5 estpi4.f

SMP Code Sections 10,000,000 points

Observations on SMP Code • Computational section is now showing 1,100 Mflip/sec, or 4.6% of theoretical peak of 24,000 Mflip/sec on 16 processor node. • Computational section is now 12 times faster, with no changes to source code • Recommendation: always use thread-safe compilers (with _r suffix) and –qsmp unless there is a good reason to do otherwise. • There are no explicit parallelism directives in the source code; all threading is within the library.

Too Many Threads Can Spoil Performance • Each node has 16 processors, and usually having more threads than processors will not improve performance

Sidebar: Cost of Misaligned Common Block • User code with Fortran77 style common blocks may receive an innocuous warning: 1514-008 (W) Variable … is misaligned. This may affect the efficiency of the code. • How much can this affect the efficiency of the code? • Test: put arrays x and y in misaligned common, with a 1-byte character in front of them

Potential Cost of Misaligned Common Blocks • 10,000,000 points used for computing Pi; • Properly aligned, dynamically allocated x and y used 0.064 seconds at 1,100 Mflip/s • Misaligned, statically allocated x and y in common block used 0.834 seconds at 88.4 Mflip/s • Common block alignment slowed computation by a factor of 12

Part I Conclusion • hpmcount can be used to measure the performance of the total code • HPM Toolkit can be used to measure the performance of discrete code sections • Optimization effort must be focused effectively • Fortran90 vector operations are generally faster than Fortran77 scalar operations • Use of automatic SMP parallelization may provide an easy performance boost • I/O may be the largest factor in “whole code” performance • Misaligned common blocks can be very expensive

Part II: Comparing Libraries • In the rich user environment on seaborg, there are many alternative ways to do the same computation • The HPM Toolkit provides the tools to compare alternative approaches to the same computation

Dot Product Functions • User coded scalar computation • User coded vector computaiton • Single processor ESSL ddot • Multi-threaded SMP ESSL ddot • Single processor IMSL ddot • Single processor NAG f06eaf • Multi-threaded SMP NAGf06eaf

Sample Problem • Test Cauchy-Schwartz inequality for N vectors of length N (X•Y)2 <= (X•X)(Y•Y) • Generate 2N random numbers (array x2) • Use 1st N for X; (X•X) computed once • Vary vector Y for i=1,n y = 2.0*x2(i:n+(i-1)) First Y is 2X, second Y is 2(X2(2:N+1)), etc. • Compute (2*N)+1 dot products of length N

Instrumented Code Section for Dot Products call f_hpmstart(1,"Dot products") xx = ddot(n,x,1,x,1) do i=1,n y = 2.0*x2(i:n+(i-1)) yy = ddot(n,y,1,y,1) xy = ddot(n,x,1,y,1) diffs(i) = (xx*yy)-(xy*xy) enddo call f_hpmstop(1)

Two User Coded Functions real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n),dp dp = 0. do i=1,n dp = dp + x(i)*y(i) ! User scalar loop enddo myddot = dp return end real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n) myddot = sum(x*y) ! User vector computation return end

Compile and Run User Functions module load hpmtoolkit echo "100000" > libs.dat setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT-qsuffix=cpp=f" $FC -o libs0 libs0.f ./libs0 <libs.dat $FC -o libs0a libs0a.f ./libs0a <libs.dat

Compile and Run ESSL Versions setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f -lessl" $FC -o libs1 libs1.f ./libs1 <libs.dat setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f -qsmp -lesslsmp" $FC -o libs1smp libs1.f ./libs1smp <libs.dat

Compile and Run IMSL Version module load imsl setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $IMSL" $FC -o libs1imsl libs1.f ./libs1imsl <libs.dat module unload imsl

Compile and Run NAG Versions module load nag_64 setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG" $FC -o libs1nag libsnag.f ./libs1nag <libs.dat module unload nag module load nag_smp64 setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG_SMP6 -qsmp=omp -qnosave " $FC -o libs1nagsmp libsnag.f ./libs1nagsmp <libs.dat module unload nag_smp64

First Comparison of Dot Product(N=100,000) Version Wall Clock (sec)Mflip/sScaled Time (1=Fastest) User Scalar 246 203 1.72 User Vector 249 201 1.74 ESSL 145 346 1.01 ESSL-SMP 408 123 2.85 Slowest IMSL 143 351 1.00 Fastest NAG 250 200 1.75 NAG-SMP 180 278 1.26

Comments on First Comparisons • The best results, by just a little, were obtained using the IMSL library, with ESSL a close second • Third best was the NAG-SMP routine, with benefits from multi-threaded computation • The user coded routines and NAG were about 75% slower than the ESSL and IMSL routines. In general, library routines are highly optimized and better than user coded routines. • The ESSL-SMP library did very poorly on this computation; this unexpected result may be due to data structures in the library, or perhaps the number of threads (default is 16).

ESSL-SMP Performance vs. Number of Threads • All for N=100,000 • Number of threads controlled by environment variable OMP_NUM_THREADS

Revised First Comparison of Dot Product(N=100,000) Version Wall Clock (sec)Mflip/sScaled Time (1=Fastest) User Scalar 246 203 4.9 User Vector 249 201 5.0 ESSL 145 346 2.9 ESSL-SMP 50 1000 1.0 Fastest 4 threads IMSL 143 351 2.9 NAG 250 200 5.0 Slowest NAG-SMP 180 278 3.6 Tuning for Number of Threads is Very, Very Important for SMP codes !

Scaling up the Problem • The first comparisons were for N=100,000 computing 200,001 dot products of vectors of length 100,000 • Second comparison for N=200,000 computes 400,001 dot products of vectors of length 200,000 • Increase computational complexity by a factor of 4.

Second Comparison of Dot Product(N=200,000) Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest) User Scalar 1090 183 2.17 User Vector 1180 169 2.35 Slowest ESSL 739 271 1.47 ESSL-SMP 503 398 1.00 Fastest IMSL 725 276 1.44 NAG 1120 179 2.23 NAG-SMP 864 231 1.72

Comments on Second Comparisons (N=200,000) • Now the best results are from the ESSL-SMP library, with the default 16 threads • The next best group is ESSL, IMSL and NAG-SMP, taking 50-75% longer than the ESSL-SMP routine. • The worst results were seen from NAG (single thread) and the user code routines. • What is the impact of the number of threads on the ESSL-SMP library performance? It is already the best.

ESSL-SMP Performance vs. Number of Threads • All for N=200,000 • Number of threads controlled by environment variable OMP_NUM_THREADS

Revised Second Comparison of Dot Product(N=200,000) Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest) User Scalar 1090 183 7.5 User Vector 1180 169 8.1 Slowest ESSL 739 271 5.1 ESSL-SMP 146 1370 1.0 Fastest (6 threads) IMSL 725 276 5.0 NAG 1120 179 7.7 NAG-SMP 864 231 5.9

Libraries and Their Performance

Libraries and Their Performance

Presentation Transcript

Licenses: maximizing their benefits for authors, libraries, and publishers

Mutual Advocates: Public Libraries and their Communities

PERFORMANCE EVALUATION OF ACADEMIC LIBRARIES IMPLEMENTATION MODEL

Public Libraries and Valuation How the Norwegian Population Values their Public Libraries

Numerical Libraries in High Performance Computing

Supporting E-science: Progress at Research Institutions and Their Libraries

Libraries in the Service of their Universities

DOCTORS, POWER AND THEIR PERFORMANCE

Modern Jet Algorithms and their performance

Valuing Libraries: Demonstrating the Contributions Libraries Make to Their Communities

“Libraries and their Users: Dial U for User”

Libraries and Program Performance

Performance of Interlending in Nordic academic libraries

Their Fathers’ Libraries: Reading and the Eighteenth-Century Woman Writer

Librarians are watching their numbers diminish and libraries close.

Performance Programming with IBM pSeries Compilers and Libraries

Modern Jet Algorithms and their performance

PUBLIC LIBRARIES AND HIGH PERFORMANCE BROADBAND ...NOW WHAT?

Performance indicators in Finnish Scientific Libraries

CLASSIFICATION OF LIBRARIES BASED ON THEIR FUNCTIONS AND OBJECTIVES Types of Libraries:

Libraries and Program Performance

Top 5 Performance Management Challenges And Their Solutions