1 / 57

Numerical Libraries in High Performance Computing

Numerical Libraries in High Performance Computing. Dmitry Pekurovsky dmitry@sdsc.edu CIEG workshop, April 25 2007. Tutorial Outline. General comments Overview of commonly used libraries MASS ScaLAPACK and its components BLAS, BLACS, PBLAS,LAPACK PETSc NAG ESSL, PESSL FFTW P3DFFT

arleen
Download Presentation

Numerical Libraries in High Performance Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Numerical Libraries in High Performance Computing Dmitry Pekurovsky dmitry@sdsc.edu CIEG workshop, April 25 2007

  2. Tutorial Outline • General comments • Overview of commonly used libraries • MASS • ScaLAPACK and its components • BLAS, BLACS, PBLAS,LAPACK • PETSc • NAG • ESSL, PESSL • FFTW • P3DFFT • Lab assignment

  3. Numerical libraries • Use numerical (a.k.a. math or scientific) libraries whenever possible • Minimize code development effort • Improved performance • Robust, accurate, up-to-date numerical algorithms • Collection (archive) of precompiled routines • Use nm –a to look inside if needed • Linking:mpxlf … –l /lib/loc/libnum.a OR mpxlf … -L/lib/loc –l lnum • Reminder: order of linking matters • In case two libraries contain the same routine name, the first one to appear on the linking command line gets used • -l options should come after the object files (.o) of your source • If library A calls routines from library B, -lA has to come before -lB.

  4. Calling Fortran Libraries from C/C++ • Fortran uses column-major, C/C++ uses row-major array storage • In C array counts begin with 0, but in Fortran they begin with 1 • Pass arguments by reference • Pass characters as strings • In C++, prototype functions to be called as follows: • extern “Fortran” func(int &n,…); • Link with –lxlf90_r flag added • More details: see http://www.sdsc.edu/us/newsletter/common/newsletter.php?issue=2005-10&corner=softwareor ask SDSC consulting

  5. Third-party libraries • Advantages: • Portable code • Extended set of features • Disadvantages: • Suboptimal performance • Sometimes costly, though many are free • Sometimes, need to install yourself • A good set is already installed on SDSC systems • Look in /usr/local/apps directory for 64-bit apps, in /usr/local/apps32 for 32-bit apps • Examples: ScaLapack, PETSc, FFTW, NAG

  6. Vendor-supplied libraries (specific to a given architecture) • Advantages: • Highly optimized performance (for a given processor, memory system etc) • Installed, tested, supported • Usually free for the user • Disadvantages: • Sometimes code not portable to other platforms • Some features may not be implemented • Examples on IBM: ESSL, MASS • Examples on Intel: MKL

  7. MASS – Mathematical Acceleration SubSystem • IBM product • Scalar Library Functions • Replace standard system intrinsics from libm • Call from Fortran or C • No source code modification required • Double precision routines: sin cos tan atan sqrt sinh cosh tanh atan2 rsqrt exp dnint log pow(C/C++) x**y(Fortran) • Provide better performance • Sometimes less accurate results (usually in the last bit) • Linking: -L/usr/local/apps/mass -lmass

  8. Scalar MASS Performance (cycles per call, length 1000 loop) F           ==604e===   ==pwr3===   ==pwr4+==   ==pwr5===u                   S           S           S           Sn                   p           p           p           pc        R          e           e           e           et        a  l   m   e   l   m   e   l   m   e   l   m   ei        n  i   a   d   i   a   d   i   a   d   i   a   do       g  b   s   u   b   s   u   b   s   u   b   s   un       e  m   s   p   m   s   p   m   s   p   m   s   p----------------------------------------------------------acos     B  91  91 1.0  70  69 1.0 108 116 0.9 113 115 1.0acosh    G 318 182 1.7 256 159 1.6 483 253 1.9 527 248 2.1asin     B  84  83 1.0  63  63 1.0 100 101 1.0 112 116 1.0asinh    D 352 232 1.5 264 178 1.5 463 301 1.5 482 286 1.7atan2    D 601 103 5.8 441  79 5.6 754 100 7.5 847  95 8.9atanh    B 244 143 1.7 186 116 1.6 300 169 1.8 312 165 1.9cbrt     D 279 282 1.0 216 216 1.0 328 333 1.0 335 336 1.0cos      B  53  25 2.1  38  16 2.4  53  21 2.5  65  29 2.2cos      D  76  45 1.7  58  37 1.6  74  58 1.3  86  60 1.4cosh     D 191  45 4.2 154  32 4.8 237  46 5.2 255  50 5.1cosisin  B 111  58 1.9  82  34 2.4 101  46 2.2 123  50 2.5cosisin  D 161  97 1.7 125  77 1.6 151 102 1.5 170 105 1.6div      D  32  32 1.0 8.8 8.8 1.0  28  28 1.0  29  29 1.0exp      D 120  35 3.4 106  27 3.9 162  35 4.6 168  40 4.2expm1    D 133  41 3.2 111  28 4.0 218  43 5.1 198  43 4.6

  9. Scalar MASS Performance (cycles per call, length 1000 loop), cont. F           ==604e===   ==pwr3===   ==pwr4+==   ==pwr5===u                   S           S           S           Sn                   p           p           p           pc        R          e           e           e           et        a  l   m   e   l   m   e   l   m   e   l   m   ei        n  i   a   d   i   a   d   i   a   d   i   a   do       g  b   s   u   b   s   u   b   s   u   b   s   un       e  m   s   p   m   s   p   m   s   p   m   s   plog      C 134  55 2.4 105  52 2.0 204  84 2.4 210  86 2.4log10    C 133  56 2.4 107  57 1.9 206  90 2.3 207  92 2.3log1p    H 157  58 2.7 124  47 2.6 201  68 3.0 220  70 3.1pow      C 309 113 2.7 235  90 2.6 453 146 3.1 463 150 3.1qdrt     C 134  98 1.4 116  88 1.3 250 156 1.6 283 159 1.8rcbrt    D 309 309 1.0 236 237 1.0 354 354 1.0 362 363 1.0rec      D  32  32 1.0 9.2 8.8 1.0  29  29 1.0  29  29 1.0rqdrt    C 149 100 1.5 126  95 1.3 243 155 1.6 276 153 1.8rsqrt    C  86  52 1.7  70  52 1.3 120  73 1.6 156  74 2.1sin      B  52  27 1.9  37  16 2.3  54  24 2.3  66  32 2.1sin      D  78  45 1.7  61  37 1.6  77  57 1.4  82  60 1.4sincos   B 108  56 1.9  80  34 2.4 101  48 2.1 121  54 2.2sinh     D 250  50 5.0 197  35 5.6 345  52 6.6 329  55 6.0sqrt     C  69  50 1.4  59  46 1.3 128  72 1.8 143  73 2.0tan      D 137  72 1.9 112  43 2.6 194  62 3.1 183  64 2.9tanh     F 256  64 4.0 199  47 4.2 357  65 5.5 333  67 5.0

  10. Vector MASS library - MASSV • Provides optimized versions of intrinsic functions for vectorized operations • Example for(i=0; i < N; i++) Z[i] = exp(x[i]); Replace with vexp(Z,x,N); • Takes advantage of pipelining Z(3)=exp(x(3)) Z(4)=exp(x(4)) Z(1)=exp(x(1)) Z(2)=exp(x(2)) Z(1)=exp(x(1)) Z(2)=exp(x(2)) Z(3)=exp(x(3)) Z(4)=exp(x(4))

  11. Vector MASS library cont. • Some source code modification required • Note: accuracy may differ even from scalar MASS functions (usually in the last bit) • Library source code available for portability to other platforms • Linking: -lmassvp4 on power4 platforms, otherwise –lmassv • MASS URL: http://www-306.ibm.com/software/awdtools/mass/

  12. Vector MASS Performance -- double precision functions (cycles per evaluation, length 1000 loop) F            ==604e===    ==pwr3===    ==pwr4+==    ==pwr5===u                    S        m   S        m   S        m   Sn                    p        a   p        a   p        a   pc         R      m   e        s   e        s   e        s   et         a  l   a   e    l   s   e    l   s   e    l   s   ei         n  i   s   d    i   v   d    i   v   d    i   v   do         g  b   s   u    b   p   u    b   p   u    b   p   un         e  m   v   p    m   3   p    m   4   p    m   5   p---------------------------------------------------------------vacos     B  91  41  2.2  70  17  4.1 108  24  4.5 113  23  4.9vacosh    G 318  39  8.2 256  15 17.1 483  21 23.0 527  19 27.7vasin     B  84  41  2.0  63  17  3.7 100  24  4.2 112  23  4.9vasinh    D 352  44  8.0 264  18 14.7 463  23 20.1 482  22 21.9vatan2    D 601  40 15.0 441  27 16.3 754  60 12.6 847  57 14.9vatanh    B 244  41  6.0 186  16 11.6 300  21 14.3 312  19 16.4vcbrt     D 279  20 14.0 216 8.8 24.5 328  11 29.8 335  12 27.9vcos      B  53 9.3  5.7  38   4  9.5  53 6.5  8.2  65 6.5 10.0vcos      D  76  27  2.8  58  12  4.8  74  20  3.7  86  20  4.3vcosh     D 191  25  7.6 154 9.2 16.7 237  14 16.9 255  13 19.6vcosisin  B 111  19  5.8  82   8 10.3 101  13  7.8 123  12 10.3vcosisin  D 161  29  5.6 125  12 10.4 151  21  7.2 170  20  8.5vdiv      D  32  11  2.9 8.8 6.8  1.3  28  13  2.2  29  13  2.2 vexp      D 120  16  7.5 106 6.4 16.6 162  14 11.6 168  13 12.9vexpm1    D 133  19  7.0 111   8 13.9 218  12 18.2 198  12 16.5vlog      C 134  20  6.7 105 7.2 14.6 204 9.7 21.0 210 9.5 22.1

  13. Vector MASS Performance - double precision functions (cycles per evaluation, length 1000 loop), cont. F            ==604e===    ==pwr3===    ==pwr4+==    ==pwr5===u                    S        m   S        m   S        m   Sn                    p        a   p        a   p        a   pc         R      m   e        s   e        s   e        s   et         a  l   a   e    l   s   e    l   s   e    l   s   ei         n  i   s   d    i   v   d    i   v   d    i   v   do         g  b   s   u    b   p   u    b   p   u    b   p   un         e  m   v   p    m   3   p    m   4   p    m   5   p--------------------------------------------------------------- vlog10    C 133  19  7.0 107 7.6 14.1 206  10 20.6 207 9.9 20.9vlog1p    H 157  26  6.0 124  11 11.3 201  13 15.5 220  12 18.3vpow      C 309  52  5.9 235  20 11.8 453  29 15.6 463  30 15.4vqdrt     C 134  15  8.9 116   6 19.3 250 8.2 30.5 283   8 35.4vrcbrt    D 309  20 15.5 236 8.8 26.8 354  11 32.2 362  11 32.9vrec      D  32  10  3.2 9.2 6.4  1.4  29  12  2.4  29  11  2.6vrqdrt    C 149  14 10.6 126   6 21.0 243 7.8 31.2 276   7 39.4vrsqrt    C  86  16  5.4  70 8.8  8.0 120  18  6.7 156  17  9.2vsin      B  52  11  4.7  37 4.8  7.7  54 7.2  7.5  66 6.9  9.6vsin      D  78  27  2.9  61  12  5.1  77  20  3.9  82  19  4.3vsinh     D 250  27  9.3 197  10 19.7 345  15 23.0 329  13 25.3vsqrt     C  69  16  4.3  59 8.8  6.7 128  17  7.5 143  17  8.4vtan      D 137  31  4.4 112  13  8.6 194  19 10.2 183  17 10.8vtanh     F 256  35  7.3 199  13 15.3 357  19 18.8 333  18 18.5

  14. ScaLapack • Scalable Linear Algebra PACKage, http://www.netlib.org/scalapack • Developing team from University of Tennessee, University of California Berkeley, ORNL, Rice U.,UCLA, UIUC etc. • Support in Commercial Packages • NAG Parallel Library • IBM PESSL • CRAY Scientific Library • VNI IMSL • Fujitsu, HP/Convex, Hitachi, NEC • Handles dense and banded matrices

  15. Serial Parallel

  16. BLAS – Basic Linear Algebra Subprograms

  17. BLAS, cont. • Use blocked operations • Optimized for a given memory hierarchy – e.g. good use of cache • Other libraries build on top of BLAS • On any given system, try to use a system-optimized version

  18. LAPACK • Linear Algebra PACKage • System of linear equations • Eigenvalue problems • Built on top of BLAS • On SDSC systems, installed at /usr/local/apps/LAPACK

  19. BLACS - Basic Linear Algebra Communication Subprograms • Built on top of MPI

  20. BLACS basics nprow=1; npcol=3; order='r'; zero=0; blacs_get(zero,zero,icontxt); blacs_gridinit(icontxt,order,nprow,npcol); blacs_gridinfo(icontxt,nprow,npcol,myrow,mycol); • icontxt like MPI_COMM_WORLD • blacs_gridinit specifies topology • blacs_gridinfo gets processor information • Here we define a 3 element array of processors

  21. PBLAS • Parallel version of BLAS • Built on top of BLAS and BLACS

  22. ScaLapack data layout

  23. Block-Cyclic Partitioning (1D)

  24. 1 1 1 2 1 3 1 5 1 6 1 4 1 7 1 8 1 9 G l o b a l M a t r i x 2 1 2 3 2 4 2 5 2 8 2 9 2 2 2 6 2 7 3 1 3 8 3 9 3 2 3 3 3 4 3 5 3 6 3 7 G l o b a l M a t r i x ( 9 x 9 ) 4 1 4 3 4 5 4 6 4 7 4 8 4 2 4 4 4 9 B l o c k S i z e = 2 x 2 5 1 5 2 5 3 5 4 5 5 5 6 5 7 5 8 5 9 C y c l i c o n 2 x 3 P E G r i d 6 1 6 2 6 4 6 5 6 7 6 9 6 3 6 6 6 8 7 1 7 2 7 3 7 6 7 7 7 4 7 5 7 8 7 9 8 1 8 2 8 6 8 7 8 8 8 9 8 3 8 4 8 5 9 1 9 3 9 4 9 7 9 8 9 9 9 2 9 5 9 6 1 1 1 2 1 3 1 5 1 6 1 7 1 8 1 4 1 9 2 1 2 8 2 3 2 4 2 9 2 5 2 2 2 7 2 6 5 1 5 8 5 4 5 6 5 2 5 7 5 3 5 9 5 5 6 1 6 2 6 7 6 8 6 3 6 4 6 9 6 5 6 6 9 1 9 2 9 7 9 8 9 3 9 4 9 9 9 5 9 6 P E ( 0 , 0 ) P E ( 0 , 1 ) P E ( 0 , 2 ) 3 1 3 2 3 7 3 8 3 3 3 4 3 9 3 5 3 6 4 1 4 2 4 7 4 8 4 3 4 4 4 9 4 5 4 6 7 1 7 2 7 7 7 3 7 6 7 8 7 4 7 9 7 5 8 1 8 2 8 7 8 8 8 9 8 6 8 3 8 4 8 5 P E ( 1 , 0 ) P E ( 1 , 1 ) P E ( 1 , 2 ) Block-cyclic partitioning (2D)

  25. Block-cyclic partitioning (2D), cont.

  26. ScaLapack – Array Descriptors

  27. Syntax for descinit descinit(idesc, m,n, mb,nb, i,j, icon, mla, ier) IorO arg Description OUT idesc Descriptor IN m Row Size (Global) IN n Column Size (Global) IN mb Row Block Size IN nb Column Size IN i Starting Grid Row IN j Starting Grid Column IN icon BLACS context IN mla Leading Dimension of Local Matrix OUT ier Error number The starting grid location is usually (i=0,j=0).

  28. ScaLapack functionality

  29. ScaLapack API • Prepend LAPACK equivalent names with P P X Y Y Z Z Z C o m p u t a t i o n P e r f o r m e d M a t r i x T y p e D a t a T y p e s

  30. ScaLapack API, cont. • Matrix Types (YY) PXYYZZZ • DB General Band (diagonally dominant-like) • DT general Tridiagonal (Diagonally dominant-like) • GB General Band • GE GEneral matrices (e.g., unsymmetric, rectangular, etc.) • GG General matrices, Generalized problem • HE Complex Hermitian • OR Orthogonal Real • PB Positive definite Banded (symmetric or Hermitian) • PO Positive definite (symmetric or Hermitian) • PT Positive definite Tridiagonal (symmetric or Hermitian) • ST Symmetric Tridiagonal Real • SY SYmmetric • TR TRiangular (possibly quasi-triangular) • TZ TrapeZoidal • UN UNitary complex

  31. ScaLapack API, cont. Drivers (ZZZ) PXYYZZZ • SLLinear Equations (SVX*) • SVLU Solver • VDSingular Value • EVEigenvalue (EVX**) • GVXGeneralized Eigenvalue *Estimates condition numbers. *Provides Error Bounds. *Equilibrates System. **Selected Eigenvalues

  32. ScaLapack function call example Solving Ax=B for a general square matrix A (N x N), B(N x NRHS) CALL PDGESV( N, NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, INFO ) • Global indices: IA =JA =IB = JB =1 • Ipiv: Integer array (N), on output containing permutation matrix (row i of the matrix was interchanged with row ipiv(i))

  33. ScaLapack performance tips • Use native BLAS • Choose a suitable number of processors, so as not to make the local matrix size too large (1000x1000 is about right) • Try to use square processor grid (Nrow=Ncol) • Block size = 64

  34. PETSc • Portable Extensible Toolkit for Scientific Computation • Intended to facilitate the scalable (parallel) solution of linear and non-linear partial differential equations • Focus on problems discretized using semi or fully implicit methods • Developed by Argonne National Laboratory, MCS

  35. PETSc • Fully usable from Fortran, C, and C++ • Uses MPI for all parallel communications • Object-oriented • Designed for both efficiency and ease of use • Available freely from http://www.mcs.anl.gov/petsc • Installed on Datastar at /usr/local/apps/petsc-2.3.1

  36. PETSc features • Parallel Vectors and discrete functions • Distributed arrays (for regular grid problems) • Parallel vector scatter/gathers (for unstructured grid problems) • Parallel (sparse) matrices • Parallel Krylov subspace methods • Parallel preconditioners • Parallel (Newton-based) nonlinear solvers • Parallel time-stepping code

  37. Nonlinear Solvers Time Steppers Newton based methods other Euler Back- ward Euler Pseudo time- stepping Other Line search Trust region Krylov Subspace Methods GMRES CG CGS Bi-CG-stab TFQMR Richard. Cheby. Other Preconditioners Adapt. Schwartz Block Jacobi ILU ICC LU (sequential only) other Matrices Compressed Sparse Row(AIJ) Blocked compressed Sparse Row(BAIJ) Block Diagonal Dense Other Index Sets Indices Block Indices Stride Other PETSc Numerical components Vectors

  38. MPI use features • Communicators and attributes to handle managing messages inside PETSc objects • Nonblocking sends and receives to overlap communication with computation • Global reduction operations • Derived data types

  39. NAG – Numerical Algorithms Group ® • Proprietary package, currently installed on Datastar in /usr/local/apps/nag (64-bit only) • Differential Equations • Linear Equations • Interpolation, Curve and Surface Fitting • Linear Algebra • Eigensystem Analysis • Statistical Analysis • Random Number Generator • Fourier Transforms, Summation of Series • Time series analysis • Mesh generation

  40. ESSL - Engineering and Scientific Subroutine Library • IBM product – optimized for IBM platforms • Over 400 high performance mathematical functions • C/C++ and Fortran, 32 and 64 bit, thread-safe • API is Fortran-based. From C/C++ must use Fortran calling conventions, e.g. column-major array • Linking: -lessl • ESSL User Guide: http://publib.boulder.ibm.com/epubs/pdf/am501403.pdf

  41. ESSL features • Numerical functions categories • Linear Algebra - vector-scalar, sparse vector-scalar, matrix-vector, sparse matrix-vector • Matrix Operations - addition, subtraction, multiplication, rank-k updates, transpose • Linear Algebra Equation solvers - dense, banded, sparse, linear least-squares • Eigensystem Analysis - standard, generalized • Signal Processing - Fourier transform, convolutions, correlations • Sorting & Searching • Interpolation - polynomial, cubic spline • Numerical Quadrature • Random Number Generation • Utilities

  42. ESSL example – 1D real-to-complex Fast Fourier Transform (FFT) #include <essl.h> main() { … double R[M][N],*work1,*work2; cmplx C[M][N/2+1]; int nwork1,nwork2; Init(R,N,M); … work1 = malloc(nwork1*sizeof(double)); work2 = malloc(nwork2*sizeof(double)); drcft(1,R,N,C,N/2+1,N,M,1,1.0,work1,nwork1,work2,nwork2); drcft(0,R,N,C,N/2+1,N,M,1,1.0,work1,nwork1,work2,nwork2); … }

  43. PESSL • Invokes ESSL • Uses MPI for communication • PESSL Categories • Subset of Level 2, 3 PBLAS • Subset of ScaLAPACK (dense, banded) • Sparse Routines • Subset of ScaLAPACK Eigensystem Routines • Fourier Transforms • Uniform Random Number Generation • BLACS • Link with -lpessl Serial Parallel

  44. FFTW – Fastest Fourier Transform in the West • Third party software, free – http://www.fftw.org • Latest version installed on SDSC systems in /usr/local/apps/fftw312d and /usr/local/apps/fftw312s • 1,2, and 3 dimensional FFTs, and related functions (sin/cos transforms, convolution etc) • Using “plans” • Initialize transform parameters long integer plan1; plan1 = fftw_plan(..N,M,...); fftw_execute(plan,A,B); • Work arrays allocated and initialized behind the scenes • ESTIMATE, MEASURE, PATIENT intialization • +’s Portable, free • -’s Performance about 3 times worse than ESSL’s FFTs!

  45. 3D FFT in parallel • FFTW version 3 does not have MPI implementation • PESSL and other libraries use 1D processor decomposition (slabs/planes) • Limitation: Np <= N • Not enough for some state-of-the art simulations, for example turbulence in a periodic domain 40963 • 2D decomposition is crucial for scaling on Petascale platforms

  46. 2D decomposition z x y

  47. 2D decomposition • 2D decomposition: processors form a square grid P1 x P2 • Columns (pencils) instead of slabs • Local geometry: • Stage 1: N x K x M • Stage 2: K x N x M • Stage 3: K x M x N • K = N/P1, M=N/P2 • At each stage, transform along the local dimension (N) • Software scaling limit removed! Now the limit is P <= N2

  48. Performance of DNS (2D) on Blue Gene at SDSC and IBM’s T.J.Watson Lab; and SDSC’s Datastar VN: Two processors per node CO: One processor per node

  49. P3DFFT • Open source library for efficient, highly scalable 3D FFT on parallel platforms • Built on top of an optimized 1D FFT library • Currently ESSL or FFTW • In the future, more libraries • Uses 2D decomposition, which includes 1D option. • Developed at SDSC • Outcome of a Strategic Applications Collaborations Project (DNS turbulence code) • Available at http://www.sdsc.edu/us/resources/p3dfft.php

  50. P3DFFT: features • Implements real-to-complex (R2C) and complex-to-real (C2R) 3D transforms • Written in Fortran 90 with MPI • Implemented as F90 module • Single or double precision • Arbitrary dimensions • Handles many uneven cases (Ni does not have to be a factor of Pj) • Assumes Ni >= Pj • Can do either in-place transform or otherwise • In which case the input and output arrays should not overlap • Memory requirements: input and output arrays + 1 buffer

More Related