numerical libraries in high performance computing
Download
Skip this Video
Download Presentation
Numerical Libraries in High Performance Computing

Loading in 2 Seconds...

play fullscreen
1 / 57

Numerical Libraries in High Performance Computing - PowerPoint PPT Presentation


  • 282 Views
  • Uploaded on

Numerical Libraries in High Performance Computing. Dmitry Pekurovsky [email protected] CIEG workshop, April 25 2007. Tutorial Outline. General comments Overview of commonly used libraries MASS ScaLAPACK and its components BLAS, BLACS, PBLAS,LAPACK PETSc NAG ESSL, PESSL FFTW P3DFFT

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Numerical Libraries in High Performance Computing' - arleen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
numerical libraries in high performance computing

Numerical Libraries in High Performance Computing

Dmitry Pekurovsky

[email protected]

CIEG workshop, April 25 2007

tutorial outline
Tutorial Outline
  • General comments
  • Overview of commonly used libraries
    • MASS
    • ScaLAPACK and its components
      • BLAS, BLACS, PBLAS,LAPACK
    • PETSc
    • NAG
    • ESSL, PESSL
    • FFTW
    • P3DFFT
  • Lab assignment
numerical libraries
Numerical libraries
  • Use numerical (a.k.a. math or scientific) libraries whenever possible
    • Minimize code development effort
    • Improved performance
    • Robust, accurate, up-to-date numerical algorithms
  • Collection (archive) of precompiled routines
    • Use nm –a to look inside if needed
  • Linking:mpxlf … –l /lib/loc/libnum.a OR

mpxlf … -L/lib/loc –l lnum

    • Reminder: order of linking matters
      • In case two libraries contain the same routine name, the first one to appear on the linking command line gets used
      • -l options should come after the object files (.o) of your source
      • If library A calls routines from library B, -lA has to come before -lB.
calling fortran libraries from c c
Calling Fortran Libraries from C/C++
  • Fortran uses column-major, C/C++ uses row-major array storage
  • In C array counts begin with 0, but in Fortran they begin with 1
  • Pass arguments by reference
  • Pass characters as strings
  • In C++, prototype functions to be called as follows:
    • extern “Fortran” func(int &n,…);
  • Link with –lxlf90_r flag added
  • More details: see http://www.sdsc.edu/us/newsletter/common/newsletter.php?issue=2005-10&corner=softwareor ask SDSC consulting
third party libraries
Third-party libraries
  • Advantages:
    • Portable code
    • Extended set of features
  • Disadvantages:
    • Suboptimal performance
    • Sometimes costly, though many are free
    • Sometimes, need to install yourself
  • A good set is already installed on SDSC systems
    • Look in /usr/local/apps directory for 64-bit apps, in /usr/local/apps32 for 32-bit apps
  • Examples: ScaLapack, PETSc, FFTW, NAG
vendor supplied libraries specific to a given architecture
Vendor-supplied libraries (specific to a given architecture)
  • Advantages:
    • Highly optimized performance (for a given processor, memory system etc)
    • Installed, tested, supported
    • Usually free for the user
  • Disadvantages:
    • Sometimes code not portable to other platforms
    • Some features may not be implemented
  • Examples on IBM: ESSL, MASS
  • Examples on Intel: MKL
mass mathematical acceleration subsystem
MASS – Mathematical Acceleration SubSystem
  • IBM product
  • Scalar Library Functions
    • Replace standard system intrinsics from libm
    • Call from Fortran or C
    • No source code modification required
    • Double precision routines: sin cos tan atan sqrt sinh cosh tanh atan2 rsqrt exp dnint log pow(C/C++) x**y(Fortran)
    • Provide better performance
    • Sometimes less accurate results (usually in the last bit)
    • Linking: -L/usr/local/apps/mass -lmass
scalar mass performance cycles per call length 1000 loop
Scalar MASS Performance (cycles per call, length 1000 loop)

F           ==604e===   ==pwr3===   ==pwr4+==   ==pwr5===u                   S           S           S           Sn                   p           p           p           pc        R          e           e           e           et        a  l   m   e   l   m   e   l   m   e   l   m   ei        n  i   a   d   i   a   d   i   a   d   i   a   do       g  b   s   u   b   s   u   b   s   u   b   s   un       e  m   s   p   m   s   p   m   s   p   m   s   p----------------------------------------------------------acos     B  91  91 1.0  70  69 1.0 108 116 0.9 113 115 1.0acosh    G 318 182 1.7 256 159 1.6 483 253 1.9 527 248 2.1asin     B  84  83 1.0  63  63 1.0 100 101 1.0 112 116 1.0asinh    D 352 232 1.5 264 178 1.5 463 301 1.5 482 286 1.7atan2    D 601 103 5.8 441  79 5.6 754 100 7.5 847  95 8.9atanh    B 244 143 1.7 186 116 1.6 300 169 1.8 312 165 1.9cbrt     D 279 282 1.0 216 216 1.0 328 333 1.0 335 336 1.0cos      B  53  25 2.1  38  16 2.4  53  21 2.5  65  29 2.2cos      D  76  45 1.7  58  37 1.6  74  58 1.3  86  60 1.4cosh     D 191  45 4.2 154  32 4.8 237  46 5.2 255  50 5.1cosisin  B 111  58 1.9  82  34 2.4 101  46 2.2 123  50 2.5cosisin  D 161  97 1.7 125  77 1.6 151 102 1.5 170 105 1.6div      D  32  32 1.0 8.8 8.8 1.0  28  28 1.0  29  29 1.0exp      D 120  35 3.4 106  27 3.9 162  35 4.6 168  40 4.2expm1    D 133  41 3.2 111  28 4.0 218  43 5.1 198  43 4.6

scalar mass performance cycles per call length 1000 loop cont
Scalar MASS Performance (cycles per call, length 1000 loop), cont.

F           ==604e===   ==pwr3===   ==pwr4+==   ==pwr5===u                   S           S           S           Sn                   p           p           p           pc        R          e           e           e           et        a  l   m   e   l   m   e   l   m   e   l   m   ei        n  i   a   d   i   a   d   i   a   d   i   a   do       g  b   s   u   b   s   u   b   s   u   b   s   un       e  m   s   p   m   s   p   m   s   p   m   s   plog      C 134  55 2.4 105  52 2.0 204  84 2.4 210  86 2.4log10    C 133  56 2.4 107  57 1.9 206  90 2.3 207  92 2.3log1p    H 157  58 2.7 124  47 2.6 201  68 3.0 220  70 3.1pow      C 309 113 2.7 235  90 2.6 453 146 3.1 463 150 3.1qdrt     C 134  98 1.4 116  88 1.3 250 156 1.6 283 159 1.8rcbrt    D 309 309 1.0 236 237 1.0 354 354 1.0 362 363 1.0rec      D  32  32 1.0 9.2 8.8 1.0  29  29 1.0  29  29 1.0rqdrt    C 149 100 1.5 126  95 1.3 243 155 1.6 276 153 1.8rsqrt    C  86  52 1.7  70  52 1.3 120  73 1.6 156  74 2.1sin      B  52  27 1.9  37  16 2.3  54  24 2.3  66  32 2.1sin      D  78  45 1.7  61  37 1.6  77  57 1.4  82  60 1.4sincos   B 108  56 1.9  80  34 2.4 101  48 2.1 121  54 2.2sinh     D 250  50 5.0 197  35 5.6 345  52 6.6 329  55 6.0sqrt     C  69  50 1.4  59  46 1.3 128  72 1.8 143  73 2.0tan      D 137  72 1.9 112  43 2.6 194  62 3.1 183  64 2.9tanh     F 256  64 4.0 199  47 4.2 357  65 5.5 333  67 5.0

vector mass library massv
Vector MASS library - MASSV
  • Provides optimized versions of intrinsic functions for vectorized operations
    • Example

for(i=0; i < N; i++)

Z[i] = exp(x[i]);

Replace with

vexp(Z,x,N);

    • Takes advantage of pipelining

Z(3)=exp(x(3))

Z(4)=exp(x(4))

Z(1)=exp(x(1))

Z(2)=exp(x(2))

Z(1)=exp(x(1))

Z(2)=exp(x(2))

Z(3)=exp(x(3))

Z(4)=exp(x(4))

vector mass library cont
Vector MASS library cont.
  • Some source code modification required
  • Note: accuracy may differ even from scalar MASS functions (usually in the last bit)
  • Library source code available for portability to other platforms
  • Linking: -lmassvp4 on power4 platforms, otherwise –lmassv
  • MASS URL: http://www-306.ibm.com/software/awdtools/mass/
vector mass performance double precision functions cycles per evaluation length 1000 loop
Vector MASS Performance -- double precision functions (cycles per evaluation, length 1000 loop)

F            ==604e===    ==pwr3===    ==pwr4+==    ==pwr5===u                    S        m   S        m   S        m   Sn                    p        a   p        a   p        a   pc         R      m   e        s   e        s   e        s   et         a  l   a   e    l   s   e    l   s   e    l   s   ei         n  i   s   d    i   v   d    i   v   d    i   v   do         g  b   s   u    b   p   u    b   p   u    b   p   un         e  m   v   p    m   3   p    m   4   p    m   5   p---------------------------------------------------------------vacos     B  91  41  2.2  70  17  4.1 108  24  4.5 113  23  4.9vacosh    G 318  39  8.2 256  15 17.1 483  21 23.0 527  19 27.7vasin     B  84  41  2.0  63  17  3.7 100  24  4.2 112  23  4.9vasinh    D 352  44  8.0 264  18 14.7 463  23 20.1 482  22 21.9vatan2    D 601  40 15.0 441  27 16.3 754  60 12.6 847  57 14.9vatanh    B 244  41  6.0 186  16 11.6 300  21 14.3 312  19 16.4vcbrt     D 279  20 14.0 216 8.8 24.5 328  11 29.8 335  12 27.9vcos      B  53 9.3  5.7  38   4  9.5  53 6.5  8.2  65 6.5 10.0vcos      D  76  27  2.8  58  12  4.8  74  20  3.7  86  20  4.3vcosh     D 191  25  7.6 154 9.2 16.7 237  14 16.9 255  13 19.6vcosisin  B 111  19  5.8  82   8 10.3 101  13  7.8 123  12 10.3vcosisin  D 161  29  5.6 125  12 10.4 151  21  7.2 170  20  8.5vdiv      D  32  11  2.9 8.8 6.8  1.3  28  13  2.2  29  13  2.2

vexp      D 120  16  7.5 106 6.4 16.6 162  14 11.6 168  13 12.9vexpm1    D 133  19  7.0 111   8 13.9 218  12 18.2 198  12 16.5vlog      C 134  20  6.7 105 7.2 14.6 204 9.7 21.0 210 9.5 22.1

vector mass performance double precision functions cycles per evaluation length 1000 loop cont
Vector MASS Performance - double precision functions (cycles per evaluation, length 1000 loop), cont.

F            ==604e===    ==pwr3===    ==pwr4+==    ==pwr5===u                    S        m   S        m   S        m   Sn                    p        a   p        a   p        a   pc         R      m   e        s   e        s   e        s   et         a  l   a   e    l   s   e    l   s   e    l   s   ei         n  i   s   d    i   v   d    i   v   d    i   v   do         g  b   s   u    b   p   u    b   p   u    b   p   un         e  m   v   p    m   3   p    m   4   p    m   5   p---------------------------------------------------------------

vlog10    C 133  19  7.0 107 7.6 14.1 206  10 20.6 207 9.9 20.9vlog1p    H 157  26  6.0 124  11 11.3 201  13 15.5 220  12 18.3vpow      C 309  52  5.9 235  20 11.8 453  29 15.6 463  30 15.4vqdrt     C 134  15  8.9 116   6 19.3 250 8.2 30.5 283   8 35.4vrcbrt    D 309  20 15.5 236 8.8 26.8 354  11 32.2 362  11 32.9vrec      D  32  10  3.2 9.2 6.4  1.4  29  12  2.4  29  11  2.6vrqdrt    C 149  14 10.6 126   6 21.0 243 7.8 31.2 276   7 39.4vrsqrt    C  86  16  5.4  70 8.8  8.0 120  18  6.7 156  17  9.2vsin      B  52  11  4.7  37 4.8  7.7  54 7.2  7.5  66 6.9  9.6vsin      D  78  27  2.9  61  12  5.1  77  20  3.9  82  19  4.3vsinh     D 250  27  9.3 197  10 19.7 345  15 23.0 329  13 25.3vsqrt     C  69  16  4.3  59 8.8  6.7 128  17  7.5 143  17  8.4vtan      D 137  31  4.4 112  13  8.6 194  19 10.2 183  17 10.8vtanh     F 256  35  7.3 199  13 15.3 357  19 18.8 333  18 18.5

scalapack
ScaLapack
  • Scalable Linear Algebra PACKage, http://www.netlib.org/scalapack
  • Developing team from University of Tennessee, University of California Berkeley, ORNL, Rice U.,UCLA, UIUC etc.
  • Support in Commercial Packages
    • NAG Parallel Library
    • IBM PESSL
    • CRAY Scientific Library
    • VNI IMSL
    • Fujitsu, HP/Convex, Hitachi, NEC
  • Handles dense and banded matrices
blas cont
BLAS, cont.
  • Use blocked operations
  • Optimized for a given memory hierarchy – e.g. good use of cache
  • Other libraries build on top of BLAS
  • On any given system, try to use a system-optimized version
lapack
LAPACK
  • Linear Algebra PACKage
    • System of linear equations
    • Eigenvalue problems
  • Built on top of BLAS
  • On SDSC systems, installed at /usr/local/apps/LAPACK
blacs basics
BLACS basics

nprow=1; npcol=3;

order='r'; zero=0;

blacs_get(zero,zero,icontxt);

blacs_gridinit(icontxt,order,nprow,npcol);

blacs_gridinfo(icontxt,nprow,npcol,myrow,mycol);

  • icontxt like MPI_COMM_WORLD
  • blacs_gridinit specifies topology
  • blacs_gridinfo gets processor information
  • Here we define a 3 element array of processors
pblas
PBLAS
  • Parallel version of BLAS
  • Built on top of BLAS and BLACS
block cyclic partitioning 2d
1

1

1

2

1

3

1

5

1

6

1

4

1

7

1

8

1

9

G

l

o

b

a

l

M

a

t

r

i

x

2

1

2

3

2

4

2

5

2

8

2

9

2

2

2

6

2

7

3

1

3

8

3

9

3

2

3

3

3

4

3

5

3

6

3

7

G

l

o

b

a

l

M

a

t

r

i

x

(

9

x

9

)

4

1

4

3

4

5

4

6

4

7

4

8

4

2

4

4

4

9

B

l

o

c

k

S

i

z

e

=

2

x

2

5

1

5

2

5

3

5

4

5

5

5

6

5

7

5

8

5

9

C

y

c

l

i

c

o

n

2

x

3

P

E

G

r

i

d

6

1

6

2

6

4

6

5

6

7

6

9

6

3

6

6

6

8

7

1

7

2

7

3

7

6

7

7

7

4

7

5

7

8

7

9

8

1

8

2

8

6

8

7

8

8

8

9

8

3

8

4

8

5

9

1

9

3

9

4

9

7

9

8

9

9

9

2

9

5

9

6

1

1

1

2

1

3

1

5

1

6

1

7

1

8

1

4

1

9

2

1

2

8

2

3

2

4

2

9

2

5

2

2

2

7

2

6

5

1

5

8

5

4

5

6

5

2

5

7

5

3

5

9

5

5

6

1

6

2

6

7

6

8

6

3

6

4

6

9

6

5

6

6

9

1

9

2

9

7

9

8

9

3

9

4

9

9

9

5

9

6

P

E

(

0

,

0

)

P

E

(

0

,

1

)

P

E

(

0

,

2

)

3

1

3

2

3

7

3

8

3

3

3

4

3

9

3

5

3

6

4

1

4

2

4

7

4

8

4

3

4

4

4

9

4

5

4

6

7

1

7

2

7

7

7

3

7

6

7

8

7

4

7

9

7

5

8

1

8

2

8

7

8

8

8

9

8

6

8

3

8

4

8

5

P

E

(

1

,

0

)

P

E

(

1

,

1

)

P

E

(

1

,

2

)

Block-cyclic partitioning (2D)
syntax for descinit
Syntax for descinit

descinit(idesc, m,n, mb,nb, i,j, icon, mla, ier)

IorO arg Description

OUT idesc Descriptor

IN m Row Size (Global)

IN n Column Size (Global)

IN mb Row Block Size

IN nb Column Size

IN i Starting Grid Row

IN j Starting Grid Column

IN icon BLACS context

IN mla Leading Dimension of Local Matrix

OUT ier Error number

The starting grid location is usually (i=0,j=0).

scalapack api
ScaLapack API
  • Prepend LAPACK equivalent names with P

P

X

Y

Y

Z

Z

Z

C

o

m

p

u

t

a

t

i

o

n

P

e

r

f

o

r

m

e

d

M

a

t

r

i

x

T

y

p

e

D

a

t

a

T

y

p

e

s

scalapack api cont
ScaLapack API, cont.
  • Matrix Types (YY) PXYYZZZ
  • DB General Band (diagonally dominant-like)
  • DT general Tridiagonal (Diagonally dominant-like)
  • GB General Band
  • GE GEneral matrices (e.g., unsymmetric, rectangular, etc.)
  • GG General matrices, Generalized problem
  • HE Complex Hermitian
  • OR Orthogonal Real
  • PB Positive definite Banded (symmetric or Hermitian)
  • PO Positive definite (symmetric or Hermitian)
  • PT Positive definite Tridiagonal (symmetric or Hermitian)
  • ST Symmetric Tridiagonal Real
  • SY SYmmetric
  • TR TRiangular (possibly quasi-triangular)
  • TZ TrapeZoidal
  • UN UNitary complex
scalapack api cont1
ScaLapack API, cont.

Drivers (ZZZ) PXYYZZZ

  • SLLinear Equations (SVX*)
  • SVLU Solver
  • VDSingular Value
  • EVEigenvalue (EVX**)
  • GVXGeneralized Eigenvalue

*Estimates condition numbers.

*Provides Error Bounds.

*Equilibrates System.

**Selected Eigenvalues

scalapack function call example
ScaLapack function call example

Solving Ax=B for a general square matrix

A (N x N), B(N x NRHS)

CALL PDGESV( N, NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, INFO )

  • Global indices: IA =JA =IB = JB =1
  • Ipiv: Integer array (N), on output containing permutation matrix (row i of the matrix was interchanged with row ipiv(i))
scalapack performance tips
ScaLapack performance tips
  • Use native BLAS
  • Choose a suitable number of processors, so as not to make the local matrix size too large (1000x1000 is about right)
  • Try to use square processor grid (Nrow=Ncol)
  • Block size = 64
petsc
PETSc
  • Portable Extensible Toolkit for Scientific Computation
  • Intended to facilitate the scalable (parallel) solution of linear and non-linear partial differential equations
  • Focus on problems discretized using semi or fully implicit methods
  • Developed by Argonne National Laboratory, MCS
petsc1
PETSc
  • Fully usable from Fortran, C, and C++
  • Uses MPI for all parallel communications
  • Object-oriented
  • Designed for both efficiency and ease of use
  • Available freely from http://www.mcs.anl.gov/petsc
  • Installed on Datastar at /usr/local/apps/petsc-2.3.1
petsc features
PETSc features
  • Parallel Vectors and discrete functions
  • Distributed arrays (for regular grid problems)
  • Parallel vector scatter/gathers (for unstructured grid problems)
  • Parallel (sparse) matrices
  • Parallel Krylov subspace methods
  • Parallel preconditioners
  • Parallel (Newton-based) nonlinear solvers
  • Parallel time-stepping code
petsc numerical components
Nonlinear Solvers

Time Steppers

Newton based methods

other

Euler

Back-

ward

Euler

Pseudo

time-

stepping

Other

Line search

Trust region

Krylov Subspace Methods

GMRES

CG

CGS

Bi-CG-stab

TFQMR

Richard.

Cheby.

Other

Preconditioners

Adapt.

Schwartz

Block

Jacobi

ILU

ICC

LU

(sequential only)

other

Matrices

Compressed

Sparse

Row(AIJ)

Blocked compressed

Sparse Row(BAIJ)

Block

Diagonal

Dense

Other

Index Sets

Indices

Block Indices

Stride

Other

PETSc Numerical components

Vectors

mpi use features
MPI use features
  • Communicators and attributes to handle managing messages inside PETSc objects
  • Nonblocking sends and receives to overlap communication with computation
  • Global reduction operations
  • Derived data types
nag numerical algorithms group
NAG – Numerical Algorithms Group ®
  • Proprietary package, currently installed on Datastar in /usr/local/apps/nag (64-bit only)
    • Differential Equations
    • Linear Equations
    • Interpolation, Curve and Surface Fitting
    • Linear Algebra
    • Eigensystem Analysis
    • Statistical Analysis
    • Random Number Generator
    • Fourier Transforms, Summation of Series
    • Time series analysis
    • Mesh generation
essl engineering and scientific subroutine library
ESSL - Engineering and Scientific Subroutine Library
  • IBM product – optimized for IBM platforms
  • Over 400 high performance mathematical functions
  • C/C++ and Fortran, 32 and 64 bit, thread-safe
    • API is Fortran-based. From C/C++ must use Fortran calling conventions, e.g. column-major array
  • Linking: -lessl
  • ESSL User Guide: http://publib.boulder.ibm.com/epubs/pdf/am501403.pdf
essl features
ESSL features
  • Numerical functions categories
    • Linear Algebra - vector-scalar, sparse vector-scalar, matrix-vector, sparse matrix-vector
    • Matrix Operations - addition, subtraction, multiplication, rank-k updates, transpose
    • Linear Algebra Equation solvers - dense, banded, sparse, linear least-squares
    • Eigensystem Analysis - standard, generalized
    • Signal Processing - Fourier transform, convolutions, correlations
    • Sorting & Searching
    • Interpolation - polynomial, cubic spline
    • Numerical Quadrature
    • Random Number Generation
    • Utilities
essl example 1d real to complex fast fourier transform fft
ESSL example – 1D real-to-complex Fast Fourier Transform (FFT)

#include

main() {

double R[M][N],*work1,*work2;

cmplx C[M][N/2+1];

int nwork1,nwork2;

Init(R,N,M);

work1 = malloc(nwork1*sizeof(double));

work2 = malloc(nwork2*sizeof(double));

drcft(1,R,N,C,N/2+1,N,M,1,1.0,work1,nwork1,work2,nwork2);

drcft(0,R,N,C,N/2+1,N,M,1,1.0,work1,nwork1,work2,nwork2);

}

pessl
PESSL
  • Invokes ESSL
  • Uses MPI for communication
  • PESSL Categories
    • Subset of Level 2, 3 PBLAS
    • Subset of ScaLAPACK (dense, banded)
    • Sparse Routines
    • Subset of ScaLAPACK Eigensystem Routines
    • Fourier Transforms
    • Uniform Random Number Generation
    • BLACS
  • Link with -lpessl

Serial Parallel

fftw fastest fourier transform in the west
FFTW – Fastest Fourier Transform in the West
  • Third party software, free – http://www.fftw.org
  • Latest version installed on SDSC systems in /usr/local/apps/fftw312d and /usr/local/apps/fftw312s
  • 1,2, and 3 dimensional FFTs, and related functions (sin/cos transforms, convolution etc)
  • Using “plans”
    • Initialize transform parameters

long integer plan1;

plan1 = fftw_plan(..N,M,...);

fftw_execute(plan,A,B);

    • Work arrays allocated and initialized behind the scenes
  • ESTIMATE, MEASURE, PATIENT intialization
  • +’s

Portable, free

  • -’s

Performance about 3 times worse than ESSL’s FFTs!

3d fft in parallel
3D FFT in parallel
  • FFTW version 3 does not have MPI implementation
  • PESSL and other libraries use 1D processor decomposition (slabs/planes)
    • Limitation: Np <= N
    • Not enough for some state-of-the art simulations, for example turbulence in a periodic domain 40963
  • 2D decomposition is crucial for scaling on Petascale platforms
2d decomposition1
2D decomposition
  • 2D decomposition: processors form a square grid P1 x P2
    • Columns (pencils) instead of slabs
    • Local geometry:
      • Stage 1: N x K x M
      • Stage 2: K x N x M
      • Stage 3: K x M x N
      • K = N/P1, M=N/P2
      • At each stage, transform along the local dimension (N)
  • Software scaling limit removed! Now the limit is P <= N2
performance of dns 2d on blue gene at sdsc and ibm s t j watson lab and sdsc s datastar
Performance of DNS (2D) on Blue Gene at SDSC and IBM’s T.J.Watson Lab; and SDSC’s Datastar

VN: Two processors per node

CO: One processor per node

p3dfft
P3DFFT
  • Open source library for efficient, highly scalable 3D FFT on parallel platforms
  • Built on top of an optimized 1D FFT library
    • Currently ESSL or FFTW
    • In the future, more libraries
  • Uses 2D decomposition, which includes 1D option.
  • Developed at SDSC
  • Outcome of a Strategic Applications Collaborations Project (DNS turbulence code)
  • Available at http://www.sdsc.edu/us/resources/p3dfft.php
p3dfft features
P3DFFT: features
  • Implements real-to-complex (R2C) and complex-to-real (C2R) 3D transforms
  • Written in Fortran 90 with MPI
  • Implemented as F90 module
  • Single or double precision
  • Arbitrary dimensions
    • Handles many uneven cases (Ni does not have to be a factor of Pj)
    • Assumes Ni >= Pj
  • Can do either in-place transform or otherwise
    • In which case the input and output arrays should not overlap
  • Memory requirements: input and output arrays + 1 buffer
p3dfft storage
P3DFFT: Storage
  • R2C transform input:
    • Global: (Nx,Ny,Nz) real array
    • Local: (Nx,Ny/P1,Nz/P2) real array
  • R2C output:
    • Global: (Nx/2,Ny,Nz) complex array
      • Conjugate symmetry: F(N-i) = F*(i)
      • F(N/2+1) through F(N-1) are redundant
      • F(0) and F(N/2) are real – pack them in one pseudo-complex pair
      • Total N/2 complex storage units
    • Local: (Nx/(2P1),Ny/P2,Nz) complex
  • C2R: input and output interchanged from R2C
building p3dfft
Building P3DFFT
  • Subdirectory build/
    • Edit makefile for your architecture
    • Specify -DSINGLE_PREC or -DDOUBLE_PREC
      • Default is single precision
    • Specify -DESSL or -DFFTW
    • Choose P3DFFT_ROOT
    • Type ‘make’ – will compile and install the library in P3DFFT_ROOT directory
  • On SDSC platforms, it is already installed in /usr/local/apps/p3dfft
  • Subdirectory sample/
    • Contains sample programs
  • For manual see README file in upper level directory
using p3dfft
Using P3DFFT
  • ‘use p3dfft’
    • Defines mytype – 4 or 8 byte reals
  • Initialization: p3dfft_setup(proc_dims,nx,ny,nz)
    • procs_dims(2) – two integers, P1 and P2
    • For 1D decomposition, set P1 = 1
    • Sets up the square 2D grid of MPI communicators in rows and columns
    • Initializes local array dimensions
  • Local dimensions: get_dims(istart,iend,isize,Q)

integer istart(3),iend(3),isize(3)

      • Set Q=1 for R2C input, Q=2 for R2C output
using p3dfft 2
Using P3DFFT (2)
  • Now allocate your arrays
  • When ready to call Fourier Transform:
    • p3dfft_ftran_r2c(IN,OUT,flg_inplace) – Forward transform
      • exp(-ikx/N)
    • p3dfft_btran_c2r(IN,OUT,flg_inplace) – Backward (inverse) transform
      • exp(ikx/N)
    • flg_inplace: boolean, true if doing an in-place transform
future work
Future work
  • Include more array distribution schemes
  • Expand options for memory ordering
  • Include more transform types
    • Derivatives
    • Complex-to-complex
  • C interface
  • Convenience functions
  • Break down in stages
  • Expand choice of 1D FFT libraries
  • Questions? Comments?

[email protected]

acknowledgements
Acknowledgements
  • Amit Majumdar, SDSC
  • ScaLapack development team http://www.netlib.org/scalapack
  • ACTS tools presentation from NERSC http://acts.nersc.gov
lab assignment
Lab assignment
  • Copy tar file ~dmitry/scalapack_examples.tar
  • Untar
  • Type ‘make’
  • Test the example programs on several processor counts - do they scale?
  • Modify source code to vary:
    • matrix size
    • block sizes
  • Rewrite example1.f in C or C++
ad