1 / 25

Scaling of a Pseudospectral DNS Turbulence Code and a Parallel 3D FFT Library

Scaling of a Pseudospectral DNS Turbulence Code and a Parallel 3D FFT Library. Dmitry Pekurovsky dmitry@sdsc.edu San Diego Supercomputer Center. Examples of turbulence are common. Disorderly, nonlinear fluctuations in 3D space and time that span a wide range of interacting scales.

chinue
Download Presentation

Scaling of a Pseudospectral DNS Turbulence Code and a Parallel 3D FFT Library

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling of a Pseudospectral DNS Turbulence Code and a Parallel 3D FFT Library Dmitry Pekurovsky dmitry@sdsc.edu San Diego Supercomputer Center

  2. Examples of turbulence are common Disorderly, nonlinear fluctuations in 3D space and time that span a wide range of interacting scales

  3. DNS examples of special interest • Study effects on local isotropy and intermittency of Reynolds and Schmidt numbers • Study mixing of passive contaminants and active chemical species of a range of molecular diffusivities • simulations at high Reynolds and Schmidt numbers • Study dispersion in the Lagrangian frame • Capture effects of anisotropy • Include Coriolis forces (rotating frame)

  4. Powerful computers and efficient algorithms are required for DNS • Capturing the physics at small scales requires simulation on grids of the largest possible size • Largest simulations in U.S.: 20483 • Worldwide: Earth Simulator 40963 (Kaneda et al. 2002) • Desired in the near future: 40963 and beyond, with more complex physics • NSF turbulence model problem: 122883 • Memory requirements grow as N3, the number of grid points • Computational work grows even faster than the number of grid points, since the number of time steps must also grow to ensure stability

  5. DNS code P.K.Yeung, D. DonzisGeorgia Tech • Code written in Fortran 90 with MPI • Time evolution: Runge Kutta 2nd order • Spatial derivative calculation: pseudospectral method • Typically, FFTs are done in all 3 dimensions. • Consider 3D FFT as compute-intensive kernel representative of performance characteristics of the full code (3D FFT is ~70% of the computation, and ~95% communication) • Input is real, output is complex; or vice versa SDSC SAC (Strategic Applications Collaboration) project • focus on scalability

  6. Typical parallel 3D FFT today • Grid: N x N x N – how to do parallel FFT? • Transpose strategy, as opposed to direct strategy. • That is, make sure all data in direction of 1D transform resides in one processor’s memory. • Parallelize over remaining dimensions. • 1D decomposition • Each CPU has several planes (slabs) • Local geometry: N x N x L, where L=N/P • Transform in 2 dimensions local to the planes first • Use an established library for 1D FFT, e.g. ESSL, FFTW • Transpose data to localize third dimension • Local geometry: N x L x N • Complete transform by operating along third dimension • Most everyone has been using this approach. Examples: • PESSL • NAS Parallel Benchmarks

  7. 1D decomposition scaling • Communication: 1 call to MPI_Alltoall or equivalent all-to-all exchange. • Demanding operation. Performance depends on interconnect bisection bandwidth. Definition: • Cut the network in two. Count the links cut. • This parameter depends on bandwidth of each link, and also network topology (e.g. fat-tree for Datastar, 3D torus for Blue Gene) • Can scale OK (depending on interconnect), but only up to a point: P <= N • For 4096^3 grid, no more than 4096 processors. • => At Petascale a new approach is needed

  8. One possible solution Hybrid MPI/OpenMP approach, using shared memory multiprocessor platforms and/or multi-core processors. • Each MPI task spawns Q threads • Do MPI communication over P=N tasks, with Q threads working on one plane. • Often does not work well due to issues such as false sharing and thread operations overhead. • Still limited by N x Q cores.

  9. 2D decomposition z x y

  10. 2D decomposition • 2D decomposition: processors form a square grid P1 x P2 • Columns (pencils) instead of slabs • Local geometry: • Stage 1: N x K x M • Stage 2: K x N x M • Stage 3: K x M x N • K = N/P1, M=N/P2 • At each stage, transform along the local dimension (N) • Software scaling limit removed! Now the limit is P <= N2

  11. Performance of DNS (2D) on Blue Gene at SDSC and IBM’s T.J.Watson Lab; and SDSC’s Datastar VN: Two processors per node CO: One processor per node

  12. DNS (2D) scaling on BG/W

  13. 2D versus 1D performance

  14. Communication and performance: 2D vs. 1D • 1D: 1 All-to-all exchange over P tasks • 2D: 2 All-to-all exchanges in groups: • P2 groups of P1 tasks each • P1 groups of P2 tasks each • Which is better? • Message size decreases if the exchange group size increases • Some links are unused in each of the 2 transposes, and data volume is the same as in 1D case • 1D decomposition ends up winning • But again: 1D is only good up to P = N

  15. Choosing P1 and P2 • P1 x P2 = P • If P <= N, choose P1 = 1 (1D decomposition is better) • If P >N, in most cases it’s best to choose P1 = P/N • Better communication pattern • In some cases, better cache use • BG observation: a sweetspot occurs when P1 = Px of the torus, due to wraparound links usage. • How do I know Px on my machine? • Ask your consultant

  16. Choosing P1 and P2, cont’d

  17. P3DFFT • Open source library for efficient, highly scalable 3D FFT on parallel platforms • Built on top of an optimized 1D FFT library • Currently ESSL or FFTW • In the future, more libraries • Uses 2D decomposition, with option for 1D. • Available at http://www.sdsc.edu/us/resources/p3dfft.php

  18. P3DFFT: more features • Implements real-to-complex (R2C) and complex-to-real (C2R) 3D transforms • Written in Fortran 90 with MPI • Implemented as F90 module • Single or double precision • Arbitrary dimensions • Handles many uneven cases (Ni does not have to be a factor of Pj) • Assumes Ni >= Pj • Can do either in-place transform or otherwise • In which case the input and output arrays should not overlap • Memory requirements: input and output arrays + 1 buffer

  19. P3DFFT: Storage • R2C transform input: • Global: (Nx,Ny,Nz) real array • Local: (Nx,Ny/P1,Nz/P2) real array • R2C output: • Global: (Nx/2,Ny,Nz) complex array • Conjugate symmetry: F(N-i) = F*(i) • F(N/2+1) through F(N-1) are redundant • F(0) and F(N/2) are real – pack them in one pseudo-complex pair • Total N/2 complex storage units • Local: (Nx/(2P1),Ny/P2,Nz) complex • C2R: input and output interchanged from R2C

  20. Building P3DFFT • Subdirectory build/ • Edit makefile for your architecture • Specify -DSINGLE_PREC or -DDOUBLE_PREC • Default is single precision • Specify -DESSL or -DFFTW • Choose P3DFFT_ROOT • Type ‘make’ – will compile and install the library in P3DFFT_ROOT directory • On SDSC platforms, it is already installed in /usr/local/apps/p3dfft • Subdirectory sample/ • Contains sample programs • For manual see README file in upper level directory

  21. Using P3DFFT • ‘use p3dfft’ • Defines mytype – 4 or 8 byte reals • Initialization: p3dfft_setup(proc_dims,nx,ny,nz) • procs_dims(2) – two integers, P1 and P2 • For 1D decomposition, set P1 = 1 • Sets up the square 2D grid of MPI communicators in rows and columns • Initializes local array dimensions • Local dimensions: get_dims(istart,iend,isize,Q) integer istart(3),iend(3),isize(3) • Set Q=1 for R2C input, Q=2 for R2C output

  22. Using P3DFFT (2) • Now allocate your arrays • When ready to call Fourier Transform: • p3dfft_ftran_r2c(IN,OUT,flg_inplace) – Forward transform • exp(-ikx/N) • p3dfft_btran_c2r(IN,OUT,flg_inplace) – Backward (inverse) transform • exp(ikx/N) • flg_inplace: boolean, true if doing an in-place transform

  23. Implementation and Performance Each processor doing a (large) number of 1D FFT • Memory access stride is a crucial parameter • FFT in X, stride 1: 28% peak on Datastar • Transpose in X & Y • FFT in Y, stride Nx/P1: 25% peak on Datastar • Transpose in Y & Z • FFT in Z, stride (Nx Ny)/P: 10% peak on Datastar • Alternative is to rearrange in memory space so that stride is 1. Involves an extra memory load. • ESSL is about 3 times faster than FFTW

  24. Future work • Include more array distribution schemes • Expand options for memory ordering • Include more transform types • Derivatives • Complex-to-complex • C interface • Convenience functions • Break down in stages • Expand choice of 1D FFT libraries • Questions? Comments? dmitry@sdsc.edu

  25. Lab assignments • Write a program that would: • initialize P3DFFT • allocate input and output 3D arrays (choose global size not too big: 1283 or 2563 is reasonable) • initialize arrays to your favorite functional form such as a sine wave • Do a forward transform • Compare the data with what you expect • Do a backward transform • After multiplying by a normalization factor, compare the results with original array • Time performance of the computational part • Run at a few grid sizes and processor counts, study performance • Do the same for in-place transform (output overwrites input) If you get lost, look in /sample directory. But try not to look!

More Related