1 / 32

Stencil Framework for Portable High Performance Computing

Stencil Framework for Portable High Performance Computing. Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA. Talk Outline. Physis stencil framework Map Reduce for K Mini-app development. Multi-GPU Application Development.

gunda
Download Presentation

Stencil Framework for Portable High Performance Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA

  2. Talk Outline • Physis stencil framework • Map Reduce for K • Mini-app development

  3. Multi-GPU Application Development • Non-unified programming models • MPI for inter-node parallelism • CUDA/OpenCL/OpenACC for accelerators • Optimization • Blocking • Overlapped computation and communication CUDA MPI CUDA  Good performance with low programmer productivity

  4. High performance, highly productive programming for heterogeneous clusters Goal • High level abstractions for structured parallel programming • Simplifying programming models • Portability across platforms • Does not sacrifice too much performance Approach

  5. Physis (Φύσις) Framework [SC’11] • Stencil DSL • Declarative • Portable • Global-view • C-based void diffusion(int x, int y, int z, PSGrid3DFloat g1, PSGrid3DFloat g2) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0);} C C+MPI Physis • DSL Compiler • Target-specific code generation and optimizations • Automatic parallelization CUDA CUDA+MPI OpenMP OpenCL

  6. DSL Overview • C + custom data types and intrinsics • Grid data types • PSGrid3DFloat, PSGrid3DDouble, etc. • Dense Cartesian domain types • PSDomain1D, PSDomain2D, and PSDomain3D • Intrinsics • Runtime management • Grid object management (PSGridFloat3DNew, etc) • Grid accesses (PSGridCopyin, PSGridGet, etc) • Applying stencils to grids (PSGridMap, PSGridRun) • Grid reductions (PSGridReduce)

  7. Writing Stencils • Stencil Kernel • C functions describing a single flow of scalar execution on one grid element • Executed over specified rectangular domains void diffusion(constint x, constint y, constint z,PSGrid3DFloat g1, PSGrid3DFloat g2, float t) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0*t); } Issues a write to grid g2 Offset must be constant Periodic access is possible with PSGridGetPeriodic.

  8. Applying Stencils to Grids • Map: Creates a stencil closure that encapsulates stencil and grids • Run: Iteratively executes stencil closures PSGrid3DFloatg1 = PSGrid3DFloatNew(NX, NY, NZ);PSGrid3DFloatg2 = PSGrid3DFloatNew(NX, NY, NZ); PSDomain3D d = PSDomain3DNew(0, NX, 0, NY, 0, NZ); PSStencilRun(PSStencilMap(diffusion,d,g1,g2,0.5), PSStencilMap(diffusion,d,g2,g1,0.5), 10); Grouping by PSStencilRun Target for kernel fusion optimization

  9. Implementation • DSL translator • Translate intrinsics calls to RT API calls • Generate GPU kernels with boundary exchanges based on static analysis • Using the ROSE compiler framework (LLNL) • Runtime • Provides a shared memory-like interface for multidimensional grids over distributed CPU/GPU memory Physis Code Implementation Source Code Executable Code

  10. CUDA Thread Blocking • Each thread sweeps points in the Z dimension • X and Y dimensions are blocked with AxB thread blocks, where A and B are user-configurable parameters (64x4 by default) Z Y X

  11. Example: 7-point Stencil GPU Code __device__ void kernel(constintx,constinty,constint z,__PSGrid3DFloatDev *g, __PSGrid3DFloatDev *g2) { float v = (((((( *__PSGridGetAddrNoHaloFloat3D(g,x,y,z) + *__PSGridGetAddrFloat3D_0_fw(g,(x + 1),y,z)) + *__PSGridGetAddrFloat3D_0_bw(g,(x - 1),y,z)) + *__PSGridGetAddrFloat3D_1_fw(g,x,(y + 1),z)) + *__PSGridGetAddrFloat3D_1_bw(g,x,(y - 1),z)) + *__PSGridGetAddrFloat3D_2_bw(g,x,y,(z - 1))) + *__PSGridGetAddrFloat3D_2_fw(g,x,y,(z + 1))); *__PSGridEmitAddrFloat3D(g2,x,y,z) = v; } __global__ void __PSStencilRun_kernel(int offset0,int offset1,__PSDomain dom, __PSGrid3DFloatDev g,__PSGrid3DFloatDev g2) { int x = blockIdx.x * blockDim.x + threadIdx.x + offset0, y = blockIdx.y * blockDim.y + threadIdx.y + offset1; if (x < dom.local_min[0] || x >= dom.local_max[0] || (y < dom.local_min[1] || y >= dom.local_max[1])) return ; int z; for (z = dom.local_min[2]; z < dom.local_max[2]; ++z) { kernel(x,y,z,&g,&g2); } }

  12. Example: 7-point Stencil CPU Code static void __PSStencilRun_0(intiter,void **stencils) { struct dim3 block_dim(64,4,1); struct __PSStencil_kernel *s0 = (struct __PSStencil_kernel *)stencils[0]; cudaFuncSetCacheConfig(__PSStencilRun_kernel,cudaFuncCachePreferL1); struct dim3 s0_grid_dim((int )(ceil(__PSGetLocalSize(0) / ((double )64))),(int )(ceil(__PSGetLocalSize(1) / ((double )4))),1); __PSDomainSetLocalSize(&s0 -> dom); s0 -> g = __PSGetGridByID(s0 -> __g_index); s0 -> g2 = __PSGetGridByID(s0 -> __g2_index); inti; for (i = 0; i < iter; ++i) {{ intfw_width[3] = {1L, 1L, 1L}; intbw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); } __PSStencilRun_kernel<<<s0_grid_dim,block_dim>>>(__PSGetLocalOffset(0), __PSGetLocalOffset(1),s0 -> dom, *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); } cudaThreadSynchronize(); }

  13. Optimization: Overlapped Computation and Communication Inner points 1. Copy boundaries from GPU to CPU for non-unit stride cases 3. Boundary exchanges with neighbors 2. Computes interior points 4. Computes boundaries Boundary Time

  14. Optimization Example: 7-Point Stencil CPU Code Computing Interior Points for (i = 0; i < iter; ++i) { __PSStencilRun_kernel_interior<<<s0_grid_dim,block_dim,0, stream_interior>>> (__PSGetLocalOffset(0),__PSGetLocalOffset(1),__PSDomainShrink(&s0 -> dom,1),*((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); intfw_width[3] = {1L, 1L, 1L}; intbw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[0]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[1]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); … __PSStencilRun_kernel_boundary_2_fw<<<1,(dim3(128,1,4)),0, stream_boundary_kernel[11]>>>(__PSDomainGetBoundary(&s0 -> dom,1,1,1,1,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); cudaThreadSynchronize(); } cudaThreadSynchronize(); } Boundary Exchange Computing Boundary Planes Concurrently

  15. Local Optimization • Register blocking • Reuse loaded grid elements with registers for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i][j][k]+f[i][j][k-1]+f[i][j][k+1]); } Original double kc= f[i][j][0]; doubkekn = f[i][j][1]; for (int k = 1; k < n-1; ++k) { double kp = kc; kc = kn; kn = f[i][j][k+1]; g[i][j][k] = a*(kc+kp+kn); } Optimized

  16. Local Optimization • Common subexpression elimination in offset computation • Eliminates intra-and inter-iteration common subexpressions for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i+j*n+k*n*n]+ f[i+j*n+(k-1)*n*n]+f[i+j*n+(k+1)*n*n]); } Original int idx = i+j*n+n*n; for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[idx]+f[idx-n*n]+f[idx+n*n]); idx += n*n; } Optimized

  17. Evaluation • Performance and productivity • Sample code • 7-point diffusion kernel (#stencil: 1) • Jacobi kernel from Himeno benchmark (#stencil: 1) • Seismic simulation (#stencil: 15) • Platform • Tsubame 2.0 • Node: Westmere-EP 2.9GHz x 2 + M2050 x 3 • Dual Infiniband QDR with full bisection BW fat tree

  18. Productivity Similar size as sequential code in C

  19. Optimization Effect 7-point diffusion stencil on 1 GPU (Tesla M2050)

  20. Diffusion Weak Scaling

  21. Seismic Weak Scaling Problem size: 256x256x256 per GPU

  22. Diffusion Strong Scaling Problem size: 512x512x4096

  23. Himeno Strong Scaling Problem size XL (1024x1024x512)

  24. Ongoing Work • Auto-tuning • Preliminary AT for the CUDA backend available • Supporting different accelerators • Supporting more complex problems • Stencil with limited data dependency • Hierarchically organized problems • Work unit: Dense structured grids • Overall problem domain: Sparsely connected work units • Example • NICAM: An Icosahedral model of climate simulation • UPACS: Fluid simulation of engineering problems

  25. Further Information • Code is available at http://github.com/naoyam/physis • Maruyama et al., “Physis: Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers,” SC’11, 2011.

  26. Talk Outline • Physis stencil framework • Map Reduce for K • Mini-app development

  27. Map Reduce for K • In-memory MPI-based MapReduce for the K computer • Implemented as a C library • Provides (some of) Hadoop-like programming interfaces • Strong focus on scalable processing of large data sets on K • Supports standard MPI clusters too • Application examples • Metagenome sequence analysis • Replica-exchange MD • Runs hundreds of NAMD as a Map Fast data loading by the Tofu network

  28. Talk Outline • Physis stencil framework • Map Reduce for K • Mini-app development

  29. HPC Mini Applications • A set of mini-apps derived from national key applications. • Source code will be publicly released around Q1-Q2 ’14. • Expected final number of apps: < 20 • Part of the national pre-exa projects • Supported by the MEXT HPCI Feasibility Study program(PI: Hirofumi Tomita, RIKEN AICS)

  30. Mini-App Methodology • Request for application submissions to the current users of the Japanese HPC systems • Documentation • Mathematical background • Target capability and capacity • Input data sets • Validation methods • Application code • Full-scale applications • Capable to run complete end-to-end simulations • 17 submissions so far • Stripped-down applications • Simplified applications with only essential part of code • 6 submissions so far

  31. Mini-App Methodology • Deriving “mini” applications from submitted applications • Understanding performance-critical patterns • Computations, memory access, file I/O, network I/O • Reducing codebase size • Removing code not used for target problems • Applying standard software engineering practices (e.g., DRY) • Refactoring into reference implementations • Performance modeling • In collaboration with the ASPEN project (Jeff Vetter at ORNL) • (Optional) Versions optimized for specific architecture

  32. Mni-App Example: Molecular Dynamics • Kernels: Pairwise force calculation + Long-range updates • Two alternative algorithms for solving the equivalent problems • FFT-based Particle Mesh Ewald • Bottlenecked by all-to-all communications at scale • Fast Multipole Method • Tree-based problem formulation with no all-to-all communications • Simplified problem settings • Only simulates water molecules in the NVE setting • Can reduce the codebase significantly • Easier to create input data sets of different scales • Two reference implementations to study performance implications by algorithmic differences • MARBLE (20K SLOC) • MODYLAS (16K SLOC)

More Related