1 / 32

Generating GPU-Accelerated Code From a High-level Domain-specific Language

Generating GPU-Accelerated Code From a High-level Domain-specific Language. Graham Markall Software Performance Optimisation Group Imperial College London http://www.doc.ic.ac.uk/~grm08 Joint work with David Ham and Paul Kelly. October 2009. TexPoint fonts used in EMF.

cordero
Download Presentation

Generating GPU-Accelerated Code From a High-level Domain-specific Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generating GPU-Accelerated Code From a High-level Domain-specific Language Graham Markall Software Performance Optimisation Group Imperial College London http://www.doc.ic.ac.uk/~grm08 Joint work with David Ham and Paul Kelly October 2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

  2. Problem & Proposal • How do we exploit multicore architectures to improve the performance of the Assembly Phase? • Writing code for new architectures is time-consuming and error-prone • Provide hardware-independent abstraction for the specification of finite element methods. • Future proofing of code • Easier development • Background: • Conjugate Gradient GPU Solver 10x faster than one CPU core • Solvers are generic – someone else will solve this problem

  3. This Study • We present a pilot study into using the Unified Form Language to generate CUDA code. • Why bother with part 1? • To prove we can speed up assembly using GPUs • To provide a guide for the output we expect from the compiler • To experiment with different performance optimisations • Part 1: • Nvidia Tesla Architecture & CUDA • Test Problems • Translation Methodology • Performance Optimisations • Performance Results • Part 2: • UFL • UFL Compiler Design • Test Results • Discussion

  4. NVIDIA Tesla Architecture & CUDA • GT200 Architecture: • 1-4GiB RAM • For high performance: • Coalescing: • Use many threads (10000+) • Caches: • Texture cache (read-only) • Shared memory Data transfer Data transfer 64B window 16 threads (half-warp)

  5. The Test Problems • Test_laplacian: Solves ∆u = f on unit square • Analytical solution: Allows us to examine the accuracy of the computed solution • Test_advection_diffusion: • Advection-Diffusion is more representative: • Time-dependent, nonlinear • multiple assemble/solve

  6. From Fortran to CUDA (Test_laplacian) • Assembly Loop in Fortran: • Assembly Loop in CUDA: 1 Element doele=1,num_ele call assemble(ele,A,b) end do callpetsc_solve(x,A,b) call gpu_assemble() call gpucg_solve(x) All Elements

  7. From Fortran to CUDA (Adv.-Diff.) • Original: • Assemble • Solve • Output of solve input to next Assemble • CUDA: • Avoid transferring the solution at every iteration • Upload initial conditions • Iterate • Transfer solution when required

  8. Performance Optimisations x1 y1 x2 y2 x3 y3 • CoalescingMaximise memorybandwidth 1 2 3 n-2 n-1 n 1 2 3 x1 y1 x2 y2 x3 y3 (x1,y1) n-2 n-1 n (x3,y3) (x2,y2) ... ...

  9. Performance Optimisations x1 y1 x2 y2 x3 y3 • CoalescingMaximise memorybandwidth • Specialisation of Kernels (reduced register usage) 1 2 3 n-2 n-1 n 1 2 3 x1 y1 x2 y2 x3 y3 (x1,y1) n-2 n-1 n (x3,y3) (x2,y2) for(int x=0; x<nodes; x++) { for(int y=0; y<nodes; y++) { ...; } } for(int x=0; x<3; x++) { for(int y=0; y<3; y++) { ...; } } ... ...

  10. Performance Optimisations x1 y1 x2 y2 x3 y3 • CoalescingMaximise memorybandwidth • Specialisation of Kernels (reduced register usage) • Texture Memory for matrix sparsity 1 2 3 n-2 n-1 n 1 2 3 x1 y1 x2 y2 x3 y3 (x1,y1) n-2 n-1 n (x3,y3) (x2,y2) for(int x=0; x<nodes; x++) { for(int y=0; y<nodes; y++) { ...; } } for(int x=0; x<3; x++) { for(int y=0; y<3; y++) { ...; } } row_ptr col_idx ... ... val

  11. Performance Results • For Advection-Diffusion problem. Test setup: • Nvidia280GTX – 1GB RAM (use Tesla C1060 for 4GB) • Intel Core 2 Duo E8400 @ 3.00GHz • 2GB RAM in host machine • Intel C++ and Fortran Compilers V10.1 • V11.0 suffers from bugs and cannot compile Fluidity • CPU Implementations compiled with –O3 flags • CUDA Implementation compiled using NVCC 2.2 • Run problem for 200 timesteps • Increasingly fine meshes • Increasing element count • Five runs of each problem • Averages reported • Double Precision computations

  12. Advection Diffusion Assembly Time

  13. Speedup in the Assembly Phase

  14. Overall Speedup (Assemble & Solve)

  15. Proportion of GPU Time in each Kernel • Which kernels should we focus on optimising? • Addto kernels: 84% of execution time

  16. The Impact of Atomic Operations Colouring Optimisation on GPUs: [1] D. Komatitsch, D.Michea and G. Erlebacher. Porting a high-order finite-element earthquake modelling application to Nvidia graphics cards using CUDA. J. Par. Dist. Comp., 69(5):451-460, 2009

  17. Summary of Part 1 • 8x Speedup over 2 CPU Cores for assembly • 6x Speedup overall • Further performance gains from: • Colouring Elements & Non-atomic ops [1] • Alternative matrix storage formats • Fusing kernels [2], Mesh partitioning [3] Fusing kernels: [2] J. Filipovic, I. Peterlik and J. Fousek. GPU Acceleration of Equations Assembly in Finite Elements Method – Preliminary Results. In SAAHPC: Symposium on Application Accelerators in HPC, July 2009. Mesh partitioning: [3] A. Klockner, T. Warburton, J. Bridge and J. S. Hesthaven. Nodal Discontinuous Galerkin methods on graphics processors. Journal of Computational Physics, in press, 2009.

  18. Part 2: A UFL [4] Example (Laplacian) • Solving: • Weak form: • (Ignoring boundaries) • Close to mathematical notation • No implementation details • Allows code generation for multiple backends and choice of optimisations to be explored. Psi = state.scalar_fields(“psi”) v=TestFunction(Psi) u=TrialFunction(Psi) f=Function(Psi, “sin(x[0])+cos(x[1])”) A=dot(grad(v),grad(u))*dx RHS=v*f*dx Solve(Psi,A,RHS) [4] M. Alnaes and Anders Logg. Unified Form Language Specification and User’s Manual.http://www.fenics.org/pub/documents/ufl/ufl-user-manual/ufl-user-manual.pdf Retrieved 15 Sep 2009.

  19. From UFL to CUDA • We “parse” UFL using the ufl.algorithmspackage • Leading to the creation of a DAG representing the assembly: • Similar for RHS Intermediate representation: Frontend Backend UFL CUDA IR

  20. Testing Frontend: psi = state.scalar_fields("psi") v = TestFunction(P) u = TrialFunction(P) f = Function(P) f.name="shape_rhs" A = dot(grad(v),grad(u))*dx solve(P, A, f) Backend (example): stringList *params = new stringList(); (*params).push_back(string("val")); (*params).push_back(string("size_val")); (*params).push_back(string("ele_psi")); (*params).push_back(string("lmat")); (*params).push_back(string("n")); launchList.push_back( kernelLaunch("matrix_addto",params)); Hand Translation: Generated Code:

  21. Testing - continued • Helmholtz equation: • Weak form: • A=(dot(grad(v), grad(u))+(20)*dot(v,u))*dx • Add extra calls to shape_shape and matrix_addto • FEniCSDolfin solution: Generated code solution:

  22. Conclusions • We obtain speedups of 8x over 2 core CPU in the assembly phase using CUDA • An overall speedup of 6x over 2 cores • Generation of CUDA code from UFL source is feasible • UFL is “future proof” • UFL is easier to use than CUDA, Fortran etc. • Allows automated exploration of optimisations • Other backends(Cell, multicore CPU, Larrabee etc.) should be possible

  23. Further work • On the UFL Compiler: • Support for a more complete subset of UFL • Development of a more expressive intermediate representation • Facilitates the development of other backends • Generation of kernels from IR • Automatic tuning • On the Conjugate Gradient Solver: • Integration with blocked SpMV implementation [5] • Expect: further performance improvements Blocked SpMV: [5] A. Monakov and A. Avetisyan. Implementing Blocked Sparse Matrix-Vector Multiplication on Nvidia GPUs. In SAMOS IX: International Symposium on Systems, Architectures, Modeling and Simulation,July 2009.

  24. Spare Slides

  25. Test Advection Diffusion UFL Advection: T=state.scalar_fields(Tracer) U=state.vector_fields(Velocity) UNew=state.vector_fields(NewVelocity) # We are solving for the Tracer, T. t=Function(T) p=TrialFunction(T) q=TestFunction(T) #The value of the advecting velocity U is known. u=Function(U) unew=Function(UNew) #Mass matrix. M=p*q*dx #Solve for T1-T4. rhs=dt*dot(grad(q),u)*t*dx t1=solve(M,rhs) rhs=dt*dot(grad(q),(0.5*u+0.5*unew))*(t+0.5*t1)*dx t2=solve(M,rhs) rhs=dt*dot(grad(q),(0.5*u+0.5*unew))*(t+0.5*t2)*dx t3=solve(M,rhs) #Solve for T at the next time step. rhs=action(M,t) + 1.0/6.0*t1 + 1.0/3.0*t2 + 1.0/3.0*t3 + 1.0/6.0*t4 t=solve(M,t) Diffusion: mu=state.tensor_fields(TracerDiffusivity) i,j=indices(2) M=p*q*dx d=-grad(q)[i]*mu[i,j]*grad(p)[j]*dx A=m-0.5*d rhs=action(M+0.5*d,t) T=solve(A,rhs)

  26. Memory Bandwidth Utilisation Orange: Using Atomic operations Blue: Using non-atomic operations

  27. Proportion of GPU Time in each Kernel Orange: Using Atomic operations Blue: Using non-atomic operations

  28. Assembly Throughput

  29. Code Generation • List of variables, kernels and parameters passed to backend. • Using the ROSE Compiler Infrastructure [6]. • CUDA Keywords (__global__, <<<...>>> notation) inserted as arbitrary strings. AST Initialisation cudaMalloc() cudaBindTexture() cudaMemcpy() Assembly kernel<<<.>>>() gpu_assemble.cu Finalisation cudaFree() cudaUnbindTexture() Streaming cudaMemcpy() Declarations Int, double, ...

  30. NVIDIA Tesla Architecture & CUDA • GT200 Architecture • 10 TPCs • 8 Banks of DRAM: 1-4GiB • y = αx + yin C: • CUDA Kernel: void daxpy(double a, double* x, double* y, int n) { for (int i=0; i<n; i++) y[i] = y[i] + a*x[i]; } __global__ void daxpy(double a, double* x, double* y, int n) { for (int i=T_ID; i<n; i+=T_COUNT) y[i] = y[i] + a*x[i]; }

  31. Variable naming • How do we ensure the output of a kernel is correctly input to successive kernels? Consistently invent names. Output: dshape_psi Input: dshape_psi Output: lmat_psi_psi Input: lmat_psi_psi Psi = state.scalar_fields(“psi”) v=TestFunction(Psi) u=TrialFunction(Psi) f=Function(Psi, “sin(x[0])+cos(x[1])”) A=dot(grad(v),grad(u))*dx RHS=v*f*dx Solve(Psi,A,RHS)

  32. Memory Bandwidth Utilisation

More Related