1 / 21

GPU Acceleration of Finite Element Computations

GPU Acceleration of Finite Element Computations. Graham Markall MSc Student, Software Performance Optimisation Group Imperial College London http://www.doc.ic.ac.uk/~grm08 Joint work with Paul Kelly, Tristan Perryman, Francis Russell, Anton Lokhmotov, Tony Field. June 2009.

urbana
Download Presentation

GPU Acceleration of Finite Element Computations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU Acceleration of Finite Element Computations Graham Markall MSc Student, Software Performance Optimisation Group Imperial College London http://www.doc.ic.ac.uk/~grm08 Joint work with Paul Kelly, Tristan Perryman, Francis Russell, Anton Lokhmotov, Tony Field June 2009

  2. The Test Problem – From FLUIDITY [4] Fortran unstructured mesh finite element code Solves ∆u = f on unit square Analytical solution: Allows us to examine the accuracy of the solution Uses the PETSc [1] conjugate gradient solver Applied Modelling & Computation Group, Dept. of Earth Science & Engineering, IC Read Input Evaluate Shape Functions Assemble System Matrix Solve System Analyse Solution Output Solution

  3. This Work Tristan Perryman authored the original GPU Conjugate Gradient (CG) solver Using Compressed Sparse Row (CSR) format CG is a simple iterative method Reuse kernels later on for other solvers My contribution is the result of part of an MSc Individual Study Option (ISO) in DoC Addition of Jacobi preconditioner Analysis of numerical accuracy and performance Background, Test problem Solver structure, Example kernel Numerical and Performance Results Explore SpMV Optimisations Resources for SpMV & Solvers

  4. Solver Structure – solving Ax=b FORTRAN: ! ... Assemble system call gpucg_solve(...) C: CUDA: gpucg_solve(...) Copy A and b to GPU Precondition matrix while(!stop) 1 iteration copy δ from GPU eval. stop cond. endwhile copy x back from GPU Kernels: SpMV, Axpy. etc... • CuBLAS Kernels are dense only – no SpMV

  5. The SpMV Kernel – 90% of solver time __global__ void csr_spmv(int n, double* src, double* dest) { int elem; for (elem=THREAD_ID; elem<n; elem+=THREAD_COUNT) { dest[elem] = 0; int a=row_ptr[elem]; int b=row_ptr[elem+1]; for (int k=a;k<b;k++) dest[elem] += fetch_double(tex_val,k-1) *src[tex1Dfetch(tex_col_idx, k-1)-1]; } } 1 Thread : Several Rows Matrix stored in Texture Memory row_ptr col_idx val

  6. Testing Numerical Accuracy Test setup: PETSc 2.3.3, Nvidia 280GTX using double precision

  7. Performance results – Solution time Test Setup: 2GB RAM, Intel Core 2 Duo E8400@3Ghz, 6MB cache, NVidia 280GTX. PETSc 2.3.3, CUDA 2.1 • ~3GFLOPs/s – Hardware maximum: 78GFLOPs/s

  8. SpMV Performance Analysis src dest __global__ void csr_spmv(int n, double* src, double* dest){ int elem; for (elem=T_ID; elem<n; elem+=T_CNT) { dest[elem] = 0; int a=row_ptr[elem]; int b=row_ptr[elem+1]; for (int k=a;k<b;k++) dest[elem] += fetch_double(tex_val,k-1) *src[tex1Dfetch(tex_col_idx, k-1)-1]; } } row_ptr col_idx val

  9. SpMV Performance Analysis src dest __global__ void csr_spmv(int n, double* src, double* dest){ int elem; for (elem=T_ID; elem<n; elem+=T_CNT) { dest[elem] = 0; int a=row_ptr[elem]; int b=row_ptr[elem+1]; for (int k=a;k<b;k++) dest[elem] += fetch_double(tex_val,k-1) *src[tex1Dfetch(tex_col_idx, k-1)-1]; } } 64B window row_ptr 16 threads (half-warp) col_idx val

  10. SpMV Performance Analysis src dest 0 1 2 3 4 5 6 7 8 9 10 11 __global__ void csr_spmv(int n, double* src, double* dest){ int elem; for (elem=T_ID; elem<n; elem+=T_CNT) { dest[elem] = 0; int a=row_ptr[elem]; int b=row_ptr[elem+1]; for (int k=a;k<b;k++) dest[elem] += fetch_double(tex_val,k-1) *src[tex1Dfetch(tex_col_idx, k-1)-1]; } } 64B window row_ptr 16 threads (half-warp) col_idx val

  11. SpMV Performance Analysis src dest 0 1 2 3 4 5 6 7 8 9 10 11 __global__ void csr_spmv(int n, double* src, double* dest){ int elem; for (elem=T_ID; elem<n; elem+=T_CNT) { dest[elem] = 0; int a=row_ptr[elem]; int b=row_ptr[elem+1]; for (int k=a;k<b;k++) dest[elem] += fetch_double(tex_val,k-1) *src[tex1Dfetch(tex_col_idx, k-1)-1]; } } 64B window 0 1 2 3 4 5 6 7 8 9 10 11 row_ptr 16 threads (half-warp) col_idx val

  12. SpMV Performance Analysis src dest 0 1 2 3 4 5 6 7 8 9 10 11 __global__ void csr_spmv(int n, double* src, double* dest){ int elem; for (elem=T_ID; elem<n; elem+=T_CNT) { dest[elem] = 0; int a=row_ptr[elem]; int b=row_ptr[elem+1]; for (int k=a;k<b;k++) dest[elem] += fetch_double(tex_val,k-1) *src[tex1Dfetch(tex_col_idx, k-1)-1]; } } 64B window 0 1 2 3 4 5 6 7 8 9 10 11 row_ptr 16 threads (half-warp) col_idx val 0 1 2 3 4 5 6 7 8 9 10 11

  13. SpMV Optimisations [2] Blocking: Cache Optimisations: Only 16KB shared memory Reordering Increase spatial/temporal locality Williams et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Pages 1–12 of: SC ’07. Vector of matrix values • Concurrent Number Cruncher [3] • SpMV performance: • 2x2 blocks: 300% faster than CSR • 4x4 blocks: 50% faster than 2x2 One 4x4 block 2x2 blocked matrix

  14. Exploration of Optimisations Matrices dumped from test problem: 1 2 3 • Cache & reordering optimisations unexplored • Due to time constraints within the ISO

  15. Resources Solver source code: http://www.doc.ic.ac.uk/~grm08/ Concurrent Number Cruncher [3]: http://alice.loria.fr/index.php/publications.html?redirect=1&Paper=NumberCruncher@2007 NVidia SpMV Library [5]: http://www.nvidia.com/object/nvidia_research_pub_001.html Williams, Samuel, Oliker, Leonid, Vuduc, Richard, Shalf, John, Yelick, Katherine, & Demmel, James. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Pages 1–12 of: SC ’07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM. [2] Muthu Manikandan Baskaran; Rajesh Bordawekar. Optimizing Sparse Matrix-Vector Multiplication on GPUs. IBM Technical Report RC24704. 2008. [6] Monakov, Alexander and Avetisyan, Arutyun. Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs, Pages 288-296 of LNCS 5657, Proceedings of SAMOS ’09, Springer-Verlag Berlin Heidelberg, 2009. [7]

  16. References [1] Balay, Satish, Buschelman, Kris, Eijkhout, Victor, Gropp, William D., Kaushik, Dinesh, Knepley, Matthew G., McInnes, Lois Curfman, Smith, Barry F., & Zhang, Hong. 2006 (Sept.). PETSc Users Manual. Tech. rept. ANL-95/11 - Revision 2.3.2. Argonne National Laboratory. See http://www.mcs.anl.gov/petsc [2] Williams, Samuel, Oliker, Leonid, Vuduc, Richard, Shalf, John, Yelick, Katherine, & Demmel, James. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Pages 1–12 of: SC ’07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM. [3] Buatois, Luc, Caumon, Guillaume, & Levy, Bruno. 2007. Concurrent Number Cruncher: An Efficient Sparse Linear Solver on the GPU. In: High Performance Computation Conference (HPCC), Springer Lecture Notes in Computer Sciences. [4] Gorman, Gerard, Piggot, Matthew & Farrell, Patrick. About FLUIDITY. http://amcg.ese.ic.ac.uk/index.php?title=FLUIDITY – retrieved 28 June 2009. [5] Nathan Bell and Michael Garland. “Efficient Sparse Matrix-Vector Multiplication on CUDA“.  NVIDIA Technical Report NVR-2008-004, December 2008. [6] Muthu Manikandan Baskaran; Rajesh Bordawekar. Optimizing Sparse Matrix-Vector Multiplication on GPUs. IBM Technical Report RC24704. 2008. [7] Monakov, Alexander and Avetisyan, Arutyun. Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs, Pages 288-296 of LNCS 5657, Proceedings of SAMOS ’09, Springer-Verlag Berlin Heidelberg, 2009.

  17. Spare slides

  18. Conjugate Gradient & PCG Source: Shewchuk, J.R. An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. School of Computer Science, Carnegie Mellon University, 1994.

  19. Blank slide Text

More Related