GPU Acceleration of Finite Element Computations

GPU Acceleration of Finite Element Computations Graham Markall MSc Student, Software Performance Optimisation Group Imperial College London http://www.doc.ic.ac.uk/~grm08 Joint work with Paul Kelly, Tristan Perryman, Francis Russell, Anton Lokhmotov, Tony Field June 2009

The Test Problem – From FLUIDITY [4] Fortran unstructured mesh finite element code Solves ∆u = f on unit square Analytical solution: Allows us to examine the accuracy of the solution Uses the PETSc [1] conjugate gradient solver Applied Modelling & Computation Group, Dept. of Earth Science & Engineering, IC Read Input Evaluate Shape Functions Assemble System Matrix Solve System Analyse Solution Output Solution

This Work Tristan Perryman authored the original GPU Conjugate Gradient (CG) solver Using Compressed Sparse Row (CSR) format CG is a simple iterative method Reuse kernels later on for other solvers My contribution is the result of part of an MSc Individual Study Option (ISO) in DoC Addition of Jacobi preconditioner Analysis of numerical accuracy and performance Background, Test problem Solver structure, Example kernel Numerical and Performance Results Explore SpMV Optimisations Resources for SpMV & Solvers

Solver Structure – solving Ax=b FORTRAN: ! ... Assemble system call gpucg_solve(...) C: CUDA: gpucg_solve(...) Copy A and b to GPU Precondition matrix while(!stop) 1 iteration copy δ from GPU eval. stop cond. endwhile copy x back from GPU Kernels: SpMV, Axpy. etc... • CuBLAS Kernels are dense only – no SpMV

The SpMV Kernel – 90% of solver time __global__ void csr_spmv(int n, double* src, double* dest) { int elem; for (elem=THREAD_ID; elem<n; elem+=THREAD_COUNT) { dest[elem] = 0; int a=row_ptr[elem]; int b=row_ptr[elem+1]; for (int k=a;k<b;k++) dest[elem] += fetch_double(tex_val,k-1) *src[tex1Dfetch(tex_col_idx, k-1)-1]; } } 1 Thread : Several Rows Matrix stored in Texture Memory row_ptr col_idx val

Testing Numerical Accuracy Test setup: PETSc 2.3.3, Nvidia 280GTX using double precision

Performance results – Solution time Test Setup: 2GB RAM, Intel Core 2 Duo E8400@3Ghz, 6MB cache, NVidia 280GTX. PETSc 2.3.3, CUDA 2.1 • ~3GFLOPs/s – Hardware maximum: 78GFLOPs/s

SpMV Performance Analysis src dest __global__ void csr_spmv(int n, double* src, double* dest){ int elem; for (elem=T_ID; elem<n; elem+=T_CNT) { dest[elem] = 0; int a=row_ptr[elem]; int b=row_ptr[elem+1]; for (int k=a;k<b;k++) dest[elem] += fetch_double(tex_val,k-1) *src[tex1Dfetch(tex_col_idx, k-1)-1]; } } row_ptr col_idx val

SpMV Performance Analysis src dest __global__ void csr_spmv(int n, double* src, double* dest){ int elem; for (elem=T_ID; elem<n; elem+=T_CNT) { dest[elem] = 0; int a=row_ptr[elem]; int b=row_ptr[elem+1]; for (int k=a;k<b;k++) dest[elem] += fetch_double(tex_val,k-1) *src[tex1Dfetch(tex_col_idx, k-1)-1]; } } 64B window row_ptr 16 threads (half-warp) col_idx val

SpMV Performance Analysis src dest 0 1 2 3 4 5 6 7 8 9 10 11 __global__ void csr_spmv(int n, double* src, double* dest){ int elem; for (elem=T_ID; elem<n; elem+=T_CNT) { dest[elem] = 0; int a=row_ptr[elem]; int b=row_ptr[elem+1]; for (int k=a;k<b;k++) dest[elem] += fetch_double(tex_val,k-1) *src[tex1Dfetch(tex_col_idx, k-1)-1]; } } 64B window row_ptr 16 threads (half-warp) col_idx val

SpMV Performance Analysis src dest 0 1 2 3 4 5 6 7 8 9 10 11 __global__ void csr_spmv(int n, double* src, double* dest){ int elem; for (elem=T_ID; elem<n; elem+=T_CNT) { dest[elem] = 0; int a=row_ptr[elem]; int b=row_ptr[elem+1]; for (int k=a;k<b;k++) dest[elem] += fetch_double(tex_val,k-1) *src[tex1Dfetch(tex_col_idx, k-1)-1]; } } 64B window 0 1 2 3 4 5 6 7 8 9 10 11 row_ptr 16 threads (half-warp) col_idx val

SpMV Performance Analysis src dest 0 1 2 3 4 5 6 7 8 9 10 11 __global__ void csr_spmv(int n, double* src, double* dest){ int elem; for (elem=T_ID; elem<n; elem+=T_CNT) { dest[elem] = 0; int a=row_ptr[elem]; int b=row_ptr[elem+1]; for (int k=a;k<b;k++) dest[elem] += fetch_double(tex_val,k-1) *src[tex1Dfetch(tex_col_idx, k-1)-1]; } } 64B window 0 1 2 3 4 5 6 7 8 9 10 11 row_ptr 16 threads (half-warp) col_idx val 0 1 2 3 4 5 6 7 8 9 10 11

SpMV Optimisations [2] Blocking: Cache Optimisations: Only 16KB shared memory Reordering Increase spatial/temporal locality Williams et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Pages 1–12 of: SC ’07. Vector of matrix values • Concurrent Number Cruncher [3] • SpMV performance: • 2x2 blocks: 300% faster than CSR • 4x4 blocks: 50% faster than 2x2 One 4x4 block 2x2 blocked matrix

Exploration of Optimisations Matrices dumped from test problem: 1 2 3 • Cache & reordering optimisations unexplored • Due to time constraints within the ISO

Resources Solver source code: http://www.doc.ic.ac.uk/~grm08/ Concurrent Number Cruncher [3]: http://alice.loria.fr/index.php/publications.html?redirect=1&Paper=NumberCruncher@2007 NVidia SpMV Library [5]: http://www.nvidia.com/object/nvidia_research_pub_001.html Williams, Samuel, Oliker, Leonid, Vuduc, Richard, Shalf, John, Yelick, Katherine, & Demmel, James. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Pages 1–12 of: SC ’07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM. [2] Muthu Manikandan Baskaran; Rajesh Bordawekar. Optimizing Sparse Matrix-Vector Multiplication on GPUs. IBM Technical Report RC24704. 2008. [6] Monakov, Alexander and Avetisyan, Arutyun. Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs, Pages 288-296 of LNCS 5657, Proceedings of SAMOS ’09, Springer-Verlag Berlin Heidelberg, 2009. [7]

References [1] Balay, Satish, Buschelman, Kris, Eijkhout, Victor, Gropp, William D., Kaushik, Dinesh, Knepley, Matthew G., McInnes, Lois Curfman, Smith, Barry F., & Zhang, Hong. 2006 (Sept.). PETSc Users Manual. Tech. rept. ANL-95/11 - Revision 2.3.2. Argonne National Laboratory. See http://www.mcs.anl.gov/petsc [2] Williams, Samuel, Oliker, Leonid, Vuduc, Richard, Shalf, John, Yelick, Katherine, & Demmel, James. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Pages 1–12 of: SC ’07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM. [3] Buatois, Luc, Caumon, Guillaume, & Levy, Bruno. 2007. Concurrent Number Cruncher: An Efficient Sparse Linear Solver on the GPU. In: High Performance Computation Conference (HPCC), Springer Lecture Notes in Computer Sciences. [4] Gorman, Gerard, Piggot, Matthew & Farrell, Patrick. About FLUIDITY. http://amcg.ese.ic.ac.uk/index.php?title=FLUIDITY – retrieved 28 June 2009. [5] Nathan Bell and Michael Garland. “Efficient Sparse Matrix-Vector Multiplication on CUDA“. NVIDIA Technical Report NVR-2008-004, December 2008. [6] Muthu Manikandan Baskaran; Rajesh Bordawekar. Optimizing Sparse Matrix-Vector Multiplication on GPUs. IBM Technical Report RC24704. 2008. [7] Monakov, Alexander and Avetisyan, Arutyun. Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs, Pages 288-296 of LNCS 5657, Proceedings of SAMOS ’09, Springer-Verlag Berlin Heidelberg, 2009.

Spare slides

Conjugate Gradient & PCG Source: Shewchuk, J.R. An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. School of Computer Science, Carnegie Mellon University, 1994.

Blank slide Text

GPU Acceleration of Finite Element Computations

GPU Acceleration of Finite Element Computations

Presentation Transcript

Finite Element Analysis

Finite Element Method

Finite Element Analisys

Using Finite Element

Finite Element Method

Finite Element Implementation

Finite Element Analysis

Finite Element Analysis

Finite Element Method

Finite Element Analysis

Finite Element Analysis

Finite Element Analysis

Finite Element Method

Finite element method

Finite Element Method

Some Experiences on Parallel Finite Element Computations Using IBM/SP2

Finite element method

Finite Element Analysis

Finite Element Analysis

Finite element methods

Finite Element Method

Finite Element Modelling