Parallel Solving massive linear equations with CUDA

Zheng Xia School of Electrical and Computer Engineering University of Ottawa, Ottawa, Canada zxia010@uottawa.ca Parallel Solving massive linear equations with CUDA

Introduction • Specify the main issue on the problem • Introduction of CLapack and CULA • CPU solving routine • Design and Implement parallel solving routine • Naïve • Coalesced access • Conclusion CPU VS. GPU: Mythbusters Demo GPU versus CPU

FEM The finite element method (FEM) is a numerical technique for finding approximate solutions to boundary value problems It encompasses all the methods for connecting many simple element equations over many small subdomains, named finite elements, to approximate a more complex equation over a larger domain e.g. Approximation of a circle with large number of lines In this case, the deformable object will be decomposed into many tetrahedrawith spring-damper model between every nodes. FEM mesh created by an analyst prior to finding a solution to a magnetic problem using FEM software.

Main issue: Linear equations Main issue: : the force on each node : displacement for each node (it will be three-dimensional which includes ) : the top-level matrix (global matrix). As the linear equations go through the equation above, the unknown displacement/force can be simply represented by the where a is the inverse of the matrix K

How to get • Condition: non-singular matrix (full rank, e.g. square matrix) • General: Coppersmith-Winograd: • Inefficient for the high-dimension matrix • Better to avoid [1] [1].http://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm

(C)Lapack (C)Lapack • Provides many routines for solving systems of linear equations such as: • Linear least squares • Eigenvalue problems • Singular value decomposition • by applying LU, QR, Cholesky and Schurdecomposition • A free software library for numerical linear algebra • CLAPACK's goal is to provide LAPACK for someone who does not have access to a Fortran compiler

LU decomposition dgesv() • DGESV computes the solution to a real system of linear equations • A * X = B, • where A is an N-by-N matrix and X and B are N-by-NRHS matrices. • The LU decomposition with partial pivoting and row interchanges is used to factor A as • A = P * L * U, • where P is a permutation matrix, L is unit lower triangular, and U is upper triangular. Limitation: Fits for small n Constrains by the full rank Hard to be parallel processed

QR decomposition dgels() • solve overdetermined or underdetermined real linear systems involving an M-by-N matrix A, or its transpose, using a QR or LQ factorization of A. • It is assumed that A has full rank. • If m>= n: find the least squares solution of an overdetermined system • minimize || B - Ax || • if m< n: find the minimum norm solution of an underdetermined system • A x = B

CULA • What is CULA • CULA is a GPU-accelerated linear algebra library that utilizes the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics.

Y=Ab • Another question: • Options: • CUBLAS Library • Simple dgemv on CPU • Parallel dgemv on GPU(with and without coalesced read)

CUBLAS Library Intro The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA™ runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU). Basic step to use the CUBLAS library: • Allocate matrices and vectors in the GPU • Fill the data • Call the sequence of desired CUBLAS functions • Upload the results from the GPU memory space back to the host. The CUBLAS library also provides helper functions for writing and retrieving data from the GPU.

CUBLAS<t>gemv() cublasSgemv() • This function performs the matrix-vector multiplication • y = α op ( A ) x + β y • where A is a m × n matrix stored in column-major format, x and y are vectors, and α and β are scalars. Also, for matrix A Limitation: attaches to a single GPU and does not auto-parallelize across multiple GPUs

Simple Dgemv on CPU

Parallel Sgemv on GPU • Method • For each threads in blocks run a loop with: • in this case, • Kernel implementation • unsigned intid = blockIdx.x * blockDim.x + threadIdx.x; • for(unsigned int k = 0;k< Row; k++) • { • if(i < size) c[i] += a[id* Row + k] * b[k]; • }

Experimental configuration • Test Platform • Intel(R) Core(TM) i7 CPU 920 @ 2.67 GHz • GeForce GTX 295 (only use single GPU) • CUDA 5.5 • Data • rand() in C single/double precision • Matrix size: ( ) • Timing function • CPU: clock_t • GPU: CUDA event • *Time compare excludes transfer between system and GPU memory

Cont'd • CLapack • dgesv_ (N, NRHS, A, LDA, IPIV, B, LDB, INFO) • dgels_ (const char *trans, const int *M, const int *N, const int *nrhs, double *A, const int *lda, double *b, const int *ldb, double *work, const int * lwork, int *info) • CUBLAS Library • cublasStatus_t cublasDgemv(cublasHandle_t handle, cublasOperation_t trans, int m, int n, const double *alpha, const double *A, int lda, const double *x, int incx, const double *beta, double *y, int incy)

Cont'd • Simple Dgemv (CPU) • void simple_sgemv(float *A, float *B, float *C,unsigned int size) • Simple Dgemv(GPU) • __global__ void CUParaSgemv(float *a, float *b, float *c,unsigned int size

Performance CLapack

Cont'd CPU VS. GPU(and CUBLAS) Not coalesced access!

Coalesced Access threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read. MartixA: 0 1 2 3 4 5 6 7 8 9 a b Thread 0: 0 4 8 Thread 1: 1 5 9 Thread 2: 2 6 a Thread 3: 3 7 b Thread 0: 0 1 2 Thread 1: 3 4 5 Thread 2: 6 7 8 Thread 3: 9 a b

Cont’d Naïve VS. Coalesced Access

Analysis N = 1200

Cont’d N = 3000

Conclusion • The parallel compute method on GPU have a significant effect when small data has been involved in solving routine. • Data(Matrix) pre-processing also accounts for a large proportion of time.

Future work • Multicore compute on GPU • Reduce the pre-process time on data • LU, QR in CUDA version • Very Large Matrix K decomposition

Questions & Comments?

Thank you!

References • LU Matrix Decomposition in parallel with CUDA: • http://www.noctua-blog.com/index.php/2011/04/21/lu-matrix-decomposition-in-parallel-with-cuda/ • QR Decomposition on GPUs • http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Kerr.pdf • The QR Algorithm for Finding Eigenvectors • http://www.cse.buffalo.edu/faculty/miller/Courses/CSE633/Eric-Mikida-Fall-2011.pdf

Appendix A:Algorithm LU decomposition Algorithm: Pivot Pivot

QR Decomposition Methods • Gram‐Schmidt Algorithm • Givens rotation • Householder reflection

Appendix B:Other methods on GPU • CUFFT • Fast Fourier Transform (FFT) library is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets, and It supports the following features: • 1D, 2D, and 3D transforms of complex and real-valued data • Batch execution for doing multiple transforms of any dimension in parallel • 2D and 3D transform sizes in the range [2, 16384] in any dimension • 1D transform sizes up to 8 million elements • In place and out of place transforms for real and complex data • Double precision transforms on compatible hardware (GT200 and later GPUs) • Support for streamed execution, enabling simultaneous computation together with data movement

Parallel Solving massive linear equations with CUDA

Parallel Solving massive linear equations with CUDA

Presentation Transcript

Solving Linear Equations

Solving Linear Equations

Solving Linear Equations

Solving Linear Equations

Solving Linear Equations

2.4 Solving Linear Equations

Solving Linear Equations

Solving Linear Equations

Solving Linear Equations

Solving linear equations

Solving Linear EQUATIONS

Solving Linear Equations

Solving Linear Equations

SOLVING LINEAR EQUATIONS

Solving Linear Equations

Solving Linear Equations

Solving Linear Equations

Solving Linear Equations

Solving Linear Equations