1 / 31

# Parallel Solving massive linear equations with CUDA - PowerPoint PPT Presentation

Zheng Xia School of Electrical and Computer Engineering University of Ottawa, Ottawa, Canada zxia010@uottawa.ca. Parallel Solving massive linear equations with CUDA. Introduction. Specify the main issue on the problem Introduction of CLapack and CULA CPU solving routine

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Parallel Solving massive linear equations with CUDA

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

#### Presentation Transcript

School of Electrical and Computer Engineering

University of Ottawa, Ottawa, Canada

zxia010@uottawa.ca

Parallel Solving massive linear equations with CUDA

• Specify the main issue on the problem

• Introduction of CLapack and CULA

• CPU solving routine

• Design and Implement parallel solving routine

• Naïve

• Coalesced access

• Conclusion

CPU VS. GPU: Mythbusters Demo GPU versus CPU

The finite element method (FEM) is a numerical technique for finding approximate solutions to boundary value problems

It encompasses all the methods for connecting many simple element equations over many small subdomains, named finite elements, to approximate a more complex equation over a larger domain

e.g. Approximation of a circle with large number of lines

In this case, the deformable object will be decomposed into many tetrahedrawith spring-damper model between every nodes.

FEM mesh created by an analyst prior to finding a solution to a magnetic problem using FEM software.

Main issue:

: the force on each node

: displacement for each node (it will be three-dimensional which includes )

: the top-level matrix (global matrix).

As the linear equations go through the equation above, the unknown displacement/force can be simply represented by the where a is the inverse of the matrix K

• Condition: non-singular matrix (full rank, e.g. square matrix)

• Inefficient for the high-dimension matrix

• Better to avoid

[1]

(C)Lapack

• Provides many routines for solving systems of linear equations such as:

• Linear least squares

• Eigenvalue problems

• Singular value decomposition

• by applying LU, QR, Cholesky and Schurdecomposition

• A free software library for numerical linear algebra

• CLAPACK's goal is to provide LAPACK for someone who does not have access to a Fortran compiler

dgesv()

• DGESV computes the solution to a real system of linear equations

• A * X = B,

• where A is an N-by-N matrix and X and B are N-by-NRHS matrices.

• The LU decomposition with partial pivoting and row interchanges is used to factor A as

• A = P * L * U,

• where P is a permutation matrix, L is unit lower triangular, and U is upper triangular.

Limitation:

Fits for small n

Constrains by the full rank

Hard to be parallel processed

dgels()

• solve overdetermined or underdetermined real linear systems involving an M-by-N matrix A, or its transpose, using a QR or LQ factorization of A.

• It is assumed that A has full rank.

• If m>= n: find the least squares solution of an overdetermined system

• minimize || B - Ax ||

• if m< n: find the minimum norm solution of an underdetermined system

• A x = B

• What is CULA

• CULA is a GPU-accelerated linear algebra library that utilizes the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics.

• Another question:

• Options:

• CUBLAS Library

• Simple dgemv on CPU

• Parallel dgemv on GPU(with and without coalesced read)

Intro

The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA™ runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU).

Basic step to use the CUBLAS library:

• Allocate matrices and vectors in the GPU

• Fill the data

• Call the sequence of desired CUBLAS functions

• Upload the results from the GPU memory space back to the host.

The CUBLAS library also provides helper functions for writing and retrieving data from the GPU.

cublasSgemv()

• This function performs the matrix-vector multiplication

• y = α op ( A ) x + β y

• where A is a m × n matrix stored in column-major format, x and y are vectors, and α and β are scalars. Also, for matrix A

Limitation: attaches to a single GPU and does not auto-parallelize across multiple GPUs

Simple Dgemv on CPU

Parallel Sgemv on GPU

• Method

• For each threads in blocks run a loop with:

• in this case,

• Kernel implementation

• unsigned intid = blockIdx.x * blockDim.x + threadIdx.x;

• for(unsigned int k = 0;k< Row; k++)

• {

• if(i < size) c[i] += a[id* Row + k] * b[k];

• }

• Test Platform

• Intel(R) Core(TM) i7 CPU 920 @ 2.67 GHz

• GeForce GTX 295 (only use single GPU)

• CUDA 5.5

• Data

• rand() in C single/double precision

• Matrix size: ( )

• Timing function

• CPU: clock_t

• GPU: CUDA event

• *Time compare excludes transfer between system and GPU memory

• CLapack

• dgesv_ (N, NRHS, A, LDA, IPIV, B, LDB, INFO)

• dgels_ (const char *trans, const int *M, const int *N, const int *nrhs, double *A, const int *lda, double *b, const int *ldb, double *work, const int * lwork, int *info)

• CUBLAS Library

• cublasStatus_t cublasDgemv(cublasHandle_t handle, cublasOperation_t trans, int m, int n, const double *alpha, const double *A, int lda, const double *x, int incx, const double *beta, double *y, int incy)

• Simple Dgemv (CPU)

• void simple_sgemv(float *A, float *B, float *C,unsigned int size)

• Simple Dgemv(GPU)

• __global__ void CUParaSgemv(float *a, float *b, float *c,unsigned int size

CLapack

CPU VS. GPU(and CUBLAS)

Not coalesced access!

threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.

MartixA:

0 1 2 3

4 5 6 7

8 9 a b

Thread 0: 0 4 8

Thread 1: 1 5 9

Thread 2: 2 6 a

Thread 3: 3 7 b

Thread 0: 0 1 2

Thread 1: 3 4 5

Thread 2: 6 7 8

Thread 3: 9 a b

Naïve VS. Coalesced Access

N = 1200

N = 3000

• The parallel compute method on GPU have a significant effect when small data has been involved in solving routine.

• Data(Matrix) pre-processing also accounts for a large proportion of time.

• Multicore compute on GPU

• Reduce the pre-process time on data

• LU, QR in CUDA version

• Very Large Matrix K decomposition

• LU Matrix Decomposition in parallel with CUDA:

• http://www.noctua-blog.com/index.php/2011/04/21/lu-matrix-decomposition-in-parallel-with-cuda/

• QR Decomposition on GPUs

• http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Kerr.pdf

• The QR Algorithm for Finding Eigenvectors

• http://www.cse.buffalo.edu/faculty/miller/Courses/CSE633/Eric-Mikida-Fall-2011.pdf

LU decomposition Algorithm:

Pivot

Pivot

• Gram‐Schmidt Algorithm

• Givens rotation

• Householder reflection

Appendix B:Other methods on GPU

• CUFFT

• Fast Fourier Transform (FFT) library is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets, and It supports the following features:

• 1D, 2D, and 3D transforms of complex and real-valued data

• Batch execution for doing multiple transforms of any dimension in parallel

• 2D and 3D transform sizes in the range [2, 16384] in any dimension

• 1D transform sizes up to 8 million elements

• In place and out of place transforms for real and complex data

• Double precision transforms on compatible hardware (GT200 and later GPUs)

• Support for streamed execution, enabling simultaneous computation together with data movement