- 190 Views
- Uploaded on
- Presentation posted in: General

Parallel Solving massive linear equations with CUDA

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Zheng Xia

School of Electrical and Computer Engineering

University of Ottawa, Ottawa, Canada

- Specify the main issue on the problem
- Introduction of CLapack and CULA
- CPU solving routine
- Design and Implement parallel solving routine
- Naïve
- Coalesced access

- Conclusion

CPU VS. GPU: Mythbusters Demo GPU versus CPU

The finite element method (FEM) is a numerical technique for finding approximate solutions to boundary value problems

It encompasses all the methods for connecting many simple element equations over many small subdomains, named finite elements, to approximate a more complex equation over a larger domain

e.g. Approximation of a circle with large number of lines

In this case, the deformable object will be decomposed into many tetrahedrawith spring-damper model between every nodes.

FEM mesh created by an analyst prior to finding a solution to a magnetic problem using FEM software.

Main issue:

: the force on each node

: displacement for each node (it will be three-dimensional which includes )

: the top-level matrix (global matrix).

As the linear equations go through the equation above, the unknown displacement/force can be simply represented by the where a is the inverse of the matrix K

How to get

- Condition: non-singular matrix (full rank, e.g. square matrix)
- General: Coppersmith-Winograd:
- Inefficient for the high-dimension matrix
- Better to avoid

[1]

[1].http://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm

(C)Lapack

- Provides many routines for solving systems of linear equations such as:
- Linear least squares
- Eigenvalue problems
- Singular value decomposition

- by applying LU, QR, Cholesky and Schurdecomposition
- A free software library for numerical linear algebra
- CLAPACK's goal is to provide LAPACK for someone who does not have access to a Fortran compiler

dgesv()

- DGESV computes the solution to a real system of linear equations
- A * X = B,
- where A is an N-by-N matrix and X and B are N-by-NRHS matrices.
- The LU decomposition with partial pivoting and row interchanges is used to factor A as
- A = P * L * U,
- where P is a permutation matrix, L is unit lower triangular, and U is upper triangular.

Limitation:

Fits for small n

Constrains by the full rank

Hard to be parallel processed

dgels()

- solve overdetermined or underdetermined real linear systems involving an M-by-N matrix A, or its transpose, using a QR or LQ factorization of A.
- It is assumed that A has full rank.
- If m>= n: find the least squares solution of an overdetermined system
- minimize || B - Ax ||
- if m< n: find the minimum norm solution of an underdetermined system
- A x = B

- What is CULA
- CULA is a GPU-accelerated linear algebra library that utilizes the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics.

- Another question:

- Options:
- CUBLAS Library
- Simple dgemv on CPU
- Parallel dgemv on GPU(with and without coalesced read)

Intro

The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA™ runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU).

Basic step to use the CUBLAS library:

- Allocate matrices and vectors in the GPU
- Fill the data
- Call the sequence of desired CUBLAS functions
- Upload the results from the GPU memory space back to the host.
The CUBLAS library also provides helper functions for writing and retrieving data from the GPU.

cublasSgemv()

- This function performs the matrix-vector multiplication
- y = α op ( A ) x + β y
- where A is a m × n matrix stored in column-major format, x and y are vectors, and α and β are scalars. Also, for matrix A

Limitation: attaches to a single GPU and does not auto-parallelize across multiple GPUs

- Method
- For each threads in blocks run a loop with:
- in this case,
- Kernel implementation
- unsigned intid = blockIdx.x * blockDim.x + threadIdx.x;
- for(unsigned int k = 0;k< Row; k++)
- {
- if(i < size) c[i] += a[id* Row + k] * b[k];
- }

- Test Platform
- Intel(R) Core(TM) i7 CPU 920 @ 2.67 GHz
- GeForce GTX 295 (only use single GPU)
- CUDA 5.5

- rand() in C single/double precision
- Matrix size: ( )

- CPU: clock_t
- GPU: CUDA event
- *Time compare excludes transfer between system and GPU memory

- CLapack
- dgesv_ (N, NRHS, A, LDA, IPIV, B, LDB, INFO)
- dgels_ (const char *trans, const int *M, const int *N, const int *nrhs, double *A, const int *lda, double *b, const int *ldb, double *work, const int * lwork, int *info)

- CUBLAS Library
- cublasStatus_t cublasDgemv(cublasHandle_t handle, cublasOperation_t trans, int m, int n, const double *alpha, const double *A, int lda, const double *x, int incx, const double *beta, double *y, int incy)

- Simple Dgemv (CPU)
- void simple_sgemv(float *A, float *B, float *C,unsigned int size)

- Simple Dgemv(GPU)
- __global__ void CUParaSgemv(float *a, float *b, float *c,unsigned int size

CLapack

CPU VS. GPU(and CUBLAS)

Not coalesced access!

threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.

MartixA:

0 1 2 3

4 5 6 7

8 9 a b

Thread 0: 0 4 8

Thread 1: 1 5 9

Thread 2: 2 6 a

Thread 3: 3 7 b

Thread 0: 0 1 2

Thread 1: 3 4 5

Thread 2: 6 7 8

Thread 3: 9 a b

Naïve VS. Coalesced Access

N = 1200

N = 3000

- The parallel compute method on GPU have a significant effect when small data has been involved in solving routine.
- Data(Matrix) pre-processing also accounts for a large proportion of time.

- Multicore compute on GPU
- Reduce the pre-process time on data
- LU, QR in CUDA version

- Very Large Matrix K decomposition

Questions & Comments?

Thank you!

- LU Matrix Decomposition in parallel with CUDA:
- http://www.noctua-blog.com/index.php/2011/04/21/lu-matrix-decomposition-in-parallel-with-cuda/
- QR Decomposition on GPUs
- http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Kerr.pdf
- The QR Algorithm for Finding Eigenvectors
- http://www.cse.buffalo.edu/faculty/miller/Courses/CSE633/Eric-Mikida-Fall-2011.pdf

LU decomposition Algorithm:

Pivot

Pivot

QR Decomposition Methods

- Gram‐Schmidt Algorithm
- Givens rotation
- Householder reflection

- CUFFT
- Fast Fourier Transform (FFT) library is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets, and It supports the following features:
- 1D, 2D, and 3D transforms of complex and real-valued data
- Batch execution for doing multiple transforms of any dimension in parallel
- 2D and 3D transform sizes in the range [2, 16384] in any dimension
- 1D transform sizes up to 8 million elements
- In place and out of place transforms for real and complex data
- Double precision transforms on compatible hardware (GT200 and later GPUs)

- Support for streamed execution, enabling simultaneous computation together with data movement