Zheng Xia
1 / 31

Parallel Solving massive linear equations with CUDA - PowerPoint PPT Presentation

  • Uploaded on

Zheng Xia School of Electrical and Computer Engineering University of Ottawa, Ottawa, Canada zxia010@uottawa.ca. Parallel Solving massive linear equations with CUDA. Introduction. Specify the main issue on the problem Introduction of CLapack and CULA CPU solving routine

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Parallel Solving massive linear equations with CUDA' - tokala

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Parallel solving massive linear equations with cuda

Zheng Xia

School of Electrical and Computer Engineering

University of Ottawa, Ottawa, Canada


Parallel Solving massive linear equations with CUDA


  • Specify the main issue on the problem

  • Introduction of CLapack and CULA

  • CPU solving routine

  • Design and Implement parallel solving routine

    • Naïve

    • Coalesced access

  • Conclusion

CPU VS. GPU: Mythbusters Demo GPU versus CPU

Parallel solving massive linear equations with cuda

The finite element method (FEM) is a numerical technique for finding approximate solutions to boundary value problems

It encompasses all the methods for connecting many simple element equations over many small subdomains, named finite elements, to approximate a more complex equation over a larger domain

e.g. Approximation of a circle with large number of lines

In this case, the deformable object will be decomposed into many tetrahedrawith spring-damper model between every nodes.

FEM mesh created by an analyst prior to finding a solution to a magnetic problem using FEM software.

Main issue linear equations
Main issue: Linear equations

Main issue:

: the force on each node

: displacement for each node (it will be three-dimensional which includes )

: the top-level matrix (global matrix).

As the linear equations go through the equation above, the unknown displacement/force can be simply represented by the where a is the inverse of the matrix K

Parallel solving massive linear equations with cuda

How to get

  • Condition: non-singular matrix (full rank, e.g. square matrix)

  • General: Coppersmith-Winograd:

  • Inefficient for the high-dimension matrix

  • Better to avoid



C lapack


  • Provides many routines for solving systems of linear equations such as:

    • Linear least squares

    • Eigenvalue problems

    • Singular value decomposition

  • by applying LU, QR, Cholesky and Schurdecomposition

  • A free software library for numerical linear algebra

  • CLAPACK's goal is to provide LAPACK for someone who does not have access to a Fortran compiler

Lu decomposition
LU decomposition


  • DGESV computes the solution to a real system of linear equations

  • A * X = B,

  • where A is an N-by-N matrix and X and B are N-by-NRHS matrices.

  • The LU decomposition with partial pivoting and row interchanges is used to factor A as

  • A = P * L * U,

  • where P is a permutation matrix, L is unit lower triangular, and U is upper triangular.


Fits for small n

Constrains by the full rank

Hard to be parallel processed

Qr decomposition
QR decomposition


  • solve overdetermined or underdetermined real linear systems involving an M-by-N matrix A, or its transpose, using a QR or LQ factorization of A.

  • It is assumed that A has full rank.

  • If m>= n: find the least squares solution of an overdetermined system

  • minimize || B - Ax ||

  • if m< n: find the minimum norm solution of an underdetermined system

  • A x = B

Parallel solving massive linear equations with cuda

  • What is CULA

  • CULA is a GPU-accelerated linear algebra library that utilizes the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics.

Parallel solving massive linear equations with cuda

  • Another question:

  • Options:

    • CUBLAS Library

    • Simple dgemv on CPU

    • Parallel dgemv on GPU(with and without coalesced read)

Cublas library
CUBLAS Library


The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA™ runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU).

Basic step to use the CUBLAS library:

  • Allocate matrices and vectors in the GPU

  • Fill the data

  • Call the sequence of desired CUBLAS functions

  • Upload the results from the GPU memory space back to the host.

    The CUBLAS library also provides helper functions for writing and retrieving data from the GPU.

Cublas t gemv


  • This function performs the matrix-vector multiplication

  • y = α op ( A ) x + β y

  • where A is a m × n matrix stored in column-major format, x and y are vectors, and α and β are scalars. Also, for matrix A

Limitation: attaches to a single GPU and does not auto-parallelize across multiple GPUs

Simple dgemv on cpu
Simple Dgemv on CPU

Parallel sgemv on gpu
Parallel Sgemv on GPU

  • Method

  • For each threads in blocks run a loop with:

  • in this case,

  • Kernel implementation

  • unsigned intid = blockIdx.x * blockDim.x + threadIdx.x;

  • for(unsigned int k = 0;k< Row; k++)

  • {

  • if(i < size) c[i] += a[id* Row + k] * b[k];

  • }

Experimental configuration
Experimental configuration

  • Test Platform

  • Intel(R) Core(TM) i7 CPU 920 @ 2.67 GHz

  • GeForce GTX 295 (only use single GPU)

  • CUDA 5.5

  • Data

    • rand() in C single/double precision

    • Matrix size: ( )

  • Timing function

    • CPU: clock_t

    • GPU: CUDA event

    • *Time compare excludes transfer between system and GPU memory

  • Cont d

    • CLapack

      • dgesv_ (N, NRHS, A, LDA, IPIV, B, LDB, INFO)

      • dgels_ (const char *trans, const int *M, const int *N, const int *nrhs, double *A, const int *lda, double *b, const int *ldb, double *work, const int * lwork, int *info)

    • CUBLAS Library

      • cublasStatus_t cublasDgemv(cublasHandle_t handle, cublasOperation_t trans, int m, int n, const double *alpha, const double *A, int lda, const double *x, int incx, const double *beta, double *y, int incy)

    Cont d1

    • Simple Dgemv (CPU)

      • void simple_sgemv(float *A, float *B, float *C,unsigned int size)

    • Simple Dgemv(GPU)

      • __global__ void CUParaSgemv(float *a, float *b, float *c,unsigned int size



    Cont d2


    Not coalesced access!

    Coalesced access
    Coalesced Access

    threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.


    0 1 2 3

    4 5 6 7

    8 9 a b

    Thread 0: 0 4 8

    Thread 1: 1 5 9

    Thread 2: 2 6 a

    Thread 3: 3 7 b

    Thread 0: 0 1 2

    Thread 1: 3 4 5

    Thread 2: 6 7 8

    Thread 3: 9 a b

    Cont d3

    Naïve VS. Coalesced Access


    N = 1200

    Cont d4

    N = 3000


    • The parallel compute method on GPU have a significant effect when small data has been involved in solving routine.

    • Data(Matrix) pre-processing also accounts for a large proportion of time.

    Future work
    Future work

    • Multicore compute on GPU

    • Reduce the pre-process time on data

      • LU, QR in CUDA version

    • Very Large Matrix K decomposition


    • LU Matrix Decomposition in parallel with CUDA:

    • http://www.noctua-blog.com/index.php/2011/04/21/lu-matrix-decomposition-in-parallel-with-cuda/

    • QR Decomposition on GPUs

    • http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Kerr.pdf

    • The QR Algorithm for Finding Eigenvectors

    • http://www.cse.buffalo.edu/faculty/miller/Courses/CSE633/Eric-Mikida-Fall-2011.pdf

    Appendix a algorithm
    Appendix A:Algorithm

    LU decomposition Algorithm:



    Parallel solving massive linear equations with cuda

    QR Decomposition Methods

    • Gram‐Schmidt Algorithm

    • Givens rotation

    • Householder reflection

    Appendix b other methods on gpu
    Appendix B:Other methods on GPU

    • CUFFT

    • Fast Fourier Transform (FFT) library is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets, and It supports the following features:

      • 1D, 2D, and 3D transforms of complex and real-valued data

      • Batch execution for doing multiple transforms of any dimension in parallel

      • 2D and 3D transform sizes in the range [2, 16384] in any dimension

      • 1D transform sizes up to 8 million elements

      • In place and out of place transforms for real and complex data

      • Double precision transforms on compatible hardware (GT200 and later GPUs)

    • Support for streamed execution, enabling simultaneous computation together with data movement