Zheng Xia
Sponsored Links
This presentation is the property of its rightful owner.
1 / 31

Parallel Solving massive linear equations with CUDA PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Zheng Xia School of Electrical and Computer Engineering University of Ottawa, Ottawa, Canada zxia010@uottawa.ca. Parallel Solving massive linear equations with CUDA. Introduction. Specify the main issue on the problem Introduction of CLapack and CULA CPU solving routine

Download Presentation

Parallel Solving massive linear equations with CUDA

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Zheng Xia

School of Electrical and Computer Engineering

University of Ottawa, Ottawa, Canada


Parallel Solving massive linear equations with CUDA


  • Specify the main issue on the problem

  • Introduction of CLapack and CULA

  • CPU solving routine

  • Design and Implement parallel solving routine

    • Naïve

    • Coalesced access

  • Conclusion

CPU VS. GPU: Mythbusters Demo GPU versus CPU


The finite element method (FEM) is a numerical technique for finding approximate solutions to boundary value problems

It encompasses all the methods for connecting many simple element equations over many small subdomains, named finite elements, to approximate a more complex equation over a larger domain

e.g. Approximation of a circle with large number of lines

In this case, the deformable object will be decomposed into many tetrahedrawith spring-damper model between every nodes.

FEM mesh created by an analyst prior to finding a solution to a magnetic problem using FEM software.

Main issue: Linear equations

Main issue:

: the force on each node

: displacement for each node (it will be three-dimensional which includes )

: the top-level matrix (global matrix).

As the linear equations go through the equation above, the unknown displacement/force can be simply represented by the where a is the inverse of the matrix K

How to get

  • Condition: non-singular matrix (full rank, e.g. square matrix)

  • General: Coppersmith-Winograd:

  • Inefficient for the high-dimension matrix

  • Better to avoid





  • Provides many routines for solving systems of linear equations such as:

    • Linear least squares

    • Eigenvalue problems

    • Singular value decomposition

  • by applying LU, QR, Cholesky and Schurdecomposition

  • A free software library for numerical linear algebra

  • CLAPACK's goal is to provide LAPACK for someone who does not have access to a Fortran compiler

LU decomposition


  • DGESV computes the solution to a real system of linear equations

  • A * X = B,

  • where A is an N-by-N matrix and X and B are N-by-NRHS matrices.

  • The LU decomposition with partial pivoting and row interchanges is used to factor A as

  • A = P * L * U,

  • where P is a permutation matrix, L is unit lower triangular, and U is upper triangular.


Fits for small n

Constrains by the full rank

Hard to be parallel processed

QR decomposition


  • solve overdetermined or underdetermined real linear systems involving an M-by-N matrix A, or its transpose, using a QR or LQ factorization of A.

  • It is assumed that A has full rank.

  • If m>= n: find the least squares solution of an overdetermined system

  • minimize || B - Ax ||

  • if m< n: find the minimum norm solution of an underdetermined system

  • A x = B


  • What is CULA

  • CULA is a GPU-accelerated linear algebra library that utilizes the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics.


  • Another question:

  • Options:

    • CUBLAS Library

    • Simple dgemv on CPU

    • Parallel dgemv on GPU(with and without coalesced read)

CUBLAS Library


The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA™ runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU).

Basic step to use the CUBLAS library:

  • Allocate matrices and vectors in the GPU

  • Fill the data

  • Call the sequence of desired CUBLAS functions

  • Upload the results from the GPU memory space back to the host.

    The CUBLAS library also provides helper functions for writing and retrieving data from the GPU.



  • This function performs the matrix-vector multiplication

  • y = α op ( A ) x + β y

  • where A is a m × n matrix stored in column-major format, x and y are vectors, and α and β are scalars. Also, for matrix A

Limitation: attaches to a single GPU and does not auto-parallelize across multiple GPUs

Simple Dgemv on CPU

Parallel Sgemv on GPU

  • Method

  • For each threads in blocks run a loop with:

  • in this case,

  • Kernel implementation

  • unsigned intid = blockIdx.x * blockDim.x + threadIdx.x;

  • for(unsigned int k = 0;k< Row; k++)

  • {

  • if(i < size) c[i] += a[id* Row + k] * b[k];

  • }

Experimental configuration

  • Test Platform

  • Intel(R) Core(TM) i7 CPU 920 @ 2.67 GHz

  • GeForce GTX 295 (only use single GPU)

  • CUDA 5.5

  • Data

    • rand() in C single/double precision

    • Matrix size: ( )

  • Timing function

    • CPU: clock_t

    • GPU: CUDA event

    • *Time compare excludes transfer between system and GPU memory

  • Cont'd

    • CLapack

      • dgesv_ (N, NRHS, A, LDA, IPIV, B, LDB, INFO)

      • dgels_ (const char *trans, const int *M, const int *N, const int *nrhs, double *A, const int *lda, double *b, const int *ldb, double *work, const int * lwork, int *info)

    • CUBLAS Library

      • cublasStatus_t cublasDgemv(cublasHandle_t handle, cublasOperation_t trans, int m, int n, const double *alpha, const double *A, int lda, const double *x, int incx, const double *beta, double *y, int incy)


    • Simple Dgemv (CPU)

      • void simple_sgemv(float *A, float *B, float *C,unsigned int size)

    • Simple Dgemv(GPU)

      • __global__ void CUParaSgemv(float *a, float *b, float *c,unsigned int size





    Not coalesced access!

    Coalesced Access

    threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.


    0 1 2 3

    4 5 6 7

    8 9 a b

    Thread 0: 0 4 8

    Thread 1: 1 5 9

    Thread 2: 2 6 a

    Thread 3: 3 7 b

    Thread 0: 0 1 2

    Thread 1: 3 4 5

    Thread 2: 6 7 8

    Thread 3: 9 a b


    Naïve VS. Coalesced Access


    N = 1200


    N = 3000


    • The parallel compute method on GPU have a significant effect when small data has been involved in solving routine.

    • Data(Matrix) pre-processing also accounts for a large proportion of time.

    Future work

    • Multicore compute on GPU

    • Reduce the pre-process time on data

      • LU, QR in CUDA version

    • Very Large Matrix K decomposition

    Questions & Comments?

    Thank you!


    • LU Matrix Decomposition in parallel with CUDA:

    • http://www.noctua-blog.com/index.php/2011/04/21/lu-matrix-decomposition-in-parallel-with-cuda/

    • QR Decomposition on GPUs

    • http://www.ece.neu.edu/groups/nucar/GPGPU/GPGPU-2/Kerr.pdf

    • The QR Algorithm for Finding Eigenvectors

    • http://www.cse.buffalo.edu/faculty/miller/Courses/CSE633/Eric-Mikida-Fall-2011.pdf

    Appendix A:Algorithm

    LU decomposition Algorithm:



    QR Decomposition Methods

    • Gram‐Schmidt Algorithm

    • Givens rotation

    • Householder reflection

    Appendix B:Other methods on GPU

    • CUFFT

    • Fast Fourier Transform (FFT) library is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets, and It supports the following features:

      • 1D, 2D, and 3D transforms of complex and real-valued data

      • Batch execution for doing multiple transforms of any dimension in parallel

      • 2D and 3D transform sizes in the range [2, 16384] in any dimension

      • 1D transform sizes up to 8 million elements

      • In place and out of place transforms for real and complex data

      • Double precision transforms on compatible hardware (GT200 and later GPUs)

    • Support for streamed execution, enabling simultaneous computation together with data movement

  • Login