Zheng Xia School of Electrical and Computer Engineering University of Ottawa, Ottawa, Canada email@example.com. Parallel Solving massive linear equations with CUDA. Introduction. Specify the main issue on the problem Introduction of CLapack and CULA CPU solving routine
School of Electrical and Computer Engineering
University of Ottawa, Ottawa, Canada
firstname.lastname@example.orgParallel Solving massive linear equations with CUDA
CPU VS. GPU: Mythbusters Demo GPU versus CPU
The finite element method (FEM) is a numerical technique for finding approximate solutions to boundary value problems
It encompasses all the methods for connecting many simple element equations over many small subdomains, named finite elements, to approximate a more complex equation over a larger domain
e.g. Approximation of a circle with large number of lines
In this case, the deformable object will be decomposed into many tetrahedrawith spring-damper model between every nodes.
FEM mesh created by an analyst prior to finding a solution to a magnetic problem using FEM software.
: the force on each node
: displacement for each node (it will be three-dimensional which includes )
: the top-level matrix (global matrix).
As the linear equations go through the equation above, the unknown displacement/force can be simply represented by the where a is the inverse of the matrix K
Fits for small n
Constrains by the full rank
Hard to be parallel processed
The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA™ runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU).
Basic step to use the CUBLAS library:
The CUBLAS library also provides helper functions for writing and retrieving data from the GPU.
Limitation: attaches to a single GPU and does not auto-parallelize across multiple GPUs
CPU VS. GPU(and CUBLAS)
Not coalesced access!
threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.
0 1 2 3
4 5 6 7
8 9 a b
Thread 0: 0 4 8
Thread 1: 1 5 9
Thread 2: 2 6 a
Thread 3: 3 7 b
Thread 0: 0 1 2
Thread 1: 3 4 5
Thread 2: 6 7 8
Thread 3: 9 a b
Naïve VS. Coalesced Access
N = 1200
N = 3000
LU decomposition Algorithm: