Zheng Xia School of Electrical and Computer Engineering University of Ottawa, Ottawa, Canada [email protected] Parallel Solving massive linear equations with CUDA. Introduction. Specify the main issue on the problem Introduction of CLapack and CULA CPU solving routine
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
School of Electrical and Computer Engineering
University of Ottawa, Ottawa, Canada
CPU VS. GPU: Mythbusters Demo GPU versus CPU
The finite element method (FEM) is a numerical technique for finding approximate solutions to boundary value problems
It encompasses all the methods for connecting many simple element equations over many small subdomains, named finite elements, to approximate a more complex equation over a larger domain
e.g. Approximation of a circle with large number of lines
In this case, the deformable object will be decomposed into many tetrahedrawith spring-damper model between every nodes.
FEM mesh created by an analyst prior to finding a solution to a magnetic problem using FEM software.
: the force on each node
: displacement for each node (it will be three-dimensional which includes )
: the top-level matrix (global matrix).
As the linear equations go through the equation above, the unknown displacement/force can be simply represented by the where a is the inverse of the matrix K
How to get
Fits for small n
Constrains by the full rank
Hard to be parallel processed
The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA™ runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU).
Basic step to use the CUBLAS library:
The CUBLAS library also provides helper functions for writing and retrieving data from the GPU.
Limitation: attaches to a single GPU and does not auto-parallelize across multiple GPUs
CPU VS. GPU(and CUBLAS)
Not coalesced access!
threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.
0 1 2 3
4 5 6 7
8 9 a b
Thread 0: 0 4 8
Thread 1: 1 5 9
Thread 2: 2 6 a
Thread 3: 3 7 b
Thread 0: 0 1 2
Thread 1: 3 4 5
Thread 2: 6 7 8
Thread 3: 9 a b
Naïve VS. Coalesced Access
N = 1200
N = 3000
Questions & Comments?
LU decomposition Algorithm:
QR Decomposition Methods