Jacobi Iterative technique on Multi GPU platform

1 / 18

# Jacobi Iterative technique on Multi GPU platform - PowerPoint PPT Presentation

Jacobi Iterative technique on Multi GPU platform. By Ishtiaq Hossain Venkata Krishna Nimmagadda. Application of Jacobi Iteration. Cardiac Tissue is considered as a grid of cells.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Jacobi Iterative technique on Multi GPU platform' - gerd

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Jacobi Iterative technique on Multi GPU platform

By

IshtiaqHossain

Application of Jacobi Iteration
• Cardiac Tissue is considered as a grid of cells.
• Each GPU thread takes care of voltage calculation at one cell. This calculation requires Voltage values of neighboring cells
• Two different models are shown in the bottom right corner
• Vcell0 in current time step is calculated by using values of surrounding cells from previous time step to avoid synchronization issues
• Vcell0k= f(Vcell1k-1+Vcell2k-1 +Vcell3k-1….+VcellNk-1)

where N can be 6 or 18

Application of Jacobi Iteration
• Initial values are provided to start computation
• In s single time step ODE and PDE parts are sequentially evaluated and added
• By solving the finite difference equations, voltage values of every cell in a time step is calculated by a thread
• Figure 1 shows a healthy cell’s voltage curve with time.

Figure 1

The Time Step

Vtemp2 is generated in every iteration for all the cells in the grid

Calculation of Vtemp2 requires Vtemp2 values of previous time step

Once the iterations are completed, final Vtemp2 is added with Vtemp1 to generate Voltage values for that time step

Memory Coalescing
• typedefstruct __align__(N)

{

int a[N];

int b[N]

-

-

} NODE;

.

.

.

.

NODE nodes[N*N];

N*N blocks and N threads are launched so that all the N threads access values in consecutive places

Time in millisecs

Design of data Structure

Serial Vs Single GPU

Time in secs

Time in secs

Hey serial, what take you so long?

128X128X128 gives us 309 secs

Enormous Speed Up

Step 1 Lessons learnt
• Choose Data structure which maximizes the memory coalescing
• The mechanics of serial code and parallel code are very different
• Develop algorithms that address the areas where serial code takes long time
Multi GPU Approach

Using OpenMP for launching host threads.

PDE is solved using Jacobi Iteration

ODE is solved using Forward Eular Method

Data partitioning and kernel invocation for GPU computation.

Inter GPU data partitioning
• Input data: 2D array of structures. Structures contain arrays.
• Data resides in host memory.

Interface Region

• Let both the cubes are of dimensions s X s X s
• Interface Region of left one is 2s2
• Interface Region of right one is 3s2
• After division, data is copied into the device memory (global) of each GPU.
Solving PDEs using multiple GPUs
• During each Jacobi Iteration threads use Global memory to share data among them.
• Threads in the Interface Region need data from other GPUs.
• Inter GPUs sharing is done through Host memory.
• A separate kernel is launched that handles the interface region computation and copies result back to device memory. So GPUs are synchronized.
• Once PDE calculation is completed for one timestamp, all values are written back to the Host Memory.
Solving PDEs using multiple GPUs

Time

Host to device copy

GPU Computation

Interface Region Computation

Device to host copy

The circus of Inter GPU sync
• Ghost Cell computing!
• Pad with dummy cells at the inter GPU interfaces to reduce communication
• Lets make other cores of CPU work
• 4 out of 8 cores in CPU are having contexts
• Use the free 4 cores to do interface computation
• Simple is the best
• Launch new kernels with different dimensions to handle cells at interface.
Various Stages

Interestingly solving PDE using Jacobi iteration is eating most of the time.

Scalability

A = 32X32X32 cells executed by each GPU

B= 32X32X32 cells executed by each GPU

C= 32X32X32 cells executed by each GPU

D= 32X32X32 cells executed by each GPU

Step 2 Lessons Learnt
• The Jacobi iterative technique looks pretty good in scalability
• Interface Selection is very important
• Making a Multi GPU program generic is a lot of effort from programmer side