- 80 Views
- Uploaded on
- Presentation posted in: General

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Nolan Goodnight Cliff Woolley Gregory LewinDavid Luebke Greg Humphreys

University of Virginia

augmented by Klaus Mueller, Stony Brook University

- Why do we port algorithms to the GPU?
- How much faster can we expect it to be, really?
- What is the challenge in porting?

Problem: Implement a Boundary Value Problem (BVP) solver using the GPU

Could benefit an entire class of scientific and engineering applications, e.g.:

- Heat transfer
- Fluid flow

- Krüger and Westermann: Linear Algebra Operators for GPU Implementation of Numerical Algorithms
- Bolz et al.: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid
- Very similar to our system
- Developed concurrently
- Complementary approach

- Very similar to our system

Problem domain is a warped disc:

regular grid

regular grid

- Boundary value problems are sometimes governedby PDEs of the form:
L=f

- L is some operator
- is the problem domain
- f is a forcing function (source term)
- Given L and f, solve for .

Heat Transfer

- Find a steady-state temperature distribution T in a solid of thermal conductivity k with thermal source S
- This requires solving a Poisson equation of the form:
k2T = -S

- This is a BVP where L is the Laplacian operator 2
All our applications require a Poisson solver.

- Most such problems cannot be solved analytically
- Instead, discretize onto a grid to form a set of linear equations, then solve:
- Direct elimination
- Gauss-Seidel iteration
- Conjugate-gradient
- Strongly implicit procedures
- Multigrid method

- Iteratively corrects an approximation to the solution
- Operates at multiple grid resolutions
- Low-resolution grids are used to correct higher-resolution grids recursively
- Very fast, especially for large grids: O(n)

- Use coarser grid levels to recursively correct an approximation to the solution
- may converge slowly on fine grid -> restrict to course grid
- push out long wavelength errors quickly (single grid solvers only smooth out high frequency errors)

- may converge slowly on fine grid -> restrict to course grid
- Algorithm:
- smooth
- residual
- restrict
- recurse

- interpolate

= Li-f

For each step of the algorithm:

- Bind as texture maps the buffers that contain the necessary data (current solution, residual, source terms, etc.)
- Set the target buffer for rendering
- Activate a fragment program that performs the necessary kernel computation (smoothing, residual calculation, restriction, interpolation)
- Render a grid-sized quad with multitexturing

source buffer texture

source buffer texture

render target buffer

render target buffer

fragment program

- Solution buffer: four-channel floating point pixel buffer (p-buffer)
- one channel each for solution, residual, source term, and a debugging term
- toggle front and back surfaces used to hold old and new solution

- Operator map: contains the discretized operator (e.g., Laplacian)
- Red-black map: accelerate odd-even tests (see later)

- Jacobi method
- one matrix row:
- calculate new value for each solution vector element:
- in our application, the aij are the Laplacian (sparse matrix):

- Also factor in the source term
- Use Red-black map to update only half of the grid cells in each pass
- converges faster in practice
- known as red-black iteration
- requires two passes per iteration

- Apply operator (Laplacian) and source term to the current solution
- residual = k2T + S

- Store result in the target surface
- Use occlusion query to determine if all solution fragments are below threshold (e < threshold)
- occlusion query = true means all fragments are below threshold
- this is an L norm, which may be too strict
- less strict norms L1, L2, will require reduction or fragment accumulation register (not available yet), could run in CPU instead

- Average (restrict) current residual into coarser grid
- Iterate/smooth on coarser grid, solving k2 = -S
- Interpolate correction back into finer grid
- or restrict once more -> recursion
- use bilinear interpolation

- Update grid with this correction
- Iterate/smooth on fine grid

- Dirichlet (prescribed)
- Neumann (prescribed derivative)
- Mixed (coupled value and derivative)
- Uk: value at grid point k
- nk: normal at grid point k

- Periodic boundaries result in toroidal mapping
- Apply boundary conditions in smoothing pass

- Only need to compute at boundaries
- boundaries need significantly more computations
- restrict computations to boundaries

- GPUs do not allow branching
- or better, both branches are executed and the invalid fragment is discarded
- even more wasteful

- decompose domain into boundary and interior areas
- use general (boundary) and fastpath (interior) shaders
- run these in two separate passes, on respective domains

- Detect steady-state natively on GPU
- Minimize shader length
- Use special-case whenever possible
- Limit context switches

- How to detect convergence?
- L1 norm - average error
- L2 norm – RMS error (common in visual sim)
- L norm – max error (common in sci/eng apps)
- Can use occlusion query!

secs to steady statevs. grid size

- Minimize number of registers used
- Vectorize as much as possible
- Use the rasterizer to perform computations of linearly-varying values
- Pre-compute invariants on CPU
- Compute texture coodinate offsets in vertex shader

- Fast-path vs. slow-path
- write several variants of each fragment program to handle boundary cases
- eliminates conditionals in the fragment program
- equivalent to avoiding CPU inner-loop branching

fast path, no boundaries

slow path with boundaries

- Fast-path vs. slow-path
- write several variants of each fragment program to handle boundary cases
- eliminates conditionals in the fragment program
- equivalent to avoiding CPU inner-loop branching

secs per v-cyclevs. grid size

- Find best packing data of multiple grid levelsinto the pbuffer surfaces - many p-buffers

- Find best packing data of multiple grid levelsinto the pbuffer surfaces - two p-buffers

- Find best packing data of multiple grid levelsinto the pbuffer surfaces - a single p-buffer
- Still one front- and one back surface for iterative smoothing

- Remove context switching
- Can introduce operations with undefined results: reading/writing same surface
- Why do we need to do this?
- there is a chance that we write and read from the same surface at the same time

- Can we get away with it?
- Yes, we can. Just need to be careful to avoid these conflicts

- What about RGBA parallelism?
- was not used in this implemtation, may give another boost of factor 4

- Performance:

secs to steady statevs. grid size

Compute 4 values at a time

Requires source, residual, solution values to be in different buffers

Complicates boundary calculations

Adds setup and teardown overhead

- Possible additional vectorization:

Stacked domain

- Performance:

secs to steady statevs. grid size

CPU

GPU

What we need going forward:

- Superbuffers
- or: Universal support for multiple-surface pbuffers
- or: Cheap context switching

- Developer tools
- Debugging tools
- Documentation

- Global accumulator
- Ever increasing amounts of precision, memory
- Textures bigger than 2048 on a side

Hardware

David Kirk

Matt Papakipos

Driver Support

Nick Triantos

Pat Brown

Stephen Ehmann

Fragment Programming

James Percy

Matt Pharr

General-purpose GPU

Mark Harris

Aaron Lefohn

Ian Buck

Funding

NSF Award #0092793