A multigrid solver for boundary value problems using programmable graphics hardware
Download
1 / 35

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware - PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware. Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys. University of Virginia. augmented by Klaus Mueller, Stony Brook University. General-Purpose GPU Programming.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware' - rylee-hubbard


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A multigrid solver for boundary value problems using programmable graphics hardware

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

Nolan Goodnight Cliff Woolley Gregory LewinDavid Luebke Greg Humphreys

University of Virginia

augmented by Klaus Mueller, Stony Brook University


General purpose gpu programming
General-Purpose GPU Programming Programmable Graphics Hardware

  • Why do we port algorithms to the GPU?

  • How much faster can we expect it to be, really?

  • What is the challenge in porting?


Case study
Case Study Programmable Graphics Hardware

Problem: Implement a Boundary Value Problem (BVP) solver using the GPU

Could benefit an entire class of scientific and engineering applications, e.g.:

  • Heat transfer

  • Fluid flow


Related work
Related Work Programmable Graphics Hardware

  • Krüger and Westermann: Linear Algebra Operators for GPU Implementation of Numerical Algorithms

  • Bolz et al.: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

    • Very similar to our system

      • Developed concurrently

      • Complementary approach


Driving problem fluid mechanics sim
Driving problem: Fluid mechanics sim Programmable Graphics Hardware

Problem domain is a warped disc:

regular grid

regular grid


Bvps background
BVPs: Background Programmable Graphics Hardware

  • Boundary value problems are sometimes governedby PDEs of the form:

    L=f

    • L is some operator

    •  is the problem domain

    • f is a forcing function (source term)

    • Given L and f, solve for .


Bvps example
BVPs: Example Programmable Graphics Hardware

Heat Transfer

  • Find a steady-state temperature distribution T in a solid of thermal conductivity k with thermal source S

  • This requires solving a Poisson equation of the form:

    k2T = -S

  • This is a BVP where L is the Laplacian operator 2

    All our applications require a Poisson solver.


Bvps solving
BVPs: Solving Programmable Graphics Hardware

  • Most such problems cannot be solved analytically

  • Instead, discretize onto a grid to form a set of linear equations, then solve:

    • Direct elimination

    • Gauss-Seidel iteration

    • Conjugate-gradient

    • Strongly implicit procedures

    • Multigrid method


Multigrid method
Multigrid method Programmable Graphics Hardware

  • Iteratively corrects an approximation to the solution

  • Operates at multiple grid resolutions

  • Low-resolution grids are used to correct higher-resolution grids recursively

  • Very fast, especially for large grids: O(n)


Multigrid method1
Multigrid method Programmable Graphics Hardware

  • Use coarser grid levels to recursively correct an approximation to the solution

    • may converge slowly on fine grid -> restrict to course grid

      • push out long wavelength errors quickly (single grid solvers only smooth out high frequency errors)

  • Algorithm:

    • smooth

    • residual 

    • restrict

      • recurse

    • interpolate

 = Li-f


Implementation overview
Implementation - Overview Programmable Graphics Hardware

For each step of the algorithm:

  • Bind as texture maps the buffers that contain the necessary data (current solution, residual, source terms, etc.)

  • Set the target buffer for rendering

  • Activate a fragment program that performs the necessary kernel computation (smoothing, residual calculation, restriction, interpolation)

  • Render a grid-sized quad with multitexturing

source buffer texture

source buffer texture

render target buffer

render target buffer

fragment program


Implementation overview1
Implementation - Overview Programmable Graphics Hardware


Input buffers
Input buffers Programmable Graphics Hardware

  • Solution buffer: four-channel floating point pixel buffer (p-buffer)

    • one channel each for solution, residual, source term, and a debugging term

    • toggle front and back surfaces used to hold old and new solution

  • Operator map: contains the discretized operator (e.g., Laplacian)

  • Red-black map: accelerate odd-even tests (see later)


Smoothing
Smoothing Programmable Graphics Hardware

  • Jacobi method

    • one matrix row:

    • calculate new value for each solution vector element:

    • in our application, the aij are the Laplacian (sparse matrix):


Smoothing1
Smoothing Programmable Graphics Hardware

  • Also factor in the source term

  • Use Red-black map to update only half of the grid cells in each pass

    • converges faster in practice

    • known as red-black iteration

    • requires two passes per iteration


Calculate residual
Calculate residual Programmable Graphics Hardware

  • Apply operator (Laplacian) and source term to the current solution

    • residual  = k2T + S

  • Store result in the target surface

  • Use occlusion query to determine if all solution fragments are below threshold (e < threshold)

    • occlusion query = true means all fragments are below threshold

    • this is an L norm, which may be too strict

    • less strict norms L1, L2, will require reduction or fragment accumulation register (not available yet), could run in CPU instead


Multigrid reduction and refinement
Multigrid reduction and refinement Programmable Graphics Hardware

  • Average (restrict) current residual into coarser grid

  • Iterate/smooth on coarser grid, solving k2  = -S

  • Interpolate correction back into finer grid

    • or restrict once more -> recursion

    • use bilinear interpolation

  • Update grid with this correction

  • Iterate/smooth on fine grid


Boundary conditions
Boundary conditions Programmable Graphics Hardware

  • Dirichlet (prescribed)

  • Neumann (prescribed derivative)

  • Mixed (coupled value and derivative)

    • Uk: value at grid point k

    • nk: normal at grid point k

  • Periodic boundaries result in toroidal mapping

  • Apply boundary conditions in smoothing pass


Boundary conditions1
Boundary conditions Programmable Graphics Hardware

  • Only need to compute at boundaries

    • boundaries need significantly more computations

    • restrict computations to boundaries

  • GPUs do not allow branching

    • or better, both branches are executed and the invalid fragment is discarded

    • even more wasteful

  • decompose domain into boundary and interior areas

    • use general (boundary) and fastpath (interior) shaders

    • run these in two separate passes, on respective domains


Optimizing the solver
Optimizing the Solver Programmable Graphics Hardware

  • Detect steady-state natively on GPU

  • Minimize shader length

  • Use special-case whenever possible

  • Limit context switches


Optimizing the solver steady state
Optimizing the Solver: Steady-state Programmable Graphics Hardware

  • How to detect convergence?

    • L1 norm - average error

    • L2 norm – RMS error (common in visual sim)

    • L norm – max error (common in sci/eng apps)

      • Can use occlusion query!

secs to steady statevs. grid size


Optimizing the solver shader length
Optimizing the Solver: Shader length Programmable Graphics Hardware

  • Minimize number of registers used

  • Vectorize as much as possible

  • Use the rasterizer to perform computations of linearly-varying values

  • Pre-compute invariants on CPU

  • Compute texture coodinate offsets in vertex shader


Optimizing the solver special case
Optimizing the Solver: Special-case Programmable Graphics Hardware

  • Fast-path vs. slow-path

    • write several variants of each fragment program to handle boundary cases

    • eliminates conditionals in the fragment program

    • equivalent to avoiding CPU inner-loop branching

fast path, no boundaries

slow path with boundaries


Optimizing the solver special case1
Optimizing the Solver: Special-case Programmable Graphics Hardware

  • Fast-path vs. slow-path

    • write several variants of each fragment program to handle boundary cases

    • eliminates conditionals in the fragment program

    • equivalent to avoiding CPU inner-loop branching

secs per v-cyclevs. grid size


Optimizing the solver context switching
Optimizing the Solver: Context-switching Programmable Graphics Hardware

  • Find best packing data of multiple grid levelsinto the pbuffer surfaces - many p-buffers


Optimizing the solver context switching1
Optimizing the Solver: Context-switching Programmable Graphics Hardware

  • Find best packing data of multiple grid levelsinto the pbuffer surfaces - two p-buffers


Optimizing the solver context switching2
Optimizing the Solver: Context-switching Programmable Graphics Hardware

  • Find best packing data of multiple grid levelsinto the pbuffer surfaces - a single p-buffer

  • Still one front- and one back surface for iterative smoothing


Optimizing the solver context switching3
Optimizing the Solver: Context-switching Programmable Graphics Hardware

  • Remove context switching

    • Can introduce operations with undefined results: reading/writing same surface

    • Why do we need to do this?

      • there is a chance that we write and read from the same surface at the same time

    • Can we get away with it?

      • Yes, we can. Just need to be careful to avoid these conflicts

    • What about RGBA parallelism?

      • was not used in this implemtation, may give another boost of factor 4


Data layout
Data Layout Programmable Graphics Hardware

  • Performance:

secs to steady statevs. grid size


Data layout1

Compute 4 values at a time Programmable Graphics Hardware

Requires source, residual, solution values to be in different buffers

Complicates boundary calculations

Adds setup and teardown overhead

Data Layout

  • Possible additional vectorization:

Stacked domain


Results cpu vs gpu
Results: CPU vs. GPU Programmable Graphics Hardware

  • Performance:

secs to steady statevs. grid size


Applications flow simulation
Applications – Flow Simulation Programmable Graphics Hardware


Applications high dynamic range
Applications – High Dynamic Range Programmable Graphics Hardware

CPU

GPU


Conclusions
Conclusions Programmable Graphics Hardware

What we need going forward:

  • Superbuffers

    • or: Universal support for multiple-surface pbuffers

    • or: Cheap context switching

  • Developer tools

    • Debugging tools

    • Documentation

  • Global accumulator

  • Ever increasing amounts of precision, memory

    • Textures bigger than 2048 on a side


Acknowledgements

Hardware Programmable Graphics Hardware

David Kirk

Matt Papakipos

Driver Support

Nick Triantos

Pat Brown

Stephen Ehmann

Fragment Programming

James Percy

Matt Pharr

General-purpose GPU

Mark Harris

Aaron Lefohn

Ian Buck

Funding

NSF Award #0092793

Acknowledgements