A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory LewinDavid Luebke Greg Humphreys University of Virginia augmented by Klaus Mueller, Stony Brook University

General-Purpose GPU Programming • Why do we port algorithms to the GPU? • How much faster can we expect it to be, really? • What is the challenge in porting?

Case Study Problem: Implement a Boundary Value Problem (BVP) solver using the GPU Could benefit an entire class of scientific and engineering applications, e.g.: • Heat transfer • Fluid flow

Related Work • Krüger and Westermann: Linear Algebra Operators for GPU Implementation of Numerical Algorithms • Bolz et al.: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid • Very similar to our system • Developed concurrently • Complementary approach

Driving problem: Fluid mechanics sim Problem domain is a warped disc: regular grid regular grid

BVPs: Background • Boundary value problems are sometimes governedby PDEs of the form: L=f • L is some operator •  is the problem domain • f is a forcing function (source term) • Given L and f, solve for .

BVPs: Example Heat Transfer • Find a steady-state temperature distribution T in a solid of thermal conductivity k with thermal source S • This requires solving a Poisson equation of the form: k2T = -S • This is a BVP where L is the Laplacian operator 2 All our applications require a Poisson solver.

BVPs: Solving • Most such problems cannot be solved analytically • Instead, discretize onto a grid to form a set of linear equations, then solve: • Direct elimination • Gauss-Seidel iteration • Conjugate-gradient • Strongly implicit procedures • Multigrid method

Multigrid method • Iteratively corrects an approximation to the solution • Operates at multiple grid resolutions • Low-resolution grids are used to correct higher-resolution grids recursively • Very fast, especially for large grids: O(n)

Multigrid method • Use coarser grid levels to recursively correct an approximation to the solution • may converge slowly on fine grid -> restrict to course grid • push out long wavelength errors quickly (single grid solvers only smooth out high frequency errors) • Algorithm: • smooth • residual  • restrict • recurse • interpolate  = Li-f

Implementation - Overview For each step of the algorithm: • Bind as texture maps the buffers that contain the necessary data (current solution, residual, source terms, etc.) • Set the target buffer for rendering • Activate a fragment program that performs the necessary kernel computation (smoothing, residual calculation, restriction, interpolation) • Render a grid-sized quad with multitexturing source buffer texture source buffer texture render target buffer render target buffer fragment program

Implementation - Overview

Input buffers • Solution buffer: four-channel floating point pixel buffer (p-buffer) • one channel each for solution, residual, source term, and a debugging term • toggle front and back surfaces used to hold old and new solution • Operator map: contains the discretized operator (e.g., Laplacian) • Red-black map: accelerate odd-even tests (see later)

Smoothing • Jacobi method • one matrix row: • calculate new value for each solution vector element: • in our application, the aij are the Laplacian (sparse matrix):

Smoothing • Also factor in the source term • Use Red-black map to update only half of the grid cells in each pass • converges faster in practice • known as red-black iteration • requires two passes per iteration

Calculate residual • Apply operator (Laplacian) and source term to the current solution • residual  = k2T + S • Store result in the target surface • Use occlusion query to determine if all solution fragments are below threshold (e < threshold) • occlusion query = true means all fragments are below threshold • this is an L norm, which may be too strict • less strict norms L1, L2, will require reduction or fragment accumulation register (not available yet), could run in CPU instead

Multigrid reduction and refinement • Average (restrict) current residual into coarser grid • Iterate/smooth on coarser grid, solving k2  = -S • Interpolate correction back into finer grid • or restrict once more -> recursion • use bilinear interpolation • Update grid with this correction • Iterate/smooth on fine grid

Boundary conditions • Dirichlet (prescribed) • Neumann (prescribed derivative) • Mixed (coupled value and derivative) • Uk: value at grid point k • nk: normal at grid point k • Periodic boundaries result in toroidal mapping • Apply boundary conditions in smoothing pass

Boundary conditions • Only need to compute at boundaries • boundaries need significantly more computations • restrict computations to boundaries • GPUs do not allow branching • or better, both branches are executed and the invalid fragment is discarded • even more wasteful • decompose domain into boundary and interior areas • use general (boundary) and fastpath (interior) shaders • run these in two separate passes, on respective domains

Optimizing the Solver • Detect steady-state natively on GPU • Minimize shader length • Use special-case whenever possible • Limit context switches

Optimizing the Solver: Steady-state • How to detect convergence? • L1 norm - average error • L2 norm – RMS error (common in visual sim) • L norm – max error (common in sci/eng apps) • Can use occlusion query! secs to steady statevs. grid size

Optimizing the Solver: Shader length • Minimize number of registers used • Vectorize as much as possible • Use the rasterizer to perform computations of linearly-varying values • Pre-compute invariants on CPU • Compute texture coodinate offsets in vertex shader

Optimizing the Solver: Special-case • Fast-path vs. slow-path • write several variants of each fragment program to handle boundary cases • eliminates conditionals in the fragment program • equivalent to avoiding CPU inner-loop branching fast path, no boundaries slow path with boundaries

Optimizing the Solver: Special-case • Fast-path vs. slow-path • write several variants of each fragment program to handle boundary cases • eliminates conditionals in the fragment program • equivalent to avoiding CPU inner-loop branching secs per v-cyclevs. grid size

Optimizing the Solver: Context-switching • Find best packing data of multiple grid levelsinto the pbuffer surfaces - many p-buffers

Optimizing the Solver: Context-switching • Find best packing data of multiple grid levelsinto the pbuffer surfaces - two p-buffers

Optimizing the Solver: Context-switching • Find best packing data of multiple grid levelsinto the pbuffer surfaces - a single p-buffer • Still one front- and one back surface for iterative smoothing

Optimizing the Solver: Context-switching • Remove context switching • Can introduce operations with undefined results: reading/writing same surface • Why do we need to do this? • there is a chance that we write and read from the same surface at the same time • Can we get away with it? • Yes, we can. Just need to be careful to avoid these conflicts • What about RGBA parallelism? • was not used in this implemtation, may give another boost of factor 4

Data Layout • Performance: secs to steady statevs. grid size

Compute 4 values at a time Requires source, residual, solution values to be in different buffers Complicates boundary calculations Adds setup and teardown overhead Data Layout • Possible additional vectorization: Stacked domain

Results: CPU vs. GPU • Performance: secs to steady statevs. grid size

Applications – Flow Simulation

Applications – High Dynamic Range CPU GPU

Conclusions What we need going forward: • Superbuffers • or: Universal support for multiple-surface pbuffers • or: Cheap context switching • Developer tools • Debugging tools • Documentation • Global accumulator • Ever increasing amounts of precision, memory • Textures bigger than 2048 on a side

Hardware David Kirk Matt Papakipos Driver Support Nick Triantos Pat Brown Stephen Ehmann Fragment Programming James Percy Matt Pharr General-purpose GPU Mark Harris Aaron Lefohn Ian Buck Funding NSF Award #0092793 Acknowledgements

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware