GPU-Efficient Recursive Filtering and Summed-Area Tables

GPU-Efficient Recursive Filtering and Summed-Area Tables D. Nehab1 A. Maximo1 R. S. Lima2 H. Hoppe3 1IMPA 2Digitok3Microsoft Research

Recursive filters • Linear, shift-invariant filters • But use feedback from earlier outputs input prologue output

Recursive filters • Linear, shift-invariant filters • But use feedback from earlier outputs • Sequential dependency chain input prologue output

Applications of recursive filtering recursivepreprocessingstep • B-Spline (or other) interpolation input coefficients interpolation (from coefficients)

Applications of recursive filtering recursive filters • B-Spline (or other) interpolation • Fast, wide, Gaussian-blur approximation • Summed-area tables input blurred

Causality and order • Recursive filters can be causal or anticausal • Causal goes forward, anticausal in reverse direction • Filter order is simply the number r of feedbacks input epilogue output

Filter sequences and separability • Often, sequences of recursive filters are needed • Independent columns • Causal • Anticausal • Independent rows • Causal • Anticausal

Algorithm RT • The baseline algorithm • Process columns in parallel, then rows in parallel • Ruijterset al. 2010 “GPU prefilter […]” input stages output columnprocessing row processing

First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P i G ( 4 t u p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

Optimization roadmap • Modern GPUs have several hundred cores • Latency-hiding requires manytimes more tasks • Images are not large enough: must parallelize further

Increasing parallelism • Similar to parallel prefix-sum algorithms • Sengupta et al. 2007 “Scan primitives for GPU computing” • Dotsenko et al. 2008 “Fast scan algorithms […]” • Compute and store incomplete prologues • Fix incomplete prologues • Somewhat more complicated than a recursive invocation • Use prologues to compute and store causal results … … … … ✗ ✗ ✗ ✗ …

Fixing incomplete prologues … … … superposition ✗ linearity

Algorithm 2 • Adds block parallelism • Sung et al. 1986 “Efficient […] recursive […]”, or • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms input stages output fix fix fix fix

First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P i G ( 4 t u 2 p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

Optimization roadmap • Modern GPUs have several hundred cores • Latency-hiding requires manytimes more tasks • Images are not large enough: must parallelize further • FLOP/IO ratio of recursive filters is too low • Can use even more FLOPs but must reduce IO • To do so, we introduce overlapping

Causal-anticausal overlapping • Start anticausal processing before causal is done • Saves reading and writing causal results! • Compute and store incomplete prologues & epilogues • Fix incomplete prologues & twice-incomplete epilogues • Twice-incomplete epilogues are trickier • Use them to compute and store anticausal results … …

Fixing twice-incomplete epilogues • Repeatedly apply linearity and superposition • Tedious derivation, simple result corrected epilogue corrected prologue twice-incomplete epilogue

Algorithm 4 • Adds causal-anticausal overlapping • Eliminates reading and writing causal results • Both in column and in row processing • Modest increase in computation input stages output fix both fix both

First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms • Alg. 4 adds causal-anticausal overlapping • Eliminates 4hw of IO • Modest increase in computation 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P i G 4 ( 4 t u 2 p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

Algorithm 5 • Adds row-column overlapping • Eliminates reading and writing column results • Modest increase in computation input stages output fix all!

Start from input and global borders

Load blocks into shared memory

Compute & store incomplete borders

All borders in global memory

Fix incomplete borders

Fix twice-incomplete borders

Fix thrice-incomplete borders

Fix four-times-incomplete borders

Done fixing all borders

Load blocks into shared memory

Finish causal columns

Finish anticausalcolumns

Finish causal rows

Finish anticausal rows

Store results to global memory

Done!

Row-column overlapping rules • Fixing thrice-incomplete row-prologues • Fixing four-times-incomplete row-epilogues

First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms • Alg. 4 adds causal-anticausal overlapping • Eliminates 4hw of IO • Modest increase in computation • Alg. 5 adds row-column overlapping • Eliminates additional 2hw of IO • Modest increase in computation 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P 5 i G 4 ( 4 t u 2 p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

Second-order filter benchmarks • Alg. 42 uses causal-anticausal overlapping • Alg. 52adds row-column overlapping • Added complexity outweighs IO reduction • Balance will change (hardware, compiler, implementation) Quintic B-Spline Interpolation (GeForce GTX 480) 5 42 52 4 ) s / P i G 3 ( t u p h g u o 2 r h T 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

Gaussian blur results • CUFFT is in frequency domain • complexity • DIR is direct convolution • complexity • Podlozhnyuk2007 whitepaper“Image convolution with CUDA” • Overlapped recursive • 3rd order approximation • complexity • van Vliet et al. 1998 “Recursive Gaussian derivative filters” • Implemented as 51 fused with 42 • Recursive approximation is faster • Even for modest size images • Also modest standard-deviations Gaussian Blur (GeForce GTX 480) 4 Overlapped Recursive DIR2.5 DIR 5 DIR 10 3 CUFFT ) s / P i G ( t u p 2 h g u o r h T 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

Summed-area table benchmarks • Harris et al 2008, GPU Gems 3 • “Parallel prefix-scan […]” • Multi-scan + transpose + multiscan • Implemented with CUDPP • Hensley 2010, Gamefest • “High-quality depth of field” • Multi-wave method • Our improvements+ specialized row and column kernels+ save only incomplete borders+ fuse row and column stages • Overlapped SAT • Row-column overlapping • First-order filter, unit coefficient, no anticausal component Summed-area Table (GeForce GTX 480) 9 8 Overlapped SAT 7 Improved Hensley [2010] ) Hensley [2010] s / 6 P Harris et al [2008] i G ( 5 t u p h g 4 u o r h 3 T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

Future work • Volumetric processing • Overlapping should generalize • Not enough shared memory (yet?) • CPU implementation • Blocking should increase L1 cache effectiveness • Is doubling amount of computation worth it? • Solving general narrow-banded linear systems • Overlapping back- and forward- substitution

Conclusions • Recursive filters are useful in many applications • Cubic and quintic B-Spline interpolation • Gaussian-blur approximation • Summed-area table computation • We introduced parallel algorithms for GPUs • Overlapping reduces IO requirements • Leads to faster algorithms • Code is available from project page • Most is already there, rest is on the way

GPU-Efficient Recursive Filtering and Summed-Area Tables

GPU-Efficient Recursive Filtering and Summed-Area Tables

Presentation Transcript

Recursive Bayes Filtering Advanced AI

Recursive Bayes Filtering Advanced AI

Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient Intranode Communication in GPU-Accelerated Systems

Provably Efficient GPU Algorithms

GPU-Efficient Recursive Filtering and Summed-Area Tables

Efficient Probe Filtering

Recursive Bilateral Filtering

Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient Query Filtering for Streaming Time Series

The Efficient Learning of Multiplication Tables

Efficient Merging and Filtering Algorithms for Approximate String Searches

Recursive

Efficient Independent Component Analysis on a GPU

RecTree: An Efficient Collaborative Filtering Method

Algorithms for Efficient Collaborative Filtering

Recursive Bayes Filtering Advanced AI

3D edge detection by separable recursive filtering and edge closing

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Summed Area Tables using Graphics Hardware

Efficient Query Filtering for Streaming Time Series

Efficient PCF shadowmap filtering