510 likes | 758 Views
GPU-Efficient Recursive Filtering and Summed-Area Tables . D. Nehab 1 A. Maximo 1 R. S. Lima 2 H. Hoppe 3 1 IMPA 2 Digitok 3 Microsoft Research. Recursive filters. Linear, shift-invariant filters But use feedback from earlier outputs. input. prologue. output.
E N D
GPU-Efficient Recursive Filtering and Summed-Area Tables D. Nehab1 A. Maximo1 R. S. Lima2 H. Hoppe3 1IMPA 2Digitok3Microsoft Research
Recursive filters • Linear, shift-invariant filters • But use feedback from earlier outputs input prologue output
Recursive filters • Linear, shift-invariant filters • But use feedback from earlier outputs • Sequential dependency chain input prologue output
Applications of recursive filtering recursivepreprocessingstep • B-Spline (or other) interpolation input coefficients interpolation (from coefficients)
Applications of recursive filtering recursive filters • B-Spline (or other) interpolation • Fast, wide, Gaussian-blur approximation • Summed-area tables input blurred
Causality and order • Recursive filters can be causal or anticausal • Causal goes forward, anticausal in reverse direction • Filter order is simply the number r of feedbacks input epilogue output
Filter sequences and separability • Often, sequences of recursive filters are needed • Independent columns • Causal • Anticausal • Independent rows • Causal • Anticausal
Algorithm RT • The baseline algorithm • Process columns in parallel, then rows in parallel • Ruijterset al. 2010 “GPU prefilter […]” input stages output columnprocessing row processing
First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P i G ( 4 t u p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )
Optimization roadmap • Modern GPUs have several hundred cores • Latency-hiding requires manytimes more tasks • Images are not large enough: must parallelize further
Increasing parallelism • Similar to parallel prefix-sum algorithms • Sengupta et al. 2007 “Scan primitives for GPU computing” • Dotsenko et al. 2008 “Fast scan algorithms […]” • Compute and store incomplete prologues • Fix incomplete prologues • Somewhat more complicated than a recursive invocation • Use prologues to compute and store causal results … … … … ✗ ✗ ✗ ✗ …
Fixing incomplete prologues … … … superposition ✗ linearity
Algorithm 2 • Adds block parallelism • Sung et al. 1986 “Efficient […] recursive […]”, or • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms input stages output fix fix fix fix
First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P i G ( 4 t u 2 p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )
Optimization roadmap • Modern GPUs have several hundred cores • Latency-hiding requires manytimes more tasks • Images are not large enough: must parallelize further • FLOP/IO ratio of recursive filters is too low • Can use even more FLOPs but must reduce IO • To do so, we introduce overlapping
Causal-anticausal overlapping • Start anticausal processing before causal is done • Saves reading and writing causal results! • Compute and store incomplete prologues & epilogues • Fix incomplete prologues & twice-incomplete epilogues • Twice-incomplete epilogues are trickier • Use them to compute and store anticausal results … …
Fixing twice-incomplete epilogues • Repeatedly apply linearity and superposition • Tedious derivation, simple result corrected epilogue corrected prologue twice-incomplete epilogue
Algorithm 4 • Adds causal-anticausal overlapping • Eliminates reading and writing causal results • Both in column and in row processing • Modest increase in computation input stages output fix both fix both
First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms • Alg. 4 adds causal-anticausal overlapping • Eliminates 4hw of IO • Modest increase in computation 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P i G 4 ( 4 t u 2 p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )
Algorithm 5 • Adds row-column overlapping • Eliminates reading and writing column results • Modest increase in computation input stages output fix all!
Row-column overlapping rules • Fixing thrice-incomplete row-prologues • Fixing four-times-incomplete row-epilogues
First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms • Alg. 4 adds causal-anticausal overlapping • Eliminates 4hw of IO • Modest increase in computation • Alg. 5 adds row-column overlapping • Eliminates additional 2hw of IO • Modest increase in computation 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P 5 i G 4 ( 4 t u 2 p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )
Second-order filter benchmarks • Alg. 42 uses causal-anticausal overlapping • Alg. 52adds row-column overlapping • Added complexity outweighs IO reduction • Balance will change (hardware, compiler, implementation) Quintic B-Spline Interpolation (GeForce GTX 480) 5 42 52 4 ) s / P i G 3 ( t u p h g u o 2 r h T 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )
Gaussian blur results • CUFFT is in frequency domain • complexity • DIR is direct convolution • complexity • Podlozhnyuk2007 whitepaper“Image convolution with CUDA” • Overlapped recursive • 3rd order approximation • complexity • van Vliet et al. 1998 “Recursive Gaussian derivative filters” • Implemented as 51 fused with 42 • Recursive approximation is faster • Even for modest size images • Also modest standard-deviations Gaussian Blur (GeForce GTX 480) 4 Overlapped Recursive DIR2.5 DIR 5 DIR 10 3 CUFFT ) s / P i G ( t u p 2 h g u o r h T 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )
Summed-area table benchmarks • Harris et al 2008, GPU Gems 3 • “Parallel prefix-scan […]” • Multi-scan + transpose + multiscan • Implemented with CUDPP • Hensley 2010, Gamefest • “High-quality depth of field” • Multi-wave method • Our improvements+ specialized row and column kernels+ save only incomplete borders+ fuse row and column stages • Overlapped SAT • Row-column overlapping • First-order filter, unit coefficient, no anticausal component Summed-area Table (GeForce GTX 480) 9 8 Overlapped SAT 7 Improved Hensley [2010] ) Hensley [2010] s / 6 P Harris et al [2008] i G ( 5 t u p h g 4 u o r h 3 T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )
Future work • Volumetric processing • Overlapping should generalize • Not enough shared memory (yet?) • CPU implementation • Blocking should increase L1 cache effectiveness • Is doubling amount of computation worth it? • Solving general narrow-banded linear systems • Overlapping back- and forward- substitution
Conclusions • Recursive filters are useful in many applications • Cubic and quintic B-Spline interpolation • Gaussian-blur approximation • Summed-area table computation • We introduced parallel algorithms for GPUs • Overlapping reduces IO requirements • Leads to faster algorithms • Code is available from project page • Most is already there, rest is on the way