1 / 100

GPU-Efficient Recursive Filtering and Summed-Area Tables

GPU-Efficient Recursive Filtering and Summed-Area Tables. Jeremiah van Oosten Reinier van Oeveren. Table of Contents. Introduction Related Works Prefix Sums and Scans Recursive Filtering Summed-Area Tables Problem Definition Parallelization Strategies Baseline (Algorithm RT)

lilka
Download Presentation

GPU-Efficient Recursive Filtering and Summed-Area Tables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU-Efficient Recursive Filtering and Summed-Area Tables Jeremiah van Oosten Reinier van Oeveren

  2. Table of Contents • Introduction • Related Works • Prefix Sums and Scans • Recursive Filtering • Summed-Area Tables • Problem Definition • Parallelization Strategies • Baseline (Algorithm RT) • Block Notation • Inter-block Parallelism • Kernel Fusion (Algorithm 2) • Overlapping • Causal-Anticausal overlapping (Algorithm 3 & 4) • Row-Column Causal-Anitcausal overlapping (Algorithm 5) • Summed-Area Tables • Overlapped Summed-Area Tables (Algorithm SAT) • Results • Conclusion

  3. Introduction

  4. Introduction • Linear filtering is commonly used to blur, sharpen or down-sample images. • A direct implementation evaluating a filter of support d on a h x w image has a cost of O(hwd).

  5. Introduction • The cost of the image filter can be reduced using a recursive filterin which case previous results can be used to compute the current value: • Cost can be reduced to O(hwr) where r is the number of recursive feedbacks.

  6. Recursive Filters • At each step, the filter produces an output element by a linear combination of the input element and previously computed output elements. 0(hwr) Continue…

  7. Recursive Filters recursive filters • Applications of recursive filters • Low-pass filtering like Gaussian kernels • Inverse Convolution ( • Summed-area tables input blurred

  8. Causality • Recursive filters can be causal or anticausal (or non-causal). • Causal filters operate on previous values. • Anticausal filters operate on “future” values.

  9. Anticausal • Anticausal filters operate on “future” values. Continue…

  10. Filter Sequences • It is often required to perform a sequence of recursive image filters. P • Independent Columns • Causal • Anticausal • Independent Rows • Causal • Anticausal P’ X Y Z U V E’ E

  11. Maximizing Parallelism • The naïve approach to solving the sequence of recursive filters does not sufficiently utilize the processing cores of the GPU. • The latest GPU from NVIDIA has 2,668 shader cores. Processing even large images (2048x2048) will not make full use of all available cores. • Under utilization of the GPU cores does not allow for latency hiding. • We need a way to make better utilization of the GPU without increasing IO.

  12. Overlapping • In the paper “GPU-Efficient Recursive Filtering and Summed-Area Tables” by Diego Nehab et. al. they introduce a new algorithmic framework to reduce memory bandwidth by overlapping computation over the full sequence of recursive filters.

  13. Block Partitioning • Partition the image into 2D blocks of size .

  14. Related Works

  15. Prefix Sums and Scans • A prefix sum • Simple case of a first-order recursive filter. • A scan generalizes the recurrence using an arbitrary binary associative operator. • Parallel prefix-sums and scans are important building blocks for numerous algorithms. • [Iverson 1962; Stone 1971; Blelloch 1989; Sengupta et. al. 2007] • An optimized implementation comes with the CUDPP library [2011].

  16. Recursive Filtering • A generalization of the prefix sum using a weighted combination of prior outputs. • This can be implemented as a scan operation with redefined basic operators. • Ruijters and Thevenaz [2010] exploit parallelisim across the rows and columns of the input.

  17. Recursive Filtering • Sung and Mitra [1986] use block parallelism and split the computation into two parts: • One computation based only on the block data assuming a zero initial conditions. • One computation based only on the initial conditions and assuming zero block data.

  18. - + height width UL UR - + LL LR Summed-Area Tables • Summed-area tables enable the averaging rectangular regions of pixel with a constant number of reads

  19. Summed-Area Tables • The paper titled “Fast Summed-Area Table Generation…” from Justin Hensley et. al. (2005) describes a method called recursive doubling which requires multiple passes of the input image. (A 256x256 image requires 16 passes to compute). Image A Image B Image A Image B

  20. Summed-Area Tables • In 2010, Justin Hensley extended his 2005 implementation to compute shaderstaking more samples per pass and storing the result in intermediate shared memory. Now a 256x256 image only required 4 passes when reading 16 samples per pass.

  21. Problem Definition

  22. Problem Definition • Casual recursive filters of order are characterized by a set of feedback coefficients in the following manner. • Given a prologue vector and an input vector of any size the filter produces the output: • Such that (has the same size as the input ).

  23. Problem Definition • Causal recursive filters depend on a prologue vector • Similar for the anitcausal filter. Given an input vector and an epilogue vector , the output vector is defined by:

  24. Problem Definition • For row processing, we define an extended casual filter and anticausal filter .

  25. Problem Definition • With these definitions, we are able to formulate the problem of applying the full sequence of four recursive filters (down, up, right, left). P • Independent Columns • Causal • Anticausal • Independent Rows • Causal • Anticausal P’ X Y Z U V E’ E

  26. Problem Definition • The goal is to implement this algorithm on the GPU to make full use of all available resources. • Maximize occupancy by splitting the problem up to make use of all cores. • Reduce I/O to global memory. • Must break the dependency chain in order to increase task parallelism. • Primary design goal: Increase the amount of parallelism without increasing memory I/O.

  27. Prior Parallelization Strategies

  28. Prior Parallelization strategies • Baseline algorithm ‘RT’ • Block notation • Inter-block parallelism • Kernel fusion

  29. Algorithm Ruijters & Thévenaz Independent row and column processing • Step RT1: In parallel for each column in , apply sequentially and store . • Step RT2: In parallel for each column in , apply sequentially and store . • Step RT1: In parallel for each row in , apply sequentially and store . • Step RT1: In parallel for each row in , apply sequentially and store .

  30. Algorithm RT in diagram form input stages output columnprocessing row processing

  31. Algorithm RT performance • Completion takes 4r ) steps • Bandwidthusage in total is • = streamingmultiprocessors • = number of cores (per processor) • = width of the input image • = height of the input image • = order of the appliedfilter

  32. Block notation (1) • Partition input image intoblocks • = number of threads in warp (=32) • What means what? • = block in matrix with index • = column-prologue submatrix • = column-epilogue submatrix For rows we have (similar) transposed operators: and

  33. Block notation (1 cont’d)

  34. Block notation (2) • Tail andhead operators: selectingprologue- andepilogue-shaped submatrices from

  35. Block notation (3) • Result: blockedversion of problemdefinition , , ,

  36. Someusefulkeyproperties (1) Superposition(based on linearity) Effects of the input andprologue/epilogue on the output canbecomputedindependently

  37. Someusefulkeyproperties (2) Express as matrix products For any, is the identity matrix Precomputed matrices thatdependonly on the feedback coefficients of filters andrespectively. Details in paper. ,

  38. Inter-block parellelism (1) Perform block computationindependently output block superposition Prologue / tail of prev. output block

  39. Inter-block parellelism (2) first term second term incomplete causal output

  40. Inter-block parellelism (3) (1) Recall: (2) Algorithm 1 1.1 In parallel forall m, computeand store each 1.2 Sequentiallyforeach m, computeand store the accordingto(1)andusing the previouslycomputed 1.3 In parallel forall m, compute & store output block using(2)and the previouslycomputed

  41. Inter-block parellelism (4) Processing allrowsand columns usingcausaland anti-causal filter pairs requires 4 successiveapplicationsof algorithm 1. There are independent tasks: hides memory access latency. However.. The memory bandwidthusage is now Significantly more thanalgorithm RT ( canbesolved

  42. Kernelfusion (1) • Original idea: Kirk & Hwu [2010] • Use output of onekernel as input for the next withoutgoingthroughglobal memory. • Fusedkernel: code frombothkernels but keep intermediateresults in shared mem.

  43. Kernelfusion (2) • Use Algorithm 1 forall filters, do fusing. • Fuse last stage of with first stage of • Fuselast stage of and first stage of • Fuse last stage of with first stage of We aimedforbandwidthreduction. Diditwork? • Algorithm 1: • Algorithm 2: yes, itdid!

  44. Kernelfusion (3), Algorithm 2 * input stages fix fix fix fix output *for the full algorithm in text, pleasesee the paper

  45. Kernelfusion (4) • Further I/O reduction is stillpossible: byrecomputingintermediaryresultsinstead of storing in memory. • More bandwidthreduction: (=good) • No. of steps: (≈bad*) Bandwidthusage is lessthanAlgorithm RT(!) but involves more computations*But.. future hardware may tip the balance in favor of more computations.

  46. Overlapping

  47. Causal-Anticausal Overlapping • Overlapping is introduced to reduce IO to global memory. • It is possible to work with twice-incompleteanticausal epilogues , computed directly from the incomplete causal output block . • This is called casual-anticausal overlapping.

  48. Causal-Anticausal Overlapping • Recall that we can express the filter so that the input and the prologue or epilogue can be computed independently and later added together.

  49. Causal-Anticausal Overlapping • Using the previous properties, we can split the dependency chains of anticausal epilogues.

  50. Causal-Anticausal Overlapping • Which can be further simplified to:

More Related