Directcompute accelerated separable filtering
Sponsored Links
This presentation is the property of its rightful owner.
1 / 29

DirectCompute Accelerated Separable Filtering PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on
  • Presentation posted in: General

DirectCompute Accelerated Separable Filtering. Separable Filters. Much faster than executing a box filter Classically performed by the Pixel Shader Consists of a horizontal and vertical pass Source image over-sampling increases with kernel size Shader is usually TEX instruction limited.

Download Presentation

DirectCompute Accelerated Separable Filtering

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


DirectCompute Accelerated Separable Filtering

AMD‘s Favorite Effects


Separable Filters

  • Much faster than executing a box filter

  • Classically performed by the Pixel Shader

  • Consists of a horizontal and vertical pass

  • Source image over-sampling increases with kernel size

    • Shader is usually TEX instruction limited

AMD‘s Favorite Effects


Separable? – Who Cares 

  • In many cases developers use this technique even though the filter may not actually be separable

    • Results are often still acceptable

    • Much faster than performing a real box filter

    • Accelerates many bilateral cases

AMD‘s Favorite Effects


Typical Pipeline Steps

Source

RT

Intermediate

RT

Destination RT

Horizontal Pass

Vertical Pass

AMD‘s Favorite Effects


Use Bilinear HW filtering?

  • Bilinear filter HW can halve the number of ALU and TEX instructions

    • Just need to compute the correct sampling offsets

  • Not possible with more advanced filters

    • Usually because weighting is a dynamic operation

    • Think about bilateral cases...

AMD‘s Favorite Effects


Where to start with DirectCompute

  • Is the Pixel Shader version TEX or ALU limited?

    • You need to know what to optimize for!

    • Use IHV tools to establish this

  • Achieving peak performance is not easy – so write a highly configurable kernel

    • Will allow you to easily experiment and fine tune

AMD‘s Favorite Effects


Thread Group Shared Memory (TGSM)

  • TGSM can be used to reduce TEX ops

  • TGSM can also be used to cache results

    • Thus saving ALU ops too

  • Load a sensible run length – base this on HW wavefront/warp size (AMD = 64, NVIDIA = 32)

    • Choose a good common factor (multiples of 64)

AMD‘s Favorite Effects


Kernel #1

128 threads load 128 texels

  • Redundant compute threads 

...........

Kernel Radius

128 – ( Kernel Radius * 2 ) threads compute results

AMD‘s Favorite Effects


Avoid Redundant Threads

  • Should ensure that all threads in a group have useful work to do – wherever possible

  • Redundant threads will not be reassigned work from another group

  • This would involve alot of redundancy for a large kernel diameter

AMD‘s Favorite Effects


Kernel #2

Kernel Radius * 2 threads

load 1 extra texel each

128 threads load 128 texels

  • No redundant compute threads 

...........

Kernel Radius

128 threads compute results

AMD‘s Favorite Effects


Multiple Pixels per Thread

  • Allows for natural vectorization

    • 4 works well on AMD HW

    • Doesn‘t hurt performance on scalar HW

  • Possible to cache TGSM reads on General Purpose Registers (GPRs)

    • Quartering TGSM reads - absolute winner!!

AMD‘s Favorite Effects


Kernel #3

Kernel Radius * 2 threads

load 1 extra texel each

32 threads load 128 texels

  • Compute threads not a multiple of 64 

...........

Kernel Radius

32 threads compute 128 results

AMD‘s Favorite Effects


Multiple Lines per Thread Group

  • Process multiple lines per thread group

    • Better than one long line

    • 2 or 4 works well

  • Improved texture cache efficiency

  • Compute threads back to a multiple of 64

AMD‘s Favorite Effects


Kernel #4

Kernel Radius * 4 threads

load 1 extra texel each

64 threads load 256 texels

...........

...........

Kernel Radius

64 threads compute 256 results

AMD‘s Favorite Effects


Kernel Diameter

  • Kernel diameter needs to be > 7 to see a DirectCompute win

    • Otherwise the overhead cancels out the advantage

  • The larger the kernel diameter the greater the win

AMD‘s Favorite Effects


Use Packing in TGSM

  • Use packing to reduce storage space required in TGSM

    • Only have 32k per SIMD

  • Reduces reads/writes from TGSM

  • Often a uint is sufficient for color filtering

  • Use SM5.0 instructions f32tof16(), f16tof32()

AMD‘s Favorite Effects


High Definition Ambient Occlusion

Depth + Normals

=

*

HDAO buffer

Original Scene

Final Scene

AMD‘s Favorite Effects


Perform at Half Resolution

  • HDAO at full resolution is expensive

  • Running at half resolution captures more occlusion – and is obviously much faster

  • Problem: Artifacts are introduced when combined with the full resolution scene

AMD‘s Favorite Effects


Bilateral Dilate & Blur

HDAO buffer doesn‘t match with scene

A bilateral dilate & blur fixes the issue

AMD‘s Favorite Effects


New Pipeline...

½ Res

Still much faster than performing at full res!

Horizontal Pass

Vertical Pass

Bilinear Upsample

Intermediate UAV

Dilated & Blurred

AMD‘s Favorite Effects


Pixel Shader vs DirectCompute

*Tested on a range of AMD and NVIDIA DX11 HW, DirectCompute is between ~2.53x to ~3.17x faster than the Pixel Shader

AMD‘s Favorite Effects


Depth of Field

  • Many techniques exist to solve this problem

  • A common technique is to figure out how blurry a pixel should be

    • Often called the Cirle of Confusion (CoC)

  • A Gaussian blur weighted by CoC is a pretty efficient way to implement this effect

AMD‘s Favorite Effects


The Pipeline...

Vertical Pass

Horizontal Pass

Intermediate UAV

CoC

AMD‘s Favorite Effects


Shogun 2: DoF OFF

AMD‘s Favorite Effects


Shogun 2: DoF ON

AMD‘s Favorite Effects


Pixel Shader vs DirectCompute

*Tested on a range of AMD and NVIDIA DX11 HW, DirectCompute is between ~1.48x to ~1.86x faster than the Pixel Shader

AMD‘s Favorite Effects


Summary

  • DirectCompute greatly accelerates larger kernel diameter filters

  • Allows for filtering at full resolution

  • For access to source code:

    • HDAO11: jon.story@amd.com

    • DoF11: nicolas.thibieroz@amd.com

AMD‘s Favorite Effects


Questions?takahiro.harada@amd.comholger.gruen@amd.comjon.story@amd.comPlease fill in the feedback forms!

AMD‘s Favorite Effects


  • Login