- 98 Views
- Uploaded on
- Presentation posted in: General

DirectCompute Accelerated Separable Filtering

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

AMD‘s Favorite Effects

- Much faster than executing a box filter
- Classically performed by the Pixel Shader
- Consists of a horizontal and vertical pass
- Source image over-sampling increases with kernel size
- Shader is usually TEX instruction limited

AMD‘s Favorite Effects

- In many cases developers use this technique even though the filter may not actually be separable
- Results are often still acceptable
- Much faster than performing a real box filter
- Accelerates many bilateral cases

AMD‘s Favorite Effects

Source

RT

Intermediate

RT

Destination RT

Horizontal Pass

Vertical Pass

AMD‘s Favorite Effects

- Bilinear filter HW can halve the number of ALU and TEX instructions
- Just need to compute the correct sampling offsets

- Not possible with more advanced filters
- Usually because weighting is a dynamic operation
- Think about bilateral cases...

AMD‘s Favorite Effects

- Is the Pixel Shader version TEX or ALU limited?
- You need to know what to optimize for!
- Use IHV tools to establish this

- Achieving peak performance is not easy – so write a highly configurable kernel
- Will allow you to easily experiment and fine tune

AMD‘s Favorite Effects

- TGSM can be used to reduce TEX ops
- TGSM can also be used to cache results
- Thus saving ALU ops too

- Load a sensible run length – base this on HW wavefront/warp size (AMD = 64, NVIDIA = 32)
- Choose a good common factor (multiples of 64)

AMD‘s Favorite Effects

128 threads load 128 texels

- Redundant compute threads

...........

Kernel Radius

128 – ( Kernel Radius * 2 ) threads compute results

AMD‘s Favorite Effects

- Should ensure that all threads in a group have useful work to do – wherever possible
- Redundant threads will not be reassigned work from another group
- This would involve alot of redundancy for a large kernel diameter

AMD‘s Favorite Effects

Kernel Radius * 2 threads

load 1 extra texel each

128 threads load 128 texels

- No redundant compute threads

...........

Kernel Radius

128 threads compute results

AMD‘s Favorite Effects

- Allows for natural vectorization
- 4 works well on AMD HW
- Doesn‘t hurt performance on scalar HW

- Possible to cache TGSM reads on General Purpose Registers (GPRs)
- Quartering TGSM reads - absolute winner!!

AMD‘s Favorite Effects

Kernel Radius * 2 threads

load 1 extra texel each

32 threads load 128 texels

- Compute threads not a multiple of 64

...........

Kernel Radius

32 threads compute 128 results

AMD‘s Favorite Effects

- Process multiple lines per thread group
- Better than one long line
- 2 or 4 works well

- Improved texture cache efficiency
- Compute threads back to a multiple of 64

AMD‘s Favorite Effects

Kernel Radius * 4 threads

load 1 extra texel each

64 threads load 256 texels

...........

...........

Kernel Radius

64 threads compute 256 results

AMD‘s Favorite Effects

- Kernel diameter needs to be > 7 to see a DirectCompute win
- Otherwise the overhead cancels out the advantage

- The larger the kernel diameter the greater the win

AMD‘s Favorite Effects

- Use packing to reduce storage space required in TGSM
- Only have 32k per SIMD

- Reduces reads/writes from TGSM
- Often a uint is sufficient for color filtering
- Use SM5.0 instructions f32tof16(), f16tof32()

AMD‘s Favorite Effects

Depth + Normals

=

*

HDAO buffer

Original Scene

Final Scene

AMD‘s Favorite Effects

- HDAO at full resolution is expensive
- Running at half resolution captures more occlusion – and is obviously much faster
- Problem: Artifacts are introduced when combined with the full resolution scene

AMD‘s Favorite Effects

HDAO buffer doesn‘t match with scene

A bilateral dilate & blur fixes the issue

AMD‘s Favorite Effects

½ Res

Still much faster than performing at full res!

Horizontal Pass

Vertical Pass

Bilinear Upsample

Intermediate UAV

Dilated & Blurred

AMD‘s Favorite Effects

*Tested on a range of AMD and NVIDIA DX11 HW, DirectCompute is between ~2.53x to ~3.17x faster than the Pixel Shader

AMD‘s Favorite Effects

- Many techniques exist to solve this problem
- A common technique is to figure out how blurry a pixel should be
- Often called the Cirle of Confusion (CoC)

- A Gaussian blur weighted by CoC is a pretty efficient way to implement this effect

AMD‘s Favorite Effects

Vertical Pass

Horizontal Pass

Intermediate UAV

CoC

AMD‘s Favorite Effects

Shogun 2: DoF OFF

AMD‘s Favorite Effects

Shogun 2: DoF ON

AMD‘s Favorite Effects

*Tested on a range of AMD and NVIDIA DX11 HW, DirectCompute is between ~1.48x to ~1.86x faster than the Pixel Shader

AMD‘s Favorite Effects

- DirectCompute greatly accelerates larger kernel diameter filters
- Allows for filtering at full resolution
- For access to source code:
- HDAO11: [email protected]
- DoF11: [email protected]

AMD‘s Favorite Effects

AMD‘s Favorite Effects