280 likes | 583 Views
Separable Filters. Much faster than executing a box filterClassically performed by the Pixel ShaderConsists of a horizontal and vertical pass Source image over-sampling increases with kernel sizeShader is usually TEX instruction limited. 28th February 2011. AMD
E N D
2. DirectCompute Accelerated Separable Filtering 28th February 2011 2 AMD‘s Favorite Effects
3. Separable Filters Much faster than executing a box filter
Classically performed by the Pixel Shader
Consists of a horizontal and vertical pass
Source image over-sampling increases with kernel size
Shader is usually TEX instruction limited 28th February 2011 AMD‘s Favorite Effects 3
4. Separable? – Who Cares ? In many cases developers use this technique even though the filter may not actually be separable
Results are often still acceptable
Much faster than performing a real box filter
Accelerates many bilateral cases 28th February 2011 AMD‘s Favorite Effects 4
5. Typical Pipeline Steps 28th February 2011 AMD‘s Favorite Effects 5
6. Use Bilinear HW filtering? Bilinear filter HW can halve the number of ALU and TEX instructions
Just need to compute the correct sampling offsets
Not possible with more advanced filters
Usually because weighting is a dynamic operation
Think about bilateral cases... 28th February 2011 AMD‘s Favorite Effects 6
7. Where to start with DirectCompute Is the Pixel Shader version TEX or ALU limited?
You need to know what to optimize for!
Use IHV tools to establish this
Achieving peak performance is not easy – so write a highly configurable kernel
Will allow you to easily experiment and fine tune 28th February 2011 AMD‘s Favorite Effects 7
8. Thread Group Shared Memory (TGSM) TGSM can be used to reduce TEX ops
TGSM can also be used to cache results
Thus saving ALU ops too
Load a sensible run length – base this on HW wavefront/warp size (AMD = 64, NVIDIA = 32)
Choose a good common factor (multiples of 64)
28th February 2011 AMD‘s Favorite Effects 8
9. Kernel #1 Redundant compute threads ? 28th February 2011 AMD‘s Favorite Effects 9
10. Avoid Redundant Threads Should ensure that all threads in a group have useful work to do – wherever possible
Redundant threads will not be reassigned work from another group
This would involve alot of redundancy for a large kernel diameter 28th February 2011 AMD‘s Favorite Effects 10
11. Kernel #2 28th February 2011 AMD‘s Favorite Effects 11 No redundant compute threads ?
12. Multiple Pixels per Thread Allows for natural vectorization
4 works well on AMD HW
Doesn‘t hurt performance on scalar HW
Possible to cache TGSM reads on General Purpose Registers (GPRs)
Quartering TGSM reads - absolute winner!!
28th February 2011 AMD‘s Favorite Effects 12
13. Kernel #3 Compute threads not a multiple of 64 ? 28th February 2011 AMD‘s Favorite Effects 13
14. Multiple Lines per Thread Group Process multiple lines per thread group
Better than one long line
2 or 4 works well
Improved texture cache efficiency
Compute threads back to a multiple of 64 28th February 2011 AMD‘s Favorite Effects 14
15. Kernel #4 28th February 2011 AMD‘s Favorite Effects 15
16. Kernel Diameter Kernel diameter needs to be > 7 to see a DirectCompute win
Otherwise the overhead cancels out the advantage
The larger the kernel diameter the greater the win 28th February 2011 AMD‘s Favorite Effects 16
17. Use Packing in TGSM Use packing to reduce storage space required in TGSM
Only have 32k per SIMD
Reduces reads/writes from TGSM
Often a uint is sufficient for color filtering
Use SM5.0 instructions f32tof16(), f16tof32() 28th February 2011 AMD‘s Favorite Effects 17
18. High Definition Ambient Occlusion 28th February 2011 AMD‘s Favorite Effects 18
19. Perform at Half Resolution HDAO at full resolution is expensive
Running at half resolution captures more occlusion – and is obviously much faster
Problem: Artifacts are introduced when combined with the full resolution scene
28th February 2011 AMD‘s Favorite Effects 19
20. Bilateral Dilate & Blur 28th February 2011 AMD‘s Favorite Effects 20
21. New Pipeline... 28th February 2011 AMD‘s Favorite Effects 21
22. Pixel Shader vs DirectCompute 28th February 2011 AMD‘s Favorite Effects 22
23. Depth of Field Many techniques exist to solve this problem
A common technique is to figure out how blurry a pixel should be
Often called the Cirle of Confusion (CoC)
A Gaussian blur weighted by CoC is a pretty efficient way to implement this effect 28th February 2011 AMD‘s Favorite Effects 23
24. The Pipeline... 28th February 2011 AMD‘s Favorite Effects 24
25. 28th February 2011 AMD‘s Favorite Effects 25
26. 28th February 2011 AMD‘s Favorite Effects 26
27. Pixel Shader vs DirectCompute 28th February 2011 AMD‘s Favorite Effects 27
28. Summary DirectCompute greatly accelerates larger kernel diameter filters
Allows for filtering at full resolution
For access to source code:
HDAO11: jon.story@amd.com
DoF11: nicolas.thibieroz@amd.com
28th February 2011 AMD‘s Favorite Effects 28
29. Questions?takahiro.harada@amd.comholger.gruen@amd.comjon.story@amd.comPlease fill in the feedback forms! 28th February 2011 29 AMD‘s Favorite Effects