1 / 28

DirectCompute Accelerated Separable Filtering

Separable Filters. Much faster than executing a box filterClassically performed by the Pixel ShaderConsists of a horizontal and vertical pass Source image over-sampling increases with kernel sizeShader is usually TEX instruction limited. 28th February 2011. AMD

gala
Download Presentation

DirectCompute Accelerated Separable Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    2. DirectCompute Accelerated Separable Filtering 28th February 2011 2 AMD‘s Favorite Effects

    3. Separable Filters Much faster than executing a box filter Classically performed by the Pixel Shader Consists of a horizontal and vertical pass Source image over-sampling increases with kernel size Shader is usually TEX instruction limited 28th February 2011 AMD‘s Favorite Effects 3

    4. Separable? – Who Cares ? In many cases developers use this technique even though the filter may not actually be separable Results are often still acceptable Much faster than performing a real box filter Accelerates many bilateral cases 28th February 2011 AMD‘s Favorite Effects 4

    5. Typical Pipeline Steps 28th February 2011 AMD‘s Favorite Effects 5

    6. Use Bilinear HW filtering? Bilinear filter HW can halve the number of ALU and TEX instructions Just need to compute the correct sampling offsets Not possible with more advanced filters Usually because weighting is a dynamic operation Think about bilateral cases... 28th February 2011 AMD‘s Favorite Effects 6

    7. Where to start with DirectCompute Is the Pixel Shader version TEX or ALU limited? You need to know what to optimize for! Use IHV tools to establish this Achieving peak performance is not easy – so write a highly configurable kernel Will allow you to easily experiment and fine tune 28th February 2011 AMD‘s Favorite Effects 7

    8. Thread Group Shared Memory (TGSM) TGSM can be used to reduce TEX ops TGSM can also be used to cache results Thus saving ALU ops too Load a sensible run length – base this on HW wavefront/warp size (AMD = 64, NVIDIA = 32) Choose a good common factor (multiples of 64) 28th February 2011 AMD‘s Favorite Effects 8

    9. Kernel #1 Redundant compute threads ? 28th February 2011 AMD‘s Favorite Effects 9

    10. Avoid Redundant Threads Should ensure that all threads in a group have useful work to do – wherever possible Redundant threads will not be reassigned work from another group This would involve alot of redundancy for a large kernel diameter 28th February 2011 AMD‘s Favorite Effects 10

    11. Kernel #2 28th February 2011 AMD‘s Favorite Effects 11 No redundant compute threads ?

    12. Multiple Pixels per Thread Allows for natural vectorization 4 works well on AMD HW Doesn‘t hurt performance on scalar HW Possible to cache TGSM reads on General Purpose Registers (GPRs) Quartering TGSM reads - absolute winner!! 28th February 2011 AMD‘s Favorite Effects 12

    13. Kernel #3 Compute threads not a multiple of 64 ? 28th February 2011 AMD‘s Favorite Effects 13

    14. Multiple Lines per Thread Group Process multiple lines per thread group Better than one long line 2 or 4 works well Improved texture cache efficiency Compute threads back to a multiple of 64 28th February 2011 AMD‘s Favorite Effects 14

    15. Kernel #4 28th February 2011 AMD‘s Favorite Effects 15

    16. Kernel Diameter Kernel diameter needs to be > 7 to see a DirectCompute win Otherwise the overhead cancels out the advantage The larger the kernel diameter the greater the win 28th February 2011 AMD‘s Favorite Effects 16

    17. Use Packing in TGSM Use packing to reduce storage space required in TGSM Only have 32k per SIMD Reduces reads/writes from TGSM Often a uint is sufficient for color filtering Use SM5.0 instructions f32tof16(), f16tof32() 28th February 2011 AMD‘s Favorite Effects 17

    18. High Definition Ambient Occlusion 28th February 2011 AMD‘s Favorite Effects 18

    19. Perform at Half Resolution HDAO at full resolution is expensive Running at half resolution captures more occlusion – and is obviously much faster Problem: Artifacts are introduced when combined with the full resolution scene 28th February 2011 AMD‘s Favorite Effects 19

    20. Bilateral Dilate & Blur 28th February 2011 AMD‘s Favorite Effects 20

    21. New Pipeline... 28th February 2011 AMD‘s Favorite Effects 21

    22. Pixel Shader vs DirectCompute 28th February 2011 AMD‘s Favorite Effects 22

    23. Depth of Field Many techniques exist to solve this problem A common technique is to figure out how blurry a pixel should be Often called the Cirle of Confusion (CoC) A Gaussian blur weighted by CoC is a pretty efficient way to implement this effect 28th February 2011 AMD‘s Favorite Effects 23

    24. The Pipeline... 28th February 2011 AMD‘s Favorite Effects 24

    25. 28th February 2011 AMD‘s Favorite Effects 25

    26. 28th February 2011 AMD‘s Favorite Effects 26

    27. Pixel Shader vs DirectCompute 28th February 2011 AMD‘s Favorite Effects 27

    28. Summary DirectCompute greatly accelerates larger kernel diameter filters Allows for filtering at full resolution For access to source code: HDAO11: jon.story@amd.com DoF11: nicolas.thibieroz@amd.com 28th February 2011 AMD‘s Favorite Effects 28

    29. Questions? takahiro.harada@amd.com holger.gruen@amd.com jon.story@amd.com Please fill in the feedback forms! 28th February 2011 29 AMD‘s Favorite Effects

More Related