1 / 40

An Optimized Diffusion Depth Of Field Solver (DDOF)

An Optimized Diffusion Depth Of Field Solver (DDOF). Holger Gruen – AMD. Agenda. Motivation Recap of a high-level explanation of DDOF Recap of earlier DDOF solvers A Vanilla Cyclic Reduction(CR) DDOF solver A DX11 optimized CR solver for DDOF Results. Motivation.

hazel
Download Presentation

An Optimized Diffusion Depth Of Field Solver (DDOF)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen – AMD AMD‘s Favorite Effects

  2. Agenda • Motivation • Recap of a high-level explanation of DDOF • Recap of earlier DDOF solvers • A Vanilla Cyclic Reduction(CR) DDOF solver • A DX11 optimized CR solver for DDOF • Results AMD‘s Favorite Effects

  3. Motivation • Solver presented at GDC 2010 [RS2010] has some weaknesses • Great implementation but memory reqs and runtime too high for many game developers • Looking for faster and memory efficient solver AMD‘s Favorite Effects

  4. Diffusion DOF recap 1 • DDOF is an enhanced way of blurring a picture taking an arbitrary CoC at a pixel into account • Interprets input image as a heat distribution • Uses the CoC at a pixel to derive a per pixel heat conductivity CoC=Circle of Confusion AMD‘s Favorite Effects

  5. Diffusion DOF recap 2 • Blurring is done by time stepping a differential equation that models the diffusion of heat • ADI method used to arrive at a separable solution for stepping • Need to solve tri-diagonal linear system for each row and then each colum of the input AMD‘s Favorite Effects

  6. DDOF Tri-diagonal system • row/colofinputimage • derivedfromCoCateachpixelof aninputrow/col • resultingblurredrow/col AMD‘s Favorite Effects

  7. Solver recap 1 • The GDC2010 solver [RS2010] is a ‚hybrid‘ solver • Performs three PCR steps upfront • Performs serial ‚Sweep‘ algorithm to solve small resulting systems • Check [ZCO2010]for details on other hybrid solvers AMD‘s Favorite Effects

  8. Solver recap 2 • The GDC2010 solver [RS2010] has drawbacks • It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm • GPUs without RW cache will suffer • For high resolutions three PCR steps produce tri-diagonal system of substantial size • This means a serial (sweep) algorithm is run on a ‚big‘ system AMD‘s Favorite Effects

  9. Solver recap 3 • Cyclic Reduction (CR) solver • Used by [Kass2006] in the original DDOF paper • Runs in two phases • reduction phase • backward substitution phase AMD‘s Favorite Effects

  10. Solver recap 4 • According to [ZCO2010]: • CR solver has lowest computational complexity of all solvers  • It suffers from lack of parallelism though  • At the end of the reduction phase • At the start of the backwards substitution phase AMD‘s Favorite Effects

  11. Passes of a Vanilla CR Solver Input image … X reduce reduce Solve for the first y Stop at size 1 Pass 1: constructfromCoC … abc reduce reduce Blurredimage Y … substitute substitute AMD‘s Favorite Effects

  12. Vanilla Solver Results • Higher performancethanreported in [Bavoil2010]  (~6 ms vs. ~8ms at 1600x1200) • Memory footprintprohibitivelyhigh • >200 MB at 1600x1200 • Need an answertotacklingthe lack ofparallelismproblem – answergiven in [ZCO2010] AMD‘s Favorite Effects

  13. Vanilla CR Solver Input image … X reduce reduce Solve for the first y This is what kills parallelism Stop at size 1 Pass 1: constructfromCoC … abc reduce reduce Blurredimage Y … substitute substitute AMD‘s Favorite Effects

  14. Keeping the parallelism high Input image … X reduce reduce Stop at a reasonable size Solve for Y at that resolution to have a big enough parallel workload (e.g using PCR see [ZCO2010]) Pass 1: constructfromCoC … abc reduce reduce Blurred image Y … substitute substitute AMD‘s Favorite Effects

  15. Memory Optimizations 1 Input image … X reduce reduce Stop at a reasonable size Solve for Y at that resolution Pass 1: constructfromCoC … abc reduce reduce Blurred image Y … substitute substitute AMD‘s Favorite Effects

  16. Memory Optimizations 1 rgab32f rgab32f … X reduce reduce Stop at a reasonable size Solve for Y at that resolution rgab32f rgab32f … abc reduce reduce rgab32f rgba32f … Y substitute substitute substi-tute AMD‘s Favorite Effects

  17. Memory Optimizations 1 rgab16f rgab16f … X reduce reduce Stop at a reasonable size Solve for Y at that resolution Thissavessomesignificantamountofmemory - Wefoundnoartifactsforgoingfrom rgba32f to rgba16f rgab32f rgab32f … abc reduce reduce rgab16f rgba16f … Y substitute substitute substi-tute AMD‘s Favorite Effects

  18. Memory Optimizations 2 rgab16f rgab16f … X reduce reduce Stop at a reasonable size Solve for Y at that resolution Thisdoesagain save a significantamountofmemoryasthisisthebiggestsurfaceusedbythesolver rgab32f rgab32f … abc reduce reduce rgab16f rgba16f … Y substitute substitute substi-tute AMD‘s Favorite Effects

  19. Memory Optimizations 2 rgab16f rgab16f … X reduce reduce Stop at a reasonable size Solve for Y at that resolution Skip abc construction pass and compute abc on-the-fly during 1. reduction pass rgab32f … abc reduce rgab16f rgba16f … Y substitute substitute substi-tute AMD‘s Favorite Effects

  20. Intermediate Results 1600x1200 AMD‘s Favorite Effects

  21. Memory Optimizations 3 rgab16f rgab16f … X reduce reduce Stop at a reasonable size Solve for Y at that resolution Yetagainthissaves a significantamountofmemory ! Skip abc construction pass compute abc during 1. reduction pass rgab32f … abc reduce rgab16f rgba16f … Y substitute substitute substi-tute AMD‘s Favorite Effects

  22. Memory Optimizations 3 rgab16f … X reduce4 Stop at a reasonable size Solve for Y at that resolution Reduce 4-to-1 in a special first reduction pass Skip abc construction pass compute abc during 1. reduction pass … abc Substitute 1-to-4 in a special substitution pass rgba16f … Y substitute substitute substitute4 AMD‘s Favorite Effects

  23. Intermediate Results 1600x1200 AMD‘s Favorite Effects

  24. DX11 Memory Optimizations 1 rgab16f … X reduce4 Stop at a reasonable size Solve for Y at that resolution Reduce 4-to-1 in a special first reduction pass Skip abc construction pass compute abc during 1. reduction pass … abc Substitute 1-to-4 in a special substitution pass rgba16f … Y substitute substitute substitute4 AMD‘s Favorite Effects

  25. DX11 Memory Optimizations 1 Pack abc and X into one rgba_uint surface rgab16f … X reduce4 Stop at a reasonable size Solve for Y at that resolution Reduce 4-to-1 in a special first reduction pass Skip abc construction pass compute abc during 1. reduction pass … abc Substitute 1-to-4 in a special substitution pass rgba16f … Y substitute substitute substitute4 AMD‘s Favorite Effects

  26. Using SM5 for data packing uint rgab16f pack x,y channel X uint (f32tof16(X.x) + (f32tof16(X.y) << 16)) rgab32f uint abc uint AMD‘s Favorite Effects

  27. Using SM5 for data packing uint rgab16f X uint lower 5 bits of z channel pack rgab32f uint higher 27 bits of x channel abc uint (asuint(abc.x) &0xFFFFFFC0) | (f32tof16(X.z) & 0x3F)) Steal 6 lowest mantissa bits of abc.x to store some bits of X.z AMD‘s Favorite Effects

  28. Using SM5 for data packing uint rgab16f X uint central 5 bits of z channel pack rgab32f uint higher 27 bits of y channel abc uint (asuint(abc.y) &0xFFFFFFC0) | ((f32tof16(X.z) >>6 )& 0x3F)) Steal 6 lowest mantissa bits of abc.y to store some bits of X.z AMD‘s Favorite Effects

  29. SM5 Memory Optimizations 1 uint rgab16f X uint higher 5 bits of z channel rgab32f uint higher 27 bits of z channel pack abc uint (asuint(abc.z) &0xFFFFFFC0) | ((f32tof16(X.z) >>12 )& 0x3F)) Steal 6 lowest mantissa bits of abc.z to store some bits of X.z AMD‘s Favorite Effects

  30. Sample Screenshot AMD‘s Favorite Effects

  31. Abs(Packed-Unpacked) x 255.0f AMD‘s Favorite Effects

  32. DX11 Memory Optimizations 2 • Solver does a horizonal and vertical pass • Chain of lower res RTs needs to be there twice • Horizontal reduction/substitution chain • Vertical reduction/substitution chain • How can DX11 help? AMD‘s Favorite Effects

  33. DX11 Memory Optimizations 2 • UAVs allow us to reuse data of the horizontal chain for the vertical chain • A proof of concept implementation shows that this works nicely but impacts the runtime significantly • ~40% lower fps • Stayed with RTs as memory was already quite low • Use only if you are really concerned about memory AMD‘s Favorite Effects

  34. Final Results 1600x1200 AMD‘s Favorite Effects

  35. Future Work • Look into CS acceleration of the solver • 4-to-1 reduction pass • 1-to-4 substitution pass • Look into using heat diffusion for other effects • e.g. Motion blur AMD‘s Favorite Effects

  36. Conclusion • Optimized CR solver is fast and mem-efficient • Used in Dragon Age 2 • 4aGames considering its use for new projects • Detailed description in ‚Game Engine Gems 2‘ • Mail me (holger.gruen@amd.com) if you want access to the sources AMD‘s Favorite Effects

  37. References • [Kass2006] “Interactive depth of field using simulated diffusion on a GPU” Michael Kass, Pixar Animation studios, Pixar technical memo #06-01 • [ZCO2010] “Fast Tridiagonal Solvers on the GPU” Y. Zhang, J. Cohen, J. D. Owens, PPoPP 2010 • [RS2010] “DX11 Effects in Metro 2033: The Last Refuge” A. Rege, O. Shishkovtsov, GDC 2010 • [Bavoil2010] „Modern Real-Time Rendering Techniques“, L. Bavoil, FGO2010 AMD‘s Favorite Effects

  38. Backup AMD‘s Favorite Effects

  39. Results 1920x1200 AMD‘s Favorite Effects

More Related