Technology Behind AMD’s Forward+ Rendering Demo

Technology Behind AMD’s “Leo Demo”Jay McKeeMTS Engineer, AMD

Why Forward Rendering? • Complex materials • Multiple light types • Supports hardware anti-aliasing • Efficient memory usage • Supports transparency • BUT, previously could not support a large number of lights

Forward+ Rendering • Modified forward renderer. Add computer shader for light culling. Modify main light loop. • Lighting and shading done in the same place, all information is preserved.

Forward+ Rendering (continued) • No limits on parameters for lights and materials • Omni • Spot • Cinematic (arbitrary falloffs, barndoor) • BRDF per material instance • Simple design, concentrate on rendering, not engine maintenance.

Important DX11 features • Compute Shaders • UAV support.

Compute Shaders • In Leo demo we use two compute shaders: • One for culling lights. • Another for spawning Virtual Point Lights (VPLs) for indirect lighting. • Culling 3,072 lights takes 1.7 ms on high end GPU.

UAVs • Array(s) of scene light information. • Array of u32 light indices for storing start/end lights per-tile. • Array of material instance data

Algorithm summary • Depth Pre-Pass • Light Culling • Screen divided into tiles. Launch compute shader per tile. • Light info such as position, radius, direction, length passed to light culling compute shader. • Light culling shader projects lights bounds to screen-space tiles. Uses scene depth from z pre-pass for z testing against light volumes. • Outputs to UAV describing per tile light list start/end along with a large UAV of u32 array of light indices. • Output UAVs are passed to main light shaders for looping through lights per-pixel.

Algorithm summary continued • Render scene materials • Base light accumulation function • Use screen x, y location to determine tileID • From tileID, get light start and end indices • From start index to end index, loop • Entry is index into light array. • Accumulate light hitting pixel • Returns total direct and indirect light hitting pixel.

Algorithm summary continued • Material shader • Decides what to do with total incoming light • Passed into material’s BRDF for example • Uses light accumulation building blocks • Env. lighting, base light accumulation, BRDF, etc. are put together for final pixel color.

Light Culling Shader Details (1/3) // 1. prepare float4 frustum[4]; float minZ, maxZ; { ConstructFrustum( frustum ); minZ = thread_REDUCE(MIN, depth ); maxZ = thread_REDUCE(MAX, depth ); ldsMinZ = SIMD_REDUCE(MIN, minZ ); ldsMaxZ = SIMD_REDUCE(MAX, maxZ ); minZ = ldsMinZ; maxZ = ldsMaxZ; }

Light Culling Shader Details (2/3) __local u32 ldsNLights = 0; __local u32 ldsLightBuffer[MAX]; // 2. overlap check, accumulate in LDS for(int i=threadIdx; i<nLights; i+=WG_SIZE) { Light light = fetchAndTransform( lightBuffer[ i ] ); if( overlaps( light, frustum ) && overlaps ( light, minZ, maxZ ) ) { AtomicAppend( ldsLightBuffer, i ); } }

Light Culling Shader Details (3/3) // 3. export to global __local u32 ldsOffset; if( threadIdx == 0 ) { ldsOffset = AtomAdd( ldsNLights ); globalLightStart[tileIdx] = ldsOffset; globalLightEnd[tileIdx] = ldsOffset + ldsNLights; } for(int i=threadIdx; i< ldsNLights; i+=WG_SIZE) { intdstIdx = ldsOffset + i; globalLightIndexBuffer[dstIdx] = ldsLightBuffer[i]; }

Light Accumulation Pseudo-code // BaseLighting.inc // THIS INC FILE IS ALL THE COMMON LIGHTING CODE StructuredBuffer<float4> LightParams : register(u0); StructuredBuffer<uint> LowerBoundLights : register(u1); StructuredBuffer<uint> UpperBoundLights : register(u2); StructuredBuffer<int2> LightIndexBuffer : register(u3); uintGetTileIndex(float2 screenPos) { float tileRes = (float)m_tileRes; uintnumCellsX = (m_width + m_tileRes - 1)/m_tileRes; uinttileIdx = floor(screenPos.x/tileRes)+floor(screenPos.y/tileRes)*numCellsX; return tileIdx; } }

Light Accumulation (2): StartHLSLBaseLightLoopBegin // THIS IS A MACRO, INCLUDED IN MATERIAL SHADERS uinttileIdx = GetTileIndex( pixelScreenPos ); uintstartIdx = LowerBoundLights[tileIdx]; uintendIdx = UppweBoundLights[tileIdx]; [loop] for ( uintlightListIdx = startIdx; lightListIdx < endIdx; lightListIdx++ ) { intlightIdx = LightIndexBuffer[lightListIdx]; // Set common light parameters float ndotl = max(0, dot(normal, lightVec)); float3 directLight = 0; float3 indirectLight = 0;

Light Accumulation (3): if( lightIdx >= numDirectLightsThisFrame ) { CalculateIndirectLight(lightIdx, indirectLight); } else { if( IsConeLight( lightIdx ) ) { // <<== Can add more light types here CalculateDirectSpotlight(lightIdx, directLight); } else { CalculateDirectSpherelight(lightIdx, directLight); } } float3 incomingLight = (directLight + indirectLight)*ndotl; float shadowTerm = CalcShadow(); EndHLSL StartHLSLBaseLightLoopEnd } EndHLSL

Material Shader Template: #include "BaseLighting.inc" float4 PS ( PSInput i ) : SV_TARGET { float3 totalDiffuse = 0; float3 totalSpec = GetEnvLighting();; $include BaseLightLoopBegin // unique material code goes here!! Light accumulation on the pixel for a given light // we have total incoming light and direct/indirect light components as well as material params and shadow term // use these building blocks to integrate lighting terms totalDiffuse += GetDiffuse(incomingLight); totalSpec+= CalcPhong(incomingLight); $include BaseLightLoopEnd float3 finalColor = totalDiffuse + totalSpec; return float4( finalColor, 1 ); }

Debug Mode Demo

Benchmark 3k dynamic lights

Compute-based Deferred v.s. Forward+ Takahiro Harada, Jay McKee, Jason C.Yang, Forward+: Bringing Deferred Lighting to the Next Level, Eurographics Short Paper (2012)

Depth Pre-Pass Critical • Pixel overdraw cripples this technique so depth pre-pass is required. • Depth pre-pass is good opportunity to use MRT to generate other full-screen data needed for post-fx and other render fx(optional).

Other important points • XBOX 360 has good bandwidth so given limitations on forward rendering, deferred makes a lot of sense. • However, ALU computation growing at faster rate than bandwidth. more and more feasible to just do the calculations than to read/write so much data. • Dynamic branching penalties not nearly as bad as before. As an optimization, compute shader can sort by light-type for example to minimize penalties. • All that "light management" CPU side code to decide which lights hit each object for setting constant registers can be ditched!

Summary • Modified forward renderer that handles scenes with 1000s of lights. • Hardware anti-aliasing (MSAA) “automatic” • Bandwidth friendly. • Makes the most of the GPU's ALU power (which is growing faster than bandwidth)

Thanks! Contact: Takahiro.Harada@amd.com jay.mckee@amd.com jasonc.yang@amd.com Leo Demo website: http://developer.amd.com/samples/demos/pages/AMDRadeonHD7900SeriesGraphicsReal-TimeDemos.aspx Eurographics 2012: 'Forward+: Bringing Deferred Lighting to the Next Level'

Technology Behind AMD’s Forward+ Rendering Demo