1 / 33

DirectX 9 & Radeon 9700 Performance Optimizations

DirectX 9 & Radeon 9700 Performance Optimizations. Richard Huddy RHuddy@ati.com. DirectX 9 and Radeon 9700 considerations. Resources Sorting and Clearing Vertex Buffers and Index Buffers Render States How to draw primitives Vertex Data Vertex Shaders Pixel Shaders Textures

brie
Download Presentation

DirectX 9 & Radeon 9700 Performance Optimizations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DirectX 9 & Radeon 9700Performance Optimizations Richard Huddy RHuddy@ati.com

  2. DirectX 9 and Radeon 9700 considerations • Resources • Sorting and Clearing • Vertex Buffers and Index Buffers • Render States • How to draw primitives • Vertex Data • Vertex Shaders • Pixel Shaders • Textures • Targets (both Z and color) • Miscellaneous

  3. General resource management • Create your most important resources first (that’s targets, shaders, textures, VB’s, IB’s etc) • “Most important” is “most frequently used” • Never call Create in your main loop • So create the main colour and Z buffers before you do anything else… • The “main buffer” is the one through which the largest number of pixels pass…

  4. Sorting • Sort roughly front to back • There’s a staggering amount of hardware devoted to making this highly efficient • Sort by vertex shader …or… • Sort by pixel shader, or • sort by texture • When you change VS or PS it’s good to go back to that shader as soon as possible… • Short shaders are faster^2 when sorted

  5. Clearing • Ideally use Clear once per frame (not less) • Always clear the whole render target • Don’t track dirty regions at all • Always clear colour, Z and stencil together unless you can just clear Z/stencil • Most importantly don’t force us to preserve stencil • Don’t use 2 triangles to clear… • Using Clear() is the way to get all the fancy Z buffer hardware working for you

  6. Vertex Buffers • Use the standard DirectX8/9 VB handling algorithm with NOOVERWRITE etc • Try to always use DISCARD at the start of the frame on dynamic VB’s • Specify write-only whenever possible • Use the default pool whenever possible • Roughly 2 – 4 MB for best performance • This allows large batches • And gives the driver sufficient granularity

  7. Index Buffers • Treat Index Buffers exactly as if they were vertex buffers – except that you always choose the smallest element possible • i.e. Use 32 bit indices only if you need to • Use 16 bit indices whenever you can • All recent ATI hardware treats Index Buffers as ‘first class citizens’ • They don’t have to be copied about before the chip gets access • So keep them out of system memory

  8. Updating Index and Vertex Buffers • IBs and VBs which are optimally located need to be updated with sequential DWORD writes. • AGP memory and LVM both benefit from this treatment…

  9. Handling Render States • Prefer minimal state blocks • ‘minimal’ means you should weed out any redundant state changes where possible • If 5% of state changes are redundant that’s OK • If 50% are redundant then get it fixed! • The expensive state changes: • Switching between VS and FF • Switching Vertex Shader • Changing Texture

  10. How to draw primitives • DrawIndexedPrimitive( strip or list ) • Indexing is a big win on real world data • Long strips beat everything else • Use lists if you would have to add large numbers of degenerate polys to stick with strips (more than ~20% means use lists) • Make sure your VB’s and IB’s are in optimal memory for best performance • Give the card hundreds of polys per call • Small batches kill performance

  11. Vertex data • Don’t scatter it around • Fewer streams give better cache behaviour • Compress it if you can • 16 bits or less per component • Even if it costs you 1 or 2 ops in the shader… • Try to avoid spilling into AGP • Because AGP has high latency • pow2 sizes help – 32 bytes is best • Work the cache on the GPU • Avoid random access patterns where possible by reordering vertex data before the main loop… • That’s at app start up or at authoring time

  12. Compiling and Linking shaders • Do this all “up front” • It may not be obvious to you - but you have to actually use a shader to force it’s complete instantiation in DirectX 9 • So, if you’re not careful you may get linking happening in your main loop • And linking may be time consuming  • Draw a little of everything before you start for real. Think of this as priming the caches…

  13. Vertex shaders I • Shorter shaders are faster – no surprises here… • Avoid all unnecessary writes • This includes the output registers of the VS • So use the write masks aggressively • Pack constants as much as possible • Prefer locality of reference on constants too… • Be aware of the expansion of macros but prefer them anyway if they match exactly what you want • Pack your shader constant updates • You should optimise the algorithm and leave the object-code optimisation to the driver/runtime

  14. Vertex shaders II • Branches and conditionals are fast so use them agressively • That’s not like the CPU where branches are slow… • Longer shaders allow better batching • Shorter shaders are also more cache friendly • i.e. it’s usually faster to switch to the previous shader than to any other • But the shorter your shaders are… • …the more of them fit into the cache.

  15. Vertex shaders II • API Change: • Now you don’t “mov” to the address register, you use “mova” • And this performs round to nearest, not floor • And now A0 is a 4d register • A0.x, A0.y, A0.z, A0.w

  16. Pixel shaders I • API change to accommodate MET’s: • You now have to explicitly write to oC0, oC1, oC2 and 0C3 to set the output colour • And the write has to be with a mov instruction • If you write to 0C[n] you must write to all elements from oC[0] to 0c[n-1] • i.e. Writes must be contiguous starting at oC0 • But the writes can happen in any order • You can also write to oDepth to update the Z buffer but note that this kills the early Z cull… (this replaces ps1.3 texdepth)

  17. Pixel shaders II • Shorter is much faster • It’s much easier to be pixel limited than vertex limited • Short shaders are more cache friendly • Be aggressive with write masks • Think dual-issue (“+”) even though it’s gone from the API (so split colour and alpha out) • Generally prefer to spend cycles on shader ops rather than using texture lookups • Because memory latency is the enemy here

  18. Pixel shaders III • Dual issue? • But that’s not in the 2.0 shader spec… • But remember that DX9 hardware like the Radeon 9700 has to run DirectX 8 apps very fast indeed • And that means it has dual issue hardware ready for you to use

  19. Pixel shaders IV • Example : Diffuse + specular lighting … dp3 r0, r1, r0 // N.H dp3 r2, r1, r2 // N.L mul r2, r2, r3 // * color mul r2, r2, r4 // * texture mul r0.r, r0.r, r0.r // spec^2 mul r0.r, r0.r, r0.r // spec^4 mul r0.r, r0.r, r0.r // spec^8 mad r0.rgb, r0.r, r5, r2 … Total: 8 instructions … dp3 r0, r1, r0 // N.H dp3 r2.r, r1, r2 // N.L mul r6.a, r0.r, r0.r // spec^2 mul r2.rgb, r2.r, r3 // * color mul r6.a, r6.a, r6.a // spec^4 mul r2.rgb, r2, r4 // * texture mul r6.a, r6.a, r6.a // spec^8 mad r0.rgb, r6.a, r5, r2 … Optimized to 5 “DI” instructions

  20. Pixel shaders IV • Texture instructions • Avoid TEXDEPTH to retain the early Z-reject • If you do choose to use TEXKILL then use it as early as possible. [But, the positioning of TEXKILL within texture loading code is unimportant] • Register usage • Minimize total number of registers used • No problems with dependency

  21. Vertex and Pixel shaders • If you’re fed up with writing assembler, and don’t feel excited by the opportunity to code 256 VS ops and 96 PS ops then… • …maybe you should consider HLSL? • In most cases it is as good as hand written assembler • And much faster to author… • Perfect for prototyping • And for release code where you use D3DX

  22. Textures I • API addition • SetSamplerState() • Handles the now-decoupled texture sampler setup. • You may now freely mix and match texture coordinates with texture samplers to fetch texels in arbitrary ways • Texture coordinates are now just iterated floats • Samplers handle clamp, wrap, bias and filter modes • You have 8 texture coordinates • And 16 texture samplers • texld r11, t7, s15 (all register numbers are max)

  23. Textures II • Use compressed textures • Do you need a good compressor? • Use smaller textures • Use 16 bit textures in preference to 32 bit • Use textures with few components • Use an L8 or A8 format if that’s what you want • Pack textures together • e. g. If you’re using two 2D textures then consider using a single RGBA texture • Texture performance is bandwidth limited

  24. Textures III • Filtering modes • Use trilinear filtering to improve texture cache coherency • Only use anisotropic or tri-linear filtering when they make sense - they are more expensive • Avoid using anisotropic filtering with bumpmapping • Avoid using tri-linear anisotropic filtering unless the quality win justifies it • More costly filtering is more affordable with longer pixel shaders

  25. Targets • Always clear the whole of the target • Present(): • WASSTILLDRAWING makes a comeback • Please use it! • Because using it properly will gain you CPU cycles - and that’s typically your scarcest resource

  26. Depth Buffer I • Never lock depth buffers • Clearing depth buffers • Clear the whole surface • When stencil is present clear both depth and stencil simultaneously • If possible disable depth buffering when alpha blending (i.e. drawing HUD’s) • Use as few depth buffers as possible… • i.e. re-use them across multiple render targets

  27. Depth Buffer II • Efficiently use Hyper-Z • Render front to back • Make Znear, Zfar close to active depth range of the scene • The EQUAL and NOT EQUAL depth tests require exact compares which kill the early Z comparisons. Avoid them!

  28. Occlusion query • New to DirectX 9 • In GL you have HP_occlusion_query and NV_occlusion_query to avoid the need for locks • Not free, but much cheaper than Lock() • Supported on all ATI hardware since the Radeon 8500 • CreateQuery(OCCLUSION, ppQuery) • Issue(Begin/End) • GetData() returns S_OK to signal completion - but please don’t spin waiting for the answer…

  29. AGP 8X • Is fast at ~2GB per second • But has high latency compared to LVM • And is 10 times slower than LVM • Radeon 9700 has up to 20GB per sec of bandwidth available when talking to LVM • (LVM = Local Video Memory)

  30. User clip planes • User clip planes are much more efficient than texkill because: • They insert a per-vertex test, rather than a per-pixel test, and vertices are typically fewer in number than pixels • It’s important always to kill data at the earliest stage possible in the pipeline • Plus, clipping is essentially a geometric operation • All hardware which supports ps1.4 supports user clip planes in hardware

  31. Sky box. First or last? • Draw it last because: • That’s a rough front to back sort • In this case you know that most sky pixels will fail the Z test. • Draw it first because: • That way you don’t need any Z tests • In this case you know that most sky pixels would pass the Z test

  32. So, here is our target: • DX9 style mainstream graphics (per frame): • > 500K triangles • < 500 DrawIndexedPrimitive() calls • < 500 VertexBuffer switches • < 200 different textures • < 200 State change groups • Few calls to SetRenderTarget - aim for 0 to 4... • 1 pass per poly is typical, but 2 is sometimes smart • Runs at monitor refresh rate • Which gives more than 40 million polys per second • And everything goes through the programmable pipeline • No occurrences of Lock(0), DrawPrimitive(), DPUP()

  33. Questions… ? Richard Huddy RHuddy@ati.com

More Related