Directx 9 radeon 9700 performance optimizations
1 / 33

DirectX 9 & Radeon 9700 Performance Optimizations - PowerPoint PPT Presentation

  • Uploaded on

DirectX 9 & Radeon 9700 Performance Optimizations. Richard Huddy [email protected] DirectX 9 and Radeon 9700 considerations. Resources Sorting and Clearing Vertex Buffers and Index Buffers Render States How to draw primitives Vertex Data Vertex Shaders Pixel Shaders Textures

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'DirectX 9 & Radeon 9700 Performance Optimizations' - brie

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Directx 9 radeon 9700 performance optimizations l.jpg

DirectX 9 & Radeon 9700Performance Optimizations

Richard Huddy

[email protected]

Directx 9 and radeon 9700 considerations l.jpg
DirectX 9 and Radeon 9700 considerations

  • Resources

  • Sorting and Clearing

  • Vertex Buffers and Index Buffers

  • Render States

  • How to draw primitives

  • Vertex Data

  • Vertex Shaders

  • Pixel Shaders

  • Textures

  • Targets (both Z and color)

  • Miscellaneous

General resource management l.jpg
General resource management

  • Create your most important resources first (that’s targets, shaders, textures, VB’s, IB’s etc)

  • “Most important” is “most frequently used”

  • Never call Create in your main loop

    • So create the main colour and Z buffers before you do anything else…

      • The “main buffer” is the one through which the largest number of pixels pass…

Sorting l.jpg

  • Sort roughly front to back

    • There’s a staggering amount of hardware devoted to making this highly efficient

  • Sort by vertex shader


    • Sort by pixel shader, or

    • sort by texture

  • When you change VS or PS it’s good to go back to that shader as soon as possible…

  • Short shaders are faster^2 when sorted

  • Clearing l.jpg

    • Ideally use Clear once per frame (not less)

      • Always clear the whole render target

        • Don’t track dirty regions at all

      • Always clear colour, Z and stencil together unless you can just clear Z/stencil

        • Most importantly don’t force us to preserve stencil

    • Don’t use 2 triangles to clear…

    • Using Clear() is the way to get all the fancy Z buffer hardware working for you

    Vertex buffers l.jpg
    Vertex Buffers

    • Use the standard DirectX8/9 VB handling algorithm with NOOVERWRITE etc

    • Try to always use DISCARD at the start of the frame on dynamic VB’s

    • Specify write-only whenever possible

    • Use the default pool whenever possible

    • Roughly 2 – 4 MB for best performance

      • This allows large batches

      • And gives the driver sufficient granularity

    Index buffers l.jpg
    Index Buffers

    • Treat Index Buffers exactly as if they were vertex buffers – except that you always choose the smallest element possible

      • i.e. Use 32 bit indices only if you need to

      • Use 16 bit indices whenever you can

    • All recent ATI hardware treats Index Buffers as ‘first class citizens’

      • They don’t have to be copied about before the chip gets access

      • So keep them out of system memory

    Updating index and vertex buffers l.jpg
    Updating Index and Vertex Buffers

    • IBs and VBs which are optimally located need to be updated with sequential DWORD writes.

    • AGP memory and LVM both benefit from this treatment…

    Handling render states l.jpg
    Handling Render States

    • Prefer minimal state blocks

      • ‘minimal’ means you should weed out any redundant state changes where possible

        • If 5% of state changes are redundant that’s OK

        • If 50% are redundant then get it fixed!

    • The expensive state changes:

      • Switching between VS and FF

      • Switching Vertex Shader

      • Changing Texture

    How to draw primitives l.jpg
    How to draw primitives

    • DrawIndexedPrimitive( strip or list )

      • Indexing is a big win on real world data

      • Long strips beat everything else

      • Use lists if you would have to add large numbers of degenerate polys to stick with strips (more than ~20% means use lists)

      • Make sure your VB’s and IB’s are in optimal memory for best performance

      • Give the card hundreds of polys per call

        • Small batches kill performance

    Vertex data l.jpg
    Vertex data

    • Don’t scatter it around

      • Fewer streams give better cache behaviour

    • Compress it if you can

      • 16 bits or less per component

      • Even if it costs you 1 or 2 ops in the shader…

    • Try to avoid spilling into AGP

      • Because AGP has high latency

    • pow2 sizes help – 32 bytes is best

      • Work the cache on the GPU

    • Avoid random access patterns where possible by reordering vertex data before the main loop…

      • That’s at app start up or at authoring time

    Compiling and linking shaders l.jpg
    Compiling and Linking shaders

    • Do this all “up front”

      • It may not be obvious to you - but you have to actually use a shader to force it’s complete instantiation in DirectX 9

      • So, if you’re not careful you may get linking happening in your main loop

      • And linking may be time consuming 

      • Draw a little of everything before you start for real. Think of this as priming the caches…

    Vertex shaders i l.jpg
    Vertex shaders I

    • Shorter shaders are faster – no surprises here…

    • Avoid all unnecessary writes

      • This includes the output registers of the VS

      • So use the write masks aggressively

      • Pack constants as much as possible

      • Prefer locality of reference on constants too…

    • Be aware of the expansion of macros but prefer them anyway if they match exactly what you want

    • Pack your shader constant updates

    • You should optimise the algorithm and leave the object-code optimisation to the driver/runtime

    Vertex shaders ii l.jpg
    Vertex shaders II

    • Branches and conditionals are fast so use them agressively

      • That’s not like the CPU where branches are slow…

      • Longer shaders allow better batching

    • Shorter shaders are also more cache friendly

      • i.e. it’s usually faster to switch to the previous shader than to any other

      • But the shorter your shaders are…

      • …the more of them fit into the cache.

    Vertex shaders ii15 l.jpg
    Vertex shaders II

    • API Change:

      • Now you don’t “mov” to the address register, you use “mova”

      • And this performs round to nearest, not floor

      • And now A0 is a 4d register

        • A0.x, A0.y, A0.z, A0.w

    Pixel shaders i l.jpg
    Pixel shaders I

    • API change to accommodate MET’s:

      • You now have to explicitly write to oC0, oC1, oC2 and 0C3 to set the output colour

      • And the write has to be with a mov instruction

      • If you write to 0C[n] you must write to all elements from oC[0] to 0c[n-1]

        • i.e. Writes must be contiguous starting at oC0

        • But the writes can happen in any order

    • You can also write to oDepth to update the Z buffer but note that this kills the early Z cull… (this replaces ps1.3 texdepth)

    Pixel shaders ii l.jpg
    Pixel shaders II

    • Shorter is much faster

      • It’s much easier to be pixel limited than vertex limited

      • Short shaders are more cache friendly

      • Be aggressive with write masks

      • Think dual-issue (“+”) even though it’s gone from the API (so split colour and alpha out)

    • Generally prefer to spend cycles on shader ops rather than using texture lookups

      • Because memory latency is the enemy here

    Pixel shaders iii l.jpg
    Pixel shaders III

    • Dual issue?

      • But that’s not in the 2.0 shader spec…

      • But remember that DX9 hardware like the Radeon 9700 has to run DirectX 8 apps very fast indeed

      • And that means it has dual issue hardware ready for you to use

    Pixel shaders iv l.jpg
    Pixel shaders IV

    • Example : Diffuse + specular lighting

    dp3 r0, r1, r0 // N.H

    dp3 r2, r1, r2 // N.L

    mul r2, r2, r3 // * color

    mul r2, r2, r4 // * texture

    mul r0.r, r0.r, r0.r // spec^2

    mul r0.r, r0.r, r0.r // spec^4

    mul r0.r, r0.r, r0.r // spec^8

    mad r0.rgb, r0.r, r5, r2

    Total: 8 instructions

    dp3 r0, r1, r0 // N.H

    dp3 r2.r, r1, r2 // N.L

    mul r6.a, r0.r, r0.r // spec^2

    mul r2.rgb, r2.r, r3 // * color

    mul r6.a, r6.a, r6.a // spec^4

    mul r2.rgb, r2, r4 // * texture

    mul r6.a, r6.a, r6.a // spec^8

    mad r0.rgb, r6.a, r5, r2

    Optimized to 5 “DI” instructions

    Pixel shaders iv20 l.jpg
    Pixel shaders IV

    • Texture instructions

      • Avoid TEXDEPTH to retain the early Z-reject

      • If you do choose to use TEXKILL then use it as early as possible. [But, the positioning of TEXKILL within texture loading code is unimportant]

    • Register usage

      • Minimize total number of registers used

      • No problems with dependency

    Vertex and pixel shaders l.jpg
    Vertex and Pixel shaders

    • If you’re fed up with writing assembler, and don’t feel excited by the opportunity to code 256 VS ops and 96 PS ops then…

    • …maybe you should consider HLSL?

    • In most cases it is as good as hand written assembler

    • And much faster to author…

      • Perfect for prototyping

      • And for release code where you use D3DX

    Textures i l.jpg
    Textures I

    • API addition

      • SetSamplerState()

      • Handles the now-decoupled texture sampler setup.

      • You may now freely mix and match texture coordinates with texture samplers to fetch texels in arbitrary ways

        • Texture coordinates are now just iterated floats

        • Samplers handle clamp, wrap, bias and filter modes

      • You have 8 texture coordinates

      • And 16 texture samplers

        • texld r11, t7, s15 (all register numbers are max)

    Textures ii l.jpg
    Textures II

    • Use compressed textures

      • Do you need a good compressor?

    • Use smaller textures

    • Use 16 bit textures in preference to 32 bit

    • Use textures with few components

      • Use an L8 or A8 format if that’s what you want

    • Pack textures together

      • e. g. If you’re using two 2D textures then consider using a single RGBA texture

    • Texture performance is bandwidth limited

    Textures iii l.jpg
    Textures III

    • Filtering modes

      • Use trilinear filtering to improve texture cache coherency

      • Only use anisotropic or tri-linear filtering when they make sense - they are more expensive

      • Avoid using anisotropic filtering with bumpmapping

      • Avoid using tri-linear anisotropic filtering unless the quality win justifies it

      • More costly filtering is more affordable with longer pixel shaders

    Targets l.jpg

    • Always clear the whole of the target

    • Present():

      • WASSTILLDRAWING makes a comeback

      • Please use it!

      • Because using it properly will gain you CPU cycles - and that’s typically your scarcest resource

    Depth buffer i l.jpg
    Depth Buffer I

    • Never lock depth buffers

    • Clearing depth buffers

      • Clear the whole surface

      • When stencil is present clear both depth and stencil simultaneously

    • If possible disable depth buffering when alpha blending (i.e. drawing HUD’s)

    • Use as few depth buffers as possible…

      • i.e. re-use them across multiple render targets

    Depth buffer ii l.jpg
    Depth Buffer II

    • Efficiently use Hyper-Z

      • Render front to back

      • Make Znear, Zfar close to active depth range of the scene

      • The EQUAL and NOT EQUAL depth tests require exact compares which kill the early Z comparisons. Avoid them!

    Occlusion query l.jpg
    Occlusion query

    • New to DirectX 9

      • In GL you have HP_occlusion_query and NV_occlusion_query to avoid the need for locks

        • Not free, but much cheaper than Lock()

    • Supported on all ATI hardware since the Radeon 8500

    • CreateQuery(OCCLUSION, ppQuery)

    • Issue(Begin/End)

    • GetData() returns S_OK to signal completion - but please don’t spin waiting for the answer…

    Agp 8x l.jpg
    AGP 8X

    • Is fast at ~2GB per second

    • But has high latency compared to LVM

    • And is 10 times slower than LVM

    • Radeon 9700 has up to 20GB per sec of bandwidth available when talking to LVM

      • (LVM = Local Video Memory)

    User clip planes l.jpg
    User clip planes

    • User clip planes are much more efficient than texkill because:

      • They insert a per-vertex test, rather than a per-pixel test, and vertices are typically fewer in number than pixels

      • It’s important always to kill data at the earliest stage possible in the pipeline

    • Plus, clipping is essentially a geometric operation

    • All hardware which supports ps1.4 supports user clip planes in hardware

    Sky box first or last l.jpg
    Sky box. First or last?

    • Draw it last because:

      • That’s a rough front to back sort

      • In this case you know that most sky pixels will fail the Z test.

    • Draw it first because:

      • That way you don’t need any Z tests

      • In this case you know that most sky pixels would pass the Z test

    So here is our target l.jpg
    So, here is our target:

    • DX9 style mainstream graphics (per frame):

      • > 500K triangles

      • < 500 DrawIndexedPrimitive() calls

      • < 500 VertexBuffer switches

      • < 200 different textures

      • < 200 State change groups

      • Few calls to SetRenderTarget - aim for 0 to 4...

      • 1 pass per poly is typical, but 2 is sometimes smart

      • Runs at monitor refresh rate

      • Which gives more than 40 million polys per second

        • And everything goes through the programmable pipeline

      • No occurrences of Lock(0), DrawPrimitive(), DPUP()