1 / 50

Graphics Performance: Balancing the Rendering Pipeline

Graphics Performance: Balancing the Rendering Pipeline Cem Cebenoyan and Matthias Wloka Introduction At a minimum, PC is a 2 processor system CPU GPU Maximum efficiency IFF All processors are busy All the time GPU CPU AGP Bus Actually, It’s Worse GPU Vertex Processing CPU AGP Bus

adamdaniel
Download Presentation

Graphics Performance: Balancing the Rendering Pipeline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Graphics Performance: Balancing the Rendering Pipeline CemCebenoyan and Matthias Wloka

  2. Introduction • At a minimum, PC is a 2 processor system • CPU • GPU • Maximum efficiency IFF • All processors are busy • All the time GPU CPU AGP Bus

  3. Actually, It’s Worse GPU Vertex Processing CPU AGP Bus Application Triangle Setup API Large Cache Fragment Shading Framebuffer Access

  4. Multi-Processor System • Conceptually, 5 processors • CPU • Vertex-processor(s) • Setup processor(s) • Fragment processor(s) • Blending processor(s) • All connected via some form of cache • To smooth data flow • To keep things humming

  5. MP Systems Become Inefficient If… • One or more processors sync to each other • For example, frame-buffer lock • Insures that all caches drain • Insures that all processors idle (CPU and GPU!) • Overhead in restarting the processors • A single processor bottlenecks all others

  6. Overview • CPU • AGP Bus • Vertex Processing • Triangle Setup • Rasterization • Memory bandwidth • Writing to and blending with video memory

  7. Overview: For Each Stage • What are its characteristics? • How does it behave? • How to measure whether it is the bottleneck • How to influence it

  8. CPU Characteristics • Stay within on-chip cache for maximum performance • Use CPU for • Collision detection • Physics • AI • Etc.

  9. CPU Characteristics (cont.) • Note that graphics is capable of • 20+ MTri/s (2 year old high-end) • 20+ MTri/s (integrated graphics) • 100+ MTri/s (current high-end) • CPU also responsible for pushing data to GPU • Cannot look at every triangle • Don’t limit graphics with CPU processing

  10. CPU Measurement • Use VTune • Or any other profiler • Most games are CPU-limited • Little to no time in the graphics driver: • CPU is the bottleneck • Faster GPU will NOT result in faster graphics • Use VTune to track where you spend your time • Optimize those places

  11. CPU Measurement (cont.) • But even if most time is spent in graphics driver: • CPU might still be the bottleneck • Faster GPU will NOT result in faster graphics • Use Nvidia Stats-driver (NVTune) to trace into the GPU • Timing graphics calls is pointless • Remember the large cache between CPU/GPU • Use Nvidia Stats-driver (NVTune) instead • NVTune available from Nvidia’s registered developer site

  12. CPU Common Problems • Small batches of geometry being sent to the GPU • 100 triangles per batch should be your minimum • Would like to see ~500 triangles/batch • Up to 10,000 triangles/batch • Combination of causes kill your performance • Runtime • Driver • Hardware

  13. CPU: Batch Size Characteristic

  14. CPU: Batching Solutions • Sort by render-state • Texture switches • Combine textures into one large (4kx4k) texture • Modify uv-coordinates accordingly • Tessellate geometry to overcome mirroring and wrapping • Mip-mapping works just fine • Transform switches • Pre-transform on the CPU into world-space • Replicate data into VBs (costs AGP memory)

  15. Other Common CPU Problems • Specify vertex buffers as WRITEONLY • Minimize state changes • consider using a PURE device, iff you are optimal • Do not lock and read data from GPU • Multi-processor sync!

  16. AGP Bus Characteristics • AGP 4x supports 20+ MTri/s • Even if all vertices and indices are dynamic • BenMark5 does just that • http://developer.nvidia.com/view.asp?IO=BenMark5 • Too often AGP 4x support is busted • Use BenMark5 to test for AGP 4x support • AGP Bus through-put influenced by • Size of vertex format of dynamically written vertices • How many vertices are dynamically written

  17. AGP Bus Characteristics (cont.) • But if frame-buffer and textures exceed video-memory, AGP is also used • to transfer STATIC vertices to GPU every frame • to transfer textures to GPU every frame • Make sure you avoid partial writes • See “Fast AGP Writes for Dynamic Vertex Data” by Dean Macri for details • Always modify all vertex-data, • even if only some data changes • Pentium 3: write in 32 byte chunks • Pentium 4: write in 64 byte chunks

  18. AGP Bus Characteristics (cont.) • GPU caches vertex fetches • Hitting this cache causes no data to cross the bus • Cache has 32-byte lines • Vertex sizes that are multiples of 32 are beneficial • See also http://developer.nvidia.com/view.asp?IO=Vertex_Buffer_Statistics

  19. AGP Bus Characteristics

  20. AGP Bus Measurement • You can tell you’re bound by the bus if: • Increasing/decreasing vertex format size significantly impacts performance • Best to increase vertex format size using components not needed by rasterizer • for example, normals

  21. Increasing AGP Bus Performance • Make sure frame buffer and textures fit into video-memory • Decrease number of dynamic objects (vertices) • Use vertex-shaders to animate static VBs! • Decrease vertex size • Let vertex-shader generate vertex-components! • Compress components and use vertex shader to decompress • For example, use 16bit short normals • Reorder vertices in VB to be sequential in use • Can use NVTriStrip to do this • Pad to multiples of 32-bytes

  22. Vertex Processing Characteristics • Each vertex is transformed and lit • Performance correlates directly to • Number of vertices processed • Length of vertex shader or • Fixed-function factors, such as • Number of active lights • Type of lights • Specular on/off • LOCALVIEWER on/off • Texgen on/off • GPU core clock frequency

  23. Vertex Processing Characteristics

  24. Vertex Processing Characteristics • After processing, vertices land in post-TnL FIFO • GeForce1/2/4 MX: effectively 10 entries • GeForce3/4 Ti: effectively 18 entries • Cache-hit saves: • all TnL work! • Everything before TnL in the pipeline • Only works with indexed primitives

  25. Vertex Processing Performance • Do not be afraid to use triangles • Rarely the bottleneck • Even if it is, it would make us happy • A lot of vertex processing power available • 6 * 6 pixel-quad with 2 tris is not vertex bound • If you can tell an object is made from triangles, you are not using enough triangles • ~10k triangles/frame is off by 2 (two!) orders of magnitude

  26. Code Creatures Demo • Grass scenes are NOT vertex-bound • In excess of 1,000,000 tris/frame for opening scene • ~250k tris/frame minimum • CodeCreatures demo available from: http://www.codecult.de/

  27. Vertex Processing Measurement • You are bound by vertex processing if: • Increasing/decreasing vertex shader length significantly influences performance • Adding unnecessary instructions may be optimized out by driver, though • Instead, use instructions that access constant memory to add zero to a result, for example • Fixed-function TnL performance improves when • Reducing number of lights • Turning off texgen • Simplifying light types

  28. Improving Vertex Processing • Optimize for the post-TnL vertex cache • Use indexed primitives • Access vertices mostly sequentially, revisiting only recently accessed vertices • Let NVTriStrip or ID3DXMesh do the work • Turn off unnecessary calculations • LOCALVIEWER often unnecessary for specular • Prefer cheap approximations for lighting and other math when using vertex shaders

  29. Improving Vertex Processing (cont.) • Optimize your vertex shaders • Use swizzling/masking extensively • Question all MOV instructions • Storing lookup tables in constant memory • for example, to compute sin/cos • See “Implementation of ‘Missing’ Vertex Shader Instructions” for more ideas • http://developer.nvidia.com/view.asp?IO=Implementation_Missing_Instructions

  30. Improving Vertex Processing (cont.) • Consider moving per-vertex work to per-pixel • Consider using ‘shader-LODing’ • Do far-away objects really need 4-bone skinning? • Can always increase screen-res/use AA to NOT be vertex-processing bound!

  31. Triangle Setup Characteristics • Triangle setup is never the bottleneck • Except when rating the GPU • Since it is the fastest stage • Setup speed influenced by: • Number of triangles • Vertex attributes needed by rasterization • Extremely small triangles running very simple TnL • i.e., degenerate triangles! • No TnL cost, since most likely hits post-TnL cache • No fill-cost, since rejected in setup

  32. Measuring/Improving Triangle Setup • Has never come up • Reduce ratio of degenerate triangles to real triangles • Reduce unnecessary components written out from the vertex shader

  33. Rasterization Characteristics • Prefer the term “fragment” to “pixel” • May not correspond to any pixel in framebuffer, for example, due to z/stencil/alpha tests • May correspond to more than one pixel due to multisampling • Commonly referred to as “fill-rate”

  34. Fill-Rate Characteristics • Fill-rate is function of • number of fragments filled • cost of each fragment • GPU’s core clock • Parallel SIMD operation, processes • Up to 4 pixels per clock on GeForce1/2/3/4 Ti • Up to 2 pixels per clock on GeForce2 MX / 4 MX • Broken into a number of parts: • Texture fetching • Texture addressing operations • Color blending operations

  35. Texture Fetching Characteristics • Texture fetches are • From AGP to local video-memory, only if frame-buffer and textures exceed video-memory (to be avoided), then • From local video-memory to on-chip cache

  36. Texture Fetching Characteristics (cont.) • Minimize cache-misses: • Use mip-mapping! • Avoid LOD bias to sharpen: it hurts caching and adds aliasing • Prefer anisotropic filtering for sharpening • Use DXT everywhere you can • Texture size as big as needed and no bigger • Texture format as small as possible • 16 vs. 32 bit • Localize texture access • E.g., normal texture reads • Dependent texture reads are less local • Per-pixel reflection potentially really bad

  37. Texture Fetching Characteristics (cont.) • Number of samples taken also affects performance: • Trilinear filtering cuts fillrate in half • Anisotropic even worse • Depending on level of anisotropy • The hardware is intelligent in this regard, you only pay for the anisotropy you use

  38. Texture Addressing Characteristics • Different texture addressing operations have wildly different performance characteristics • But texture cache hits/misses more significant

  39. Texture Addressing Characteristics • Also, every two textures cuts fill-rate in half: • 1 or 2 textures runs at full speed • 3 or 4 textures runs at half speed (two clocks)

  40. Color Blending Characteristics • Color blending operations also called ‘Register Combiners’ • 1 or 2 instructions (combiners) – full speed • 3 or 4 instructions (combiners) – half speed • 5 or 6 instructions (combiners) – one third speed • 7 or 8 instructions (combiners) – one quarter speed • These numbers are for GF3 / 4 Ti • But if using 4 textures • Already at half-speed or less • Using up to 4 combiners is free

  41. Fill-Rate Measurement • You are bound by fill-rate, if • Reducing texture sizes • Or better turning off texturing • Increases performance significantly • Turning on / off trilinear affects performance • Increasing texture units used to 4, but not actually fetching from any textures (using pixel shader instructions like texcoord), causes you to slow down

  42. Improving Fill-Rate • Render z-only pass first • Because z-optimizations happen before rasterization • Helps with memory bandwidth as well • Even for older chips without z-optimizations • Do everything to reduce texture cache misses • Turn on anisotropic, but turn off trilinear filtering • Mip-map transitions are less visible with anisotropic filtering on

  43. Improving Fill-Rate (cont.) • Consider palletized normal maps for compression • Consider moving per-pixel work to per-vertex • Consider ‘shader LODing’ • Turn off detail map computations in the distance

  44. Memory Bandwidth Characteristics • Memory bandwidth is often the bottleneck • especially at high resolutions • Memory bandwidth influenced by: • Screen and render-target resolutions • Render-target color / z bit depth • FSAA • Texture sizes and formats (texture fetching) • Overdraw complexity • Alpha blending • GPU’s memory-interface width • Memory clock

  45. Memory Bandwidth Characteristics • FSAA hits memory bandwidth exclusively • no fill-rate hit with multi-sample • Failing the z/stencil/alpha test means • Pixel color is not written • Z is not written

  46. Measuring Memory Bandwidth • Switch frame-buffer format to 16bit • Switch all render-targets to 16bit • If performance doubles • App was 100% memory-bandwidth bound • If performance unchanged • App is not memory-bandwidth bound

  47. Improving Memory Bandwidth • Overdraw • Reduce as much as possible • Lightly sort objects front to back • All architectures benefit, since z-test fails • Reduce blending as much as possible • Always enable alpha-test when blending • Tweak test-value as much as possible • Consider using 2-pass alpha-test/-blend technique • Always clear z/stencil (using clear()) • Do not clear color if not necessary • Writing z from shader destroys early z

  48. Improving Memory Bandwidth (cont.) • Prefer FSAA over high resolution • Consider using z-only pass • Turn off z-writing for all subsequent passes

  49. Conclusion • A lot of different performance bottle-necks • Know which one to tweak • Use suggestions here to • make things faster w/o making it visibly worse • Make things prettier for free!

  50. Questions… ? cem@nvidia.com mwloka@nvidia.com http://developer.nvidia.com

More Related