1 / 22

Caching Strategies for Textures

Caching Strategies for Textures. Paul Arthur Navratil. Overview. Conceptual summary Design and Analysis of a Cache Architecture for Texture Mapping (Hakura and Gupta 1997) Prefetching in a Texture Cache Architecture (Igehy, Eldrige, and Proudfoot 1998) Discussion!. Mip mapping.

mulan
Download Presentation

Caching Strategies for Textures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Caching Strategies for Textures Paul Arthur Navratil

  2. Overview • Conceptual summary • Design and Analysis of a Cache Architecture for Texture Mapping (Hakura and Gupta 1997) • Prefetching in a Texture Cache Architecture (Igehy, Eldrige, and Proudfoot 1998) • Discussion!

  3. Mip mapping • Achieves acceptable performance texture mapping • Interpolation between fixed levels of detail is a constant computation cost per fragment • Reduces aliasing [Williams p.4] • Efficient memory use • Memory access pattern is well understood

  4. Hakura and Gupta: Problem • Motivation: need high bandwidth, low latency memory access for texture mapping • Previous work uses brute-force • Dedicated DRAM for each fragment generator [Akeley p.3] • SGI RealityEngine can have 320MB texture memory, but only 16MB of unique texture memory!

  5. Hakura and Gupta: Idea • Observation: If textures exhibit spatial and temporal localities, design a system to exploit them • Use SRAM cache for each fragment generatorHave a single, shared DRAM texture memory • Advantages • Unique texture memory is larger • Uses cheaper chip (SRAM over DRAM) • SRAM gives higher bandwidth and lower latency

  6. Hakura and Gupta: Locality • Mip mapping has inherent spatial locality • Four contiguous texels on each of two levels for trilinear interpolation, with texel area close to pixel area • Texture mapping has two temporal localities • Overlapping texel usage along contiguous fragment generation • Repeating texture across image [color images.ps]

  7. Hakura and Gupta: Caching • Observation: Increase in DRAM density has decreased DRAM bandwidth! • Cache decreases bandwidth requirement by decreasing accesses to texture memory • Block transfers from memory to cache maximize DRAM bandwidth utilization • Texture memory can be shared (not dedicated) • No cache coherence issues • Cache characterized by: • Cache size • Cache line size • Associativity • Which combination is best?

  8. Hakura and Gupta: Texture Representation in Memory • Base case: Linear (Non-Blocked) • Williams original representation misses spatial locality • Use contiguous RGBA values per texel [Hakura p.5] • Observations: • Gradual level-of-detail change uses more of a fetched cache line • Higher line size drops cold miss rate • Principle of Texture Thrift: amount of texture info required to render is proportional to the resolution of the image, and is independent of the number of surfaces and the size of the texture [Peachey 90] • In examples, workset limited to one textureWorst case bound by either texture size or screen size • This representation is sensitive to the texture orientation on screen.

  9. Hakura and Gupta: Texture Representation in Memory • Blocked case: convert 2-D arrays into 4-D arrays. • Address calculation is a two-step process • Block size remains constant across mipmap levels • Observations: • Reduces dependency on texture orientation, and utilizes spatial locality • Lowest miss rates occur when block size matches cache line size [Hakura p.7] • Increasing line size alone creates worse miss rates • Can use 2-way associative cache to avoid conflict with blocks of different mipmap levels (see Igehy)

  10. Hakura and Gupta: Rasterization • Rasterization order affects texture access pattern, and thus cache behavior also • Use tiling (chunking) to utilize spatial locality • If tiles are too large, the working set will be larger than the cache size, and capacity misses will result [Hakura p.9] • Smaller triangles in image reduce this effect

  11. Hakura and Gupta: Performance • Rendering performance and memory bandwidth are good measures of a texture mapping system • Fragment generator observations • Machine must access more than one texel per cycle • Must hide memory latency to achieve maximum throughput (address precomputation) • SRAM cache observations • Multiple banks with interleaced lines for multiple texel access • Interleave texels within each block • Without multi-texel access, trilinear interpolation can compare texels only once every two cycles!

  12. Hakura and Gupta: Conclusions • Caching yields a three-fold to fifteen-fold reduction in memory bandwidth requirements • Cache should be at least 16 KB and 2-way associative • Long cache lines better utilize bandwidth (with a slight increase in bandwidth requirements) • Block size should match cache line size • Rasterization pattern should be tiled

  13. Igehy et al: Problem • Motivation: Memory bandwidth and latency are (becoming) bottleneck for texture systems • Previous work shows caching benefits [Hakura97; Cox98], but fails to hide memory latency • Little literature on prefetching texels: • used in some systems, but the algorithms are not described (proprietary) e.g. [Torborg and Kajiya, 1996]

  14. Igehy et al: Idea • Combine prefetching and caching in an architecture with a clear description • Advantages: • Simple • Robust to variations in bandwidth requirements and latencies • Achieves within 3% of performance of a zero-latency system

  15. Igehy et al: Traditional Prefetching (no cache) • When a fragment is ready for texturing, queue it and request the texels • Fragment stays in queue for time equal to memory latency • If the queue is sized correctly, latency will be masked • Problems: • If covering large request rate and latency, early prefetch can cause cache miss • Tags must be checked at double-rate to maximize throughput (prefetch check and read check) • Prefetch buffer size must increase as request rate and latency increase

  16. Igehy et al: Texture Prefetching • Differences from traditional prefetch: • Tag checks occur once per texel, before cache access • Add reorder buffer to handle early return of texel data • New cache blocks only put in cache when associated fragment reaches head of the queue • Cache organization: • Four banks each, with adjacent levels of mipmap in alternating banks • Data interleaved so the four accesses for bilinear interpolation can occur in parallel • Can process 8 requests in parallel, which is enough for trilinear interpolation

  17. Igehy et al: Texture Properties • Texture caching effectiveness is scene dependent • Observation: unique-texel-to-fragment ratio is lower bound on number of texels that must be fetched per frame (unless utilizing inter-frame locality) • Want a low unique-texel-to-fragment ratio! • Ratio affected by: • Magnification (lowers ratio) • Repetition (lowers ratio if cache holds entire texture) • Minification (ratio depends on texel-area-to-pixel-area ratio)

  18. Igehy et al: Memory Organization • Use 6-D texture representation in Hakura [Igehy p.5] • Rasterize in tiled pattern (not scan-line) • Cache associativity does not appreciably affect miss rate • Design minimizes conflict misses • General formula for determining associativity: • m independent n-way associative caches can handle a rate of m bilinear accesses (four texels) per cycle to m*n textures (or texture levels in mipmap)

  19. Igehy et al: Bandwidth • Average texel requests per frame are not enough to determine actual requirements • High-request bursts occur [Igehy p.6]e.g. color map vs. light map • When system misses ideal (zero-latency) performance, bandwidth is to blame [Igehy p.8] • e.g. AGP vs. NUMA

  20. Igehy et al: Conclusions • System that approximates zero-latency is possible • Achieved 97% utilization of available resources • Fragment queue should slightly exceed latency of memory system to account for miss bursts • Reserve reorder-buffer slot when memory request is made to avoid deadlock

  21. Discussion!

More Related