Caching Strategies for Textures

Caching Strategies for Textures Paul Arthur Navratil

Overview • Conceptual summary • Design and Analysis of a Cache Architecture for Texture Mapping (Hakura and Gupta 1997) • Prefetching in a Texture Cache Architecture (Igehy, Eldrige, and Proudfoot 1998) • Discussion!

Mip mapping • Achieves acceptable performance texture mapping • Interpolation between fixed levels of detail is a constant computation cost per fragment • Reduces aliasing [Williams p.4] • Efficient memory use • Memory access pattern is well understood

Hakura and Gupta: Problem • Motivation: need high bandwidth, low latency memory access for texture mapping • Previous work uses brute-force • Dedicated DRAM for each fragment generator [Akeley p.3] • SGI RealityEngine can have 320MB texture memory, but only 16MB of unique texture memory!

Hakura and Gupta: Idea • Observation: If textures exhibit spatial and temporal localities, design a system to exploit them • Use SRAM cache for each fragment generatorHave a single, shared DRAM texture memory • Advantages • Unique texture memory is larger • Uses cheaper chip (SRAM over DRAM) • SRAM gives higher bandwidth and lower latency

Hakura and Gupta: Locality • Mip mapping has inherent spatial locality • Four contiguous texels on each of two levels for trilinear interpolation, with texel area close to pixel area • Texture mapping has two temporal localities • Overlapping texel usage along contiguous fragment generation • Repeating texture across image [color images.ps]

Hakura and Gupta: Caching • Observation: Increase in DRAM density has decreased DRAM bandwidth! • Cache decreases bandwidth requirement by decreasing accesses to texture memory • Block transfers from memory to cache maximize DRAM bandwidth utilization • Texture memory can be shared (not dedicated) • No cache coherence issues • Cache characterized by: • Cache size • Cache line size • Associativity • Which combination is best?

Hakura and Gupta: Texture Representation in Memory • Base case: Linear (Non-Blocked) • Williams original representation misses spatial locality • Use contiguous RGBA values per texel [Hakura p.5] • Observations: • Gradual level-of-detail change uses more of a fetched cache line • Higher line size drops cold miss rate • Principle of Texture Thrift: amount of texture info required to render is proportional to the resolution of the image, and is independent of the number of surfaces and the size of the texture [Peachey 90] • In examples, workset limited to one textureWorst case bound by either texture size or screen size • This representation is sensitive to the texture orientation on screen.

Hakura and Gupta: Texture Representation in Memory • Blocked case: convert 2-D arrays into 4-D arrays. • Address calculation is a two-step process • Block size remains constant across mipmap levels • Observations: • Reduces dependency on texture orientation, and utilizes spatial locality • Lowest miss rates occur when block size matches cache line size [Hakura p.7] • Increasing line size alone creates worse miss rates • Can use 2-way associative cache to avoid conflict with blocks of different mipmap levels (see Igehy)

Hakura and Gupta: Rasterization • Rasterization order affects texture access pattern, and thus cache behavior also • Use tiling (chunking) to utilize spatial locality • If tiles are too large, the working set will be larger than the cache size, and capacity misses will result [Hakura p.9] • Smaller triangles in image reduce this effect

Hakura and Gupta: Performance • Rendering performance and memory bandwidth are good measures of a texture mapping system • Fragment generator observations • Machine must access more than one texel per cycle • Must hide memory latency to achieve maximum throughput (address precomputation) • SRAM cache observations • Multiple banks with interleaced lines for multiple texel access • Interleave texels within each block • Without multi-texel access, trilinear interpolation can compare texels only once every two cycles!

Hakura and Gupta: Conclusions • Caching yields a three-fold to fifteen-fold reduction in memory bandwidth requirements • Cache should be at least 16 KB and 2-way associative • Long cache lines better utilize bandwidth (with a slight increase in bandwidth requirements) • Block size should match cache line size • Rasterization pattern should be tiled

Igehy et al: Problem • Motivation: Memory bandwidth and latency are (becoming) bottleneck for texture systems • Previous work shows caching benefits [Hakura97; Cox98], but fails to hide memory latency • Little literature on prefetching texels: • used in some systems, but the algorithms are not described (proprietary) e.g. [Torborg and Kajiya, 1996]

Igehy et al: Idea • Combine prefetching and caching in an architecture with a clear description • Advantages: • Simple • Robust to variations in bandwidth requirements and latencies • Achieves within 3% of performance of a zero-latency system

Igehy et al: Traditional Prefetching (no cache) • When a fragment is ready for texturing, queue it and request the texels • Fragment stays in queue for time equal to memory latency • If the queue is sized correctly, latency will be masked • Problems: • If covering large request rate and latency, early prefetch can cause cache miss • Tags must be checked at double-rate to maximize throughput (prefetch check and read check) • Prefetch buffer size must increase as request rate and latency increase

Igehy et al: Texture Prefetching • Differences from traditional prefetch: • Tag checks occur once per texel, before cache access • Add reorder buffer to handle early return of texel data • New cache blocks only put in cache when associated fragment reaches head of the queue • Cache organization: • Four banks each, with adjacent levels of mipmap in alternating banks • Data interleaved so the four accesses for bilinear interpolation can occur in parallel • Can process 8 requests in parallel, which is enough for trilinear interpolation

Igehy et al: Texture Properties • Texture caching effectiveness is scene dependent • Observation: unique-texel-to-fragment ratio is lower bound on number of texels that must be fetched per frame (unless utilizing inter-frame locality) • Want a low unique-texel-to-fragment ratio! • Ratio affected by: • Magnification (lowers ratio) • Repetition (lowers ratio if cache holds entire texture) • Minification (ratio depends on texel-area-to-pixel-area ratio)

Igehy et al: Memory Organization • Use 6-D texture representation in Hakura [Igehy p.5] • Rasterize in tiled pattern (not scan-line) • Cache associativity does not appreciably affect miss rate • Design minimizes conflict misses • General formula for determining associativity: • m independent n-way associative caches can handle a rate of m bilinear accesses (four texels) per cycle to m*n textures (or texture levels in mipmap)

Igehy et al: Bandwidth • Average texel requests per frame are not enough to determine actual requirements • High-request bursts occur [Igehy p.6]e.g. color map vs. light map • When system misses ideal (zero-latency) performance, bandwidth is to blame [Igehy p.8] • e.g. AGP vs. NUMA

Igehy et al: Conclusions • System that approximates zero-latency is possible • Achieved 97% utilization of available resources • Fragment queue should slightly exceed latency of memory system to account for miss bursts • Reserve reorder-buffer slot when memory request is made to avoid deadlock

Discussion!

Caching Strategies for Textures

Caching Strategies for Textures

Presentation Transcript

Analysis of Caching and Replication Strategies for Web Applications

Lapped Textures

Caching for Sustainability

Textures

Textures

Textures

Storage-Aware Caching: Revisiting Caching for Heterogeneous Systems

Proactive Caching Strategies for IAPP Latency Improvement during 802.11 Handoff

Caching for Performance

Caching

Lapped Textures

Caching

Advected textures

Caching

Procedural Textures

Igneous textures

Lapped Textures

Metamorphic Textures Textures of Regional Metamorphism

Caching Strategies