Advanced Cache Optimizations

Advanced Computer Architecture Advanced Cache Optimizations Omid Fatemi Advanced Computer Architecture

Six Basic Cache Optimizations • Reducing hit time • Giving Reads Priority over Writes • e.g., Read complete before earlier writes in write buffer • Avoiding Address Translation during Cache Indexing • Reducing Miss Penalty • Multilevel Caches • Reducing Miss Rate • Larger Block size (Compulsory misses) • Larger Cache size (Capacity misses) • Higher Associativity (Conflict misses)

Reducing hit time Small and simple caches Way prediction Trace caches Increasing cache bandwidth Pipelined caches Multibanked caches Nonblocking caches Reducing Miss Penalty Critical word first Merging write buffers Reducing Miss Rate Compiler optimizations Reducing miss penalty or miss rate via parallelism Hardware prefetching Compiler prefetching 11 Advanced Cache Optimizations

Fast Hit Times Via Small and Simple Caches • Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache? • Small data cache and clock rate. • Direct Mapped, on chip. • Smaller cache can fit on chip. • In DM, the data transmission is overlapped with tag check. • 1st suggestion: tags on chip and the data off chip. (usually for next level caches). • 2nd : Direct mapped. • the size of the L1 caches has recently not increased between generations. • The L1 caches are the same size between the Alpha 21264 and 21364, UltraSPARC II and III, and AMD K6 and Athlon. • The L1 data cache size is actually reduced from 16 KB in Pentium III to 8 KB in Pentium 4.

Small and Simple Caches • Make cache smaller to improve hit time • Or (probably better), add a new & smaller L0 cache between existing L1 cache and CPU. • Keep L1 cache on same chip as CPU • Physically close to functional units that access it • Keep L1 design simple, e.g. direct-mapped • Avoids multiple tag comparisons • Tag can be compared after data cache fetch • Reduces effective hit time

Cache Access Times http://research.compaq.com/wrl/people/jouppi/CACTI.html

Hit Time Miss Penalty Pseudo Hit Time Time Reducing Misses via Way Prediction and Pseudo-Associativity • How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? • Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) (in way prediction, the set is predicted) • Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles (variable) • Better for caches not tied directly to processor (L2) • Used in MIPS R1000 L2 cache, similar in UltraSPARC The Alpha 21264 uses way prediction in its instruction cache. (a set predictor bit is added to select which set) hit time: one clock cycle and pseudo hit time: three clock cycles. Miss rate(+), Hardware complexity (2).

Way Prediction • Keep in each set a way-prediction information to predict the block in each set will be accessed next • Only one tag may be matched at the first cycle; if miss, other blocks need to be examined • Beneficial in two aspects • Fast data access: Access the data without knowing the tag comparison results • Low power: Only match a single tag; if majority of the prediction is correct • Different systems use variations of the concept

BR BR BR Fast (Instruction Cache) Hit times via Trace Cache Key Idea: Pack multiple non-contiguous basic blocks into one contiguous trace cache line BR BR BR • Single fetch brings in multiple basic blocks • Trace cache indexed by start address and next n branch predictions

Trace Caches • Supply enough instructions per cycle without dependencies • Finding ILP beyond 4 instructions per cycle • Don’t limit instructions in a static cache block to spatial locality • Find dynamic sequence of instructions including taken branches • NetBurst (P4) uses trace caches • Addresses are no longer aligned. • Same instruction is stored more than once • If part of multiple traces

Trace Cache Implementation • Trace cache is dynamic instruction cache • Block size is (for example) 8 words • Fetch a block normally first time (from I cache or memory) • Allocate an entry in trace cache and put in 8 instructions actually fetched sequentially in block • Association (tag) is with addr. of first inst. in block and (multiple bit) prediction obtained related to first inst. addr. • In future, if hit in trace cache, (addr + prediction), the block of instructions are used – else I cache is used. • Note one start address can have multiple traces depending on prediction • In principle, can use trace over multiple branches with more complex prediction information. • But will have (potentially) lots of cache entries for same starting block of instructions – efficiency?

Trace Cache in Pentium IV • The x86 CISC instruction set is a disaster to implement as a pipelined processor. • So Pentium III & IV use (in order) translation of x86 instructions into (possibly 10s of) ‘micro-ops’ (simple RISC). • Main CPU is then fully speculative out of order processor. • Trace cache of 12K mops.

Fast Hits by Pipelining Cache • pipelining cache access: • the effective latency of a first level cache hit can be multiple clock cycles, • giving fast cycle time and slow hits. • For example: • Pentium takes one clock cycle to access the instruction cache • Pentium Pro through Pentium III: two clocks • Pentium 4: four clocks. • This split increases the number of pipeline stages • greater penalty on mispredicted branches • more clock cycles between the issue of the load and the use of the data. • really increasing the bandwidth of instructions rather than decreasing the actual latency of a cache hit.

Miss Penalty/Rate Reduction: Non-blocking Caches to Reduce Stalls on Misses • Non-blocking cacheor lockup-free cacheallow data cache to continue to supply cache hits during a miss • requires out-of-order execution CPU • “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests • “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses • Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses • Requires multiple memory banks (otherwise cannot support) • Pentium Pro allows 4 outstanding memory misses

Miss Penalty in Out of Order Execution • Miss penalty: • full latency of the miss to memory? • just the “exposed” or non-overlapped latency when the processor must stall? • This question does not arise in processors which stall until the data miss completes. • Let’s redefine memory stalls to lead to a new definition of miss penalty as non-overlapped latency: • Memory stall cycles / Instruction = Misses / Instruction x (Total miss latency - Overlapped miss latency) • length of memory latency: what to consider as the start and the end of a memory operation • length of latency overlap: ?

Average Memory Stall Time

Value of Hit Under Miss for SPEC • FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26 • Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 • 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss 0->1 1->2 2->64 Base “Hit under n Misses” Integer Floating Point

Increase Cache Bandwidth: Multibanked Caches • Adjacent words found in different mem. banks • Banks can be accessed in parallel • Overlap latencies for accessing each word • Can use narrow bus (and have fast sequential access) • To return accessed words sequentially • Fits well with sequential access • e.g., of words in cache blocks • Independent memory banks • Allows multiple simultaneous cache misses • L2 cache of the AMD Opteron: two banks, L2 cache of the Sun Niagara: four banks.

Reduce Miss Penalty: Early Restart and Critical Word First • Don’t wait for full block to be loaded before restarting CPU • Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution • Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first • Generally useful only in large blocks, • Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart • Commonly used in all the processors block

Reduce Miss Penalty: Merging Write Buffer • If the buffer contains other modified blocks, the addresses can be checked to see if the address of this new data matches the address of the valid write buffer entry. • If so, the new data are combined with that entry, called write merging. Or new data replaces old data. • Multiword writes are usually faster than writes performed one word at a time. • Blocks of write buffer is designed as the size of memory path width. • Have valid bitsper sub-block to indicate valid • Also reduces stalls due to buffer full.

Write Merging Example

Reducing Misses by Compiler Optimizations • Instructions • Reorder procedures in memory so as to reduce conflict misses • Profiling to look at conflicts (using developed tools) • McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software • Data • Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays • Loop Interchange: change nesting of loops to access data in order stored in memory • Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap • Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows Miss rate(+), No Hardware complexity

Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; Improve spatial locality

Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial locality

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j]= 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j]+ c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality

Blocking Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; • Two Inner Loops: • Read all NxN elements of z[] • Read N elements of 1 row of y[] repeatedly • Write N elements of 1 row of x[] • Capacity Misses a function of N & Cache Size: • 3 NxNx4 => no capacity misses; otherwise ... • Idea: compute on BxB submatrix that fits

Blocking example After blocking:

Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; • B called Blocking Factor • Capacity Misses from 2N3 + N2 to 2N3/B +N2

Reducing Conflict Misses by Blocking • Conflict misses in caches not FA vs. Blocking size • Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache

Effect of Compiler Optimizations

Prefetched instruction block Req block Stream Buffer Unified L2 Cache CPU L1 Instruction Req block RF Reducing Penalty/Rate by Hardware Prefetching of Instructions & Data • Prefetching relies on having extra memory bandwidth that can be used without penalty • Instruction Prefetching • Alpha 21064 fetches 2 blocks on a miss • Extra block placed in “stream buffer” • On miss check stream buffer • If hit in stream buffer, move stream buffer block into cache and prefetch next block

Data Prefetching • Prefetching works with data blocks too: • Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% • Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches • Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8 different 4 KB pages • Prefetching invoked if 2 successive L2 cache misses to a page, if distance between those cache blocks is < 256 bytes

Issues in Prefetching • Usefulness – should produce hits • Timeliness – not late and not too early • Cache and bandwidth pollution L1 Instruction Unified L2 Cache CPU L1 Data RF Prefetched data

Hardware Data Prefetching • Prefetch-on-miss: • Prefetch b + 1 upon miss on b • One Block Lookahead (OBL) scheme • Initiate prefetch for block b + 1 when block b is accessed • Why is this different from doubling block size? • Can extend to N block lookahead • Strided prefetch • If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc. Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access

3. Reducing Penalty/Rate by Software Prefetching Data • Compiler inserts prefetch instructions to request the data before they are needed. • Data Prefetch • Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) • Faulting or non-faulting • Most processors now offer non-faulting cache prefetches (non-binding prefetch) • In the case of exception: turn into no-ops • Useful for non-blocking (lockup-free) caches Miss rate(+), Hardware complexity(3).

Summary of Memory Optimizations

5. Victim Caches • Small, fully associative cache between a cache and its refill path. • Not changing clock rate • Contains blocks that are discarded from a cache because of a miss -victim” • They are checked on a miss • If there, the victim block and cache block are swapped. • Miss rate (+), Hardware complexity (2)

5. Reducing Misses via a“Victim Cache” • How to combine fast hit time of direct mapped yet still avoid conflict misses? • Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache • Used in Alpha, HP machines • The AMD Athlon has a victim cache with eight entries.

Advanced Cache Optimizations