CMPS 255 Computer Architecture Cache performance and alternatives PH 5.4

CMPS 255Computer ArchitectureCache performance and alternatives PH 5.4

Sources of Cache Misses • Compulsory (cold start or process migration, first reference): • First access to a block, “cold” fact of life, not a whole lot you can do about it. If you are going to run “millions” of instruction, compulsory misses are insignificant • Solution: increase block size (increases miss penalty; very large blocks could increase miss rate) • Capacity: • Cache cannot contain all blocks accessed by the program • Solution: increase cache size (may increase access time) • Conflict (collision): • Multiple memory locations mapped to the same cache location • Solution 1: increase cache size • Solution 2: increase associativity (stay tuned) (may increase access time) 2

Handling Cache Misses (Single Word Blocks) • Read misses (Instruction Cache and Data Cache) • stall the pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resume • Write misses (Data Cache only) • stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write-back cache), write the word from the processor to the cache, then let the pipeline resume or • Write allocate – just write the word into the cache updating both the tag and data, no need to check for cache hit, no need to stall or • No-write allocate – skip the cache write (but must invalidate that cache block since it will now hold stale data) and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn’t full 3

CPIstall Measuring Cache Performance • CPU execution time divided into the clock cycles • clock cycles spent executing the program cycles • and the clock cycles spent waiting for the memory system • Assuming cache hit costs are included as part of the normal CPU execution cycle, then CPU time = (CPU execution clock cycles + Memory-stall clock cycles) × Clock cycle time CPU time = IC × (CPIideal + MemoryStallCycles) × CC • Memory-stall cycles come from cache misses (a sum of read-stalls and write-stalls) Read-stall cycles = reads/program × read miss rate × read miss penalty Write-stall cycles = (writes/program × write miss rate × write miss penalty) + write buffer stalls • For write-through caches, we can simplify this to (ignoring write buffer stalls) • Memory-stall cycles = accesses/program × miss rate × miss penalty 4

Impacts of Cache Performance • A processor with a CPIideal of 2, a 100 cycle miss penalty, 36% load/store instr’s, 2% Instruction cache, and 4% Data cache miss rates • number of memory miss cycles for instructions in terms of IC is Instruction miss cycles = IC ×2% × 100 = 2.00 × IC • frequency of all loads and stores is 36%, then the number of memory miss cycles for data references: Data miss cycles = IC ×36% ×4% × 100 = 1.44 × IC • Then total number of memory-stall cycles is: Memory-stall cycles = 2.00 × IC+ 1.44 × IC = 3.44 × IC (This is more than three cycles of memory stall per instruction) • Accordingly, the total CPI including memory stalls is: CPIstalls = 2 + 3.44 = 5.44 • Ideal CPU is 5.44/2 =2.72 times faster

Impacts of Cache Performance • What if the CPIideal is reduced to 1 because of pipelining? • System with cache misses would then have a CPI of 1 + 3.44 = 4.44 • Ideal CPU is 4.44/1 =4.44 times faster • the amount of execution time spent on memory stalls would have risen from 3.44/5.44 = 63% to 3.44/4.44 = 77% • What if CPIideal is reduced to 0.5? 0.25? • What if the Data cache miss rate went up 1%? 2%? • What if the processor clock rate is doubled (doubling the miss penalty)?

Average Memory Access Time (AMAT) • A larger cache will have a longer access time. An increase in hit time will likely add another stage to the pipeline. At some point the increase in hit time for a larger cache will overcome the improvement in hit rate leading to a decrease in performance. • Average Memory Access Time (AMAT) is the average to access memory considering both hits and misses AMAT = Hit time + Miss rate ×Miss penalty • What is the AMAT for a CPU with a 1ns clock, a miss penalty of 20 clock cycles, a missrate of 0.05 misses per instruction and a cache access time of 1 clock cycle? • AMAT = 1 + 0.05 × 20 = 2ns (i.e., 2 cycles per instruction) • What is the AMAT for a processor with a 20 psec clock, a miss penalty of 50 clock cycles, a miss rate of 0.02 misses per instruction and a cache access time of 1 clock cycle? • AMAT = 1 + 0.02x50 = 2

Performance Summary • Relative cache penalty increases as processor performance improves (faster clock rate and/or lower CPI) • Increasing clock rate: The memory speed is unlikely to improve as fast as processor cycle time. When calculating CPIstall, the cache miss penalty is measured in processor clock cycles needed to handle a miss • Decreasing base CPI: The lower the CPIideal, the more pronounced the impact of stalls • Greater proportion of time spent on memory stalls • Can’t neglect cache behavior when evaluating system performance

Reducing Cache Miss Rates #1 - Associative Caches • Allow more flexible block placement • In a direct mappedcache a memory block maps to exactly one cache block • At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache • Requires all entries to be searched at once • Comparator per entry (expensive) • A compromise is to divide the cache into sets each of which consists of n “ways” (n-way set associative). • A memory block maps to a unique set (specified by the index field) • Each set contains n entries • block can be placed in any way of that set (so there are n choices) (block address) modulo (# sets in the cache) • n comparators (less expensive)

Associative Cache Example • direct-mapped placement: • memory block 12 can be found in only one cache block. • (12 modulo 8) = 4 • two-way set-associative : • four sets, • memory block 12 must be in set (12 mod 4) = 0; • the memory block could be in either element of the set • fully associative placement, • memory block 12 can appear in any of the eight cache blocks

Spectrum of Associativity • For a cache with 8 entries Set associative cache reduces cache misses by reducing conflicts between blocks that would have been mapped to the same cache block frame in the case of direct mapped cache 1-way set associative: (direct mapped) 1 block frame per set 2-way set associative: 2 blocks frames per set 4-way set associative: 4 blocks frames per set 8-way set associative: 8 blocks frames per set In this case it becomes fully associative since total number of block frames = 8

01 4 00 01 0 4 00 0 01 4 00 0 01 4 Another Reference String Mapping • Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid miss miss miss miss 0 4 0 4 00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0) 00 01 10 11 Idx Tag data 4 0 4 0 miss miss miss miss 01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0) • 8 requests, 8 misses • Ping pong effect due to conflict misses - two memory locations that map into the same cache block

Set Associative Cache Example Main Memory 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx One word blocks Two low order bits define the byte in the word (32b words) Cache Way Set V Tag Data 0 0 1 0 1 1 Q2: How do we find it? Use next 1 low order memory address bit to determine which cache set (i.e., modulo the number of sets in the cache) Q1: Is it there? Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache

Associativity Example • Compare 4-block caches: Direct mapped, 2-way set associative, fully associative considering block access sequence: 0, 8, 0, 6, 8 • Direct mapped • 2-way set associative • Fully associative

Four-Way Set Associative Cache 31 30 . . . 13 12 11 . . . 2 1 0 Tag 22 8 Index Index V V V V Tag Tag Tag Tag Data Data Data Data 0 1 2 . . . 253 254 255 0 1 2 . . . 253 254 255 0 1 2 . . . 253 254 255 0 1 2 . . . 253 254 255 32 4x1 select Hit Data 1024 block frames, Each block = one word, 4-way set associative, 1024 / 4= 28= 256 sets Byte offset Way 0 Way 1 Way 2 Way 3

Used for tag compare Selects the set Selects the word in the block Increasing associativity Decreasing associativity Fully associative (only one set) Tag is all the bits except block and byte offset Direct mapped (only one way) Smaller tags, only a single comparator Range of Set Associative Caches • For a fixed size cache, each increase by a factor of twoin associativitydoubles the number of blocks per set (i.e., the number or ways) and halves the number of sets – decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Tag Index Block offset Byte offset

Range of Set Associative Caches-Example • cache of 4096 blocks, 4-word block size, 32-bitaddress, • 16 (or 24) bytes per block, yields 32 - 4 = 28 bits to be used for index and tag. • find the total number of sets and the total number of tag bits for: • direct-mapped cache has the same number of sets as blocks, • log2(4096) = 12 bits of index; • total number of tag bits is (28-12) × 4096 = 16 × 4096 = 66 K tag bits. • two-way set-associative cache, • 2048 sets, • total number of tag bits is (28-11) × 2 × 2048 = 34 × 2048 = 70 Kbits. • four-way set-associative cache, • 1024 sets, • total number of tag bits is (28-10) × 4 × 1024 = 72 × 1024 = 74 K tag bits. • fully associative cache, • one set with 4096 blocks, • total number of tag bits is 28 × 4096 × 1 = 115 K tag bits.

Cache Replacement Policy • When a cache miss occurs the cache controller may have to select a block of cache data to be removed from a cache block frame and replaced with the requested data, such a block is selected by one of three methods: • Random: • Any block is randomly selected for replacement providing uniform allocation. • Simple to build in hardware. Most widely used cache replacement strategy. • Least-recently used (LRU): • Accesses to blocks are recorded and the block replaced is the one that was not used for the longest period of time. • Full LRU is expensiveto implement, as the number of blocks to be tracked increases, and is usually approximated by block usage bits that are cleared atregular time intervals. • First In, First Out (FIFO: • Because LRU can be complicated to implement, this approximates LRU by determining the oldest block rather than LRU

Cache Organization & Placement Strategies • Placement strategies or mapping of a main memory data block onto cache block frame addresses divide cache into three organizations: • Direct mapped cache:A block can be placed in only one location (cache block frame), given by the mapping function: index= (Block address) MOD (Number of blocks in cache) • Least complex to implement • Fully associative cache:A block can be placed anywhere in cache. (no mapping function). • Most complex cache organization to implement • Set associative cache:A block can be placed in a restricted set of places, or cache block frames. A set is a group of block frames in the cache. A block is first mapped onto the set and then it can be placed anywhere within the set. The set in this case is chosen by: index = (Block address) MOD (Number of sets in cache) If there are n blocks in a set the cache placement is called n-way set-associative

Costs of Set Associative Caches • When a miss occurs, which way’s block do we pick for replacement? • Direct mapped: no choice • Set associative • Prefer non-valid entry, if there is one • Otherwise, choose among entries in the set • Choose the one unused for the longest time • Simple for 2-way, manageable for 4-way, too hard beyond that • Random • Gives approximately the same performance as LRU for high associativity • N-way set associative cache costs • N comparators (delay and area) • MUX delay (set selection) before data is available • Data available after set selection (and Hit/Miss decision). • In a direct mapped cache, the cache block is available before the Hit/Miss decision • So its not possible to just assume a hit and continue and recover later if it was a miss

Reducing Cache Miss Rates #2 - Multilevel Cache • Use multiple levels of caches • Primary (Level-1) cache attached to CPU • Small, but fast • Level-2 cacheservices misses from primary cache • Larger, slower, but still faster than main memory • Main memory services L-2 cache misses • Some high-end systems include L-3 cache

Reducing Cache Miss Rates #2 - Multilevel Cache • Example: Given processor with a base CPI of 1.0,clock rate of 4GHz, miss rate per instruction at the primary cache is 2%, main memory access time of 100ns. • How much faster will the processor be if we add a secondary cachethat reduce the miss rate to main memory to 0.5% and has a 5 ns access time? • With just primary cache • Miss penalty = 400 cycles • Total CPI = CPIideal+ Memory-stall cycles per instruction= 1 + 0.02 × 400 = 9 • with two level cache • Miss Penalty for an access to L2 cache= 20 cycles • If the miss needs to go to main memory, then the total miss penalty secondary cache access time + main memory access time • Total CPI = 1+ Primary stalls per instruction + Secondary stalls per instruction = 1 + 0.02 × 20 + 0.005 × 400 = 3.4 • Performance ratio = 9/3.4 = 2.6

Reducing Cache Miss Rates #2 - Multilevel Cache • With L1 cache, a processor with a CPIideal of 2, a 100 cycle miss penalty, 36% load/store instr’s, 2% Instruction cache, and 4% Data cache miss rates • Then number of memory-stall cycles is: Memory-stall cycles = 2.00 + 1.44 = 3.44 • Accordingly, the total CPI including memory stalls is: CPIstalls = 2 + 3.44 = 5.44 • With L2 cache: with a CPIideal of 2, a 100 cycle miss penalty (to main memory), 25 cycle miss penalty to L2 cache, 36% load/store instr’s, 2% Instruction cache miss rate to L1, and 4% Data cache miss rate to L1, and 0.5% L2 cache miss rate CPIstalls= 2 + .02 × 25 + .36 × .04 × 25 + .005×100 + .36×.005×100 = 3.54 (as compared to 5.44 with no L2 cache)

Multilevel Cache Design Considerations • Design considerations for L1 and L2 caches are very different • Primary cache should focus on minimizing hit time in support of a shorter clock cycle • Smaller with smaller block sizes • Secondary cache(s) should focus on reducing miss rate to reduce the penalty of long main memory access times • Larger with larger block sizes • Higher levels of associativity • The miss penalty of the L1 cache is significantly reduced by the presence of an L2 cache – so it can be smaller (i.e., faster) but have a higher miss rate • For the L2 cache, hit time is less important than miss rate • The L2 cache hit time determines L1 cache ’s miss penalty • L2 cache local miss rate >> than the global miss rate

CMPS 255 Computer Architecture Cache performance and alternatives PH 5.4

CMPS 255 Computer Architecture Cache performance and alternatives PH 5.4

Presentation Transcript

Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs

Cache performance

Computer Architecture Cache Memory

Computer Architecture Probabilistic L1 Cache Filtering

Computer Architecture Cache Memory

Cache Performance

Constructive Computer Architecture Cache Coherence Arvind

Constructive Computer Architecture Cache Coherence Arvind

Computer Architecture Project #2 Cache Simulator

Cache performance

Cache coherence for CMPs

High-Performance Computer Architecture

Cache Performance

361 Computer Architecture Lecture 14: Cache Memory

Computer Architecture Cache Model John Morris

Computer Architecture Cache Memory

Computer Architecture Cache Memory

Computer Architecture Cache Memory

CMPS 255 Computer Architecture

CMPS 255 Computer Architecture Pipelining,hazards , and exception PH: 4.5, 4.9