Computer Architecture Lecture Notes Spring 2005 Dr. Michael P. Frank

Computer Architecture Lecture Notes Spring 2005Dr. Michael P. Frank Competency Area 6: Cache Memory

Introduction • Memory is important to performance. • Users want large memories with fast access times  ideally unlimited fast memory • To use an analogy, think of a bookshelf containing many books: • Suppose you are writing a paper on birds. You go to the bookshelf, pull out some of the books on birds and place them on the desk. As you start to look through them you realize that you need more references. So you go back to the bookshelf and get more books on birds and put them on the desk. Now as you begin to write your paper, you have many of the references you need on the desk in front of you. • This is an example of the principle of locality: This principle states that programs access a relatively small portion of their address space at any instant of time.

Temporal & Spatial Locality • There are two types of locality: TEMPORAL LOCALITY (locality in time) If an item is referenced, it will likely be referenced again soon. Data is reused. SPATIAL LOCALITY (locality in space) If an item is referenced, items in neighboring addresses will likely be referenced soon • Most programs contain natural locality in structure. For example, most programs contain loops in which the instructions and data need to be accessed repeatedly. This is an example of temporal locality. • Instructions are usually accessed sequentially, so they contain a high amount of spatial locality. • Also, data access to elements in an array is another example of spatial locality.

Memory Hierarchy • We can exploit the natural locality in programs by implementing the memory of a computer as a memory hierarchy. • A memory hierarchy consists of multiple levels of memory with different speeds and sizes. The fastest memories are more expensive, and usually much smaller in size (see figure). • The user has the illusion of a memory that is large and fast, by accessing the largest level of the memory hierarchy quickly. • This is accomplished by using efficient methods for memory structure and organization.

Memory Technologies • Memory hierarchies are built using three main technologies: • Main memory is implemented using dynamic random access memory (DRAM): • value is stored as a charge on capacitor • very small but slower than SRAM • Memory levels closer to CPU (a.k.a. cache) are implemented using static random access memory (SRAM): • value is stored on a pair of inverting gates • very fast but takes up more space than DRAM • Memory levels the farthest away from the CPU are usually implemented using magnetic disks. They are the largest and slowest levels in the hierarchy.

Basic Caching Concepts • Memory system is organized as a hierarchy with the level closest to the processor being a subset of any level further away, and all of the data is stored at the lowest level (see figure). • Data is copied between only two adjacent levels at any given time. We call the minimum unit of information contained in a two-level hierarchy a block. The highlighted square shown in the figure. • If data requested by the user appears in some block in the upper level it is known as a hit. If data is not found in the upper levels, it is known as a miss.

Blocks vs. Frames • A block is a fixed-size chunk of (possibly mutable) data that has an associated logical address in the machine’s memory address space • There may be copies of a given block in various places, • some more up-to-date than others • Like versions of a document • A (block) frame is a block-sizedpiece of physical hardwarewhere a block could be placed. • Like a physical picture frameyou place a document copy in. A particular block 1001010110101111 Block frames

Four Questions to Ask re: a Cache Design • Consider any level in a memory hierarchy. • Remember: a block is the unit of data transfer. • Between the given level, and the levels below it • The level’s design is described by four behaviors: • Block Placement: • Where could a new block be placed in the given level? • Block Identification: • How is a existing block found, if it is in the level? • Block Replacement: • Which existing block should be replaced, if necessary? • Write Strategy: • How are writes to the block handled?

The Three Major Placement Schemes

Direct-Mapped Identification Tags Block frames Block address Full byte address: Frm# Tag Off. Decode & Row Select Muxselect ? Compare Tags Data Word Hit

Direct-Mapped Placement • A block can only go into one frame in the cache • Determined by the block’s address (in memory space) • The frame number for block placement is usually given by some low-order bits of block’s address. • This can also be expressed as: (Frame number) = (Block address) mod (Number of frames in cache) • Note that in a direct-mapped cache, • Block placement & replacement choices are both completely determined by the address of the new block that is to be accessed.

Direct-Mapped Cache Example • Here is a typical architecture for a direct-mapped cache. • The block size is 1 word of 4 bytes (32 bits). • The byte address is broken down as follows: • addr[1:0] = • 2-bit byte offset within word • addr[11:2] = • 10-bit block frame index • Addr[31:12] = • 20-bit block tag • On a read, if the tag matches, then this is the data we want. • We get a cache hit. • But, what happens if the tag bits don’t match?

Cache Miss Behavior • If the tag bits do not match, then a cache miss occurs. • Upon a cache miss: • The CPU is stalled freezing the contents of internal and programmer-visible registers while waiting for the result from memory. • Desired block of data is fetched from memory and placed in cache. • Execution is restarted at the cycle that caused the cache miss. • Recall that we have two different types of memory accesses: • reads (loads) or writes (stores). • Thus, overall we can have 4 kinds of cache events: • read hits, read misses, write hits and write misses.

Fully-Associative Placement • One alternative to direct-mapped is: • Allow block to fill any empty frame in the cache. • How do we then locate the block later? • Can associate each stored block with a tag • Identifies the block’s home address in main memory. • When the block is needed, we can use the cache as an associative memory, using the tag to match all frames in parallel, to pull out the appropriate block. • Another alternative to direct-mapped is placement under full program control. • A register file can be viewed as basically a small programmer-controlled cache (w. 1-word blocks).

Fully-Associative Identification Block addrs Block frames Address Block addr Off. Parallel Compare& Select • Note that, compared to Direct: • More address bits have to be stored with each block frame. • A comparator is needed for each frame, to do the parallel associative lookup. Muxselect Hit Data Word

Set-Associative Placement • The block address determines not a single frame, but a frame set. • A frame set is several frames, grouped together. (Frame set #) = (Block address) mod (# of frame sets) • The block can be placed associatively anywhere within that frame set. • Where? This is part of the placement strategy. • If there are n frames in each frame set, the scheme is called “n-way set-associative”. • Direct mapped = 1-way set-associative. • Fully associative = There is only 1 frame set.

Set-Associative Identification Tags Block frames Address Block address Set# Tag Off. Note:4 = 22separatesets Set Select (2 bits) Parallel Compare • Intermediate between direct-mapped and fully-associative in number of tag bits needed to be associated with cache frames. • Still need a comparator for each frame (but only those in one set need be activated). Muxselect Hit Data Word

Cache size equation • Simple equation for the size of a cache: (Cache size) = (Block size) × (Number of sets) × (Set Associativity) • Can relate to the size of various address fields: (Block size) = 2(# of offset bits) (Number of sets) = 2(# of index bits) (# of tag bits) = (# of memory address bits)  (# of index bits)  (# of offset bits) Memory address

Replacement Strategies Replacement strategy: Which existing block do we replace, when a new block comes in? • With a direct-mapped cache: • There’s only one choice! (Same as placement.) • With a (fully- or set-) associative cache: • If any frame in the set is empty, pick one of those. • Otherwise, there are many possible strategies: • (Pseudo-) random: Simple, fast, and fairly effective • Least-recently used (LRU), and approximations thereof • Makes little difference in larger caches • First in, first out (FIFO) – Use time since block was read • May be easier to track than time since last access

Write Strategies • Most accesses are reads, not writes • Especially if instruction reads are included • Optimize for reads! • Direct mapped can return value before valid check • Writes are more difficult, because: • We can’t write to cache till we know the right block • Object written may have various sizes (1-8 bytes) • When to synchronize cache with memory? • Write through - Write to cache & to memory • Prone to stalls due to high mem. bandwidth requirements • Write back - Write to memory upon replacement • Memory may be left out of date for a long time

Action on Cache Hits vs. Misses • Read hits: • Desirable • Read misses: • stall the CPU, fetch block from memory, deliver to cache, restart • Write hits: • replace data in cache and memory at same time (write-through strategy) • write the data only into the cache (write-back strategy) it is written to main memory only when it is replaced • Write misses: • No write-allocate: Write the data to memory only. • Write-allocate: read the entire block into the cache, then write the word

Cache Hits vs. Cache Misses • Consider the write-through strategy: every block written to cache is automatically written to memory. • Pro: Simple; memory is always up-to-date with the cache • No write-back required on block replacement. • Con: Creates lots of extra traffic on the memory bus. • Write hit time may be increased if CPU must wait for bus. • One solution to write time problem is to use a write buffer to store the data while it is waiting to be written to memory. • After storing data in cache and write buffer, processor can continue execution. • Alternately, a write-back strategy writes data to main memory only a block is replaced. • Pros: Reduces memory bandwidth used by writes. • Cons: May increases miss penalty due to write-back.

Hit/Miss Rate, Hit Time, Miss Penalty • Several important quantities used in cache performance analysis relate to hits and misses: • The hit rateor hit ratio is • fraction of memory accesses found in upper level. • The miss rate(= 1 – hit rate)is • fraction of memory accesses not found in upper levels. • The hit time is • the time to access the upper level of the memory hierarchy, which includes the time needed to determine whether the access is a hit or miss. • The miss penalty is • the time needed to replace a block in the upper level with a corresponding block from the lower level. • may include the time to write back an evicted block.

Cache Performance Analysis • Performance is always a key issue when designing caches. • We consider improving cache performance by: • (1) reducing the miss rate, and • (2) reducing the miss penalty. • For (1) we can reduce the probability that different memory blocks will contend for the same cache location. • For (2), we can add additional levels to the hierarchy, which is called multilevel caching. We can determine the CPU time as

Cache Performance • The memory-stall clock cycles come from cache misses. • It can be defined as the sum of the stall cycles coming from writes +those coming from reads: Memory-Stall CC = Read-stall cycles + Write-stall cycles where

Cache Performance Example Assume an instruction cache miss rate for the gcc benchmark of 2% and a data cache miss rate of 4%. If a machine has a CPI of 2 without any memory stalls and the miss penalty is 40 cycles for all misses, determine how much faster a machine would run with a perfect cache that never missed. Use the following instruction mix for gcc: * Note that we use memory for both instructions and data so we can apply cache architecture designs to both memory units.

Cache Performance • Example (on own) Suppose we increase the performance of the machine in the previous example by doubling its clock rate. Since the main memory speed is unlikely to change, assume that the absolute time to handle a cache miss does not change. How much faster will the machine be with the faster clock, assuming the same miss rate as previous example?

Cache Performance Formulas • Useful formulas for analyzing ISA/cache interactions : • (CPU time) = [(CPU cycles) + (Memory stall cycles)] × (Clock cycle time) • (Memory stall cycles) = (Instruction count) × (Accesses per instruction) × (Miss rate) × (Miss penalty) • But, are not the best measure for cache design by itself: • Focus on time per-program, not per-access • But accesses-per-program isn’t up to the cache design • We can limit our attention to individual accesses • Neglects hit penalty • Cache design may affect #cycles taken even by a cache hit • Neglects cycle length • May be impacted by a poor cache design

A Key Cache Performance Metric: AMAT (Average memory access time) = (Hit time) + (Miss rate)×(Miss penalty) • The times Tacc, Thit, and T+miss can be all either: • Real time (e.g., nanoseconds) • Or, number of clock cycles • In contexts where cycle time is known to be a constant • Important: • T+miss means the extra (not total) time for a miss • in addition to Thit, which is incurred by all accesses Hit time Lower levelsof hierarchy CPU Cache Miss penalty

More Cache Performance Metrics • Can split access time into instructions & data: Avg. mem. acc. time =(% instruction accesses) × (inst. mem. access time) + (% data accesses) × (data mem. access time) • Another simple formula: CPU time = (CPU execution clock cycles + Memory stall clock cycles) × cycle time • Useful for exploring ISA changes • Can break stalls into reads and writes: Memory stall cycles = (Reads × read miss rate × read miss penalty) + (Writes × write miss rate × write miss penalty) Hit time + miss rate × miss penalty

Factoring out Instruction Count • Gives (lumping together reads & writes): • May replace: • So that miss rates aren’t affected by redundant accesses to same location within an instruction.

Improving Cache Performance • Consider the cache performance equation: • It obviously follows that there are three basic ways to improve cache performance: • A. Reducing miss rate (5.3) • B. Reducing miss penalty (5.4) • C. Reducing hit time (5.5) • Note that by Amdahl’s Law, there will be diminishing returns from reducing only hit time or amortized miss penalty by itself, instead of both together. (Average memory access time) = (Hit time) + (Miss rate)×(Miss penalty) “Amortized miss penalty” Reducing amortizedmiss penalty

Computer Architecture Lecture Notes Spring 2005 Dr. Michael P. Frank