Memory Hierarchy Design

Memory Hierarchy Design • Motivated by a combination of programmer's desire for unlimited fast memory and economical considerations, and based on: • principle of locality, and • cost/performance ratio of memory technologies (fast  small, large  slow), to achieve a memory system with cost almost as low as the cheapest level of memory and speed almost as fast as the fastest level. • Hierarchy: CPU/register file (RF)  cache (C)  main memory (MM)  disk memory/I/O devices (DM) • Speed in descending order: RF > C > MM > DM • Space in ascending order: RF < C < MM < DM

Memory Hierarchy Design • Thegaps in speed and space between the different levels are widening increasingly:

Memory Hierarchy Design • Cache performance review: Memory stall cycles = Number_of_misses * Miss_penalty = IC * Miss_per_instr * Miss_penalty = IC * MAPI * Miss_rate * Miss_penalty where MAPI stands for memory accesses per instruction • Four Fundamental Memory Hierarchy Design Issues: • Block placement issue: where can a block, the atomic memory unit in cache-memory transactions, be placed in the upper level? • Block identification issue: how is a block found if it is in the upper level? • Block replacement issue: which block should be replaced on a miss? • Write strategy issue: what happens on a write?

Memory Hierarchy Design • Placement: three approaches: • fully associative: any block in the main memory can be placed in any block frame. It is flexible but expensive due to associativity • direct mapping: each block in memory is placed in a fixed block frame with the following mapping function: (Block Address) MOD (Number of blocks in cache) • set associative: a compromise between fully associative and direct mapping; The cache is divided into sets of block frames, and each block from the memory is first mapped to a fixed set wherein the block can be placed in any block frame. Mapping to a set follows the function, called a bit selection: (Block Address) MOD (Number of sets in cache)

Memory Hierarchy Design • Identification: • Each block frame in the cache has an address tag indicating the block's address in the memory • All possible tags are searched in parallel • A valid bit is attached to the tag to indicate whether the block contains valid information or not • An address for a datum from CPU, A, is divided into a block address field and a block offset field: • block address = (A) / (block size) • block offset = (A) MOD (block size) • block address is further divided into tag and index: • index indicates the set in which the block may reside • tag is compared to indicate a hit or a miss

Memory Hierarchy Design • Replacement on a cache miss: • The more choices for replacement, the more expensive for hardware direct mapping is the simplest • Random vs. least-recently used (LRU): the former has uniform allocation and is simple to build while the latter can take advantage of temporal locality but can be expensive to implement (why?). First in, first out (FIFO) approximates LRU and is simpler than LRU Data cache misses per 1000 instructions

Memory Hierarchy Design • Write strategies: • Most cache accesses are reads: 10% stores + 37% loads + 100% instructions  only 7% of all memory accesses are writes • Optimize reads to make the common case fast, observing that CPU doesn't have to wait for writes while must wait for reads: fortunately, read is easy in direct-mapping: reading and tag comparison can be done in parallel (what about associative mapping?); but write is hard: • Cannot overlap tag reading and block writing (destructive) • CPU specifies write size: only 1 - 8 bytes. Thus write strategies often distinguish cache design;On a write hit: • write through(or store through): • ensuring consistency at the cost of memory and bus bandwidth • write stalls may be alleviated by using write buffers • write back(store in): • minimizing memory and bus traffic at the cost of weakened consistency, • use dirty bit to indicate modification • read misses may result in writes (why?) • On a write miss: • write allocate(fetch on write) • no-write allocate(write around)

Memory Hierarchy Design • An Example:The Alpha 21264 Data Cache • Cache size=64KB, block size=64B, two-way set associativity, write-back, write allocate on a write miss. • What is the index size? = 64K/(64*2) = 216/(26+1)=29

Memory Hierarchy Design • Cache Performance: • Memory access time is an indirect measure of performance and it is not a substitute for execution time:

Memory Hierarchy Design • Example 1: How much does cache help in performance?

Memory Hierarchy Design • Example 2: What’s the relationship between AMAT and CPU Time?

Memory Hierarchy Design • Improving Cache Performance • The average memory access time can be improved by reducing any of the three parameters above: • R1  reducing miss rate; • R2  reducing miss penalty; • R3  reducing hit time; • Four categories of cache organizations that help reduce these parameters: • Organizations that help reduce miss rate: larger block size, larger cache size, higher associativity, way prediction and pseudoassociativity, and compiler optimization; • Organizations that help reduce miss penalty: multilevel caches, critical word first, read miss before write miss, merging write buffers, and victim cache; • Organizations that help reduce miss penalty or miss rate via parallelism: non-blocking caches, hardware prefetching, and compiler prefetching; • Organizations that help reduce hit time: small and simple caches, avoid address translation, pipelined cache access, and trace cache.

Memory Hierarchy Design • Reducing Miss Rate • There are three kinds of cache misses depending on the causes: • Compulsory: the very first access to a block cannot be a hit, since the block must be first brought in from the main memory. Also call cold-start misses; • Capacity: lack of space in cache to hold all blocks needed for the execution. Capacity misses will occur because of blocks being discarded and later retrieved; • Conflict: due to mapping that confines blocks to restricted area of cache (e.g., direct mapping, set-associative), also called collision missesor interference misses • While 3-C characterization gives insights to causes, they are at times too simplistic (and they are inter-dependent). For example, they ignore replacement policies.

Memory Hierarchy Design Roles of 3-C

Memory Hierarchy Design • First Miss Rate Reduction Technique: Large Block Size • Takes advantage of spatial locality reduces compulsory miss • Increases miss penalty (it takes longer to fetch a block) • Increases conflict misses, and/or increases capacity misses • Must strike a delicate balance among MP, MR, and AMAT, in finding an appropriate block size

Memory Hierarchy Design • First Miss Rate Reduction Technique: Larger Block Size • Example: Find the optimal block size in terms of AMAT, given that miss penalty is 40 cycles overhead plus 2 cycles/16 bytes and miss rates of the table below. • Solution: AMAT = HT + MR * MP = HT + MR * (40 + block size * 2 / 16) • High latency and bandwidth encourages large block size • Low latency and bandwidth encouragessmall block size

Memory Hierarchy Design • Second Miss Rate Reduction Technique: Larger Caches • An obvious way to reduce capacity misses. • Drawback: high overhead in terms of hit time and higher cost. • Popular in off-chip cache (2nd and 3rd level cache). • Third Miss Rate Reduction Technique: Higher Associativity • Miss rate • Rule of Thumb: • 8-way associativity is almost equal to full associativity; • Miss rate of (1-way of N-sized cache) is almost equal to Miss rate of (2-way of 0.5N-sized cache) • The higher the associativity, the longer the hit time (why?) • Higher miss rate rewards higher associativity.

Memory Hierarchy Design • Fourth Miss Rate Reduction Technique: Way Prediction and Pseudoassociative Caches • Way prediction helps select one block among those in a set, thus requiring only one tag comparison (if hit). • Preserves advantages of direct-mapping (why?); • In case of a miss, other block(s) are checked. • Pseudoassociative (also called column associative) caches • Operate exactly as direct-mapping caches when hit, thus again preserving advantages of the direct-mapping; • In case of a miss, another block is checked (as if in set-associative caches), by simply inverting the most significant bit of the index field to find the other block in the “pseudoset”. • real hit time < pseudo-hit time • too many pseudo hits would defeat the purpose

Memory Hierarchy Design • Fifth Miss Rate Reduction Technique: Compiler Optimizations

Memory Hierarchy Design • Fifth Miss Rate Reduction Technique: Compiler Optimizations • Blocking: improve temporal and spatial locality • multiple arrays are accessed in both ways (i.e., row-major and column-major), namely, orthogonal accesses that can not be helped by earlier methods • concentrate on submatrices, or blocks • All N*Nelements of Yand Zare accessed Ntimes and each element of Xis accessed once. Thus, there are N3operations and 2N3+ N2reads! Capacity misses are a function of Nand cache size in this case.

Memory Hierarchy Design • Fifth Miss Rate Reduction Technique: Compiler Optimizations • To ensure that elements being accessed can fit in the cache, the original code is changed to compute a submatrix of size B*B, where B is called the blocking factor. • To total number of memory words accessed is 2N3//B + N2 • Blocking exploits a combination of spatial (Y) and temporal (Z) locality.

Memory Hierarchy Design • First Miss Penalty Reduction Technique: Multilevel Caches • To keep up with the widening gap between CPU and main memory, try to: • make cache faster, and • make cache larger by adding another, larger but slower cache between cache and the main memory.

Memory Hierarchy Design • First Miss Penalty Reduction Technique: Multilevel Caches • Local miss rate vs. global miss rate: • Local miss rate is defined as • Global miss rate is defined as

Memory Hierarchy Design • Second Miss Penalty Reduction Technique: Critical Word First and Early Restart • CPU needs just one word of the block at a time: • critical word first: fetch the required word first, and • early start: as soon as the required word arrives, send it to CPU. • Third Miss Penalty Reduction Technique: Giving Priority to Read Misses over Write Misses • Serves reads before writes have been completed: • while write buffers improve write-through performance, they complicate memory accesses by potentially delaying updates to memory; • instead of waiting for the write buffer to become empty before processing a read miss, the write buffer is checked for content that might satisfy the missing read. • in a write-back scheme, the dirty copy upon replacing is first written to the write buffer instead of the memory, thus improving performance.

Memory Hierarchy Design • Fourth Miss Penalty Reduction Technique: Merging Write Buffer • Improves efficiency of write buffers that are used by both write-through and write back caches: • Multiple single-word writes are combined into a single write buffer entry which is otherwise used for multi-word write. • Reduces stalls due to write buffer being full

Memory Hierarchy Design • Fifth Miss Penalty Reduction Technique: Victim Cache • victim caches attempt to avoid miss penalty on a miss by: • Adding a small fully-associative cache that is used to contain discarded blocks (victims) • It is proven to be effective, especially for small 1-way cache. e.g., a 4-entry victim cache removes 20% !

Memory Hierarchy Design • Reducing Cache Miss Penalty or Miss Rate via Parallelism • Nonblocking Caches (Lock-free caches): • Hardware Prefetching of Instructions and Data:

Memory Hierarchy Design • Reducing Cache Miss Penalty or Miss Rate via Parallelism • Compiler-Controlled Prefetching: compiler inserts prefetch instructions

Memory Hierarchy Design • Reducing Cache Miss Penalty or Miss Rate via Parallelism • Compiler-Controlled Prefetching: An Example for(i:=0; i<3; i:=i+1) for(j:=0; j<100; j:=j+1) a[i][j] := b[j][0] * b[j+1][0] • 16-byte blocks, 8KB cache, 1-way write back, 8-byte elements; What kind of locality, if any, exists for a and b? • 3 rows and 100 columns; spatial locality: even-indexed elements miss and odd-indexed elements hit, leading to 3*100/2 = 150 misses • 101 rows and 3 columns; no spatial locality, but there is temporal locality: same element is used in ith and (i + 1)st iterations and the same element is access in each i iteration. 100 misses for i = 0 and 1 miss for j = 0 for a total of 101 misses • Assuming large penalty (50 cycles and at least 7 iterations must be prefetched). Splitting the loop into two, we have

Memory Hierarchy Design • Reducing Cache Miss Penalty or Miss Rate via Parallelism • Compiler-Controlled Prefetching: An Example (continued) for(j:=0; j<100; j:=j+1){ prefetch(b[j+7][0]; prefetch(a[0][j+7]; a[0][j] := b[j][0] * b[j+1][0];}; for(i:=1; i<3; i:=i+1) for(j:=0; j<100; j:=j+1){ prefetch(a[i][j+7]; a[i][j] := b[j][0] * b[j+1][0]} • Assuming that each iteration of the pre-split loop consumes 7 cycles and no conflict and capacity misses, then it consumes a total of 7*300 + 251*50 = 14650 cycles (total iteration cycles plus total cache miss cycles); whereas the split loop consumes a total of (1+1+7)*100+(4+7)*50+(1+7)*200+(4+4)*50 = 3450

Memory Hierarchy Design • Reducing Cache Miss Penalty or Miss Rate via Parallelism • Compiler-Controlled Prefetching: An Example (continued) • the first loop consumes 9 cycles per iteration (due to the two prefetch instruction) • the second loop consumes 8 cycles per iteration (due to the single prefetch instruction), • during the first 7 iterations of the first loop array aincurs 4 cache misses, • array bincurs 7 cache misses, • during the first 7 iterations of the second loop for i = 1 and i = 2 array aincurs 4 cache misses each • array bdoes not incur any cache miss in the second split!.

Memory Hierarchy Design • First Hit Time Reduction Technique: Small and simplecaches • smaller is faster: • small index, less address translation time • small cache can fit on the same chip • low associativity: in addition to a simpler/shorter tag check, 1-way cache allows overlapping tag check with transmission of data which is not possible with any higher associativity! • Second Hit Time Reduction Technique: Avoid address translation during indexing • Make the common case fast: • use virtual address for cache because most memory accesses (more than 90%) take place in cache, resulting in virtual cache

Memory Hierarchy Design • Second Hit Time Reduction Technique: Avoid address translation during indexing • Make the common case fast: • there are at least three important performance aspects that directly relate to virtual-to-physical translation: • improperly organized or insufficiently sized TLBs may create excess not-in-TLB faults, adding time to program execution time • for a physical cache, the TLB access time must occur before the cache access, extending the cache access time • two-line address (e.g., an I-line and a D-line address) may be independent of each other in virtual address space yet collide in the real address space, when they draw pages whose lower page address bits (and upper cache address bits) are identical • problems with virtual cache: • Page-level protection must be enforced no matter what during address translation (solution: copy protection info from TLB on a miss and hold it in a field for future virtual indexing/tagging) • when a process is switched in/out, the entire cache has to be flushed out ‘cause physical address will be different each time, i.e., the problem of context switching (solution: process identifier tag -- PID)

Memory Hierarchy Design • Second Hit Time Reduction Technique: Avoid address translation during indexing • problems with virtual cache: • different virtual addresses may refer to the same physical address, i.e., the problem of synonyms/aliases • HW solution: guarantee every cache block a unique phy. Address • SW solution: force aliases to share some address bits (e.g., page-coloring) • Virtually indexed and physically tagged • Third Hit Time Reduction Technique: Pipelined cache writes • the solution is to reduce CCT and increase # of stages – increases instr. throughput • Fourth Hit Time Reduction Technique: Trace caches • Finds a dynamic sequence of instructions including taken branches to load into a cache block: • Put traces of the executed instructions into cache blocks as determined by the CPU • Branch prediction is folded in to the cache and must be validated along with the addresses to have a valid fetch. • Disadvantage: store the same instructions multiple times

Memory Hierarchy Design • Main Memory and Organizations for Improving Performance

Memory Hierarchy Design • Main Memory and Organizations for Improving Performance • Wider main memory bus • Cache miss penalty decreases proportionally • Cost: • wider bus (x n) and multiplexer (x n), • expandability (x n), and • error correction is more expensive • Simple interleaved memory • Potential parallelism with multiple DRAMs • Sending address and accessing multiple bands in parallel but transmitting data sequentially (4+24+4x4=44 cycles  16/44 = 0.4 byte/cycle) • Independent memory banks

Memory Hierarchy Design • Main Memory and Organizations for Improving Performance

Memory Hierarchy Design • Virtual Memory

Memory Hierarchy Design • Virtual Memory • Fast address translation: an example – Alpha 21264 data TLB ASN is used as PID for virtual caches; TLB is not flushed on a context switch but only when ASNs are recycled; Fully associative placement

Memory Hierarchy Design • Virtual Memory • What is the optimal page size? – It depends: • page table size 1/page size • large page size makes virtual cache possible (avoiding the aliases problem), thus reducing cache hit time • transfer of larger pages (over the network) is more efficient: efficiency of transfer • small TLB favors larger pages • main drawback for large page size: • internal fragmentation: waste of storage • process startup time: large context switching overhead

Memory Hierarchy Design • Summarizing Virtual Memory& Caches • A hypothetical memory hierarchy going from virtual address to L2 cache access:

Memory Hierarchy Design • Protection and Examples of Virtual Memory • The invention of multiprogramming led to the need to share computer resources such as CPU, memory, I/O, etc. by multiple programs whose instantiations are called “processes”; • Time-sharing of computer resources by multiple processes requires that processes take turns using such resources and designers of operating systems and computer must ensure that the switching among different processes, also called “context switching” is done correctly: • The computer designer must ensure that the CPU portion of the process state can be saved and restored; • The operating systems designer must guarantee that processes do not interfere with each others’ computations • Protecting processes: • Base and Bound – each process falls in a pre-defined portion of the address space, that is, an address is valid if Base  Address  Bound, where OS keeps and defines the values of Base and Bound in two registers.

Memory Hierarchy Design • Protection and Examples of Virtual Memory • The computer designer’s responsibilities in helping the OS designer protect processes from each other: • Providing two modes to distinguish a user process from a kernel process (or equivalently, supervisor or executive process); • Providing a portion of the CPU state, including the base/bound registers, the user/kernel mode bit(s), and the exception enable/disable bit, that a user can use but cannot write; and • Providing mechanisms by which the CPU can switch between the user mode to the supervisor mode. • While base-and-bound constitutes the minimum protection system, virtual memory offers a more fine-grained alternative to this simple model: • Address translation provides an opportunity to check any possible violations – the read/write and user/kernel signals from CPU vs. the permission flags marked on individual pages by virtual memory (or OS) to detect stray memory accesses; • Depending on the designer’s apprehension, protection can be either relaxed or escalated. In escalated protection, multiple levels of access permissions can be used, much like the military classification system.

Memory Hierarchy Design • Protection and Examples of Virtual Memory • A Paged Virtual Memory Example – The Alpha Memory Management and the 21264 TLB (one for Instruction and one for Data) • A combination of segmentation and paging, with 48-bit virtual addresses while the 64-bit address space being divided into three segments: seg0 (bits 63-47 = 0..00), kseg (bits 63-46 = 0…10), and seg1 (bits 63-46 = 1…11) • Advantages: segmentation divides address space and conserves page table space, while paging provides virtual memory, relocation, and protection • Even with segmentation, the size of the page tables for the 64-bit address space is alarming. A three-level hierarchical page table is used in Alpha, with each PT contained in one page:

Memory Hierarchy Design