Caches and Virtual Memory

ENGS 116 Lecture 13 Caches and Virtual Memory Vincent H. Berk October 31st, 2008 Reading for Today: Sections C.1 – C.3 (Jouppi article) Reading for Monday: Sections C.4 – C.7 Reading for Wednesday: Sections 5.1 – 5.3

ENGS 116 Lecture 12 Improving Cache Performance • Average memory-access time (AMAT) = Hit time + Miss rate  Miss penalty (ns or clocks) • Improve performance by: 1. Reducing the miss rate 2. Reducing the miss penalty 3. Reducing the time to hit in the cache

ENGS 116 Lecture 12 Reducing Miss Rate • Larger Blocks • Larger Cache • Higher Associativity

ENGS 116 Lecture 12 Classifying Misses: 3 Cs • Compulsory: The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses even in an infinite cache) • Capacity: If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in fully associative, size X cache) • Conflict:If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way set associative, size X cache)

ENGS 116 Lecture 12 0.14 1-way Conflict 0.12 2-way 0.1 4-way 0.08 8-way 0.06 Miss Rate per Type Capacity 0.04 0.02 0 1 2 4 8 16 32 64 128 Compulsory vanishingly small Cache Size (KB) Compulsory 3Cs Absolute Miss Rate (SPEC92)

ENGS 116 Lecture 12 0.14 1-way Conflict 0.12 2-way 0.1 4-way 0.08 Miss Rate per Type 8-way 0.06 Capacity 0.04 0.02 0 1 2 4 8 16 32 64 128 Compulsory Cache Size (KB) 2:1 Cache Rule miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2

ENGS 116 Lecture 12 100% 1-way Conflict 80% 2-way 4-way 60% 8-way Miss Rate per Type 40% Capacity 20% 0% 1 2 4 8 16 32 64 128 Flaws: for fixed block size Good: insight Compulsory Cache Size (KB) 3Cs Relative Miss Rate

ENGS 116 Lecture 12 How Can We Reduce Misses? • 3 Cs: Compulsory, Capacity, Conflict • In all cases, assume total cache size not changed • What happens if we: 1) Change Block Size: Which of 3Cs is obviously affected? 2) Change Associativity: Which of 3Cs is obviously affected? 3) Change Compiler: Which of 3Cs is obviously affected?

ENGS 116 Lecture 12 1. Reduce Misses via Larger Block Size

ENGS 116 Lecture 12 2. Reduce Misses: Larger Cache Size • Obvious improvement but: • Longer hit time • Higher cost • Each cache size favors a block-size, based on memory bandwidth AMAT = Hit time + Miss rate  Miss penalty (ns or clocks)

ENGS 116 Lecture 12 3. Reduce Misses via Higher Associativity • 2:1 Cache Rule: • Miss Rate DM cache size N ≈ Miss Rate 2-way SA cache size N/2 • Beware: Execution time is final measure! • Will clock cycle time increase? • 8-Way is almost fully associative

ENGS 116 Lecture 12 Example: Avg. Memory Access Time vs. Miss Rate • Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped Cache Size Associativity (KB) 1-way 2-way 4-way 8-way 1 2.33 2.15 2.07 2.01 2 1.98 1.86 1.76 1.68 4 1.72 1.67 1.61 1.53 8 1.46 1.48 1.47 1.43 16 1.29 1.32 1.32 1.32 32 1.20 1.24 1.25 1.27 641.14 1.20 1.21 1.23 1281.10 1.17 1.18 1.20 (Red means A.M.A.T. not improved by more associativity)

ENGS 116 Lecture 12 Reducing Miss Penalty • Multilevel caches • Read priority over write AMAT = Hit time + Miss rate  Miss penalty (ns or clocks)

ENGS 116 Lecture 12 1. Reduce Miss Penalty: L2 Caches • L2 Equations AMAT = Hit TimeL1 + Miss RateL1 Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1 (Hit TimeL2 + Miss RateL2 Miss PenaltyL2) • Definitions: • Local miss rate — misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2) • Global miss rate — misses in the cache divided by the total number of memory accesses generated by the CPU(Miss RateL1 Miss RateL2) • Global miss rate is what matters —indicates what fraction of memory accesses from CPU go all the way to main memory

ENGS 116 Lecture 12 Comparing Local and Global Miss Rates • 32 KByte 1st level cache;Increasing 2nd level cache • Global miss rate close to single level cache rate provided L2 >> L1 • Don’t use local miss rate • L2 not tied to CPU clock cycle! • Cost & A.M.A.T. • Generally fast hit times and fewer misses • Since hits are few, target miss reduction

ENGS 116 Lecture 12 L2 cache block size & A.M.A.T. • 32KB L1, 8-byte pathto memory

ENGS 116 Lecture 12 2. Reduce Miss Penalty: Read Priority over Write on Miss • Write through with write buffers offer RAW conflicts with main memory reads on cache misses • If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50%) • Check write buffer contents before read; if no conflicts, let the memory access continue • Write Back? • Read miss replacing dirty block • Normal: Write dirty block to memory, and then do the read • Instead copy the dirty block to a write buffer, then do the read, and then do the write • CPU stalls less frequently since restarts as soon as read finished

ENGS 116 Lecture 12 Reducing Hit Time • Avoiding Address Translation in index AMAT = Hit time + Miss rate  Miss penalty (ns or clocks)

ENGS 116 Lecture 13 1. Fast Hits by Avoiding Address Translation • Send virtual address to cache? Called Virtually Addressed Cache or Virtual Cache vs.Physical Cache • Every time process is switched logically must flush the cache; otherwise get false hits >> Cost is time to flush + “compulsory” misses from empty cache • Must handle aliases (sometimes called synonyms): Two different virtual addresses map to same physical address • Solution to aliases • HW guarantees each block a unique physical address OR page coloring used to ensure virtual and physical addresses match in last x bits • Solution to cache flush • Add process identifier tag that identifies process as well as address within process: cannot get a hit if wrong process

ENGS 116 Lecture 13 Virtually Addressed Caches CPU CPU CPU VA VA VA VA Tags PA Tags $ TB $ TB VA PA PA L2 $ TB $ MEM PA PA MEM MEM Overlap $ access with VA translation: requires $ index to remain invariant across translation Conventional Organization Virtually Addressed Cache Translate only on miss Synonym Problem

ENGS 116 Lecture 13 2. Fast Cache Hits by Avoiding Translation: Index with Physical Portion of Address Page Address Page Offset 31 12 11 0 Address Tag Index Block Offset • If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag • Limits cache to page size: what if want bigger caches and uses same trick? • Higher associativity moves barrier to right • Page coloring (software OS requires that all Aliases share lower address bits, leads to set-associative pages!)

ENGS 116 Lecture 14 Virtual Address (232, 264) to Physical Address mapping (228) Virtual memory in terms of cache: Cache block? Cache miss? How is virtual memory different from caches? What controls replacement Size (transfer unit, mapping mechanisms) Lower-level use Virtual Memory

ENGS 116 Lecture 14 0 4K 8K 12K 16K 20K 24K 28K Figure 5.36 The logical program in its contiguous virtual address space is shown on the left; it consists of four pages A, B, C, and D. Virtual address: Physical address: 0 A 4K C B 8K C Physical main memory 12K D A Virtual memory B D Disk

ENGS 116 Lecture 14 Figure 5.37 Typical ranges of parameters for caches and virtual memory.

ENGS 116 Lecture 14 Virtual Memory • 4 Questions for Virtual Memory (VM)? • Q1: Where can a block be placed in the upper level? fully associative, set associative, or direct mapped? • Q2: How is a block found if it is in the upper level? • Q3: Which block should be replaced on a miss? random or LRU? • Q4: What happens on a write? write back or write through? • Other issues: size; pages or segments or hybrid

ENGS 116 Lecture 14 Virtual address Virtual page number Page offset Main memory Page table Physical address Figure 5.40 The mapping of a virtual address to a physical address via a page table.

ENGS 116 Lecture 14 Page offset <13> Page-frame address <30> <30> Tag <21> Physical page # 1 <1> <2><2> 2 V R W (low-order 13 bits of address)   <13>  34-bit physical address 4 3 32:1 MUX <21> (high-order 21 bits of address) Fast Translation: Translation Buffer (TLB) • Cache of translated addresses • Data portion usually includes physical page frame number, protection field, valid bit, use bit, and dirty bit • Alpha 21064 data TLB: 32-entry fully associative

ENGS 116 Lecture 14 Selecting a Page Size • Reasons for larger page size • Page table size is inversely proportional to the page size; therefore memory saved • Fast cache hit time easy when cache ≤ page size (VA caches); bigger page makes it feasible as cache grows in size • Transferring larger pages to or from secondary storage, possibly over a network, is more efficient • Number of TLB entries is restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses • Reasons for a smaller page size • Fragmentation: don’t waste storage; data must be contiguous within page • Quicker process start for small processes • Hybrid solution: multiple page sizes • Alpha: 8 KB, 16 KB, 32 KB, 64 KB pages (43, 47, 51, 55 virtual addr bits)

ENGS 116 Lecture 14 Page Table Base Register + + Page table entry + Page table entry Page table entry Physical address physical page-frame number page offset Alpha VM Mapping 21 Virtual address seg0/seg1 selector 000 … 0 or 111 … 1 level1 level2 level3 page offset 10 10 10 13 • “64-bit” address divided into 3 segments • seg0 (bit 63 = 0) user code/heap • seg1 (bit 63 = 1, 62 = 1) user stack • kseg (bit 63 = 1, 62 = 0) kernel segment for OS • Three level page table, each one page • Alpha only 43 bits of VA • (future min page size up to 64 KB 55 bits of VA) • PTE bits; valid, kernel & user, read & write enable (no reference, use, or dirty bit) • What do you do? L1 page table . . . L2 page table . . . L3 page table . . . . . . . . . . . . 8 bytes 32 bit address 32 bit fields Main memory

Memory Hierarchy <28>

ENGS 116 Lecture 14 Protection • Avoid separate processes to access each others memory • Causes Segmentation Fault: sigSEGV • Useful for Multitasking systems • Operating system issue • Each Process has its own state • Page tables • Heap, Text, Stack pages • Registers, PC • To prevent processes from modifying their own page tables: • Rings of protection, Kernel vs. User • To prevent processes from modifying other process memory: • Page tables point to distinct physical pages

ENGS 116 Lecture 14 Protection 2 • Each page needs: • PID bit • Read/Write/Execute bit • Each process needs • Stack frame page(s) • Text or code pages • Data or heap pages • State table keeping: • PC and other CPU status registers • State of all registers

ENGS 116 Lecture 14 Alpha 21064 • Separate Instruction & Data TLB & Caches • TLBs fully associative • TLB updates in SW(“Private Arch Lib”) • Caches 8KB direct mapped, write through • Critical 8 bytes first • Prefetch instr. stream buffer • 2 MB L2 cache, direct mapped, WB (off-chip) • 256 bit path to main memory, 4 x 64-bit modules • Victim buffer: to give read priority over write • 4-entry write buffer between D$ & L2$ Data Instr Write Buffer Stream Buffer Victim Buffer

Caches and Virtual Memory