ECE200 – Computer Organization

ECE200 – Computer Organization Chapter 7 – Large and Fast: Exploiting Memory Hierarchy

Homework 7 • 7.2, 7.3, 7.5, 7.6, 7.7, 7.11, 7.15, 7.20, 7.21, 7.27, 7.32

Outline for Chapter 7 lectures • Motivation for, and concept of, memory hierarchies • Caches • Main memory • Characterizing memory hierarchy performance • Virtual memory • Real memory hierarchies

The memory dilemma • Ch 6 assumption: on-chip instruction and data memories hold the entire program and its data and can be accessed in one cycle • Reality check • In high performance machines, programs may require 100’s of megabytes or even gigabytes of memory to run • Embedded processors have smaller needs but there is also less room for on-chip memory • Basic problem • We need much more memory than can fit on the microprocessor chip • But we do not want to incur stall cycles every time the pipeline accesses instructions or data • At the same time, we need the memory to be economical for the machine to be competitive in the market

Solution: a hierarchy of memories

Another view

Typical characteristics of each level • First level (L1) is separate on-chip instruction and data caches placed where our instruction and data memories reside • 16-64KB for each cache (desktop/server machine) • Fast, power-hungry, not-so-dense, static RAM (SRAM) • Second level (L2) consists of another larger unified cache • Holds both instructions and data • 256KB-4MB • On or off-chip • SRAM • Third level is main memory • 64-512MB • Slower, lower-power, denser dynamic RAM (DRAM) • Final level is I/O (e.g., disk)

Caches and the pipeline • L1 instruction and data caches and L2 cache L1 L1 cache cache L2 cache to mm

Memory hierarchy operation (1) Search L1 for the instruction or data If found (cache hit), done (2) Else (cache miss), search L2 cache If found, place it in L1 and repeat (1) (3) Else, search main memory If found, place it in L2 and repeat (2) (4) Else, get it from I/O (Chapter 8) Steps (1)-(3) are performed in hardware • 1-3 cycles to get from L1 caches • 5-20 cycles to get from L2 cache • 50-200 cycles to get from main memory L1 L2 main memory

Principle of locality • Programs access a small portion of memory within a short time period • Temporal locality: recently accessed memory locations will likely be accessed soon • Spatial locality: memory locations near recently accessed locations will likely be accessed soon

POL makes memory hierarchies work • A large percentage of the time (typically >90%) the instruction or data is found in L1, the fastest memory • Cheap, abundant main memory is accessed more rarely •  Memory hierarchy operates at nearly the speed of expensive on-chip SRAM with about the cost of main memory (DRAMs)

Caches • Caches are small, fast, memories that hold recently accessed instructions and/or data • Separate L1 instruction and L1 data caches • Need simultaneous access of instructions and data in pipelines • L2 cache holds both instructions and data • Simultaneous access not as critical since >90% of the instructions and data will be found in L1 • PC or effective address from L1 is sent to L2 to search for the instruction or data

How caches exploit the POL • On a cache miss, a block of several instructions or data, including the requested item, are returned • The entire block is placed into the cache so that future searches for items in the block will be successful requested instruction instructioni instructioni+1 instructioni+2 instructioni+3

How caches exploit the POL • Consider sequence of instructions and data accesses in this loop with a block size of 4 words Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1 , -4 bne $s1, $zero, Loop

Searching the cache • The cache is much smaller than main memory • Multiple memory blocks must share the same cache location block

Searching the cache • Need a way to determine whether the desired instruction or data is held in the cache • Need a scheme for replacing blocks when a new block needs to be brought in on a miss block

Cache organization alternatives • Direct mapped: each block can be placed in only one cache location • Set associative: each block can be placed in any of n cache locations • Fully associative: each block can be placed in any cache location

Cache organization alternatives • Searching for block 12 in caches of size 8 blocks Set Set # 0

Searching a direct mapped cache • Need log2 number of sets of the address bits (the index) to select the block location • block offset used to select the desired byte, half-word, or word within the block • Remaining bits (the tag) used to determine if this is the desired block or another that shares the same cache location Set assume data cache with 16 byte blocks 8 sets 4 block offset bits 3 index bits 25 tag bits tag index block offset memory address

Searching a direct mapped cache • Block is placed in the set index • number of sets = cache size/block size Set assume data cache with 16 byte blocks 8 sets 4 block offset bits 3 index bits 25 tag bits tag index block offset memory address

Direct mapped cache organization • 64KB instruction cache with 16 byte (4 word) blocks • 4K sets (64KB/16B)  need 12 address bits to pick

Direct mapped cache organization • The data section of the cache holds the instructions

Direct mapped cache organization • The tag section holds the part of the memory address used to distinguish different blocks

Direct mapped cache organization • A valid bit associated with each set indicates if the instructions are valid or not

Direct mapped cache access • The index bits are used to select one of the sets

Direct mapped cache access • The data, tag, and Valid bit from the selected set are simultaneously accessed

Direct mapped cache access • The tag from the selected entry is compared with the tag field of the address

Direct mapped cache access • A match between the tags and a Valid bit that is set indicates a cache hit

Direct mapped cache access • The block offset selects the desired instruction

Set associative cache • Block placed in one way of set index • Number of sets = cache size/(block size*ways) ways 0-3

Set associative cache operation • The index bits are used to select one of the sets

Set associative cache operation • The data, tag, and Valid bit from all ways of the selected entry are simultaneously accessed

Set associative cache operation • The tags from all ways of the selected entry are compared with the tag field of the address

Set associative cache operation • A match between the tags and a Valid bit that is set indicates a cache hit (hit in way1 shown)

Set associative cache operation • The data from the way that had a hit is returned through the MUX

= = = = Fully associative cache • A block can be placed in any set 31 30 29 28 27 26 …6 5 4 3 2 1 0 30 V Tag Data 0 1 . . . . . . . . . 1022 1023 . . . . . . Hit MUX . . Data

Different degrees of associativity • Four different caches of size 8 blocks

Cache misses • A cache miss occurs when the block is not found in the cache • The block is requested from the next level of the hierarchy • When the block returns, it is loaded into the cache and provided to the requester • A copy of the block remains in the lower levels of the hierarchy • The cache miss rate is found by dividing the total number of misses by the total number of accesses (misses/accesses) • The hit rate is 1-miss rate L1 L2 main memory

Classifying cache misses • Compulsory misses • Caused by the first access to a block that has never been in the cache • Capacity misses • Due to the cache not being big enough to hold all the blocks that are needed • Conflict misses • Due to multiple blocks competing for the same set • A fully associative cache with a “perfect” replacement policy has no conflict misses

A A 0 1 0 1 Cache miss classification examples • Direct mapped cache of size two blocks • Blocks A and B map to set 0, C and D to set 1 • Access pattern 1: A, B, C, D, A, B, C, D • Access pattern 2: A, A, B, A B B B 0 1 0 1 0 1 C D ?? ?? ?? ?? A B B B 0 1 0 1 0 1 0 1 D D C D ?? ?? ?? ?? A B A 0 1 0 1 0 1 ?? ?? ?? ??

Reducing capacity misses • Increase the cache size • More cache blocks can be simultaneously held in the cache • Drawback: increased access time

Reducing compulsory misses • Increase the block size • Each miss results in more words being loaded into the cache • Block size should only be increased to a certain point! • As block size is increased • Fewer cache sets (increased contention) • Larger percentage of block may not be referenced

Reducing conflict misses • Increase the associativity • More locations in which a block can be held • Drawback: increased access time

Cache miss rates for SPEC92

Block replacement policy • Determines what block to replace on a cache miss to make room for the new block • Least recently used (LRU) • Pick the one that has been unused for the longest time • Based on temporal locality • Requires ordering bits to be kept with each set • Too expensive beyond 4-way • Random • Pseudo-randomly pick a block • Generally not as effective as LRU (higher miss rates) • Simple even for highly associative organizations • Most recently used (MRU) • Keep track of which block was accessed last • Randomly pick a block from other than that one • Compromise between LRU and random

Cache writes • The L1 data cache needs to handle writes (stores) in addition to reads (loads) • Need to check for a hit before writing • Don’t want to write over another block on a miss • Requires a two cycle operation (tag check followed by write) • Write back cache • Check for hit • If a hit, write the byte, halfword, or word to the correct location in the block • If a miss, request the block from the next level in the hierarchy • Load the block into the cache, and then perform the write

Cache writes and block replacement • With a write back cache, when a block is written to, copies of the block in the lower levels are not updated • If this block is chosen for replacement on a miss, we need save it to the next level • Solution: • A dirty bit is asociated with each cache block • The dirty bit is set if the block is written to • A block with a set dirty bit that is chosen for replacement is written to the next level before being overwritten with the new block

Cache writes and block replacement (1) cache miss (4) block loaded into L1D cache L1D (2) dirty block written to L2 (3) read request sent to L2 L2 main memory

Main memory systems Central Processing Unit • The main memory (MM) lies in the memory hierarchy between the cache hierarchy and I/O • MM is too large to fit on the microprocessor die • MM controller is on the microprocessor die or a separate chip • DRAMs lie on Dual In-line Memory Modules (DIMMs) on the printed circuit board • ~10-20 DRAMs per DIMM • Interconnect width is typically ½ to ¼ of the cache block • Cache block is returned to L2 in multiple transfers Level1 Data Cache Level1 Instruction Cache Level2 Cache microprocessor Interconnect Main Memory Control Input/ Output disk etc DIMMs

Main memory systems • DRAM characteristics • Storage cells lie in a grid of rows and columns • Row decoder selects the row • Column latches latch data from the row • Column MUX selects a subset of the data to be output • Access time ~40ns from row address to data out Storage cells address row decoder column latches column MUX data out

ECE200 – Computer Organization