1 / 87

ECE200 – Computer Organization

ECE200 – Computer Organization. Chapter 7 – Large and Fast: Exploiting Memory Hierarchy. Homework 7. 7.2, 7.3, 7.5, 7.6, 7.7, 7.11, 7.15, 7.20, 7.21, 7.27, 7.32. Outline for Chapter 7 lectures. Motivation for, and concept of, memory hierarchies Caches Main memory

warner
Download Presentation

ECE200 – Computer Organization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE200 – Computer Organization Chapter 7 – Large and Fast: Exploiting Memory Hierarchy

  2. Homework 7 • 7.2, 7.3, 7.5, 7.6, 7.7, 7.11, 7.15, 7.20, 7.21, 7.27, 7.32

  3. Outline for Chapter 7 lectures • Motivation for, and concept of, memory hierarchies • Caches • Main memory • Characterizing memory hierarchy performance • Virtual memory • Real memory hierarchies

  4. The memory dilemma • Ch 6 assumption: on-chip instruction and data memories hold the entire program and its data and can be accessed in one cycle • Reality check • In high performance machines, programs may require 100’s of megabytes or even gigabytes of memory to run • Embedded processors have smaller needs but there is also less room for on-chip memory • Basic problem • We need much more memory than can fit on the microprocessor chip • But we do not want to incur stall cycles every time the pipeline accesses instructions or data • At the same time, we need the memory to be economical for the machine to be competitive in the market

  5. Solution: a hierarchy of memories

  6. Another view

  7. Typical characteristics of each level • First level (L1) is separate on-chip instruction and data caches placed where our instruction and data memories reside • 16-64KB for each cache (desktop/server machine) • Fast, power-hungry, not-so-dense, static RAM (SRAM) • Second level (L2) consists of another larger unified cache • Holds both instructions and data • 256KB-4MB • On or off-chip • SRAM • Third level is main memory • 64-512MB • Slower, lower-power, denser dynamic RAM (DRAM) • Final level is I/O (e.g., disk)

  8. Caches and the pipeline • L1 instruction and data caches and L2 cache L1 L1 cache cache L2 cache to mm

  9. Memory hierarchy operation (1) Search L1 for the instruction or data If found (cache hit), done (2) Else (cache miss), search L2 cache If found, place it in L1 and repeat (1) (3) Else, search main memory If found, place it in L2 and repeat (2) (4) Else, get it from I/O (Chapter 8) Steps (1)-(3) are performed in hardware • 1-3 cycles to get from L1 caches • 5-20 cycles to get from L2 cache • 50-200 cycles to get from main memory L1 L2 main memory

  10. Principle of locality • Programs access a small portion of memory within a short time period • Temporal locality: recently accessed memory locations will likely be accessed soon • Spatial locality: memory locations near recently accessed locations will likely be accessed soon

  11. POL makes memory hierarchies work • A large percentage of the time (typically >90%) the instruction or data is found in L1, the fastest memory • Cheap, abundant main memory is accessed more rarely •  Memory hierarchy operates at nearly the speed of expensive on-chip SRAM with about the cost of main memory (DRAMs)

  12. Caches • Caches are small, fast, memories that hold recently accessed instructions and/or data • Separate L1 instruction and L1 data caches • Need simultaneous access of instructions and data in pipelines • L2 cache holds both instructions and data • Simultaneous access not as critical since >90% of the instructions and data will be found in L1 • PC or effective address from L1 is sent to L2 to search for the instruction or data

  13. How caches exploit the POL • On a cache miss, a block of several instructions or data, including the requested item, are returned • The entire block is placed into the cache so that future searches for items in the block will be successful requested instruction instructioni instructioni+1 instructioni+2 instructioni+3

  14. How caches exploit the POL • Consider sequence of instructions and data accesses in this loop with a block size of 4 words Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1 , -4 bne $s1, $zero, Loop

  15. Searching the cache • The cache is much smaller than main memory • Multiple memory blocks must share the same cache location block

  16. Searching the cache • Need a way to determine whether the desired instruction or data is held in the cache • Need a scheme for replacing blocks when a new block needs to be brought in on a miss block

  17. Cache organization alternatives • Direct mapped: each block can be placed in only one cache location • Set associative: each block can be placed in any of n cache locations • Fully associative: each block can be placed in any cache location

  18. Cache organization alternatives • Searching for block 12 in caches of size 8 blocks Set Set # 0

  19. Searching a direct mapped cache • Need log2 number of sets of the address bits (the index) to select the block location • block offset used to select the desired byte, half-word, or word within the block • Remaining bits (the tag) used to determine if this is the desired block or another that shares the same cache location Set assume data cache with 16 byte blocks 8 sets 4 block offset bits 3 index bits 25 tag bits tag index block offset memory address

  20. Searching a direct mapped cache • Block is placed in the set index • number of sets = cache size/block size Set assume data cache with 16 byte blocks 8 sets 4 block offset bits 3 index bits 25 tag bits tag index block offset memory address

  21. Direct mapped cache organization • 64KB instruction cache with 16 byte (4 word) blocks • 4K sets (64KB/16B)  need 12 address bits to pick

  22. Direct mapped cache organization • The data section of the cache holds the instructions

  23. Direct mapped cache organization • The tag section holds the part of the memory address used to distinguish different blocks

  24. Direct mapped cache organization • A valid bit associated with each set indicates if the instructions are valid or not

  25. Direct mapped cache access • The index bits are used to select one of the sets

  26. Direct mapped cache access • The data, tag, and Valid bit from the selected set are simultaneously accessed

  27. Direct mapped cache access • The tag from the selected entry is compared with the tag field of the address

  28. Direct mapped cache access • A match between the tags and a Valid bit that is set indicates a cache hit

  29. Direct mapped cache access • The block offset selects the desired instruction

  30. Set associative cache • Block placed in one way of set index • Number of sets = cache size/(block size*ways) ways 0-3

  31. Set associative cache operation • The index bits are used to select one of the sets

  32. Set associative cache operation • The data, tag, and Valid bit from all ways of the selected entry are simultaneously accessed

  33. Set associative cache operation • The tags from all ways of the selected entry are compared with the tag field of the address

  34. Set associative cache operation • A match between the tags and a Valid bit that is set indicates a cache hit (hit in way1 shown)

  35. Set associative cache operation • The data from the way that had a hit is returned through the MUX

  36. = = = = Fully associative cache • A block can be placed in any set 31 30 29 28 27 26 …6 5 4 3 2 1 0 30 V Tag Data 0 1 . . . . . . . . . 1022 1023 . . . . . . Hit MUX . . Data

  37. Different degrees of associativity • Four different caches of size 8 blocks

  38. Cache misses • A cache miss occurs when the block is not found in the cache • The block is requested from the next level of the hierarchy • When the block returns, it is loaded into the cache and provided to the requester • A copy of the block remains in the lower levels of the hierarchy • The cache miss rate is found by dividing the total number of misses by the total number of accesses (misses/accesses) • The hit rate is 1-miss rate L1 L2 main memory

  39. Classifying cache misses • Compulsory misses • Caused by the first access to a block that has never been in the cache • Capacity misses • Due to the cache not being big enough to hold all the blocks that are needed • Conflict misses • Due to multiple blocks competing for the same set • A fully associative cache with a “perfect” replacement policy has no conflict misses

  40. A A 0 1 0 1 Cache miss classification examples • Direct mapped cache of size two blocks • Blocks A and B map to set 0, C and D to set 1 • Access pattern 1: A, B, C, D, A, B, C, D • Access pattern 2: A, A, B, A B B B 0 1 0 1 0 1 C D ?? ?? ?? ?? A B B B 0 1 0 1 0 1 0 1 D D C D ?? ?? ?? ?? A B A 0 1 0 1 0 1 ?? ?? ?? ??

  41. Reducing capacity misses • Increase the cache size • More cache blocks can be simultaneously held in the cache • Drawback: increased access time

  42. Reducing compulsory misses • Increase the block size • Each miss results in more words being loaded into the cache • Block size should only be increased to a certain point! • As block size is increased • Fewer cache sets (increased contention) • Larger percentage of block may not be referenced

  43. Reducing conflict misses • Increase the associativity • More locations in which a block can be held • Drawback: increased access time

  44. Cache miss rates for SPEC92

  45. Block replacement policy • Determines what block to replace on a cache miss to make room for the new block • Least recently used (LRU) • Pick the one that has been unused for the longest time • Based on temporal locality • Requires ordering bits to be kept with each set • Too expensive beyond 4-way • Random • Pseudo-randomly pick a block • Generally not as effective as LRU (higher miss rates) • Simple even for highly associative organizations • Most recently used (MRU) • Keep track of which block was accessed last • Randomly pick a block from other than that one • Compromise between LRU and random

  46. Cache writes • The L1 data cache needs to handle writes (stores) in addition to reads (loads) • Need to check for a hit before writing • Don’t want to write over another block on a miss • Requires a two cycle operation (tag check followed by write) • Write back cache • Check for hit • If a hit, write the byte, halfword, or word to the correct location in the block • If a miss, request the block from the next level in the hierarchy • Load the block into the cache, and then perform the write

  47. Cache writes and block replacement • With a write back cache, when a block is written to, copies of the block in the lower levels are not updated • If this block is chosen for replacement on a miss, we need save it to the next level • Solution: • A dirty bit is asociated with each cache block • The dirty bit is set if the block is written to • A block with a set dirty bit that is chosen for replacement is written to the next level before being overwritten with the new block

  48. Cache writes and block replacement (1) cache miss (4) block loaded into L1D cache L1D (2) dirty block written to L2 (3) read request sent to L2 L2 main memory

  49. Main memory systems Central Processing Unit • The main memory (MM) lies in the memory hierarchy between the cache hierarchy and I/O • MM is too large to fit on the microprocessor die • MM controller is on the microprocessor die or a separate chip • DRAMs lie on Dual In-line Memory Modules (DIMMs) on the printed circuit board • ~10-20 DRAMs per DIMM • Interconnect width is typically ½ to ¼ of the cache block • Cache block is returned to L2 in multiple transfers Level1 Data Cache Level1 Instruction Cache Level2 Cache microprocessor Interconnect Main Memory Control Input/ Output disk etc DIMMs

  50. Main memory systems • DRAM characteristics • Storage cells lie in a grid of rows and columns • Row decoder selects the row • Column latches latch data from the row • Column MUX selects a subset of the data to be output • Access time ~40ns from row address to data out Storage cells address row decoder column latches column MUX data out

More Related