1 / 51

Chapter 7 Large and Fast: Exploiting Memory Hierarchy

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Bo Cheng. Principle of locality. programs access a relatively small portion of their address space at a given time. Temporal locality (locality in time): if an item is referenced, it will tend to be referenced again soon.

stacia
Download Presentation

Chapter 7 Large and Fast: Exploiting Memory Hierarchy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 7Large and Fast: Exploiting Memory Hierarchy Bo Cheng

  2. Principle of locality programs access a relatively small portion of their address space at a given time. • Temporal locality (locality in time): if an item is referenced, it will tend to be referenced again soon. • Spatial locality (locality in space): if an item is referenced, items whose addresses are close will tend to be referenced soon.

  3. Basic Structure

  4. The Principal • By combining two concepts (locality and hierarchy): • Temporal Locality => Keep most recently accessed data items closer to the processor • Spatial Locality => Move blocks consisting of multiple contiguous words to upper levels of the hierarchy

  5. Memory Hierarchy (I)

  6. Upper Lower Memory hierarchy (II) • Data is copied between adjacent levels • Minimum unit of information copied is a block • If the requested data appears in some block in the upper level, this is called a hit, otherwise a miss and a block containing the requested data is copied from a lower level. • The hit rate or hit ratio, is the fraction of memory accesses found in the upper level. The miss rate (1.0 - hit rate) is the fraction not found at the upper level. • Hit time: the time to access the upper level including the time to determine if the access is a hit or a miss. • Miss penalty the time to replace a block in the upper level.

  7. Memory Hierarchy (II)

  8. The Moore’s Law

  9. Cache • A safe place for hiding or storing things • The level of memory hierarchy between processor and main memory • Refer to any storage managed to take advantage pf locality of access • Motivation: • high processor cycle speed • low memory cycle speed • fast access to recently used portions of a program's code and data

  10. The Basic Cache Concept 1. The CPU is requesting data item Xn 2. The request results in a miss 3. The word Xn is brought from memory into cache

  11. Direct Mapped Cache • Each memory location is mapped to exactly one location in the cache. • address of the block modulo number of blocks in the cache. • Answer two crucial questions • How do we know if a data item is in the cache? • If it is, how do we find it?

  12. The Example of Direct-Mapped Cache

  13. Lower bits Upper bits

  14. Cache Contents m n • Tag Identify whether a word in the cache corresponds to the requested word. • Valid bit indicates whether an entry contains a valid address • Data Tag size = 32 – n – 2 = 32 – 10 - 2 Size = 2index x ( valid + tag + data) = 2n x ( 1 + m + 4*8)

  15. Direct-Mapped Example How many total bits are required for direct-mapped? • A Cache • 16 KB of data • 4-word blocks • 32 bits address 4-word n + m + 4 = 32 …. (1) 16KB* = 4K words = 210 block → n = 10 m = 18 The total bits = 210 x (1 +18 + 4*4*8) = 147 Kbits 16 KB 4 x 4 x 8 = 128 bits

  16. Source: http://www.faculty.uaf.edu/ffdr/EE443/ Mapping an address to a cache block

  17. Block Size vs. Miss Rate

  18. Handling Cache Misses • Stall the entire pipeline & fetch the requested word • Steps to handle an instruction cache miss: • Send the original PC value (PC-4) to the memory. • Instruct main memory to perform a read and wait for the memory to complete its access. • Write the cache entry, putting the data from memory in the data portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and turning the valid bit on. • Restart the instruction execution at the first step, which will refresh the instruction, this time finding it in the cache.

  19. Cache Write-Through Main Memory Write-Through • A scheme in which writes always update both the cache and the memory, ensuring that data is always consistent between the two. • Write buffer: • A queue that holds data while the data are waiting to be written to memory.

  20. Cache Main Memory Write-Back Write-Back • A scheme that handles writes by updating values only to the block in the cache, then writing the modified block to the lower level of the hierarchy when the block is replaced. • Pro: Improve performance, especially when writes are frequent (and couldn’t be handled by write buffer) • Con: More complex to implement

  21. Cache Performance • CPU time = (CPU execution clock cycles + Memory-stall clock cycles) x Clock cycle time • Memory-stall clock cycles = Read-stall cycles + Write-stall cycles • Read-stall cycles = (Reads/Program) x Read miss rate x Read miss penalty • Write-stall cycles = ((Writes/Program) x Write miss rate x Write miss penalty) +Write buffer stalls • Memory-stall clock cycles = (MemoryAccess/Program) x Miss Rate x Miss Penalty • Memory-stall clock cycles = (Instructions/Program) x Misses/Instructions) x Miss Penalty

  22. Source: http://www.faculty.uaf.edu/ffdr/EE443/ The Example (1.38 + 2)

  23. What if …. • What if the processor is made faster, but the memory system stays the same? • Speed up the machine by improving the CPI from 2 to 1 without increasing the clock • The system with a perfect cache would be 2.38 / 1 = 2.38 times faster • The amount of time spent on memory stalls rises from 1.38/3.38 = 41% to 1.38/2.38 = 58%

  24. What if ….

  25. Our Observations • Relative cache penalties increases as a processor becomes faster • The lower the CPI, the more pronounced the impact of stall cycles • If the main memory system is the same, a higher CPU clock rate leads to a larger miss penalty

  26. Decreasing miss ratio with associative cache • direct-mapped cache: A cache structure in which each memory location is mapped to exactly one location in the cache. • set-associative cache: A cache that has a fixed number of locations (at least two) where each block can be placed. • fully associative cache: A cache structure in which a block can be placed in any location in the cache.

  27. The Example (12 mod 8) = 4 (12 mod 4) = 0 Can appear in any of the eight cache block

  28. One More Example – Direct Mapped 5 Misses

  29. Two-Way Set Associative Cache which block to replace – commonly used is LRU scheme Least recently used (LRU) A replacement scheme in which the block replaced is the one that has been unused for the longest time. 4 Misses

  30. The Implementation of 4-Way Set Associative Cache

  31. Fully Associative Cache 3 Misses Increasing degree of associativity → decrease in miss rate

  32. The miss penalty The miss penalty 100 /0.2 = 500 clock cycles 5 /0.2 = 25 clock cycles Multilevel Total CPI = 1 + Primary-Stall per instruction + Secondary-Stall per instruction = 1 + (25 * 2%) + (500 * 0.5%) = 1 + 0.5 + 2.5 = 4.0 Original Total CPI = 1 + Memory-Stall cycle per instruction = 1 + 500 * 2% = 1 + 10 = 11 Performance of Multilevel Cache 11/4 = 2.8

  33. Designing the Memory System to Support Caches (I) • Consider hypothetical memory system parameters: • 1 memory bus clock cycle to send address • 15 memory bus clock cycles to initiate DRAM access • 1 memory bus clock cycle to transfer a word of data • a cache block is a 4-word blocks • 1-word-wide bank of DRAMs • The miss penalty is: 1 + 4 × 15 + 4 × 1 = 65 clock cycles • Number of bytes transferred per clock cycle per miss: • (4*4) / 65 = 0.25

  34. Designing the Memory System to Support Caches (II)

  35. Virtual Memory • The technique in which main memory acts as a "cache" for the secondary storage • automatically manages main memory and secondary storage • Motivation • allow efficient sharing of memory among multiple programs • remove the programming burdens of a small, limited amount of main memory

  36. Source: http://www.faculty.uaf.edu/ffdr/EE443/ Basic Concepts of Virtual Memory • Virtual memory allows each program to exceed the size of primary memory • It automatically manages two levels of memory hierarchy: • Main memory (physical memory) • Secondary storage • Same concepts as in caches, different terminology • A virtual memory block – a page • A virtual memory miss – a page fault • CPU produces a virtual address (which is translated to a physical address, used to access main memory). This process (accomplished by a combination o HW and SW) is called memory mapping or address translation.

  37. Mapping from a Virtual to Physical Address 232 = 4 GB 230 = 1 GB

  38. High Cost of a Miss • Page fault takes millions of cycles to process • E.g., main memory is 100,000 times faster than disk • This time is dominated by the time it takes to get the first word for typical page size • Key decisions: • Page size large enough to amortize the high access time • Pick organization that reduces page fault rate (e.g., fully associative placement of pages) • Handle page faults in software (overhead is small compared to disk access times) and use clever algorithms for page placement • Use write-back

  39. Page Table • Containing the virtual to physical address translations in a virtual memory system. • Resides in memory • Indexed with the page number form the virtual address • Contains corresponding physical page number • Each program has its own page table • Hardware includes a register pointing to the start of the page table (page table register)

  40. Page Table Size • For Example: • Consider 32-bit virtual addresses, • 4-KB page size, • 4B per page table entry: • Number of page table entries • = 230/212 = 220 • Size of page table • = 220 x 4 = 4 MB

  41. Page Faults • Occurs when a valid bit (V) is found to be 0: • Transfer the control to the operating system (using the exception mechanism) • The operating system must find the appropriate page in the next level of hierarchy • Decide where to place it in the main memory • Where is the page on this disk? • The information can be found either in the same page table, or in a separate structure • The OS creates the space on disk for all the pages of the process • at the time it creates the process • At the same time, a data structure that records the location of each • page is also created.

  42. The Translation-Lookaside Buffer (TLB) • Each memory access by a program requires two memory accesses: • Obtain the physical address (reference the page table) • Get the data • Because of the spatial and temporal locality within each page, a translation for a virtual page will likely be needed in the near future. • To speed this process up include a special cache that keeps track of recently used translations

  43. The Translation-Lookaside Buffer (TLB)

  44. Processing read/write requests

  45. Where Can a Block Be Placed? 1. Increase in the degree of associativity: usually decreases the miss rate. 2. The improvement in miss rate comes from: reduced competition for the same location.

  46. How Is a Block Found?

  47. What block is replaced on a miss? • Which block is a candidate for replacement: • In a fully associative cache – all blocks are candidates • In a set-associative cache – all the blocks in the set • In a direct-mapped cache – there is only one candidate • In set-associative and fully associative caches, use one of two strategies • 1. Random. (use hardware assistance to make it fast) • 2. LRU (Least recently used). usually two complicated even for fourway associativity.

  48. How Are Write Handled? • There are two basic options: • Write-through – The information is written to both the block in the cache and to the block in the lower level of the memory hierarchy • Write-back – The modified block is written to the lower level only when it is replaced • ADVANTAGES of WRITE-THROUGH • Misses are cheaper and simpler • Easier to implement (although it usually requires a write buffer) • ADVANTAGES of WRITE-BACK • CPU can write at the rate that the cache can accept • Combined writes • Effective use of bandwidth (writing the entire block) • Virtual memory is a special case – only a write-back is practical

  49. The Big Picture • Where to place a block? • One place (direct-mapped) • A few places (set-associative) • Any place (fully-associative) • How to find a block? • Indexing (direct-mapped) • Limited search (set-associative) • Full search (fully associative) • Separate lookup table (page table) • 3. Which block should be replaced on a cache miss? • Random • LRU • 4. What happens on a write? • Write-through • Write-back

  50. The 3Cs • Compulsory misses – caused by the first access to a block that has never been in the cache (cold-start misses) • INCREASE THE BLOCK SIZE (increase in miss penalty) • Capacity misses – caused when the cache cannot contain all the blocks needed by the program. Blocks are being replaced and later retrieved again. • INCREASE THE SIZE (access time increases as well) • Conflict misses – occur when multiple blocks compete for the same set (collision misses) • INCREASE ASSOCIATIVITY (may slow down access time)

More Related