1 / 106

Cache Configurations and Performance: Write-Through vs Write-Back

Understand the different cache configurations and their impact on performance. Learn about write-through and write-back strategies, and their effects on data consistency and bus traffic.

rubyej
Download Presentation

Cache Configurations and Performance: Write-Through vs Write-Back

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outline • Cache writes • DRAM configurations • Performance • Associative caches • Multi-level caches

  2. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

  3. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

  4. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

  5. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

  6. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

  7. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

  8. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] Not Valid 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

  9. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream:Hit/Miss 0b01001000 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

  10. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

  11. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[208-223] M[32-47] 00 1 01 01 1 11 10 1 00 11 0 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

  12. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[16-31] M[32-47] 00 1 01 01 1 00 10 1 00 11 0 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 M 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

  13. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[16-31] M[32-47] 00 1 01 01 1 00 10 1 00 11 0 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 M 0b00111000 0b00010000 Byte Offset Tag Index Block Offset

  14. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[16-31] M[32-47] M[48-63] 00 1 01 01 1 00 10 1 00 11 1 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 M 0b00111000 M 0b00010000 Byte Offset Tag Index Block Offset

  15. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[16-31] M[32-47] M[48-63] 00 1 01 01 1 00 10 1 00 11 1 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 M 0b00111000 M 0b00010000 Byte Offset Tag Index Block Offset

  16. Direct-mapped CacheBlocksize=4words, wordsize= 4bytes Valid Tag Data M[64-79] M[16-31] M[32-47] M[48-63] 00 1 01 01 1 00 10 1 00 11 1 00 Reference Stream: Hit/Miss 0b01001000 H 0b00010100 M 0b00111000 M 0b00010000 H Byte Offset Tag Index Block Offset

  17. Cache Writes • There are multiple copies of the data lying around • L1 cache, L2 cache, DRAM • Do we write to all of them? • Do we wait for the write to complete before the processor can proceed?

  18. Do we write to all of them? • Write-through • Write-back • creates data - different values for same item in cache and DRAM. • This data is referred to as

  19. Do we write to all of them? • Write-through - write to all levels of hierarchy • Write-back • creates data - different values for same item in cache and DRAM. • This data is referred to as

  20. Do we write to all of them? • Write-through - write to all levels of hierarchy • Write-back - write to lower level only when cache line gets evicted from cache • creates inconsistent data - different values for same item in cache and DRAM – stale data. • Inconsistent data in highest level in cache is referred to as dirty • If they all match, they are clean • The old data is stale.

  21. Write-Through Sw $3, 0($5) CPU L1 L2 Cache DRAM

  22. Write-Back Sw $3, 0($5) CPU L1 L2 Cache DRAM

  23. Which performs the write faster? Which has faster evictions from a cache? Which causes more bus traffic? Write-through vs Write-back

  24. Which performs the write faster? Write-back - it only writes the L1 cache Which has faster evictions from a cache? Which causes more bus traffic? Write-through vs Write-back

  25. Which performs the write faster? Write-back - it only writes the L1 cache Which has faster evictions from a cache? Write-through - no write involved, just overwrite tag Which causes more bus traffic? Write-through vs Write-back

  26. Which performs the write faster? Write-back - it only writes the L1 cache Which has faster evictions from a cache? Write-through - no write involved, just overwrite tag Which causes more bus traffic? Write-through. DRAM is written every store. Write-back only writes on eviction. Write-through vs Write-back

  27. Does processor wait for write? • Write buffer • Any loads must check write buffer in parallel with cache access. • Buffer values are more recent than cache values.

  28. Does processor wait for write? • Write buffer - intermediate queue for pending writes • Any loads must check write buffer in parallel with cache access. • Buffer values are more recent than cache values.

  29. Outline • Cache writes • DRAM configurations • Performance • Associative caches

  30. Challenge • DRAM is designed for density, not speed • DRAM is ______ than the bus • We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. • Widening anything increases the cost by quite a bit.

  31. Challenge • DRAM is designed for density, not speed • DRAM is slower than the bus • We are allowed to change the width, the number of DRAMs, and the bus protocol, but the access latency stays slow. • Widening anything increases the cost by quite a bit.

  32. Narrow Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM

  33. Narrow Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? • 1cycle + 15 cycles/word * 8 words + 1 cycle/word * 8 words = 129 cycles Cache Bus DRAM

  34. Wide Configuration CPU • Given: • 1 clock cycle request • 15 cycles / 2 words DRAM latency • 1 cycle / 2 words bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM

  35. Wide Configuration CPU • Given: • 1 clock cycle request • 15 cycles / 2 words DRAM latency • 1 cycle / 2 words bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? • 1cycle + 15 cycles/2 words * 8 words + 1 cycle/2words*8words = 65 cycles Cache Bus DRAM

  36. Interleaved Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? Cache Bus DRAM DRAM

  37. Interleaved Configuration CPU • Given: • 1 clock cycle request • 15 cycles / word DRAM latency • 1 cycle / word bus latency • If a cache block is 8 words, what is the miss penalty of an L2 cache miss? • 1 cycle + 15 cycles / 2 words * 8 words + 1 cycle / word * 8 words = 69 cycles Cache Bus DRAM DRAM

  38. Recent DRAM trends • Fewer, Bigger DRAMs • New bus protocols (RAMBUS) • small DRAM caches (page mode) • SDRAM (synchronous DRAM) • one request & length nets several continuous responses.

  39. Outline • Cache writes • DRAM configurations • Performance • Associative caches

  40. Performance • Execute Time = (Cpu cycles + Memory-stall cycles) * clock cycle time • Memory-stall cycles = • accesses * misses * cycles = • program access miss • memory access * Miss rate * Miss penalty • program • instructions * misses * cycles = • program inst miss • instructions * misses * miss penalty • program inst

  41. Example 1 • instruction cache miss rate: 2% • data cache miss rate: 3% • miss penalty: 50 cycles • ld/st instructions are 25% of instructions • CPI with perfect cache is 2.3 • How much faster is the computer with a perfect cache?

  42. Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr

  43. Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275

  44. Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275 • Memory cycles = I * .0275 * 50 = I* 1.375

  45. Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275 • Memory cycles = I * .0275 * 50 = I* 1.375 • ExecT = (Cpu CPI * I + MemCycles)*Clk

  46. Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275 • Memory cycles = I * .0275 * 50 = I* 1.375 • ExecT = (Cpu CPI * I + MemCycles)*Clk • = (2.3 * I + 1.375 * I) * clk = 3.675IC

  47. Example 1 • misses = Iacc * Imr + Dacc * Dmr • instr instr instr • = 1 * .02 + .25 * .03 = .02 + .0075 = .0275 • Memory cycles = I * .0275 * 50 = I* 1.375 • ExecT = (Cpu CPI * I + MemCycles)*Clk • = (2.3 * I + 1.375 * I) * clk = 3.675IC • speedup = 3.675 IC / 2.3IC = 1.6

  48. Example 2 • Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? • How long is the miss penalty now?

  49. Example 2 • Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? • How long is the miss penalty now? 100 cycles • Memory cycles =

  50. Example 2 • Double the clock rate from Example1. What is the ideal speedup when taking into account the memory system? • How long is the miss penalty now? 100 cycles • Memory cycles = I * .0275 * 100 = I * 2.75

More Related