1 / 37

3D Implemented SRAM/DRAM Hybrid Cache Architecture for High-Performance and Low Power Consumption

3D Implemented SRAM/DRAM Hybrid Cache Architecture for High-Performance and Low Power Consumption. Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto , and Kazuaki Murakami Kyushu University. Outline. Why 3D? Will 3D always work well? Support Adaptive Execution! Conclusions.

quito
Download Presentation

3D Implemented SRAM/DRAM Hybrid Cache Architecture for High-Performance and Low Power Consumption

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 3D Implemented SRAM/DRAM Hybrid Cache Architecturefor High-Performance and Low Power Consumption Koji Inoue, Shinya Hashiguchi, Shinya Ueno, Naoto Fukumoto, and Kazuaki Murakami Kyushu University

  2. Outline • Why 3D? • Will 3D always work well? • Support Adaptive Execution! • Conclusions

  3. Outline • Why 3D? • Will 3D always work well? • Support Adaptive Execution! • Conclusions

  4. Wire-bonding (WB) 3D stacking (System-in-Package or SiP) Package-on-Package (POP) 3D stacking From 2D to 3D! • Stack Multiple Dies • Connect Dies with Through Silicon Vias TSV Multi-Level 3D IC Sensor IO RF Analog DRAM Source: Yuan Xie, “3D IC Design/Architecture,” Coolchips Special Session, 2009 Processor

  5. Chip Implementation Examplesfrom ISSCC’09 • Image Sensors • SRAM for SoCs • DRAM • Multi-core + SRAMconnected with wireless TSVs 8Gb 3D DRAM(Samsung) SRAM+Multicore(Keio Univ.) SRAM for SoCs(NEC) Image Sensor(MIT) U. Kang et al., “8Gb DDR3 DRAM Using Through-Silicon-Via Technology,” ISSCC’09. H. Saito et al., “A Chip-Stacked Memory for On-Chip SRAM-Rich SoCs and Processors, “ ISSCC’09. V. Suntharalingam et al., “A 4-Side Tileable Back Illuminated 3D-Integrated Mpixel CMOS Image Sensor,” ISSCC’09. K. Niitsu et al., “An Inductive-Coupling Link for 3D Integration of a 90nm CMOS Processor and a 65nm CMOS SRAM,” ISSCC’09.

  6. Why 3D? (1/3) • Wire Length Reduction • Replace long, high capacitance wires by TSVs • Low Latency, Low Energy • Small footprint

  7. Why 3D? (2/3) • Integration • From “Off-Chip” to “On-Chip” • Improved Communication • Low Latency, High Bandwidth, and Low Energy • Heterogeneous Integration • E.g. Emerging Devices

  8. 100 10 1 0.1 Performance Fine Process Performance Improvement (times) Power Consumption Stacking 180 130 90 65 45 32 22 15 12 Process node(nm) Why 3D? (3/3) N.Miyakawa,”3D Stacking Technology for Improvement of System Performance,” International Trade Partners Conference, Nov.2008

  9. Outline • Why 3D? • Will 3D always work well? • Support Adaptive Execution! • Conclusions

  10. Importance of On-Chip Caches • Memory-Wall Problem • Memory bandwidth does not scale with the # of cores • Growing speed gap between processor cores and DRAMs • So, Becomes more serious • Let’s increase on-chip cache capacity, but… • Requires large chip area Core 2 Duo Pentium4 Core Bus Level 2 cache 4MB L2 Cache 1MB L2 Cache http://www.atmarkit.co.jp/ ,http://www.chip-architect.com/

  11. Will 3D always work well? “Stacking a DRAM Cache” Main-Memory Access Time L1 Miss Rate L1 Hit Time L2 Hit Time L2 Miss Rate Ave. Memory Acc. Time ? 32MB DRAM Cache Core Core 4MB Cache Core Core Tag RAM

  12. Cache-Size Sensitivity Varies among Programs! Sensitive! Insensitive! Sensitive! Sensitive! Insensitive! Insensitive!

  13. 2D vs. 3D 172.mgrid LU Profit 171.swim 3D 32MB DRAM Cache 5 FMM 4 Ocean 3 181.mcf 256.bzip2 2 WaterSpatial Better Cholesky 1 188.ammp Barnes 0 0 2D 2MB SRAM Cache 100 50 80 FFT 179.art 300.twolf 301.apsi 60 100 40 HTL2_OVERHEAD[cc] MRL2_REDUTION[points] 20 0 150 200 Profit

  14. Appropriate Cache Size Varies within Programs! The lower, the better Ocean

  15. Outline • Why 3D? • Will 3D always work well? • Adaptive Execution! • Conclusions

  16. Will 3D always work well? “Stacking a DRAM Cache” Main-Memory Access Time L1 Miss Rate L1 Hit Time L2 Hit Time L2 Miss Rate Ave. Memory Acc. Time ? 32MB DRAM Cache Core Core 4MB Cache Core Core Tag RAM

  17. SRAM/DRAM Hybrid Cache Architecture • Support Two Operation Modes • High-Speed, Small Cache Mode (or SRAM Cache Mode) • Low-Speed, Large Cache Mode (or DRAM Cache Mode) • Adapt to variation of application behavior 32MB DRAM Cache Core Core 32MB DRAM Cache (Power Gated) 4MB Tag SRAM Core Core 4MB Cache SRAM Cache Mode DRAM Cache Mode

  18. Microarchitecture (1/2) Way 0 Way 1 Tag Tag 32MB DRAM Cache Core Core 4MB Tag SRAM Way 0 Way 1 2way set-associative DRAM Cache 2way set-associative SRAM Cache

  19. Microarchitecture (2/2) 64b physical address Offset Assume Ld==Ls==64B Tag field SARM(Size: Cs, Block: Ls, Asso. Ws) Index DARM(Size: d Block: Ld, Asso. Wd) = = MUX MUX = = Data (SRAM) MUX Hit/Miss (SRAM) Data (DRAM) Hit/Miss (DRAM)

  20. How to Adapt FFT • Static Approach • Optimizes at program level • Does not change it during execution • Needs a static analysis • Dynamic Approach • Optimizes at interval level (or phase level) • Needs a run-time profiling FMM Barnes Ocean

  21. Experimental Set Up • Processor: In-Order • Benchmarks: SPEC CPU 2000, Splash2 • The operation mode is set at the beginning of the program execution (and is maintained until the end) • Assume an appropriate operation mode is know for each benchmark 2D-BASE 3D-CONV 3D-HYBRID L1D, L1I Caches: 32KB Access Lat.:2clock cycles Core@3GHz Core@3GHz • L2 SRAM Cache • 2MB, 64B Block • 8way • Lat. 6 clock cycles • 3D DRAM Cache • 32MB, 64B Block • 8way • Lat. 28 clock cycles L1 I/D L1 I/D 3D DRAM L2 Cache 2D SRAM L2Cache Lat.:181 clock cycles Main memory Main memory

  22. Evaluation Results 2D-BASE3D-CONV3D-HYBRID Benchmark Programs

  23. How to Adapt FFT • Static Approach • Optimizes at program level • Does not change it during execution • Needs a static analysis • Dynamic Approach • Optimizes at interval level (or phase level) • Needs a run-time profiling FMM Barnes Ocean

  24. Run-Time Mode Selection • Divide Program Execution into “epochs”, e.g. 200K L2 Misses • Predict an Appropriate Operation Mode for Next Epoch • On SRAM mode, a small tag RAM which stores sampled tags is used to predict DRAM mode miss rates Hardware Support for Measurement if then transit from SRAM mode to DRAM mode! N-2 N-1 N N+1 epoch Operation Mode SRAMCache Mode DRAMCache Mode

  25. Results 2D-SRAMDRAM-STACKHYBRID-IDEALHYBRID

  26. Results 2D-SRAMDRAM-STACKHYBRID-IDEALHYBRID

  27. Results 2D-SRAMDRAM-STACKHYBRID-IDEALHYBRID

  28. Outline • Why 3D? • Will 3D always work well? • Adaptive Execution! • Conclusions

  29. Conclusions • The 3D solution is one of the most promising ways to achieve… • High performance • Low energy • It does not ALWAYS work well! • Run-time adaptive execution by considering memory access behavior

  30. Acknowledgement • This research was supported in part by New Energy and Industrial Technology Development Organization

  31. Backup Slides

  32. How to Adapt FFT • Static Approach • Optimizes at program level • Does not change it during execution • Needs a static analysis • Dynamic Approach • Optimizes at interval level (or phase level) • Needs a run-time profiling FMM Barnes Ocean

  33. Run-Time Mode Selection • Divide Program Execution into “epochs”, e.g. 200K L2 Misses • Predict an Appropriate Operation Mode for Next Epoch • On SRAM mode, a small tag RAM which stores sampled tags is used to predict DRAM mode miss rates Hardware Support for Measurement if then transit from SRAM mode to DRAM mode! N-1 N N+1 N+2 epoch Operation Mode SRAMCache Mode DRAMCache Mode

  34. Experimental Set Up • Processor: In-Order • Benchmarks: SPEC CPU 2000, Splash2 SRAMCache Mode DRAMCache Mode Size:32KB Access Time : 2clock cycles Core Core • Size : 2MB • Access Time : 6clock cycles • Size: 32MB • Access Time : 28clock cycles L1 Cache L1 Cache L2 Cache L2 Cache Access Time:181clock cycles Main Memory Main Memory

  35. Results 2D-SRAMDRAM-STACKHYBRID-IDEALHYBRID

  36. Results 2D-SRAMDRAM-STACKHYBRID-IDEALHYBRID

  37. Results 2D-SRAMDRAM-STACKHYBRID-IDEALHYBRID

More Related