1 / 34

Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies

Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies. Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian, R. Iyer, S. Makineni and D. Newell University of Utah and Intel STL . Motivation. Many-core designs requires large cache capacity for

drake
Download Presentation

Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian, R. Iyer, S. Makineni and D. Newell University of Utah and Intel STL

  2. Motivation Many-core designs requires large cache capacity for performance SRAM has low latency and consumes less power DRAM has 8X density but poor latency/power characteristics Can we design a hybrid SRAM-DRAM cache to take advantage of both technologies? Can we build a customized on-chip network specifically targeted at such a design? University of Utah

  3. DRAM SRAM Proposal - 3D Stacked Hybrid Cache • Not an option in conventional 2D design • 3D Mixed-process stacking enables a single vertical SRAM/DRAM bank University of Utah

  4. Executive Summary • 3D stacked hybrid cache design • Synergistic proposals to improve performance and power efficiency • Optimizing Capacity • Reconfigurable cache hierarchy • Optimizing Communication • Page coloring for effective data placement - reduced communication • Tailor-made on-chip interconnection network - quicker communication • Up to 62% performance increase University of Utah

  5. Outline • Overview of 3D Technology • Technique I - Reconfigurable Cache Hierarchy • Technique II - Page coloring • Technique III - On-chip Interconnection Network • Evaluation • Conclusions University of Utah

  6. I/O Bumps Through-Silicon Vias (TSVs) Bulk Si #2 Die #2 Active Si #2 Metal Die-to-die vias Metal Die #1 Active Si #1 Bulk Si #1 Heat sink 3D Technology + Mixed process integration possible + High speed vertical interconnects - Thermal Issues Source: Black et al. MICRO’06 University of Utah

  7. Baseline Model Upper die - 16 SRAM Banks, with grid based on-chip network Lower die - 16 Processing cores University of Utah

  8. Outline • Overview of 3D Technology • Technique I - Reconfigurable Cache Hierarchy • Technique II - On-chip Interconnection Network • Technique III - Page coloring • Evaluation • Conclusions University of Utah

  9. Technique I - Reconfigurable hierarchy • Increase capacity by stacking a DRAM bank on each SRAM cache bank, reconfigure bank size based on demand • More compelling with 3D and NUCA • Space capacity on die 3 does not intrude with layout of second die or steal capacity from neighboring caches • Cache already partitioned into NUCA banks, additional banks do not complicate logic too much • Access time grows less than linearly with capacity • Dramatic increase in capacity, no gradation, only two choices • Turn-off DRAM for small working set size University of Utah

  10. Proposed Reconfigurable Cache Model Die containing 16 DRAM banks and no interconnect Inter-die via pillar to access portion of L2 in DRAM (not shown: one pillar per sector) Die containing 16 SRAM banks and tree interconnect Inter-die via pillar to send request from core to L2 SRAM (not shown: one pillar for each core) Die containing 16 cores University of Utah

  11. Proposed Reconfiguration Policy • Simple heuristic for enabling/disabling DRAM bank: Every Reconfiguration Interval, • If usage is low and cache-bank miss-rate is low disable DRAM bank above • If usage is high and cache-bank miss-rate is high enable DRAM bank above • Reconfiguration interval is every 10 million cycles • All cores are stalled for 100K cycles during reconfiguration University of Utah

  12. Cache Organization DRAM 32 Total Capacity 0 Ways 1 MB 9 MB SRAM Adaptive arrays become tag arrays for ways in DRAM Tag Array 4 2 Ways Data Array High Low Access Pressure University of Utah

  13. Cache Organization • SRAM banks have three memory arrays – tag array, data array, adaptive array (can act as both tag & data) • Whenever DRAM banks are switched on, tags implemented in part of the SRAM • Quick lookup of tag • Increased capacity manifests as additional ways • Cache lines in SRAM need not be flushed on reconfiguration • Two ways of data available with low latency, moving MRU data to these ways will further increase efficiency University of Utah

  14. Why is this better than a L2/L3 hierarchy? • Additional access penalty on L2 miss before the L3 is accessed to service the request • In our scheme, we look up all tags in parallel, in the SRAM • An additional level implies additional coherence complexity • Our experiments show non-trivial performance degradation on implementing SRAM/DRAM as L2/L3 compared to our scheme University of Utah

  15. Outline • Overview of 3D Technology • Technique I - Reconfigurable Cache Hierarchy • Technique II - Page coloring • Technique III - On-chip Interconnection Network • Evaluation • Conclusions University of Utah

  16. Technique II - Page Coloring • OS can control what Physical Page Number is assigned to each virtual page, thus controlling the index • It can be manipulated to redirect cache line placements CACHE VIEW Cache Tag Index Page Color Offset PhysicalPageNumber Offset PHYSICAL ADDRESS University of Utah

  17. Page Coloring • Page coloring employed to map data to banks based on proximity to cores. • We assume an offline oracle page-coloring implementation • Policies depend upon 2 criteria: • Knowledge of a page being private or shared • Knowledge of a page being data or code • More capacity pressure on banks carrying shared data University of Utah

  18. Shared Data + Code Private Page Shared Data Private Code Proposed Page Coloring Schemes Share4:D+I Rp:I+Share4:D Share16:D+I Shared data & code mapped to central 4 banks Shared data to central 4 banks; code replicated Shared data + code distributed to all 16 banks University of Utah

  19. Outline • Overview of 3D Technology • Technique I - Reconfigurable Cache Hierarchy • Technique II - Page coloring • Technique III - On-chip Interconnection Network • Evaluation • Conclusions University of Utah

  20. Technique III - Interconnection network Routers saved! Links Router TREE University of Utah

  21. On chip tree network • Predictable traffic pattern • Data moves between shared central banks/private overhead banks and the core • Decreased router overhead • Saves energy and time University of Utah

  22. Page coloring Tree network Hybrid 3D cache Synergy between proposals • No search (S-NUCA) • Radiating traffic pattern - Increased bank capacity with low latency - No spills into neighboring banks University of Utah

  23. Outline • Overview of 3D Technology • Technique I - Reconfigurable Cache Hierarchy • Technique II - Page coloring • Technique III - On-chip Interconnection Network • Evaluation • Conclusions University of Utah

  24. Methodology • Intel ManySim trace-based simulator • CACTI cache model for area, power and access latencies • HotSpot 4.0 for thermal evaluation • 16cores, 32nm process, 4GHz clock • 4KB page granularity • 1MB SRAM bank and 8MB DRAM bank • SAP, SPECjbb, TPC-C and TPC-E commercial multi-threaded workload traces University of Utah

  25. Workload Characterization • Working set size of code pages is 0.6% of data pages • Average code page access count is 57% University of Utah

  26. Page Coloring Evaluation Code Replication favorable when capacity is available Capacity constraint favors distributing shared pages University of Utah

  27. Interconnect Evaluation Most accesses are random Most accesses are local due to code replication Network power savings up to 48% University of Utah

  28. Hybrid Cache Evaluation Re-configurable Cache (with code replication) performs 55% better than Base-1 ~ 5% IPC drop, to get power savings Proposed Chip Base-No-PC Base-2x-No-PC Base-3- level L2 L3 L2 L2 SRAM SRAM DRAM DRAM L2 L2 L2 SRAM SRAM SRAM Cores Cores Cores Cores University of Utah

  29. SRAM-DRAM Hits without Reconfiguration Most accesses are to SRAM ways except in shared banks (5,6,9,10) University of Utah

  30. SRAM-DRAM Hits with Reconfiguration University of Utah

  31. Reconfiguration Policy Shared Banks have DRAM always enabled SPECJbb – DRAM always enabled – majority pages are private University of Utah

  32. Related Work • Reconfigurable Caches in 2D • Ranganathan et al. (ISCA ‘00), Balasubramonian et al. (MICRO ‘00), Zhang et al. (ISCA ‘03) • 3D Cache hierarchy • Lie et al. (IEEE D&T ‘05), Loi et al. (DAC ‘06), Kgil et al. (ASPLOS ‘06), Loh (ISCA ‘08) • Page coloring for NUCA • Cho et al. (MICRO ‘06), Awasthi et al. (HPCA’09), Chaudhuri (HPCA ‘09) • 3D NUCA interconnect • Li et al. (ISCA ‘06) • Our is the first paper to propose SRAM/DRAM, targeted tree network, and combining all these into a 3D hierarcy University of Utah

  33. Key Contributions • A synergistic cache design • Communication- and capacity-optimized 3D cache • Reconfigurable cache to improve performance while reducing power • OS-based page coloring for reducedcommunication • Tailor-made on-chip network for quickercommunication • Significant increase in efficiency • Performance improvement of up to 62% • Network power savings of up to 48% • Typical thermal effect +7 Celsius University of Utah

  34. Thank you.. • Questions? University of Utah

More Related