1 / 23

Brian Rogers †‡ , Anil Krishna †‡ , Gordon Bell ‡ , Ken Vu ‡ , Xiaowei Jiang † , Yan Solihin †

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture. Brian Rogers †‡ , Anil Krishna †‡ , Gordon Bell ‡ , Ken Vu ‡ , Xiaowei Jiang † , Yan Solihin †. †. ‡. NC STATE UNIVERSITY. As Process Technology Scales …. P.

erica-king
Download Presentation

Brian Rogers †‡ , Anil Krishna †‡ , Gordon Bell ‡ , Ken Vu ‡ , Xiaowei Jiang † , Yan Solihin †

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling the Bandwidth Wall:Challenges in and Avenues for CMP Scalability36th International Symposium on Computer Architecture Brian Rogers†‡, Anil Krishna†‡, Gordon Bell‡, Ken Vu‡, Xiaowei Jiang†, Yan Solihin† † ‡ NC STATE UNIVERSITY

  2. As Process Technology Scales … P P P P P P P P P P P P P P P P P P P P $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ DRAM DRAM Scaling the Bandwidth Wall -- ISCA 2009

  3. Problem • Core growth >> Memory bandwidth growth • Cores: ~ exponential growth (driven by Moore’s Law) • Bandwidth: ~ much slower growth (pin and power limitations) • At each relative technology generation (T): • (# Cores = 2T) >> (Bandwidth = BT) • Some key questions (Our contributions): • How constraining is increasing gap between # of cores and available memory bandwidth? • How should future CMPs be designed; how should we allocate transistors to caches and cores? • What techniques can best reduce memory traffic demand? Build Analytical CMP Memory Bandwidth Model Scaling the Bandwidth Wall -- ISCA 2009

  4. Agenda • Background / Motivation • Assumptions / Scope • CMP Memory Traffic Model • Alternate Views of Model • Memory Traffic Reduction Techniques • Indirect • Direct • Dual • Conclusions Scaling the Bandwidth Wall -- ISCA 2009

  5. Assumptions / Scope • Homogenous cores • Single-threaded cores (multi-threading adds to problem) • Co-scheduled sequential applications • Multi-threaded apps with data sharing evaluated separately • Enough work to keep all cores busy • Workloads static across technology generations • Equal amount of cache per core • Power/Energy constraints outside scope of this study Scaling the Bandwidth Wall -- ISCA 2009

  6. Agenda • Background / Motivation • Assumptions / Scope • CMP Memory Traffic Model • Alternate Views of Model • Memory Traffic Reduction Techniques • Indirect • Direct • Dual • Conclusions Scaling the Bandwidth Wall -- ISCA 2009

  7. Cache Miss Rate vs. Cache Size • Relationship follows the Power Law, Hartstein et al. (√2 Rule) R = New cache size / Old cache size α = Sensitivity of workload to cache size change M = M0 * R-α Scaling the Bandwidth Wall -- ISCA 2009

  8. CMP Traffic Model • Express chip area in terms of Core Equivalent Areas (CEAs) • Core = 1 CEA, Unit_of_Cache = 1 CEA • P = # cores, C = # cache CEAs, N = P+C, S = C/P • Assume that non-core and non-cache components require constant fraction of area • Add # of cores term for CMP model: Scaling the Bandwidth Wall -- ISCA 2009

  9. CMP Traffic Model (2) P = # cores, C = # cache CEAS N = P+C, S = C/P • Going from CMP1=<P1,C1> to CMP2=<P2,C2> • Remove common terms, express M2 in terms of M1 Scaling the Bandwidth Wall -- ISCA 2009

  10. One Generation of Scaling • Baseline Processor: 8 cores, 8 cache CEAs • N1=16, P1=8, C1=8, S1=1, and ~ fully utilized BW • α = 0.5 • How many cores possible if 32 CEAS now available? • Ideal Scaling = 2X # of cores at each successive technology generation Ideal Scaling BW Limited Scaling Scaling the Bandwidth Wall -- ISCA 2009

  11. Agenda • Background / Motivation • Assumptions / Scope • CMP Memory Traffic Model • Alternate Views of Model • Memory Traffic Reduction Techniques • Indirect • Direct • Dual • Conclusions Scaling the Bandwidth Wall -- ISCA 2009

  12. CMP Design Constraint P = # cores, C = # cache CEAS N = P+C, S = C/P • If available off-chip BW grows by factor of B: • Total memory traffic should grow by at most a factor of B each generation • Write S2 in terms of P2 and N2: • New technology: N2 CEAs, B bandwidth => solve for P2 numerically P2 is # of cores that can be supported Scaling the Bandwidth Wall -- ISCA 2009

  13. Scaling Under Area Constraints • With an increasing # of CEAs available, how many cores can be supported at constant BW requirement • 2x die area: 1.4x cores • 4x die area: 1.9x cores • 8x die area: 2.4x cores • 16x die area: 3.2x cores • … Scaling the Bandwidth Wall -- ISCA 2009

  14. Agenda • Background / Motivation • Assumptions / Scope • CMP Memory Traffic Model • Alternate Views of Model • Memory Traffic Reduction Techniques • Indirect • Direct • Dual • Conclusions Scaling the Bandwidth Wall -- ISCA 2009

  15. Categories of Techniques • Indirect • Cache Compression • DRAM Caches • 3D-stacked Cache • Unused Data Filter • Smaller Cores • Direct • Link Compression • Sectored Caches • Dual • Cache+Link Compress • Small Cache Lines • Data Sharing Scaling the Bandwidth Wall -- ISCA 2009

  16. Indirect – DRAM Cache F – Influenced by Increased Density Ideal Scaling Scaling the Bandwidth Wall -- ISCA 2009

  17. Direct – Link Compression R – Influenced by Compression Ratio Ideal Scaling Scaling the Bandwidth Wall -- ISCA 2009

  18. Dual – Small Cache Lines F,R – Influenced by % Unused Data Ideal Scaling Scaling the Bandwidth Wall -- ISCA 2009

  19. Dual – Data Sharing • Please see paper for details on modeling of sharing • Data sharing unlikely to provide a scalable solution Scaling the Bandwidth Wall -- ISCA 2009

  20. Summary of Individual Techniques Indirect Direct Dual Scaling the Bandwidth Wall -- ISCA 2009

  21. Summary of Combined Techniques Scaling the Bandwidth Wall -- ISCA 2009

  22. Conclusions • Contributions • Simple, powerful analytical CMP memory traffic model • Quantify significance of memory BW wall problem • 10% chip area for cores in 4 generations if constant traffic req. • Guide design (cores vs. cache) of future CMPs • Given fixed chip area and BW scaling, how many cores? • Evaluate memory traffic reduction techniques • Combinations can enable ideal scaling for several generations • Need bandwidth-efficient computing: • Hardware/Architecture level: DRAM caches, cache/link compression, prefetching, smarter memory controllers, etc. • Technology level: 3D chips, optical interconnects, etc. • Application level: working set reduction, locality enhancement, data vs. pipelined parallelism, computation vs. communication, etc. Scaling the Bandwidth Wall -- ISCA 2009

  23. Questions ? Thank You Brian Rogers bmrogers@ece.ncsu.edu Scaling the Bandwidth Wall -- ISCA 2009

More Related