1 / 36

Towards Scalable and Energy-Efficient Memory System Architectures

Towards Scalable and Energy-Efficient Memory System Architectures. Rajeev Balasubramonian School of Computing University of Utah. Main Memory Problems. 3. Reliability. PROCESSOR. DIMM. DIMM. 2. High capacity at high bandwidth. 1. Energy. Motivation: Memory Energy.

clodia
Download Presentation

Towards Scalable and Energy-Efficient Memory System Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah

  2. Main Memory Problems 3. Reliability PROCESSOR DIMM DIMM 2. High capacity at high bandwidth 1. Energy

  3. Motivation: Memory Energy • Contributions of memory to overall system energy: • 25-40%, IBM, Sun, and Google server data summarized • by Meisner et al., ASPLOS’09 • HP servers: 175 W out of ~785 W for 256 GB memory • (HP power calculator) • Intel SCC: memory controller contributes 19-69% of • chip power, ISSCC’10

  4. Motivation: Reliability • DRAM data from Schroeder et al., SIGMETRICS’09: • 25K-70K errors per billion device hours per Mbit • 8% of DRAM DIMMs affected by errors every year • DRAM error rates may get worse as scalability limits are • reached; PCM (hard and soft) error rates expected to be • high as well • Primary concern: storage and energy overheads for error • detection and correction • ECC support is not too onerous; chip-kill is much worse

  5. Motivation: Capacity, Bandwidth DIMM DIMM Processor

  6. Motivation: Capacity, Bandwidth DIMM DIMM Cores are increasing, but pins are not Processor

  7. Motivation: Capacity, Bandwidth Will eventually need disruptive shifts: NVM, optics DIMM DIMM Cores are increasing, but pins are not Processor High channel frequency  fewer DIMMs Can’t have high capacity, high bandwidth, and low energy Pick 2 of the 3!

  8. Memory System Basics DIMM M DIMM DIMM M M Processor Multiple on-chip memory controllers that handle multiple 64-bit channels

  9. Memory System Basics: FB-DIMM DIMM M DIMM DIMM DIMM DIMM M M DIMM DIMM DIMM DIMM M M Processor FB-DIMM: Can boost capacity with narrow channels and buffering at each DIMM

  10. What’s a Rank? x8 x8 x8 x8 x8 x8 x8 x8 Processor DIMM M 64b Rank: DRAM chips required to provide the 64b output expected by a JEDEC standard bus For example: 8 x8 DRAM chips

  11. What’s a Bank? BANK x8 x8 x8 x8 x8 x8 x8 x8 Processor DIMM M 64b Bank: A portion of a rank that is tied up when servicing a request; multiple banks in a rank enable parallel handling of multiple requests

  12. What’s an Array? BANK x8 x8 x8 x8 x8 x8 x8 x8 Processor DIMM M 64b Array: Matrix of cells One array provides 1 bit/cycle Each array reads out an entire row Large array  high density

  13. What’s a Row Buffer? Bitlines Wordline … RAS Array Row Buffer CAS Output pin

  14. Row Buffer Management • Row buffer: collection of rows read out by arrays in a bank • Row buffer hits incur low latency and low energy • Bitlines must be precharged before a new row can be read • Open page policy: delays the precharge until a different • row is encountered • Close page policy: issues the precharge immediately

  15. Primary Sources of Energy Inefficiency • Overfetch: 8 KB of data read out for each cache line request • Poor row buffer hit rates: diminished locality in multi-cores • Electrical medium: bus speeds have been increasing • Reliability measures: overhead in building a reliable system • from inherently unreliable parts

  16. SECDED Support 8-bit ECC 64-bit data word • One extra x8 chip per rank • Storage and energy overhead of 12.5% • Cannot handle complete failure in one chip

  17. Chipkill Support I 8-bit ECC 64-bit data word At most one bit from each DRAM chip • Use 72 DRAM chips to read out 72 bits • Dramatic increase in activation energy and overfetch • Storage overhead is still 12.5%

  18. Chipkill Support II 5-bit ECC 8-bit data word At most one bit from each DRAM chip • Use 13 DRAM chips to read out 13 bits • Storage and energy overhead: 62.5% • Other options exist; trade-off between energy and storage

  19. Summary So Far • We now understand… • why memory energy is a problem • - overfetch, row buffer miss rates • why reliability incurs high energy overheads • - chipkill support requires high activation per useful bit • why capacity and bandwidth increases cost energy • - need high frequency and buffering per hop

  20. Crucial Timing • Disruptive changes may be compelling today… • Increasing role of memory energy • Increasing role of memory errors • Impact of multi-core: high bandwidth needs, loss of locality • Emerging technologies (NVM, optics) • will require a revamp of memory architecture • ideas can be easily applied to NVM • role of DRAM may change

  21. Attacking the Problem • Find ways to maximize row buffer utility • Find ways to reduce overfetch • Treat reliability as a first-class design constraint • Use photonics and 3D to boost capacity and bandwidth • Solutions must be very cost-sensitive

  22. Maximizing Row Buffer Locality • Micro-pages (ASPLOS’10) • Handling multiple memory controllers (PACT’10) • On-going work: better write scheduling, better bank • management (data mapping, row closure)

  23. Micro-Pages • Key observation: most accesses to a page are localized • to a small region (micro-page)

  24. Solution • Identify hot micro-pages • Co-locate hot micro-pages in reserved DRAM rows • Memory controller keeps track of re-direction • Low overheads if applications have few hot micro-pages • that account for most memory accesses Processor DIMM M

  25. Results • Overall 9% improvement in performance and 15% • reduction in energy

  26. Handling Multiple Memory Controllers • Data mapping across multiple memory controllers is key: • Must equalize load and queuing delays • Must minimize “distance” • Must maximize row buffer hit rates DIMM M DIMM DIMM M M

  27. Solution • Cost function to guide initial page placement • Similar cost function to guide page migration • Initial page placement improves performance by 7%, • page migration by 9% • Row buffer hit rates can be doubled

  28. Reducing Overfetch Key idea: eliminate overfetch by employing smaller arrays and activating a single array in a single chip Single Subarray Access (SSA), ISCA’10 • Positive effects: • Minimizes activation energy • Small activation footprint: more • arrays can be asleep longer • Enables higher parallelism and • reduces queuing delays • Negative effects: • Longer transfer time • Drop in density • No row buffer hits • Vulnerable to chip failure • Change to standards

  29. Energy Results • Dynamic energy reduction of 6x • In some cases, 3x reduction in leakage

  30. Performance Results • SSA better on half the programs (mem-intensive ones)

  31. Support for Reliability • Checksum support per row allows low-cost error detection • Can build a 2nd tier error-correction scheme, based on RAID Checksum … Data row Parity DRAM chip DRAM chip • Reads: single array read • Writes: two array reads and two array writes

  32. Capacity and Bandwidth • Silicon photonics to break the pin barrier at the processor • But, several concerns at the DIMM: • Breaking the DRAM pin barrier will impact cost! • High capacity  daisy-chaining and loss of power • High static power for photonics; need high utilization • Scheduling for large capacities

  33. Exploiting 3D Stacks (ISCA’11) DRAM chips Memory controller Processor DIMM Interface die + Stack controller Waveguide • Interface die for photonic penetration • Does not impact DRAM design • Few photonic hops; high utilization • Interface die schedules low-level operations

  34. Packet-Based Scheduling Protocol • High capacity  high scheduling complexity • Move to a packet-based interface • Processor issues an address request • Processor reserves a slot for data return • Scheduling minutiae are handled by stack controller • Data is returned at the correct time • Back-up slot in case deadline is not met • Better plug’n’play • Reduced complexity at processor • Can handle heterogeneity

  35. Summary • Treat reliability as a first-order constraint • Possible to use photonics to break pin barrier and not • disrupt memory chip design: boosts bandwidth and • capacity ! • Can reduce memory chip energy by reducing overfetch • and with better row buffer management

  36. Acks • Terrific students in the Utah Arch group • Prof. Al Davis (Utah) and collaborators at HP, Intel, IBM • Funding from NSF, Intel, HP, University of Utah

More Related