1 / 86

Towards Scalable and Energy-Efficient Memory System Architectures

Towards Scalable and Energy-Efficient Memory System Architectures. Rajeev Balasubramonian , Al Davis, Ani Udipi , Kshitij Sudan, Manu Awasthi , Nil Chatterjee , Seth Pugsley , Manju Shevgoor School of Computing University of Utah. Towards Scalable and Energy-Efficient

wolfe
Download Presentation

Towards Scalable and Energy-Efficient Memory System Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, AniUdipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee, Seth Pugsley, ManjuShevgoor School of Computing University of Utah

  2. Towards Scalable and Energy-Efficient Memory System Architectures

  3. Convergence of Technology Trends Energy New Memory Technologies Overhaul of main memory architecture! BW, Capacity, and Locality for Multi-Cores Reliability

  4. High Level Approach • Explore changes to memory chip microarchitecture • Must cause minimal disruption to density • Explore changes to interfaces and standards • Major change appears inevitable! • Explore system and memory controller innovations • Most attractive, but order-of-magnitude improvement unlikely • Design solutions that are technology-agnostic

  5. Projects • Memory Chip • Reduce overfetch • Support reliability • Handle PCM drift • Promote read/write • parallelism • Memory Interface • Interface with • photonics • Organize channel • for high capacity • Memory Controller • Maximize use of • row buffer • Schedule for low • latency and energy • Exploit mini-ranks DIMM CPU … MC

  6. Talk Outline • Mature work: • SSA architecture – Single Subarray Access (ISCA’10) • Support for reliability (ISCA’10) • Interface with photonics (ISCA’11) • Micro-pages – data placement for row buffer efficiency (ASPLOS’10) • Handling multiple memory controllers (PACT’10) • Managing resistance drift in PCM cells (NVMW’11) • Preliminary work: • Handling read/write parallelism • Enabling high capacity • Handling DMA scheduling • Exploiting rank subsetting for performance and thermals

  7. Minimizing Overfetch with Single Subarray Access AniUdipi Primary Impact DIMM CPU … MC

  8. Problem 1 - DRAM Chip Energy • On every DRAM access, multiple arrays in multiple chips are activated • Was useful when there was good locality in access streams • Open page policy • Helped keep density high and reduce cost-per-bit • With multi-thread, multi-core and multi-socket systems, there is much more randomness • “Mixing” of access streams when finally seen by the memory controller

  9. Rethinking DRAM Organization • Limited use for designs based on locality • As much as 8kbytes read in order to service a 64byte cache line request • Termed “overfetch” • Substantially increases energy consumption • Need a new architecture that • Eliminates overfetch • Increases parallelism • Increases opportunity for power-down • Allows efficient reliability

  10. Proposed Solution – SSA Architecture ONE DRAM CHIP ADDR/CMD BUS DIMM 64 Bytes Subarray Bitlines Bank Row buffer 8 8 8 8 8 8 8 8 8 DATA BUS MEMORY CONTROLLER Global Interconnect to I/O

  11. SSA Basics • Entire DRAM chip divided into small “subarrays” • Width of each subarray is exactly one cache line • Fetchentirecache line from a single subarray in a single DRAM chip – SSA • Groups of subarrays combined into “banks” to keep peripheral circuit overheads low • Close page policy and “posted-RAS” • Data bus to processor essentially split into 8 narrow buses

  12. SSA Architecture Impact • Energy reduction • Dynamic – fewer bitlines activated • Static – smaller activation footprint – more and longer spells of inactivity – better power down • Latency impact • Limited pins per cache line – serialization latency • Higher bank-level parallelism – shorter queuing delays • Area increase • More peripheral circuitry and I/O at finer granularities – area overhead (< 5%)

  13. Area Impact • Smaller arrays – more peripheral overhead • More wiring overhead in the on-chip interconnect between arrays and pin pads • We did a best-effort area impact calculation using a modified version of CACTI 6.5 • Analytical model, has its limitations • More feedback in this specific regard would be awesome! • More info on exactly where in the hierarchy overfetch stops would be great too

  14. Support for Chipkill Reliability AniUdipi Primary Impact DIMM CPU … MC

  15. Problem 2 – DRAM Reliability • Many server applications require chipkill-level reliability – failure of an entire DRAM chip • One example of existing systems • Consider baseline 64-bit word plus 8-bit ECC • Each of these 72 bits must be read out of a different chip, else a chip failure will lead to a multi-bit error in the 72-bit field – unrecoverable! • Reading 72 chips - significant overfetch! • Chipkill even more of a concern for SSA since entire cache line comes from a single chip

  16. Proposed Solution DIMM DRAM DEVICE Approach similar to RAID-5 L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C L9 C L10 C L11 C L12 C L13 C L14 C L15 C P1 C L8 C .. .. .. .. .. .. .. .. .. P7 C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C L – Cache Line C – Local Checksum P – Global Parity

  17. Chipkill design • Two-tier error protection • Tier - 1 protection – self-contained error detection • 8-bit checksum/cache line – 1.625% storage overhead • Every cache line read is now slightly longer • Tier -2 protection – global error correction • RAID-like striped parity across 8+1 chips • 12.5% storage overhead • Error-free access (common case) • 1 chip reads • 2 chip writes – leads to some bank contention • 12% IPC degradation • Erroneous access • 9 chip operation

  18. Questions • What are the common failure modes in DRAM? PCM? • Do entire chips fail? • Do parts of chips fail? • Which parts? Bitlines? Wordlines? Capacitors? • Entire arrays? • Entire banks? • I/O? • Should all these failures be handled the same way?

  19. Designing Photonic Interfaces AniUdipi Primary Impact DIMM CPU … MC

  20. Problem 3 – Memory interconnect • Electrical interconnects are not scaling well • Where can photonics make an impact, both on energy and performance? • Various levels in the DRAM interconnect • Memory cell to sense-amp - addressed by SSA • Row buffer to I/O – currently electrical (on-chip) • I/O pins to processor – currently electrical (off-chip) • Photonic interconnects • Large static power component – laser/ring tuning • Much lower dynamic component – relatively unaffected by distance • Electrical interconnects • Relatively small static component • Large dynamic component • Cannot overprovision photonic bandwidth, use only where necessary

  21. Consideration 1 – How much photonics on a die? Photonic Energy Electrical Energy

  22. Consideration 2 - Increasing Capacity • 3D stacking is imminent • There will definitely be several dies on the channel • Each die has photonic components that are constantly burning static power • Need to minimize this! • TSVs available within a stack; best of both worlds • Large bandwidth • Low static energy • Need to exploit this!

  23. Proposed Design DRAM chips Memory controller Processor DIMM Photonic Interface die + Stack controller Waveguide

  24. Proposed Design – Interface Die • Exploit 3D die stacking to move all photonic components to a separate interface die, shared by several memory dies • Use photonics where there is heavy utilization – shared bus between processor and interface die i.e. the off-chip interconnect • Helps break pin barrier for efficient I/O, substantially improves socket-edge BW • On-stack, where there is low utilization, use efficient low-swing interconnects and TSVs

  25. Advantages of the proposed system • Reduction in energy consumption • Fewer photonic resources, without loss in performance • Rings, couplers, trimming • Industry considerations • Does not affect design of commodity memory dies • Same memory die can be used with both photonic and electrical systems • Same interface die can be used with different kinds of memory dies – DRAM, PCM, STT-RAM, Memristors

  26. Problem 4 – Communication Protocol • Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface • Need to handle heterogeneous memory modules, each with its own maintenance requirements, further complicates scheduling • Very little interoperability – affects both consumers (too many choices!) and vendors (stock-keeping and manufacturing) • Heavy pressure on address/command bus – several commands to micro-manage every operation of the DRAM • Several independent banks – need to maintain large amounts of state to schedule requests efficiently • Simultaneous arbitration for multiple resources (address bus, data bank, data bus) to complete a single transaction

  27. Proposed Solution – Packet-based interface • Release most of the tight control memory controller holds today • Move mundane tasks to the memory modules themselves (on the interface die) - make them more autonomous • maintenance operation (refresh, scrub, etc.) • routine operations (DRAM precharge, NVM wear handling) • timing control (DRAM alone has almost 20 different timing constraints to be respected) • coding and any other special requirements • Only information the memory module needs is the address and read/write identification, time slots reserved apriori for data return

  28. Advantages • Better interoperability, plug and play • As long as the interface die has the necessary information, everything in interchangeable • Better support for heterogeneous systems • Allows easier data movement between DRAM and NVM for example, on the same channel • Reduces memory controller complexity • Allows innovation and value addition in the memory, without being constrained by processor-side support • Reduces bit transport energy on the address/command bus

  29. Data Placement with Micro-Pages To Boost Row Buffer Utility Kshitij Sudan Primary Impact DIMM CPU … MC

  30. DRAM Access Inefficiencies • Over fetch due to large row-buffers • 8 KB read into row buffer for a 64 byte cache line • Row-buffer utilization for a single request < 1% • Diminishing locality in multi-cores • Increasingly randomized memory access stream • Row-buffer hit rates bound to go down • Open page policy and FR-FCFS request scheduling • Memory controller schedules requests to open row-buffers first • Goal • Improve row-buffer hit-rates for Chip Multi-Processors

  31. Key Observation Cache Block Access Pattern Within OS Pages • Gather all heavily accessed chunks of independent OS pages and map them to the same DRAM row For heavily accessed pages in a given time interval, accesses are usually to a few cache blocks

  32. Basic Idea Reserved DRAM Region 4 KB OS Pages 1 KB micro-pages DRAM Memory Coldest micro-pages Hottest micro-pages

  33. Hardware Implementation (HAM) Hardware Assisted Migration (HAM) Baseline 4 GB Main Memory CPU Memory Request Physical Address X Page A X Mapping Table Old Address New Address X Y 4 MB Reserved DRAM region Y New addr . Y

  34. Results • 5M cycle EPOCH, ROPS, HAM and ORACLE Percent change in performance Apart from average 9% performance gains, our schemes also save DRAM energy at the same time!

  35. Conclusions • On average, for applications with room for improvement and with our best performing scheme • Average performance ↑9% (max. 18%) • Average memory energy consumption ↓18%(max. 62%). • Average row-buffer utilization ↑ 38% • Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses

  36. Data Placement Across Multiple Memory Controllers Kshitij Sudan Primary Impact DIMM CPU … MC

  37. DRAM NUMA Latency DIMM DIMM DIMM DIMM DIMM DIMM MC MC Core 1 Core 2 Core 1 Core 2 Core 3 Core 4 Core 3 Core 4 On-Chip Memory Controller MC QPI QPI Interconnect DIMM DIMM DIMM DIMM DIMM DIMM Memory Channel MC MC Core 1 Core 2 Core 1 Core 2 DIMM DRAM (DIMMs) Core 3 Core 4 Core 3 Core 4 Socket Boundary

  38. Problem Summary • Pin limitations → increasing queuing delay • Almost 8x increase in queuing delays from single core/one thread to 16 cores/16 threads • Multi-cores → increasing row-buffer interference • Increasingly randomized memory access stream • Longer on- and off-chip wire delays → increasing NUMA factor • NUMA factor already at 1.5x today • Goal • Improve application performance by reducing queuing delays and NUMA latency

  39. Policies to Manage Data Placement Among MCs • Adaptive First Touch • Assign new virtual pages to a DRAM (physical) page belonging to MC(j) minimizing the a cost function • Dynamic Page Migration • Programs change phases →Imbalance in MC load • Migrate pages between MCs at runtime • Integrating Heterogenous Memory Technologies cost j = α x loadj + β x rowhitsj + λ x distancej costk = Λ * distancek + Γ * rowhitsk cost j = α x loadj + β x rowhitsj + λ x distance + Ƭ x LatencyDimmClusterj+ µ x Usagej

  40. Summary • Multiple on-chip MCs will be common in future CMPs • Multiple cores sharing one MC, MCs controlling different types of memories • Intelligent data mapping needed • Adaptive First Touch policy (AFT) • Increases performance by 6.5% in homogeneous and by 1.6% in DRAM – PCM hierarchy. • Dynamic page migration, improvement on AFT • Further improvement over AFT - 8.9% over baseline in homogeneous, and by 4.4% in best performing DRAM-PCM hierarchy.

  41. Managing Resistance Drift in PCM Cells Manu Awasthi Primary Impact DIMM CPU … MC

  42. Quick Summary • Multi level cells in PCM appear imminent • A number of proposals exist to handle hard errors and lifetime issues of PCM devices • Resistance Drift is a less explored phenomenon • Will become increasingly significant as number of levels/cell increases – primary cause of “soft errors” • Naïve techniques based on DRAM-like refresh will be extremely costly for both latency and energy • Need to explore holistic solutions to counter drift

  43. What is Resistance Drift? Time 11 10 01 00 ERROR!! Tn B T0 A Resistance Crystalline Amorphous

  44. Resistance Drift Data (11) (01) (10) (00)

  45. Resistance Drift - Issues • Programmed resistance drifts according to power law equation - • R0, α usually follow a Gaussian distribution • Time to drift (error) depends on • Programmed resistance (R0), and • Drift Coefficient (α) • Is highly unpredictable!! Rdrift(t) = R0х (t)α

  46. Resistance Drift - How it happens ERROR!! 11 10 01 00 Number of Cells R0 R0 Rt Rt • Median case cell • Typical R0 • Typical α • Worst case cell • High R0 • High α Scrub rate will be dictated by the Worst Case R0 and Worst Case α Naive refresh/scrub will be extremely costly!

  47. Architectural Solutions - Headroom • Assumes support for Light Array Reads for Drift Detection (LARDDs) & ECC-N • Headroom-h scheme – scrub is triggered if N-h errors are detected • Decreases probabilityof errors slipping through • Increases frequency of full scrub and hence decreases life time • Gradual Headroom scheme : Start with large LARDD frequency, increase frequency as errors increase Read Line Check for Errors True Errors < N-h After N cycles False Scrub Line

  48. Reducing Overheads with Circuit Level Solution • Invoking ECC on every LARDD increases energy consumption • Parity – like error detection circuit is used to signal the need for a full fledged ECC error detect • Number of Drift Prone States in each line are counted when the line is written into memory (single bit represents odd/even) • At every LARDD, parity is verified • Reduces need for ECC read-compare at every LARDD cycle (11) (01) (10) (00) 48

  49. More Solutions • Precise Writes • More write iterations to program state closer to mean, reduce chance of drift • Increases energy consumption , write time and decreases lifetime! • Non Uniform Guardbanding • Resistance is equally distributed between all n states • Expand resistance range for drift prone states at expense of non-drift prone ones

  50. Results Errors LARDD Interval (seconds)

More Related