1 / 43

Exploiting 3D-Stacked Memory Devices

Exploiting 3D-Stacked Memory Devices. Rajeev Balasubramonian School of Computing University of Utah Oct 2012. Power Contributions. PROCESSOR. PERCENTAGE OF TOTAL SERVER POWER. MEMORY. Power Contributions. PROCESSOR. PERCENTAGE OF TOTAL SERVER POWER. MEMORY. Example IBM Server.

archer
Download Presentation

Exploiting 3D-Stacked Memory Devices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012

  2. Power Contributions PROCESSOR PERCENTAGE OF TOTAL SERVERPOWER MEMORY

  3. Power Contributions PROCESSOR PERCENTAGE OF TOTAL SERVERPOWER MEMORY

  4. Example IBM Server Source: P. Bose, WETI Workshop, 2012

  5. Reasons for Memory Power Increase • Innovations for the processor, but not for memory • Harder to get to memory (buffer chips) • New workloads that demand more memory • SAP HANA in-memory databases • SAS in-memory analytics

  6. The Cost of Data Movement • 64-bit double-precision FP MAC: 50 pJ(NSF CPOM Workshop report) • 1 instruction on an ARM Cortex A5: 80 pJ(ARM datasheets) • Fetching 256-bit block from a distant cache bank: 1.2 nJ • (NSF CPOM Workshop report) • Fetching 256-bit block from an HMC device: 2.68 nJ • Fetching 256-bit block from a DDR3 device: 16.6 nJ • (Jeddeloh and Keeth, 2012 Symp. on VLSI Technology)

  7. Memory Basics MC Host Multi-Core Processor MC MC MC

  8. FB-DIMM MC Host Multi-Core Processor MC … MC MC

  9. SMB/SMI MC Host Multi-Core Processor MC MC MC

  10. Micron Hybrid Memory Cube Device

  11. HMC Architecture MC Host Multi-Core Processor MC MC MC

  12. Key Points • HMC allows logic layer to easily reach DRAM chips • Open question: new functionalities on the logic chip – • cores, routing, refresh, scheduling • Data transfer out of the HMC is just as expensive as before • Near Data Computing … to cut off-HMC movement • Intelligent Network-of-Memories … to reduce hops

  13. Near Data Computing (NDC)

  14. Timely Innovation • A low-cost way to achieve NDC • Workloads that are embarrassingly parallel • Workloads that are increasingly memory bound • Mature frameworks (MapReduce) in place

  15. Open Questions • What workloads will benefit from this? • What causes the benefit?

  16. Workloads • Initial focus on MapReduce, but any workload with • localized data access patterns will be a good fit • Map phase in MapReduce: the dataset is partitioned • and each Map phase works on its “split”; embarrassingly • parallel, localized data access, often the bottleneck; • e.g., count word occurrences in each individual document • Reduce phase in MapReduce: aggregates the results of • many mappers; requires random access of data; but deals • with less data than Mappers; • e.g., summing up the occurrences for each word

  17. Baseline Architecture MC MC MC MC • Mappers and Reducers both execute on the host processor • Many simple cores is better than few complex cores • 2 sockets, 256 GB memory, processing power budget 260 W, • 512 Arm cores (EE-Cores) per socket, each core at 876 MHz

  18. NDC Architecture MC MC MC MC • Mappers execute on ND Cores; Reducers execute on the • host processor • 32 cores per HMC; 2048 total ND Cores and 1024 total • EE-Cores; 260 W total processing power budget

  19. NDC Memory Hierarchy MC MC MC MC • Memory latency excludes delay for link queuing and traversal • Many row buffer hits • L1 I and D caches per ND Core • The vault has space reserved for intermediate outputs, and • Mapper/Runtime code/data

  20. Methodology • Three workloads: • Range-Aggregate: count occurrences of something • Group-By: count occurrences of everything • Equi-Join: for two databases, it counts the pairs that • have similar attributes • Dataset: 1998 World Cup web server logs • Simulations of individual mappers and reducers on • EE-cores on TRAX simulator

  21. Single Thread Performance

  22. Effect of Bandwidth

  23. Exec Time vs. Frequency

  24. Maximizing the Power Budget

  25. Scaling the Core Count

  26. Energy Reduction

  27. Results Summary • Execution time reductions of 7%-89% • NDC performance scales better with core count • Energy reduction of 26%-91% • No bandwidth limitation • Lower memory access latency • Lower bit transport energy

  28. Intelligent Network of Memories • How should several HMCs be connected to the processor? • How should data be placed in these HMCs?

  29. Contributions • Evaluation of different network topologies • Route adaptivity does help • Page placement to bring popular data to nearby HMCs • Percolate-down based on page access counts • Use of router bypassing under low load • Use of deep sleep modes for distant HMCs

  30. Topologies

  31. Topologies

  32. Topologies (d) F-Tree (e) T-Tree

  33. Network Properties • Supports 44-64 HMC devices with 2-4 rings • Adaptive routing (deadlock avoidance based on timers) • An entire page resides in one ring, but cache lines are • striped across the channels

  34. Percolate-Down Page Placement • New pages are placed in nearest ring • Periodically, inactive pages are demoted to the next ring; • thresholds matter because of queuing delays • Activity is tracked with the multi-queue algorithm: • hierarchical queues, each entry has a timer and an access • count, demotion to lower queue if timer expires, promotion • to higher queue if access count is high • Page migration off the critical path, striped across many • channels, distant links are under-utilized

  35. Router Bypassing • Topologies with more links and adaptive routing (T-Tree) • are better… but distant links experience relatively low load • While a complex router is required for the T-Tree, the router • can often be bypassed

  36. Power-Down Modes • Activity shift to nearby rings  under-utilization at distant • HMCs • Can power off the DRAM layers (PD-0) and the SerDes • circuits (PD-1) • 26% energy saving for a 5% performance penalty

  37. Methodology • 128-thread traces of NAS parallel benchmarks (capacity • requirements of nearly 211 GB) • Detailed simulations with 1 billion memory access traces, • confirmatory page-access simulations for the entire • application • Power breakdown: 3.7 pJ/bit for DRAM access, 6.8 pJ/bit • for HMC logic layer, 3.9 pJ/bit for a 5x5 router

  38. Results – Normalized Exec Time • T-Tree P-Down reduces exec time by 50% • 86% of flits bypass the router • 88% of requests serviced by Ring-0

  39. Results – Energy

  40. Summary • Must reduce data movement on off-chip memory links • NDC reduces energy, improves performance by • overcoming the bandwidth wall • More work required to analyze workloads, build software • frameworks, analyze thermals, etc. • iNoM uses OS page placement to minimize hops for • popular data and increase power-down opportunities • Path diversity is useful, router overhead is small

  41. Acknowledgements • Co-authors: Kshitij Sudan, Seth Pugsley, ManjuShevgoor, • Jeff Jestes, Al Davis, Feifei Li • Group funded by: NSF, HP, Samsung, IBM

  42. Backup Slide

  43. Backup Slide

More Related