1 / 28

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers. Manu Awasthi , David Nellans , Kshitij Sudan, Rajeev Balasubramonian, Al Davis University of Utah. Takeaway. Multiple, on-chip MCs will be common in future CMPs, with multiple cores sharing one MC

dorit
Download Presentation

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi , David Nellans, Kshitij Sudan, Rajeev Balasubramonian, Al Davis University of Utah

  2. Takeaway • Multiple, on-chip MCs will be common in future CMPs, with multiple cores sharing one MC • NUMA memory hierarchies across multiple sockets • Intelligent data mapping required to reduce average memory access delay • Hardware-software co-design approach required for efficient data placement • Minimum software involvement • Data placement needs to be aware of system parameters • Row-buffer hit rates, queuing delays, physical proximity, etc.

  3. NUMA - Today DIMM DIMM DIMM DIMM DIMM DIMM Conceptual representation of four socket Nehalem machine MC MC Core 1 Core 2 Core 1 Core 2 Core 3 Core 4 Core 3 Core 4 On-Chip Memory Controller MC Socket 1 QPI QPI Interconnect DIMM DIMM DIMM DIMM DIMM DIMM Memory Channel MC MC Core 1 Core 2 Core 1 Core 2 DIMM DRAM (DIMMs) Core 3 Core 4 Core 3 Core 4 Socket Boundary

  4. NUMA - Future DIMM Future CMPs with multiple on-chip MCs DIMM Core 1 Core 2 Core 3 Core 4 MC1 MC2 L2$ L2$ L2$ L2$ Core 5 Core 6 Core 7 Core 8 On-Chip Memory Controller MC L2$ L2$ L2$ L2$ On-Chip Interconnect Core 9 Core 10 Core 11 Core 12 L2$ L2$ L2$ L2$ Memory Channel Core 13 Core 14 Core 15 Core 16 DIMM DRAM (DIMMs) MC4 MC3 L2$ L2$ L2$ L2$ DIMM DIMM

  5. Local Memory Access DIMM DIMM DIMM DIMM DIMM DIMM DATA • Accessing local memory is fast!! MC MC Core 1 Core 2 Core 1 Core 2 ADDR Core 3 Core 4 Core 3 Core 4 Socket 1 DIMM DIMM DIMM DIMM DIMM DIMM MC MC Core 1 Core 2 Core 1 Core 2 Core 3 Core 4 Core 3 Core 4 5

  6. Problem 1 - Remote Memory Access DIMM DIMM DIMM DIMM DIMM DIMM • Data for Core N can be anywhere! MC MC Core 1 Core 2 Core 1 Core 2 ADDR Core 3 Core 4 Core 3 Core 4 Socket 1 DIMM DIMM DIMM DIMM DIMM DIMM MC MC Core 1 Core 2 Core 1 Core 2 Core 3 Core 4 Core 3 Core 4

  7. Problem 1 - Remote Memory Access DIMM DIMM DIMM DIMM DIMM DIMM • Data for Core N can be anywhere! MC MC Core 1 Core 2 Core 1 Core 2 Core 3 Core 4 Core 3 Core 4 Socket 1 DIMM DIMM DIMM DIMM DIMM DIMM DATA MC MC Core 1 Core 2 Core 1 Core 2 Core 3 Core 4 Core 3 Core 4

  8. Memory Access Stream – Single Core Memory Controller Request Queue Prog 1 CPU 0 Prog 1 CPU 0 Prog 1 CPU 0 Prog 2 CPU 0 Prog 1 CPU 0 Prog 1 CPU 0 Prog 1 CPU 0 Prog 1 CPU 0 Out In • Single cores executed a handful of context-switched programs. • Spatio-temporal locality can be exploited!!

  9. Problem 2 - Memory Access Stream - CMPs Memory Controller Request Queue Prog 6 CPU 6 Prog 5 CPU 5 Prog 4 CPU 4 Prog 3 CPU 3 Prog 2 CPU 2 Prog 1 CPU 1 Prog 1 CPU 1 Prog 0 CPU 0 Out In • Memory accesses from cores get interleaved, leading to loss of spatio-temporal locality.

  10. Problem 3 – Increased Overheads for Memory Accesses 1 Core/1 Thread Increased queuing delays 16 Core/16 Threads

  11. Problem 4 – Pin Limitations MC1 MC2 MC3 MC4 MC1 MC2 MC3 MC4 Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 MC16 MC4 Core 5 Core 6 Core 7 Core 8 Core 5 Core 6 Core 7 Core 8 MC15 MC4 Core 9 Core 10 Core 11 Core 12 Core 9 Core 10 Core 11 Core 12 MC14 MC4 Core 13 Core 14 Core 15 Core 16 Core 13 Core 14 Core 15 Core 16 MC13 MC4 MC5 MC6 MC7 MC8 MC12 MC11 MC10 MC9 • Pin bandwidth is limited : Number of MCs cannot grow exponentially • A small number of MCs will have to handle all traffic

  12. Problems Summary - I • Pin limitations imply an increase in queuing delay • Almost 8x increase in queuing delays from single core/one thread to 16 cores/16 threads • Multi-core implies an increase in row-buffer interference • Increasingly randomized memory access stream • Row-buffer hit rates bound to go down • Longer on- and off-chip wire delays imply an increase in NUMA factor • NUMA factor already at 1.5 today

  13. Problems Summary - II • DRAM access time in systems with multiple on-chip MCs is governed by • Distance between requesting core and responding MC. • Load on the on-chip interconnect. • Average queuing delay at responding MC • Bank and rank contention at target DIMM • Row-buffer hit rate at responding MC Bottomline : Intelligent management of data is required

  14. Adaptive First Touch Policy • Basic idea : Assign each new virtual page to a DRAM (physical) page belonging to MC (j) that minimizes the following cost function – cost j = α x loadj + β x rowhitsj + λ x distancej Measure of Queuing Delay Measure of Locality at DRAM Measure of Physical Proximity Constants α, β and λ can be made programmable

  15. Dynamic Page Migration Policy • Programs change phases!! • Can completely stop touching new pages • Can change the frequency of access to a subset of pages • Leads to imbalance in MC accesses • For long running programs with varying working sets, AFT can lead to some MCs getting overloaded Solution : Dynamically migrate pages between MCs at runtime to decrease imbalance

  16. Dynamic Page Migration Policy DIMM DIMM Core 1 Core 2 Core 3 Core 4 Lightly Loaded MC Lightly Loaded MC MC2 MC2 MC2 MC1 L2$ L2$ L2$ L2$ Core 5 Core 6 Core 7 Core 8 L2$ L2$ L2$ L2$ Core 9 Core 10 Core 11 Core 12 L2$ L2$ L2$ L2$ Core 13 Core 14 Core 15 Core 16 Lightly Loaded MC Heavily Loaded (Donor) MC MC2 MC4 MC3 MC3 L2$ L2$ L2$ L2$ DIMM DIMM

  17. Dynamic Page Migration Policy DIMM DIMM Core 1 Core 2 Core 3 Core 4 Select Recipient MC MC2 MC2 MC2 MC1 L2$ L2$ L2$ L2$ Core 5 Core 6 Core 7 Core 8 Copy N pages from donor to recipient MC L2$ L2$ L2$ L2$ Core 9 Core 10 Core 11 Core 12 L2$ L2$ L2$ L2$ Core 13 Core 14 Core 15 Core 16 Select N pages MC2 MC4 MC3 MC3 L2$ L2$ L2$ L2$ DIMM DIMM

  18. Dynamic Page Migration Policy - Challenges • Selecting recipient MC • Move pages to MC with least value of cost function • Selecting N pages to migrate • Empirically select the best possible value • Can also be made programmable costk = Λ x distancek + Γ x rowhitsk Move pages to a physically proximal MC Minimize interference at recipient MC

  19. Dynamic Page Migration Policy - Overheads • Pages are physically copied to new addresses • Original address mapping has to be invalidated • Invalidate cache lines belonging to copied pages • Copying pages can block resources, leading to unnecessary stalls. • Instant TLB invalidates could cause misses in memory even when data is present. • Solution : Lazy Copying • Essentially, delayed write-back

  20. Issues with TLB Invalidates TLB INV TLB INV TLB INV TLB INV Read A’ -> A Core 1 Core 3 Core 5 Core 12 OS Stall! Donor MC Recipient MC Copy Page A,B

  21. Lazy Copying TLB Update TLB Update TLB Update TLB Update Read Only Read Only Read Only Read Only Read A’ -> A Core 1 Core 3 Core 5 Core 12 OS Flush Dirty Cachelines Read A’ -> A Donor MC Recipient MC Copy Complete Copy Page A,B

  22. Methodology • Simics based simulation platform • DRAMSim based DRAM timing. • DRAM energy figures from CACTI 6.5 • Baseline : Assign pages to closest MC

  23. Results - Throughput AFT : 17.1% , Dynamic Page Migration : 34.8%

  24. Results – DRAM Locality STDDEV Down, increased fairness AFT : 16.6% , Dynamic Page Migration : 22.7%

  25. Results – Reasons for Benefits

  26. Sensitivity Studies • Lazy Copying does help, a little • Average 3.2% improvement over without lazy copying • Terms/Variables in cost function • Very sensitive to load and row-buffer hit rates, not as much to distance • Cost of TLB shootdowns • Negligible, since fairly uncommon • Physical placement of MCs – center or peripheral • Most workloads agnostic to physical placement

  27. Summary • Multiple, on-chip MCs will be common in future CMPs, with multiple cores sharing one MC • Intelligent data mapping will need to be done to reduce average memory access delay • Adaptive First Touch policy • Increases performance by 17.1% • Decreases DRAM energy consumption by 14.1% • Dynamic page migration, improvement on AFT • Further improvement over AFT by 17.7%, 34.8% over baseline. • Increases energy consumption by 5.2%

  28. Thank You • http://www.cs.utah.edu/arch-research

More Related