1 / 37

Memory System Performance in a NUMA Multicore Multiprocessor

Memory System Performance in a NUMA Multicore Multiprocessor. Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich. Summary. NUMA multicore systems are unfair to local memory accesses Local execution sometimes suboptimal. Outline. NUMA multicores: how it happened

agalia
Download Presentation

Memory System Performance in a NUMA Multicore Multiprocessor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich

  2. Summary • NUMA multicore systems are unfair to local memory accesses • Local execution sometimes suboptimal

  3. Outline • NUMA multicores: how it happened • Experimental evaluation: Intel Nehalem • Bandwidth sharing model • The next generation: Intel Westmere

  4. NUMA multicores: how it happened First generation: SMP 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 BusC BusC BusC BusC BusC BusC BusC BusC Northbridge MC MC DRAM memory

  5. NUMA multicores: how it happened Next generation: NUMA 0 1 2 3 4 5 6 7 BusC BusC BusC BusC IC IC Northbridge MC MC MC DRAM memory DRAM memory

  6. NUMA multicores: how it happened Next generation: NUMA 0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 MC MC IC IC DRAM memory DRAM memory

  7. NUMA multicores: how it happened Next generation: NUMA 0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 MC MC IC IC DRAM memory DRAM memory

  8. Bandwidth sharing • Frequent scenario:bandwidth sharedbetween cores • Sharing model for the Intel Nehalem 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 MC IC IC MC DRAM memory DRAM memory

  9. Outline • NUMA multicores: how it happened • Experimental evaluation: Intel Nehalem • Bandwidth sharing model • The next generation: Intel Westmere

  10. Evaluation system Intel Nehalem E5520 2 x 4 cores 8 MB level 3 cache 12 GB DDR3 RAM 5.86 GT/s QPI Processor 0 Processor 1 0 1 2 3 4 5 6 7 Level 3 cache Level 3 cache Global Queue Global Queue Global Queue Global Queue MC QPI QPI MC QPI QPI DRAM memory DRAM memory

  11. Bandwidth sharing: local accesses Processor 0 Processor 1 0 1 2 3 4 5 6 7 0 3 Level 3 cache Level 3 cache Global Queue Global Queue Global Queue MC QPI QPI MC DRAM memory DRAM memory DRAM memory

  12. Bandwidth sharing: remote accesses Processor 0 Processor 1 0 1 2 3 4 5 6 7 0 4 5 3 Level 3 cache Level 3 cache Global Queue Global Queue Global Queue MC QPI QPI MC DRAM memory DRAM memory DRAM memory

  13. Bandwidth sharing: combined accesses Processor 0 Processor 1 0 1 2 3 4 5 6 7 0 4 5 3 Level 3 cache Level 3 cache Global Queue Global Queue Global Queue Global Queue MC QPI QPI MC DRAM memory DRAM memory DRAM memory

  14. Global Queue • Mechanism to arbitrate between different types of memory accesses • We look at fairness of the Global Queue: • local memory accesses • remote memory accesses • combined memory accesses

  15. Benchmark program • STREAM triad for (i=0; i<SIZE; i++) { a[i]=b[i]+SCALAR*c[i]; } • Multiple co-executing triad clones

  16. Multi-clone experiments • All memory allocated on Processor 0 • Local clones: Remote clones: • Example benchmark configurations: C C (0L, 3R) (2L, 3R) (2L, 0R) C C C C C C C C C C Processor 0 Processor 1 Processor 0 Processor 1

  17. GQ fairness: local accesses Processor 0 Processor 1 Total bandwidth [GB/s] 0 1 2 3 4 5 6 7 C C C C Cache Cache GQ GQ IMC QPI QPI IMC DRAM memory DRAM DRAM

  18. GQ fairness: remote accesses Processor 0 Processor 1 Total bandwidth [GB/s] 0 1 2 3 4 5 6 7 C C C C Cache Cache GQ GQ IMC QPI QPI IMC DRAM memory DRAM DRAM

  19. Global Queue fairness • Global Queue fair when there areonly local/remote accesses in the system • What about combined accesses?

  20. GQ fairness: combined accesses Execute clones in all possible configurations (2L, 3R)

  21. GQ fairness: combined accesses Execute clones in all possible configurations

  22. GQ fairness: combined accesses Total bandwidth [GB/s]

  23. GQ fairness: combined accesses Execute clones in all possible configurations

  24. Combined accesses Total bandwidth [GB/s]

  25. Combined accesses • In configuration (4L, 1R)remote clone gets30% more bandwidth than a local clone • Remote execution can be better than local

  26. Outline • NUMA multicores: how it happened • Experimental evaluation: Intel Nehalem • Bandwidth sharing model • The next generation: Intel Westmere

  27. Bandwidth sharing model 0 1 2 3 4 5 6 7 C C Level 3 cache Level 3 cache Global Queue Global Queue IMC QPI QPI IMC DRAM memory DRAM memory DRAM memory

  28. Sharing factor () • Characterizes the fairness of the Global Queue • Dependence of sharing factor on contention?

  29. Contention affects sharing factor Processor 0 Processor 0 C DRAM QPI C contenders C C C

  30. Contention affects sharing factor Sharing factor ()

  31. Combined accesses Total bandwidth [GB/s]

  32. Contention affects sharing factor • Sharing factor decreases with contention • With local contention remote execution becomes more favorable

  33. Outline • NUMA multicores: how it happened • Experimental evaluation: Intel Nehalem • Bandwidth sharing model • The next generation: Intel Westmere

  34. The next generation Intel Westmere X5680 2 x 6 cores 12 MB level 3 cache 144 GB DDR3 RAM 6.4 GT/s QPI 6 7 8 9 A B 0 1 2 3 4 5 Level 3 cache Level 3 cache Global Queue Global Queue IMC QPI QPI IMC DRAM memory DRAM memory

  35. The next generation Total bandwidth [GB/s]

  36. Conclusions • Optimizing for data locality can be suboptimal • Applications: • OS scheduling (see ISMM’11 paper) • data placement and computation scheduling

  37. Thank you! Questions?

More Related