slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead PowerPoint Presentation
Download Presentation
Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead

Loading in 2 Seconds...

play fullscreen
1 / 40
gualtier

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead - PowerPoint PPT Presentation

99 Views
Download Presentation
Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Memory Management inNUMA Multicore Systems:Trapped between Cache Contention and Interconnect Overhead ZoltanMajo and Thomas R. Gross Department of Computer Science ETH Zurich

  2. NUMA multicores Processor 0 Processor 1 0 1 2 3 Cache Cache MC MC MC IC IC IC IC MC DRAM memory DRAM memory DRAM memory DRAM memory

  3. NUMA multicores Two problems: • NUMA: interconnect overhead Processor 0 Processor 1 0 1 2 3 Cache Cache A B MC MC IC IC DRAM memory DRAM memory MA MB

  4. NUMA multicores Two problems: • NUMA: interconnect overhead • multicore: cache contention Processor 0 Processor 1 0 1 2 3 Cache Cache A B MC MC Cache IC IC DRAM memory DRAM memory MA MB

  5. Outline • NUMA: experimental evaluation • Scheduling • N-MASS • N-MASS evaluation

  6. Multi-clone experiments • Intel Xeon E5520 • 4 clones of soplex (SPEC CPU2006) • local clone • remote clone • Memory behavior of unrelated programs Processor 0 Processor 1 0 1 2 3 4 5 6 7 Cache Cache C C C C MC IC IC MC DRAM memory DRAM memory C C C C C M M M M C

  7. Local bandwidth: 32% Local bandwidth: 80% Local bandwidth: 0% Local bandwidth: 57% Local bandwidth: 100% 2 1 C C C C C C C C C C C C C C C M M M M M M M M M M M M M M M M M M M M C C C C C Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache 3 DRAM DRAM DRAM DRAM DRAM 4 5

  8. Performance of schedules • Which is the best schedule? • Baseline: single-program execution mode C Cache Cache M

  9. Execution time Slowdown relative to baseline local clones remote clones average C C C

  10. Outline • NUMA: experimental evaluation • Scheduling • N-MASS • N-MASS evaluation

  11. N-MASS(NUMA-Multicore-Aware Scheduling Scheme) Two steps: • Step 1: maximum-local mapping • Step 2: cache-aware refinement

  12. Step 1: Maximum-local mapping Processor 0 Processor 1 A MA 0 1 2 3 4 5 6 7 B MB Cache Cache C MC D MD DRAM DRAM

  13. Default OS scheduling Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM

  14. N-MASS(NUMA-Multicore-Aware Scheduling Scheme) Two steps: • Step 1: maximum-local mapping • Step 2: cache-aware refinement

  15. Step 2: Cache-aware refinement In an SMP: Processor 0 Processor 1 D C A B 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM

  16. Step 2: Cache-aware refinement Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MA MB MC MD DRAM DRAM In an SMP:

  17. Step 2: Cache-aware refinement Processor 0 Processor 1 A B D C 0 1 2 3 4 5 6 7 Cache Cache NUMA penalty MA MB MC MD DRAM DRAM Performance degradation A B C A B D C D In an SMP:

  18. Step 2: Cache-aware refinement Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM In a NUMA:

  19. Step 2: Cache-aware refinement Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM In a NUMA:

  20. Step 2: Cache-aware refinement Processor 0 Processor 1 1 2 3 4 5 6 7 A C D B 0 Cache Cache NUMA penalty NUMA allowance MA MB MC MD DRAM DRAM Performance degradation A B C A C D D B In a NUMA:

  21. Performance factors Two factors cause performance degradation: • NUMA penalty slowdown due toremote memory access • cache pressure local processes:misses / KINST (MPKI) remote processes:MPKI x NUMA penalty NUMA penalty

  22. Implementation • User-mode extension to the Linux scheduler • Performance metrics • hardware performance counter feedback • NUMA penalty • perfect information from program traces • estimate based on MPKI • All memory for a process allocated on one processor

  23. Outline • NUMA: experimental evaluation • Scheduling • N-MASS • N-MASS evaluation

  24. Workloads NUMA penalty • SPEC CPU2006 subset • 11 multi-program workloads (WL1  WL11) 4-program workloads(WL1  WL9) 8-program workloads(WL10, WL11) CPU-bound Memory-bound

  25. Memory allocation setup • Where the memory of each process is allocated influences performance • Controlled setup: memory allocation maps

  26. Memory allocation maps Processor 0 A M A Processor 1 B MB C MC D MD Cache Cache Allocation map: 0000 DRAM DRAM M A MB MC MD

  27. Memory allocation maps Processor 0 Processor 0 A Processor 1 Processor1 B C D Allocation map: 0011 Allocation map: 0000 Cache Cache Cache Cache DRAM DRAM DRAM DRAM M A MB MC MD M A MB MC MD

  28. Memory allocation maps Processor 0 Processor 0 A Processor 1 Processor 1 B C D Allocation map: 0011 Allocation map: 0000 Cache Cache Cache Cache DRAM DRAM DRAM DRAM M A MB MC MD M A MB MC MD Unbalanced Balanced

  29. Evaluation • Baseline: Linux average • Linux scheduler non-deterministic • average performance degradation in all possible cases • N-MASS with perfect NUMA penalty information

  30. WL9: Linux average Average slowdown relative to single-program mode

  31. WL9: N-MASS Average slowdown relative to single-program mode

  32. WL1: Linux average and N-MASS Average slowdown relative to single-program mode

  33. N-MASS performance • N-MASS reduces performance degradation by up to 22% • Which factor more important: interconnect overhead or cache contention? • Compare: - maximum-local - N-MASS (maximum-local + cache refinement step)

  34. Data-locality vs. cache balancing (WL9) Performance improvement relative to Linux average

  35. Data-locality vs. cache balancing (WL1) Performance improvement relative to Linux average

  36. Data locality vs. cache balancing • Data-locality more important than cache balancing • Cache-balancing gives performance benefits mostly with unbalanced allocation maps • What if information about NUMA penalty not available?

  37. Estimating NUMA penalty NUMA penalty • NUMA penalty is not directly measurable • Estimate: fit linear regression onto MPKI data

  38. Estimate-based N-MASS: performance Performance improvement relative to Linux average

  39. Conclusions • N-MASS: NUMAmulticore-aware scheduler • Data locality optimizations more beneficial than cache contention avoidance • Better performance metrics needed for scheduling

  40. Thank you! Questions?