1 / 40

Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead. Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich. NUMA multicores. Processor 0. Processor 1. 0. 1. 2. 3. Cache. Cache. MC. MC. MC. IC. IC. IC. IC.

Download Presentation

Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory Management inNUMA Multicore Systems:Trapped between Cache Contention and Interconnect Overhead ZoltanMajo and Thomas R. Gross Department of Computer Science ETH Zurich

  2. NUMA multicores Processor 0 Processor 1 0 1 2 3 Cache Cache MC MC MC IC IC IC IC MC DRAM memory DRAM memory DRAM memory DRAM memory

  3. NUMA multicores Two problems: • NUMA: interconnect overhead Processor 0 Processor 1 0 1 2 3 Cache Cache A B MC MC IC IC DRAM memory DRAM memory MA MB

  4. NUMA multicores Two problems: • NUMA: interconnect overhead • multicore: cache contention Processor 0 Processor 1 0 1 2 3 Cache Cache A B MC MC Cache IC IC DRAM memory DRAM memory MA MB

  5. Outline • NUMA: experimental evaluation • Scheduling • N-MASS • N-MASS evaluation

  6. Multi-clone experiments • Intel Xeon E5520 • 4 clones of soplex (SPEC CPU2006) • local clone • remote clone • Memory behavior of unrelated programs Processor 0 Processor 1 0 1 2 3 4 5 6 7 Cache Cache C C C C MC IC IC MC DRAM memory DRAM memory C C C C C M M M M C

  7. Local bandwidth: 32% Local bandwidth: 80% Local bandwidth: 0% Local bandwidth: 57% Local bandwidth: 100% 2 1 C C C C C C C C C C C C C C C M M M M M M M M M M M M M M M M M M M M C C C C C Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache 3 DRAM DRAM DRAM DRAM DRAM 4 5

  8. Performance of schedules • Which is the best schedule? • Baseline: single-program execution mode C Cache Cache M

  9. Execution time Slowdown relative to baseline local clones remote clones average C C C

  10. Outline • NUMA: experimental evaluation • Scheduling • N-MASS • N-MASS evaluation

  11. N-MASS(NUMA-Multicore-Aware Scheduling Scheme) Two steps: • Step 1: maximum-local mapping • Step 2: cache-aware refinement

  12. Step 1: Maximum-local mapping Processor 0 Processor 1 A MA 0 1 2 3 4 5 6 7 B MB Cache Cache C MC D MD DRAM DRAM

  13. Default OS scheduling Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM

  14. N-MASS(NUMA-Multicore-Aware Scheduling Scheme) Two steps: • Step 1: maximum-local mapping • Step 2: cache-aware refinement

  15. Step 2: Cache-aware refinement In an SMP: Processor 0 Processor 1 D C A B 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM

  16. Step 2: Cache-aware refinement Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MA MB MC MD DRAM DRAM In an SMP:

  17. Step 2: Cache-aware refinement Processor 0 Processor 1 A B D C 0 1 2 3 4 5 6 7 Cache Cache NUMA penalty MA MB MC MD DRAM DRAM Performance degradation A B C A B D C D In an SMP:

  18. Step 2: Cache-aware refinement Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM In a NUMA:

  19. Step 2: Cache-aware refinement Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM In a NUMA:

  20. Step 2: Cache-aware refinement Processor 0 Processor 1 1 2 3 4 5 6 7 A C D B 0 Cache Cache NUMA penalty NUMA allowance MA MB MC MD DRAM DRAM Performance degradation A B C A C D D B In a NUMA:

  21. Performance factors Two factors cause performance degradation: • NUMA penalty slowdown due toremote memory access • cache pressure local processes:misses / KINST (MPKI) remote processes:MPKI x NUMA penalty NUMA penalty

  22. Implementation • User-mode extension to the Linux scheduler • Performance metrics • hardware performance counter feedback • NUMA penalty • perfect information from program traces • estimate based on MPKI • All memory for a process allocated on one processor

  23. Outline • NUMA: experimental evaluation • Scheduling • N-MASS • N-MASS evaluation

  24. Workloads NUMA penalty • SPEC CPU2006 subset • 11 multi-program workloads (WL1  WL11) 4-program workloads(WL1  WL9) 8-program workloads(WL10, WL11) CPU-bound Memory-bound

  25. Memory allocation setup • Where the memory of each process is allocated influences performance • Controlled setup: memory allocation maps

  26. Memory allocation maps Processor 0 A M A Processor 1 B MB C MC D MD Cache Cache Allocation map: 0000 DRAM DRAM M A MB MC MD

  27. Memory allocation maps Processor 0 Processor 0 A Processor 1 Processor1 B C D Allocation map: 0011 Allocation map: 0000 Cache Cache Cache Cache DRAM DRAM DRAM DRAM M A MB MC MD M A MB MC MD

  28. Memory allocation maps Processor 0 Processor 0 A Processor 1 Processor 1 B C D Allocation map: 0011 Allocation map: 0000 Cache Cache Cache Cache DRAM DRAM DRAM DRAM M A MB MC MD M A MB MC MD Unbalanced Balanced

  29. Evaluation • Baseline: Linux average • Linux scheduler non-deterministic • average performance degradation in all possible cases • N-MASS with perfect NUMA penalty information

  30. WL9: Linux average Average slowdown relative to single-program mode

  31. WL9: N-MASS Average slowdown relative to single-program mode

  32. WL1: Linux average and N-MASS Average slowdown relative to single-program mode

  33. N-MASS performance • N-MASS reduces performance degradation by up to 22% • Which factor more important: interconnect overhead or cache contention? • Compare: - maximum-local - N-MASS (maximum-local + cache refinement step)

  34. Data-locality vs. cache balancing (WL9) Performance improvement relative to Linux average

  35. Data-locality vs. cache balancing (WL1) Performance improvement relative to Linux average

  36. Data locality vs. cache balancing • Data-locality more important than cache balancing • Cache-balancing gives performance benefits mostly with unbalanced allocation maps • What if information about NUMA penalty not available?

  37. Estimating NUMA penalty NUMA penalty • NUMA penalty is not directly measurable • Estimate: fit linear regression onto MPKI data

  38. Estimate-based N-MASS: performance Performance improvement relative to Linux average

  39. Conclusions • N-MASS: NUMAmulticore-aware scheduler • Data locality optimizations more beneficial than cache contention avoidance • Better performance metrics needed for scheduling

  40. Thank you! Questions?

More Related