Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead

Memory Management inNUMA Multicore Systems:Trapped between Cache Contention and Interconnect Overhead ZoltanMajo and Thomas R. Gross Department of Computer Science ETH Zurich

NUMA multicores Processor 0 Processor 1 0 1 2 3 Cache Cache MC MC MC IC IC IC IC MC DRAM memory DRAM memory DRAM memory DRAM memory

NUMA multicores Two problems: • NUMA: interconnect overhead Processor 0 Processor 1 0 1 2 3 Cache Cache A B MC MC IC IC DRAM memory DRAM memory MA MB

NUMA multicores Two problems: • NUMA: interconnect overhead • multicore: cache contention Processor 0 Processor 1 0 1 2 3 Cache Cache A B MC MC Cache IC IC DRAM memory DRAM memory MA MB

Outline • NUMA: experimental evaluation • Scheduling • N-MASS • N-MASS evaluation

Multi-clone experiments • Intel Xeon E5520 • 4 clones of soplex (SPEC CPU2006) • local clone • remote clone • Memory behavior of unrelated programs Processor 0 Processor 1 0 1 2 3 4 5 6 7 Cache Cache C C C C MC IC IC MC DRAM memory DRAM memory C C C C C M M M M C

Local bandwidth: 32% Local bandwidth: 80% Local bandwidth: 0% Local bandwidth: 57% Local bandwidth: 100% 2 1 C C C C C C C C C C C C C C C M M M M M M M M M M M M M M M M M M M M C C C C C Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache 3 DRAM DRAM DRAM DRAM DRAM 4 5

Performance of schedules • Which is the best schedule? • Baseline: single-program execution mode C Cache Cache M

Execution time Slowdown relative to baseline local clones remote clones average C C C

N-MASS(NUMA-Multicore-Aware Scheduling Scheme) Two steps: • Step 1: maximum-local mapping • Step 2: cache-aware refinement

Step 1: Maximum-local mapping Processor 0 Processor 1 A MA 0 1 2 3 4 5 6 7 B MB Cache Cache C MC D MD DRAM DRAM

Default OS scheduling Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM

N-MASS(NUMA-Multicore-Aware Scheduling Scheme) Two steps: • Step 1: maximum-local mapping • Step 2: cache-aware refinement

Step 2: Cache-aware refinement In an SMP: Processor 0 Processor 1 D C A B 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM

Step 2: Cache-aware refinement Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MA MB MC MD DRAM DRAM In an SMP:

Step 2: Cache-aware refinement Processor 0 Processor 1 A B D C 0 1 2 3 4 5 6 7 Cache Cache NUMA penalty MA MB MC MD DRAM DRAM Performance degradation A B C A B D C D In an SMP:

Step 2: Cache-aware refinement Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM In a NUMA:

Step 2: Cache-aware refinement Processor 0 Processor 1 1 2 3 4 5 6 7 A C D B 0 Cache Cache NUMA penalty NUMA allowance MA MB MC MD DRAM DRAM Performance degradation A B C A C D D B In a NUMA:

Performance factors Two factors cause performance degradation: • NUMA penalty slowdown due toremote memory access • cache pressure local processes:misses / KINST (MPKI) remote processes:MPKI x NUMA penalty NUMA penalty

Implementation • User-mode extension to the Linux scheduler • Performance metrics • hardware performance counter feedback • NUMA penalty • perfect information from program traces • estimate based on MPKI • All memory for a process allocated on one processor

Workloads NUMA penalty • SPEC CPU2006 subset • 11 multi-program workloads (WL1  WL11) 4-program workloads(WL1  WL9) 8-program workloads(WL10, WL11) CPU-bound Memory-bound

Memory allocation setup • Where the memory of each process is allocated influences performance • Controlled setup: memory allocation maps

Memory allocation maps Processor 0 A M A Processor 1 B MB C MC D MD Cache Cache Allocation map: 0000 DRAM DRAM M A MB MC MD

Memory allocation maps Processor 0 Processor 0 A Processor 1 Processor1 B C D Allocation map: 0011 Allocation map: 0000 Cache Cache Cache Cache DRAM DRAM DRAM DRAM M A MB MC MD M A MB MC MD

Memory allocation maps Processor 0 Processor 0 A Processor 1 Processor 1 B C D Allocation map: 0011 Allocation map: 0000 Cache Cache Cache Cache DRAM DRAM DRAM DRAM M A MB MC MD M A MB MC MD Unbalanced Balanced

Evaluation • Baseline: Linux average • Linux scheduler non-deterministic • average performance degradation in all possible cases • N-MASS with perfect NUMA penalty information

WL9: Linux average Average slowdown relative to single-program mode

WL9: N-MASS Average slowdown relative to single-program mode

WL1: Linux average and N-MASS Average slowdown relative to single-program mode

N-MASS performance • N-MASS reduces performance degradation by up to 22% • Which factor more important: interconnect overhead or cache contention? • Compare: - maximum-local - N-MASS (maximum-local + cache refinement step)

Data-locality vs. cache balancing (WL9) Performance improvement relative to Linux average

Data-locality vs. cache balancing (WL1) Performance improvement relative to Linux average

Data locality vs. cache balancing • Data-locality more important than cache balancing • Cache-balancing gives performance benefits mostly with unbalanced allocation maps • What if information about NUMA penalty not available?

Estimating NUMA penalty NUMA penalty • NUMA penalty is not directly measurable • Estimate: fit linear regression onto MPKI data

Estimate-based N-MASS: performance Performance improvement relative to Linux average

Conclusions • N-MASS: NUMAmulticore-aware scheduler • Data locality optimizations more beneficial than cache contention avoidance • Better performance metrics needed for scheduling

Thank you! Questions?

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead