Memory Management Challenges in NUMA Multicore Systems

Memory Management inNUMA Multicore Systems:Trapped between Cache Contention and Interconnect Overhead ZoltanMajo and Thomas R. Gross Department of Computer Science ETH Zurich

NUMA multicores Processor 0 Processor 1 0 1 2 3 Cache Cache MC MC MC IC IC IC IC MC DRAM memory DRAM memory DRAM memory DRAM memory

NUMA multicores Two problems: • NUMA: interconnect overhead Processor 0 Processor 1 0 1 2 3 Cache Cache A B MC MC IC IC DRAM memory DRAM memory MA MB

NUMA multicores Two problems: • NUMA: interconnect overhead • multicore: cache contention Processor 0 Processor 1 0 1 2 3 Cache Cache A B MC MC Cache IC IC DRAM memory DRAM memory MA MB

Outline • NUMA: experimental evaluation • Scheduling • N-MASS • N-MASS evaluation

Multi-clone experiments • Intel Xeon E5520 • 4 clones of soplex (SPEC CPU2006) • local clone • remote clone • Memory behavior of unrelated programs Processor 0 Processor 1 0 1 2 3 4 5 6 7 Cache Cache C C C C MC IC IC MC DRAM memory DRAM memory C C C C C M M M M C

Local bandwidth: 32% Local bandwidth: 80% Local bandwidth: 0% Local bandwidth: 57% Local bandwidth: 100% 2 1 C C C C C C C C C C C C C C C M M M M M M M M M M M M M M M M M M M M C C C C C Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache 3 DRAM DRAM DRAM DRAM DRAM 4 5

Performance of schedules • Which is the best schedule? • Baseline: single-program execution mode C Cache Cache M

Execution time Slowdown relative to baseline local clones remote clones average C C C

N-MASS(NUMA-Multicore-Aware Scheduling Scheme) Two steps: • Step 1: maximum-local mapping • Step 2: cache-aware refinement

Step 1: Maximum-local mapping Processor 0 Processor 1 A MA 0 1 2 3 4 5 6 7 B MB Cache Cache C MC D MD DRAM DRAM

Default OS scheduling Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM

N-MASS(NUMA-Multicore-Aware Scheduling Scheme) Two steps: • Step 1: maximum-local mapping • Step 2: cache-aware refinement

Step 2: Cache-aware refinement In an SMP: Processor 0 Processor 1 D C A B 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM

Step 2: Cache-aware refinement Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MA MB MC MD DRAM DRAM In an SMP:

Step 2: Cache-aware refinement Processor 0 Processor 1 A B D C 0 1 2 3 4 5 6 7 Cache Cache NUMA penalty MA MB MC MD DRAM DRAM Performance degradation A B C A B D C D In an SMP:

Step 2: Cache-aware refinement Processor 0 Processor 1 A B C D 0 1 2 3 4 5 6 7 Cache Cache MA MB MC MD DRAM DRAM In a NUMA:

Step 2: Cache-aware refinement Processor 0 Processor 1 1 2 3 4 5 6 7 A C D B 0 Cache Cache NUMA penalty NUMA allowance MA MB MC MD DRAM DRAM Performance degradation A B C A C D D B In a NUMA:

Performance factors Two factors cause performance degradation: • NUMA penalty slowdown due toremote memory access • cache pressure local processes:misses / KINST (MPKI) remote processes:MPKI x NUMA penalty NUMA penalty

Implementation • User-mode extension to the Linux scheduler • Performance metrics • hardware performance counter feedback • NUMA penalty • perfect information from program traces • estimate based on MPKI • All memory for a process allocated on one processor

Workloads NUMA penalty • SPEC CPU2006 subset • 11 multi-program workloads (WL1  WL11) 4-program workloads(WL1  WL9) 8-program workloads(WL10, WL11) CPU-bound Memory-bound

Memory allocation setup • Where the memory of each process is allocated influences performance • Controlled setup: memory allocation maps

Memory allocation maps Processor 0 A M A Processor 1 B MB C MC D MD Cache Cache Allocation map: 0000 DRAM DRAM M A MB MC MD

Memory allocation maps Processor 0 Processor 0 A Processor 1 Processor1 B C D Allocation map: 0011 Allocation map: 0000 Cache Cache Cache Cache DRAM DRAM DRAM DRAM M A MB MC MD M A MB MC MD

Memory allocation maps Processor 0 Processor 0 A Processor 1 Processor 1 B C D Allocation map: 0011 Allocation map: 0000 Cache Cache Cache Cache DRAM DRAM DRAM DRAM M A MB MC MD M A MB MC MD Unbalanced Balanced

Evaluation • Baseline: Linux average • Linux scheduler non-deterministic • average performance degradation in all possible cases • N-MASS with perfect NUMA penalty information

WL9: Linux average Average slowdown relative to single-program mode

WL9: N-MASS Average slowdown relative to single-program mode

WL1: Linux average and N-MASS Average slowdown relative to single-program mode

N-MASS performance • N-MASS reduces performance degradation by up to 22% • Which factor more important: interconnect overhead or cache contention? • Compare: - maximum-local - N-MASS (maximum-local + cache refinement step)

Data-locality vs. cache balancing (WL9) Performance improvement relative to Linux average

Data-locality vs. cache balancing (WL1) Performance improvement relative to Linux average

Data locality vs. cache balancing • Data-locality more important than cache balancing • Cache-balancing gives performance benefits mostly with unbalanced allocation maps • What if information about NUMA penalty not available?

Estimating NUMA penalty NUMA penalty • NUMA penalty is not directly measurable • Estimate: fit linear regression onto MPKI data

Estimate-based N-MASS: performance Performance improvement relative to Linux average

Conclusions • N-MASS: NUMAmulticore-aware scheduler • Data locality optimizations more beneficial than cache contention avoidance • Better performance metrics needed for scheduling

Thank you! Questions?

Memory Management Challenges in NUMA Multicore Systems

Memory Management Challenges in NUMA Multicore Systems

Presentation Transcript

ETH Zurich www.ethz.ch

Department of Computer Science

Department of Computer Science

Department of Computer Science

Department of Computer Science

Giovanna Davatz, ETH Zurich

Department of computer science

Department of Computer Science and Computer Engineering

Christian Maier, ETH Zurich

Department of Computer Science

Institute of Astronomy ETH Zurich

Thomas Wenzler Institute of Astronomy ETH Zurich wenzler@astro.phys.ethz.ch

iGEM 2007 ETH Zurich

Systems Group Dept. Computer Science ETH Zurich - Switzerland

DEPARTMENT OF COMPUTER SCIENCE

Fran ç ois E. Cellier Department of Computer Science ETH Zurich Switzerland

Department of Computer Science

Department of Computer Science

Thomas Wenzler Institute of Astronomy ETH Zurich wenzler@astro.phys.ethz.ch

Department of Computer Science

Fran ç ois E. Cellier Department of Computer Science ETH Zurich Switzerland

Department of Computer Science