1 / 41

(Mis) Understanding the NUMA Memory System Performance of Multithreaded Workloads

(Mis) Understanding the NUMA Memory System Performance of Multithreaded Workloads. Zoltán Majó Thomas R. Gross Department of Computer Science ETH Zurich, Switzerland. NUMA-multicore memory system. Processor 1. Processor 0. LOCAL_CACHE: 38 cycles. REMOTE_CACHE: 186 cycles. T. 0. 1. 2.

eros
Download Presentation

(Mis) Understanding the NUMA Memory System Performance of Multithreaded Workloads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. (Mis)Understanding theNUMA Memory System Performance of Multithreaded Workloads Zoltán MajóThomas R. Gross Department of Computer ScienceETH Zurich, Switzerland

  2. NUMA-multicore memory system Processor 1 Processor 0 LOCAL_CACHE:38 cycles REMOTE_CACHE:186 cycles T 0 1 2 3 8 9 10 11 4 5 6 7 12 13 14 15 LOCAL_DRAM:190 cycles Last-level cache Last-level cache REMOTE_DRAM: 310 cycles DRAM IC MC DRAM MC IC All data based on experimental evaluation of Intel Nehalem (Hackenberg [MICRO ’09], Molka [PACT ‘09])

  3. Experimental setup • Three benchmark programs from PARSEC • streamcluster, ferret, and dedup • Grown size of inputs more pressure on the memory system • Intel Westmere • 4 processors, 32 cores • 3 execution scenarios • w/o NUMA: Sequential • w/o NUMA: Parallel (8 cores/1 processor) • w/ NUMA: Parallel (32 cores/4 processors)

  4. Execution scenarios Processor 1 Processor 0 0 T T 1 2 T 3 T T 8 9 T T 10 T 11 4 T 5 T T 6 T 7 12 T T 13 14 T T 15 Last-level cache Last-level cache DRAM IC MC DRAM MC IC Processor 2 Processor 3 DRAM DRAM MC IC IC MC Last-level cache Last-level cache 16 T T 17 18 T T 19 24 T T 25 T 26 27 T T 20 T 21 22 T 23 T 28 T 29 T T 30 31 T

  5. Parallel performance

  6. CPU cycle breakdown dedup: good scaling (26X) streamcluster: poor scaling (11X)

  7. Outline • Introduction • Performance analysis • Data locality • Prefetcher effectiveness • Source-level optimizations • Performance evaluation • Conclusions

  8. Data locality • Page placement policy • Commonly used policy: first-touch (default in Linux) • Measurement: data locality of the benchmarks • Data locality = [%] • Read transfers measured at the processor’s uncore Remote memory references Total memory references

  9. NUMA-multicore memory system Processor 1 Processor 0 LOCAL_CACHE REMOTE_CACHE T 0 1 2 3 8 9 10 11 4 5 6 7 12 13 14 15 LOCAL_DRAM Last-level cache Last-level cache REMOTE_DRAM DRAM IC MC DRAM MC IC Processor 2 Processor 3 DRAM DRAM MC IC IC MC Last-level cache Last-level cache 16 17 18 19 24 25 26 27 20 21 22 23 28 29 30 31

  10. Data locality

  11. Inter-processor data sharing Cause of data sharing • streamcluster: data points to be clustered • ferret and dedup: in-memory databases

  12. Prefetcher performance • Experiment • Run each benchmarks with prefetcher on/off • Compare performance • Causes of prefetcher inefficiency • ferret and dedup: hash-based memory access patterns • streamcluster: random shuffling

  13. streamcluster: random shuffling while (input = read_data_points()) { clusters = process(input); } Randomlyshuffle data pointsto increase probability that each point is compared to each cluster.

  14. streamcluster: prefetcher effectiveness Original data layout (before shuffling) points coordinates A B C D E F G H T0 T1

  15. streamcluster: prefetcher effectiveness Data layout (after pointer-based shuffle) points coordinates A B C D E F G H T0 T1

  16. streamcluster: prefetcher effectiveness Data layout (after pointer-based shuffle) points coordinates A B C D E F G H T0

  17. Outline • Introduction • Performance analysis • Data locality • Prefetcher effectiveness • Source-level optimizations • Prefetching • Data locality • Performance evaluation • Conclusions

  18. streamcluster: Optimizing prefetching Copy-based shuffle Performance improvement over pointer-based shuffle • Westmere: 12% • Nehalem: 60% points coordinates A B C D E F G H G B C H F E A D T0 T1

  19. Data locality optimizations Control the mapping of data and computations: • Data placement • Supported by numa_alloc(), move_pages() • First-touch: also OK if data accessed at single processor • Interleaved page placement: reduce interconnect contention[Lachaize et al. USENIX ATC’12, Dashti et al. ASPLOS’13] • Computation scheduling • Threads: affinity scheduling, supported by sched_setaffinity() • Loop parallelism: rely on OpenMP static loop scheduling • Pipeline parallelism: locality-aware task dispatch

  20. streamcluster points coordinates G C B H F E A D Executed at Processor 0 Placed at Processor 0 T0 Executed at Processor 1 Placed at Processor 1 T1

  21. ferret Image database Stage 4: Index T T T T T T T T T T T T T T T T T T T T Stage 5:Rank Stage 6: Output Stage 4: Index Stage 3: Extract Stage 2:Segment Stage 1: Input T T T T T T T T T T T T T T T T T T T T T T Executingat Proc. 0 Executingat Proc. 1

  22. ferret Placedat Proc. 0 Placedat Proc. 1 Image database Stage 4: Index’ Stage 4: Index Stage 5:Rank Stage 4: Index’’ Stage 6: Output Stage 2:Segment Stage 3: Extract Stage 1: Input T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T Executingat Proc. 0 Executingat Proc. 1

  23. Performance evaluation • Two parameters with major effect on NUMA performance • Data placement • Schedule of computations • Execution scenario: schedule / placement Scenario 1: default / FT Schedule: default Placement: first-touch (FT)

  24. default / FT Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM Processor 2 Processor 3 DRAM DRAM T T T T T T T T T T T T T T T T

  25. default / FT Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM Processor 2 Processor 3 DRAM DRAM T T T T T T T T T T T T T T T T

  26. default / FT Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM T T T T T T T T T T T T T T T T

  27. Performance evaluation • Two parameters with major effect on NUMA performance • Data placement • Schedule of computations • Execution scenario: schedule / placement changeplacement Scenario 1: default / FT Schedule: default Placement: first-touch (FT) Scenario 2: default / INTL Schedule: default Placement: interleaved (INTL)

  28. default / FT default / INTL Processor 0 Processor 1 T 0 T 1 T 2 3 T 8 T 9 T 10 T 11 T 4 T T 5 T 6 7 T T 12 T 13 T 14 15 T DRAM DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM T 16 T 17 18 T T 19 T 24 25 T 26 T 27 T T 20 21 T 22 T 23 T 28 T 29 T 30 T 31 T

  29. Performance evaluation • Two parameters with major effect on NUMA performance • Data placement • Schedule of computations • Execution scenario: schedule / placement changeplacement Scenario 1: default / FT Schedule: default Placement: first-touch (FT) Scenario 2: default / INTL Schedule: default Placement: interleaved (INTL) changeschedule Scenario 3: NUMA / INTL Schedule: NUMA-aware Placement: interleaved (INTL)

  30. default / INTL NUMA/ INTL Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T

  31. Performance evaluation • Two parameters with major effect on NUMA performance • Data placement • Schedule of computations • Execution scenario: schedule / placement changeplacement Scenario 1: default / FT Schedule: default Placement: first-touch (FT) Scenario 2: default / INTL Schedule: default Placement: interleaved (INTL) changeschedule changeplacement Scenario 4: NUMA/ NUMA Schedule: NUMA-aware Placement: NUMA-aware (NA) Scenario 3: NUMA / INTL Schedule: NUMA-aware Placement: interleaved (INTL)

  32. NUMA/ NUMA Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T

  33. Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT default / FT Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM T T T T T T T T T T T T T T T T

  34. Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT default / INTL Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D Processor 2 Processor 3 D D D D D D D D D D D D DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T

  35. Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT NUMA / INTL Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D D D D D D D D D Processor 2 Processor 3 DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T

  36. Performance evaluation: ferret Uncore transfers [x 109] Improvement over default / FT NUMA / NUMA Processor 0 Processor 1 T T T T T T T T T T T T T T T T DRAM DRAM D D D D D D D D Processor 2 Processor 3 D D D D D D D D DRAM DRAM D D D D D D D D D D D D D D D D T T T T T T T T T T T T T T T T

  37. Performance evaluation (cont’d) streamcluster dedup

  38. Data locality optimizations: summary • Data locality better than avoiding interconnect contention • Interleaved placement easy to control • Data locality: lack of tools for implementing optimizations • Other options • Data replication • Automatic data migration

  39. Performance evaluation: ferret Uncore transfers [x 109] Improvement over default (FT)

  40. Conclusions • Details matter • Prefetcher efficiency • Data locality • Substantial improvements • Benchmarking using NUMA-multicores far from easy • Two aspects to consider: data placement and computation scheduling • Appreciate memory system details to avoid misconceptions • Limited support for understanding hardware bottlenecks

  41. Thank you for your attention!

More Related