1 / 33

When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency

When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency. Gaurav Chadha , Scott Mahlke , Satish Narayanasamy University of Michigan. Motivation. Hardware trends CMPs are ubiquitous. More and more cores in a system

diella
Download Presentation

When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. When Less Is MOre (LIMO): Controlled Parallelism forImproved Efficiency GauravChadha, Scott Mahlke, SatishNarayanasamy University of Michigan

  2. Motivation • Hardware trends • CMPs are ubiquitous. • More and more cores in a system • Mobile: Qualcomm Snapdragon, Samsung Exynos, NVIDIA Tegra 3. • Server: Tilera • Multi-threaded applications are pervasive. • But, do we always want to maximize the number of threads? NO

  3. Run fewer threads: DVFS • Mostmulti-threaded applications stop scaling beyond a certain number of cores. • It becomes counter-productive to run more threads. • Maximum power budget is fixed for a system. • Fewer cores can “borrow” power from disabled cores. • Intel Turbo Boost Frequency increases in steps of 133 MHz frequency cores

  4. Scalability: Problems • Too many threads • Increased contention for shared resources. • Increased synchronization costs. • Too few threads • Underutilization of resources.

  5. Scalability: Less threads are better • 4 threads best for streamcluster

  6. Scalability: Less threads are as good • Ferret, facesim, x264, dedup show poor scalability

  7. Scalability: Opportunities • Run fewer threads • Disable some cores and increase frequency of the active ones.

  8. Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

  9. Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1

  10. Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

  11. Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1

  12. Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

  13. Run fewer threads: DVFS Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1

  14. Run fewer threads: DVFS • DVFS makes the case for fewer threads more compelling. • With fewer threads • increase frequency • reduce contention. 5 out of 11 benchmarks Who can decide the best number of threads? Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1

  15. DVFS in current systems Inputs change 1.1 GHz 1.1 GHz 1.1 GHz 1.4 GHz Stalled System resources change Programmer decides how many threads to run (e.g. 32 threads on 32 cores) Different hardware configurations S Program characteristics change Stalled threads Execution progress 10 threads stalled 12 threads stalled 16 threads stalled Turbo Boost increases frequency

  16. Our system 1.1 GHz 1.1 GHz 1.4 GHz Stalled Detection logic pro-actively disables more threads S Disabled Frequency is increased threads Execution progress 10 threads stalled 12 threads stalled 16 threads stalled/disabled Turbo Boost

  17. Less Is MOre (LIMO) • Less Is MOre for efficiency • Observation: • Most programs do not scale after a certain limit • DVFS can help provide better performance • A runtime system • Monitors shared resource contention (shared cache, shared program variables) • Pro-actively disables threads • Employs DVFS LIMO

  18. Outline

  19. Roadblocks: Shared Cache • Roadblocks • Physical shared resources • Program level shared resources • Shared cache

  20. Roadblocks: Shared Cache • Abstract representation of most multi-threaded programs • The peak performance point shifts depending on working set size and shared cache size Best performance Working set fits in shared cache Working set does not fit in shared cache Working set too large

  21. Roadblocks: Program Resources • Roadblocks • Physical shared resources • Program level shared resources • Shared cache • Synchronization stalls (locks)

  22. Roadblocks: Program Resources Best performance Increased parallelism gives more performance Increased synchronization costs hurt performance

  23. LIMO 1.1 GHz 1.1 GHz 1.4 GHz 1.8 GHz Stalled After 100 million instructions, working set size estimate calculated Pro-actively disables more threads • 20 threads at 1.1 GHz: 20 * 1.1 = 22 • 16 threads at 1.4 GHz: 16 * 1.4 = 22.4 S Disabled Frequency is increased threads Working set of 10 threads fits in cache - 6 threads disabled Disabled • 10 threads at 1.4 GHz: 10 * 1.4 = 14 • 8 threads at 1.8 GHz: 8 * 1.8 = 14.4 Pro-actively disables more threads D Execution progress 10 threads stalled 12 threads stalled 16 threads stalled/disabled 8 threads disabled

  24. Methodology: Configuration • Modified timing simulator FeS2 which uses Simics. • Hardware configuration:

  25. Methodology: Simulation • 9 evenly spaced checkpoints • Timing simulations starting from these checkpoints • 80 million useful instructions simulated/checkpoint • Statistics cleared after the first 20 million • Useful instructions: committed in user mode, excluding spin loops. • Benchmarks from the PARSEC benchmark suite, Apache web server (httpd), speech recognition benchmark (sphinx) from ALP.

  26. Example perf. breakdown Ferret

  27. Example perf. breakdown Ferret

  28. Example perf. breakdown Ferret

  29. Example perf. breakdown Ferret

  30. Example perf. breakdown Ferret

  31. % Performance Improvement Good scalability Reduced synchronization stalls Reduced thrashing in shared cache

  32. Conclusion • Scalability is difficult to achieve and predict. • Determining best number of threads is hard. • Contention in shared hardware resources • Contention in program level shared objects • LIMO frees the programmer from this burden. • Monitors shared resource contention (shared cache, shared program variables) • Pro-actively disables threads • Employs DVFS • 14% average improvement in performance over all threads.

  33. Thank you!

More Related