Satoshi Imamura Hiroshi Sasaki Naoto Fukumoto Koji Inoue Kazuaki Murakami Kyu shu University

Optimizing Power-Performance Trade-offfor Parallel Applications throughDynamic Core and Frequency Scaling Satoshi ImamuraHiroshi SasakiNaoto Fukumoto Koji InoueKazuaki Murakami Kyushu University

Many-core Processors • Multi-core processor is currently mainstream • Core counts on a chip increase as technology size shrinks • Many-core processor era is coming • 10s and 100s of cores on a chip • Execute a multi-threaded program for high performance TILERA ”TILE-Gx100” Block Diagram http://www.tilera.com/products/processors/TILE-Gx_Family

Challenge of Many-core • Demand for low power consumption • Ex: Large scale data centers • Reduce peak power consumption by power capping Programs need to be efficiently executedunder power consumption constraint

Two Knobs to DeterminePerformance • CPU frequency&the number of cores • Characteristics of multi-threaded programs differ among/within programs • Sensitivity to CPU frequency • Parallelism Need to choose the proper configuration according to the kind of programs and their behaviors

Experimental Environment 32-core AMD four socket system C0 C1 C2 C3 CPU0 CPU1 L2 L2 L2 L2 Shared L3 CPU2 CPU3 Memory controller Conventional execution & Power constraint: The power when all 32 cores run on 0.8 GHz

Characteristics among Programs blackscholes dedup x264

Characteristics within a Program better 4 8 12 16 32 4 8 12 16 32 4 8 12 16 32 4 8 12 16 32 4 8 12 16 32 IPS：Instructions Per Second

Our Goal • Maximize performance of parallel programs on many-core under power consumption constraint • Variety of characteristics among/within programs • Sensitivity to CPU frequency • Scalability to core counts • Choose the optimal trade-off point between core counts and CPU frequency dynamically

Overview of DCFS(Dynamic Core and Frequency Scaling) blackscholes • Optimize core counts and CPU frequency dynamically according to characteristics of program • High parallelism • Parallel processing with the maximum available core counts • Medium/low parallelism • Restrict the number of active cores • Reallocate the power budget to increase CPU frequency dedup

DCFS Algorithm • Two phases • In Training phase • Change the configuration of core counts and CPU frequency periodically • Measure IPS during execution with each configuration • Estimate the optimal configuration using measured IPS • In Execution phase • Execute with the optimal configuration • Detect behavior changes of executed program Execution phase Execution phase Execution phase Execution phase Execution phase Execution time Training Phase

How to find the best configuration • Find the best core counts for each CPU frequency • Decrement core counts until IPS declines • Select the configuration with the highest IPS x264

Evaluation Result • DCFS-3, DCFS-10: • Our proposed technique without detecting behavior changes • Execution with the configuration estimated in Training phase for constant 3 or 10 seconds • DCFS-WD: • Our proposed technique with detecting behavior changes Middle/low parallelism High parallelism

Evaluation Result • Almost no performance improvement for high parallelism programs • Execution with all cores maximizes performance • Performance degradation due to overhead of Training phase Middle/low parallelism High parallelism

Evaluation Result Middle/low parallelism High parallelism • Almost no performance improvement despite of middle/low parallelism • Two most memory-bound programs in PARSEC* • Small performance improvement by increasing CPU frequency * Bienia, C. et al, “PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on chip-multiprocessors”, IISWC 2008.

Evaluation Result Middle/low parallelism High parallelism • Performance improvement for middle/low parallelism programs • 35% improvement for dedup • 20% improvement on average for four programs • 6% improvement on average for all programs

Conclusions • Challenge of many-core processors • Maximizing performance under power constraint • Proposed technique: DCFS • Optimize core counts and CPU frequency dynamically • Detect behaviorchanges of executed program • Evaluation • Max 35% performance improvement • 6% performance improvement for ten benchmarks • No performance improvement for high parallelism and memory-bound programs

Future Work • Improve the algorithm of our technique to find the best configuration and to detect behavior changes • Evaluate under different power consumption constraints • Evaluate on different platforms

Thank you for your attention.I would appreciate if you could ask me questions slowly.

blackscholes bodytrack canneal dedup ferret freqmine

streamcluster swaptions vips x264

Backup Slides

Experimental Environment 32-core AMD four socket system CPU0 CPU1 C0 C1 C2 C3 L2 L2 L2 L2 CPU2 CPU3 Shared L3 Memory controller

Power Constraint Assumption Conventional execution • Power consumption constraint (): • The power when all cores run on minimum available CPU frequency • Max CPU frequency is decided by core counts under

How to Determine Max CPU Frequency • ThePower consumption constraint • The power consumption when N cores run • Choose maximum CPU frequency and supply voltage according to this inequation in each core count : The switching activity of the circuit, : Total number of cores, : Capacitance per core, : Minimum operating frequency, :Minimum supply voltage

Implementation of DCFS • Training phase • Change the configuration periodically • Execute with each configuration for a short period (“Training period”) • Measure IPS as indicator of performance • Compare measured IPS to estimate the optimal configuration • Execution phase • Execute with the optimal configuration • Measure IPS periodically to detect phase changes of program • No need of static analysis and modification of programs

Detail Implementation of DCFS • Periodical reading performance counters • Use Linux “perf-tools” • Thread allocation to the specified core • Use Linux standardAPI “sched_setaffinity(2)” • Training period: 30 ms • Measure IPS every 1 second to detect phase changes • IPS increases or decreases by more than 10%

The Way to Change Core Counts Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread • Use “Thread Packing*” • Change core counts while the number of threads is constant • No need to modify source codes⇒ Easy implementation Idle Core Core Idle Core Core Core Core Core Idle Core Idle Core Core Idle Idle *Cochran, R. et al, “Pack & Cap: Adaptive DVFS and Thread Packing Under Power Caps”, Micro,2011

Benchmarks • 10 benchmarks from PARSEC 2.1* • Input set size: native *Bienia, C. et al, “The PARSEC benchmark suite: Characterization and architectural implications”, PACT, 2008

Analysis of canneal & streamclster canneal streamcluster • Two most memory-bound programs in PARSEC* • Small performance improvement by increasing CPU frequency * Bienia, C. et al, “PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on chip-multiprocessors”, IISWC 2008.

Analysis ofdedup • DCS@0.8GHz: Control only core counts dynamically • 4% overhead of Training phase • DCFS achieves high performance by scaling both core counts and CPU frequency 30 30 31 32 32 32 32 32 32 31 30 4 8 6 6 8 8 8 8

Experiment Environment (Xeon)

Maximum CPU Frequency and Supply Voltage for Each Core Counts (Xeon)

Evaluation Result (Xeon) • Performance decrement for all programs except swaptions • Great or moderate scalability • All 12 cores execution maximizes the performance⇒ Performance decrement due to overhead of Training phase • For swaptions: High performance only when executed with power of two core counts⇒ Execution with eight cores maximizes the performance

Analysis offerret • Performance improvement by increasing core counts • Execution with all cores maximizes performance • Performance degradation due to overhead of Training phase

Satoshi Imamura Hiroshi Sasaki Naoto Fukumoto Koji Inoue Kazuaki Murakami Kyu shu University