Accurate Power and Energy Measurement on Kepler -based Tesla GPUs

Accurate Power and Energy Measurementon Kepler-based Tesla GPUs Martin Burtscher Department of Computer Science

Introduction • GPU-based accelerators • Quickly spreading in PCs and even handheld devices • Widely used in high-performance computing • Power and energy efficiency • Heat dissipation is a problem • Electric bill and battery life are of growing concern • Exascale requires 50x boost in performance per watt • Important research area • Need to develop techniques to reduce power and energy • Have to be able to measure power/energy of programs

GPU Power Sensors • Hardware • High-end compute GPUs include power sensors • For example, K20/K40 Tesla cards have built-in sensor • These cards are the target of this talk • Software • Can query sensor with NVIDIA Management Library • http://developer.nvidia.com/nvidia-management-library-nvml

Problems • Power sensor data behaves strangely • Running the same kernel twice yields different energy • First launch: 114 J, second launch: 147 J (29% more energy) • Running a kernel 2x as long more than doubles energy • 1x input: 732 J, 2x input: 1579 J (8% above doubling) • Power sensor sampling rate varies greatly • Ranges from 0.266 ms to 130 ms (7.7 Hz to 3760 Hz)

Methodology • Hardware • Two K20c, two K20m, two K20X, and two K40m GPUs • Measurement • Query power and time in loop on “idle” CPU core • Test code • Compute-intensive regular n-body kernel • Constant computation rate of over 2 TFlops on a K20c • No data dependences; vary n to adjust kernel runtime

Expected Power Profile Kernel starts executing Kernel stops executing GPU idle power Measurement loop runtime

Measured Power Profile Macroscopic phenomena 3s 5s 4s Switch to step shape Power ramps up slowly Power ramps down slowly Idle power reached

Energy = Area Under Power Curve Unclear how big energy is Missing energy? Delayed energy? Integrateto where?

Ramp-up Behavior of 2 Short Runs Ramp down doesn’t follow 2nd run starts higher but also follows curve Short run same as longer run

Ramp-down Behavior of Several Runs Driver lowers power level Shape depends on power at t2 Shape always the same Steps down every second Power increases after kernel done

Sampling Interval Lengths Driver activity can prevent sampling Very long interval Wide range of intervals Short intervals

Sampling Interval Lengths (zoomed-in) Sampled power only ever changes after long interval Identical values Very long interval Many short intervals

Correcting the Measurements

Sampling Frequency • Eliminate redundant samples • Only sample once every 15 ms (66.7 Hz) • Cannot accurately measure kernels under ~150 ms • Account for the variation in interval length • Use high-resolution time stamps • Example: energy from t1 to t4 • Dotted (fixed intervals): 1205 J • Solid (variable intervals): 1066 J • 13% discrepancy

True Power • Sensor hardware • Seems to asymptotically approach true power • Reminiscent of capacitor charging • True instant power • Ptrueis a function of the slope of the power profile dP/dt and the power measured by the sensor Psensor Ptrue= Psensor + C × dPsensor/dt • “Capacitance” of sensor • C ≈ 0.84 s on all tested K20 GPUs

Back-calculated from Expected Profile Minimized absolute errors to determine C ‘Capacitor’ function matches measured values perfectly

Corrected Power Profile Wobbles due to sampling errors ‘Active idle’ power level Corrected profile matches expected rectangular profile

Correction of 2 Short Runs Corrected power profile matches expected profile

Second K20c GPU Identical to original K20c

K20m GPU Similar profile but higher power level

K20X GPU Profile is good, no correction needed! Huge 600 ms gap

K40m GPU K40m again requires correction

Application to Full CUDA Program • Implementation of Barnes Hut n-body algorithm • Taken from LonestarGPU benchmark suite • Contains multiple regular and irregular kernels • Highly optimized, but still suffers from load imbalance, divergence, and uncoalesced accesses • Main kernel is ‘regularized’ (warp-based) NASA/JPL-Caltech/SSC

Barnes Hut Power Profile (1 Step) Slow then fast drop-off “Wave” in profile Original profile is hard to interpret

Barnes Hut Power Profile (Kernels) Slow then fast drop-off “Wave” in profile Original profile is hard to interpret

Corrected Barnes Hut Power Profile Corrected profile reveals important info Regularized main kernel Two similar irreg. kernels Decrease due to load imbal. One more irreg. kernel Very short regular kernel

K20Power Tool • Output • Corrected profile and corresponding ‘active’ energy • Features • Computes instant power using ‘capacitor’ formula • Employs high-resolution time steps • Samples at true frequency of 66.7 Hz • Dissemination • Open source, research license • http://cs.txstate.edu/~burtscher/research/K20power/

Marcher System • Tool will be part of Marcher system at Texas State • NSF-funded green computing infrastructure • Marcher is a power-measurable cluster system • 832 general-purpose cores • 12,000 GPU and MIC cores • 1.2 TB of DDR3 with power throttling and scaling • 50 TB of hybrid storage with hard drives and SSDs • Component-level power measurement tools (e.g., CPU, DRAM, Disk, GPU, Xeon Phi)

Summary • Correctly measuring K20/K40 power and energy • Sample at 66.7 Hz and include time stamps • Compute true power with presented formula • Use neighboring power samples to approximate slope • Compute true energy by integrating true power • Over intervals where power is above ‘active idle’ • K20Power tool • Software tool that implements this methodology • Paper at http://cs.txstate.edu/~burtscher/papers/gpgpu14.pdf

Acknowledgments • Collaborators • Ivan Zecenaand ZiliangZong • U.S. National Science Foundation • DUE-1141022, CNS-1217231, and CNS-1305359 • NVIDIA Corporation • Grants and equipment donations • Texas State University • Research Enhancement Program Nvidia

Accurate Power and Energy Measurement on Kepler -based Tesla GPUs

Accurate Power and Energy Measurement on Kepler -based Tesla GPUs

Presentation Transcript

Pressure Measurement based on Thermocouples

Lesson 7: Power and Energy Measurement

Measurement Based on Images

List Ranking on GPUs

High-Performance Computing with NVIDIA Tesla GPUs

Optimization on Kepler

Interactive Screen-Space Accurate Photon Tracing on GPUs

Drills on Work, Energy and Power

Measurement of Work, Power, and Energy Expenditure

Fast and Accurate Skew Estimation Based on Distance Transform

Physical Simulation on GPUs

Energy measurement

Energy Spread Measurement in the TESLA Extraction Line

Cycle Accurate Performance Measurement

Power and energy measurement in RLC circuits.

Is accurate system level power measurement challenging? Check this out!

Energy, Energy Measurement and Calculations

Energy Spread and Energy Precision: Comparison of NLC and TESLA

Matter, Energy, and Measurement

Cycle Accurate Performance Measurement

Accurate Positioning Devices Based on UWB Technology