1 / 24

Green Governors: A Framework for Continuously Adaptive DVFS

Green Governors: A Framework for Continuously Adaptive DVFS. Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden. Introduction. Optimize power efficiency Reduce power without harming performance Goal: minimize power efficiency metrics

royal
Download Presentation

Green Governors: A Framework for Continuously Adaptive DVFS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden

  2. Introduction • Optimize power efficiency • Reduce power without harming performance • Goal: minimize power efficiency metrics • Energy delay product (EDP), energy delay square product (ED2P) etc. • Exploit memory slack • Applications with many LLC misses  memory becomes bottleneck • Performance insensitive to processor frequency • Scaling frequency down  high energy benefit at low performance cost • Develop analytical models to predict impact of frequency scaling • No empirical parameters • No training period • Suitable for run-time use

  3. Modeling DVFS • Theoretical (work in simulator) • Extend previous Interval-based models (Karkhanis and Smith, ISCA 2004, Eyerman et. al , ACM TOCS, 2010)  Two models for runtime DVFS management • Miss-based & Stall-based models  differ in accuracy and ease of implementation • Estimate energy benefits – performance loss • G. Keramidas, V. Spiliopoulos, and S. Kaxiras. Interval-Based Models for Run-Time DVFS Orchestration in SuperScalar Processors. Proc. of Int. Conference on Computing Frontiers, 2010 • Implementation in real hardware • Apply model for power-performance adaptation in real processors • Case study: Intel Core i7 • Approximate models based on available performance monitoring hardware • Estimate power characteristics of real hardware • V. Spiliopoulos, S. Kaxiras, G. Keramidas "Green governors: A framework for Continuously Adaptive DVFS" International Green Computing Conference (IGCC'11).

  4. LLC Miss (off-chip) Data Miss (on-chip) Inst. Miss (on-chip) Branch MissPred. Steady-State IPC Interval-based Performance Model • Break the execution time of a program to intervals • Steady-state intervals: the IPC is limited by the machine width and program’s ILP • Miss-intervals: introduce stall cycles due to branch mispredictions, on-chip instruction/data misses, LLC misses (off-chip misses) Instr. rate (IPC) cycles 4

  5. LLC Miss (off-chip) Data Miss (on-chip) Instr Miss (on-chip) Branch MissPred. Steady-State IPC Interval-based DVFS Model (step 1) • Miss Intervals and Frequency scaling (time measured in cycles) • Branch-MissPredictions Miss Intervals  • same penalty (in cycles) in all frequencies • On-chip data/instruction Miss-Intervals • same penalty (in cycles) in all frequencies • LLC (off-chip) Miss intervals • for DVFS only account for this interval Instr. rate (IPC) cycles 5

  6. Interval-based DVFS Model (step 2) • LLC Miss Interval and Frequency scaling • Model core frequency scaling as change in memory latency in cycles • Example: memory access time = 100ns f = 1GHz  T = 1ns  mem_lat = 100 cycles f = 500MHz  T = 2ns  mem_lat = 50 cycles 6

  7. Interval-based DVFS Model (step 2) LLC Miss Interval and Frequency scaling Model core frequency scaling as change in memory latency in cycles Mem. latency IQ Drain Ramp-up LLC Miss (off-chip) LLC Miss Full-stall RoB fill Steady-State IPC Instr. rate (IPC) cycles 7 7

  8. Mem. latency Mem. latency IQ Drain Ramp-up Ramp-up LLC Miss RoB fill Steady-State IPC Frequency scaling == Change in memory latency •  Frequency: •  memory latency,  full stall area • Other areas (ROB–fill, IQ-drain and ramp-up) remain intact Instr. rate (IPC) Full-stall cycles 8

  9. DVFS target: Eliminate the slack  Memory latency up to ROB fill time No more available slack due to off chip misses Further reduction performance penalty Mem. latency Mem. latency Instr. rate (IPC) Instr. rate (IPC) IQ Drain Ramp-up Ramp-up Ramp-up LLC Miss LLC Miss Full-stall RoB fill RoB fill Steady-State IPC Steady-State IPC cycles cycles 9

  10. Mem. latency Instr. rate (IPC) IQ Drain Ramp-up LLC Miss Full-stall RoB fill Steady-State IPC cycles Elastic and Non-Elastic Areas • Target: Eliminate “slack” by reducing Memory Latency but: • ROB fill area: DOES NOT shrink  inelastic area • Full-stall, IQ drain and Ramp-up: DO shrink  elastic areas 10

  11. Two Simple Interval-Based Models • Stall-based Model • Fed by in-core information • Assumes all stalls scale with frequency • Disregards ROB fill area • Can be used in real hardware • Miss-based Model • Fed by information from the memory system • Accounts for both elastic-inelastic areas • Required information not available in current hardware 11

  12. Mem. latency LLC Miss RoB fill Steady-State IPC Stall-based Model • Assume (all) stalls scale with f • Not true due to RoB Fill • Exec cycles at f/k: cinit – stalls + (stalls/k) Instr. rate (IPC) cycles stalls 12 12

  13. Mem. latency LLC Miss RoB fill Steady-State IPC Miss-based Model • Assumes whole miss interval scales with f • Exec cycles at f/k: cinit – misses*mem_lat + (misses*mem_lat/k) Instr. rate (IPC) cycles 13 13

  14. Miss1 d Steady-State IPC Miss-based Model, more … • But important implication for overlapping misses! • Stalls of misses under a miss do not scale because of the inelastic Rob fill • Miss based model predicts execution cycles based on the number of clusters of misses Mem. latency Mem. latency d d Instr. rate (IPC) Miss2 Mem. latency Mem. latency cycles 14

  15. Real Hardware Approximations • Cannot apply miss-based model • No cluster of misses counter available • Cannot apply stall-based model as it is • No stalls due to LLC misses counter available • Approximate stall-based model • Approximate LLC stalls with the minimum between all pipeline stalls and worst case stalls due to LLC misses (LLC misses * mem_lat) • Good accuracy • Predict execution time going from fmin to fmax and vice versa • Less than 5% avg error

  16. Measuring power

  17. Power prediction • Previous researchers correlated total power (P = a C f V2 + Pstatic) with performance counter events • We correlate effective capacitance(P = a C f V2 + Pstatic) with performance counter events • Run a set of benchmarks • Compute effective C of benchmark i as • Estimate Ci as • Minimize

  18. Power prediction • Only need to train the model for a single frequency: • Prediction in other frequencies: • Events monitored • Uops executed • L2 misses • L2 accesses • Resource stalls • FP operations • Branch mispredictions

  19. Implementing Linux Frequency Governors • Linux kernel module that selects frequency • Window-based approach • Run application for a time window • Estimate performance (using stall-based model) and power in any frequency • Scale frequency based on policy of interest • Implement different policies • Optimize EDP/ED2P with/without performance constraints • Single & multi-process management • Experimental framework • Intel Core i7 • SPEC2006 benchmark suite

  20. Intel i7 single process (OptEDP)

  21. Intel i7 single process (OptEDPlimit)

  22. Intel i7 multi-process (OptEDP)

  23. Conclusions • DVFS modeling in simulators • Implement the model in real processors • Apply, explain and validate our model for SPEC2006 • Contribution: optimize power efficiency using linux frequency governors • Other uses of the models • PowerSleuth: combine models with phase detection to characterize the power behavior of applications • Future work • Multi-threading applications

  24. Thank You! Any questions? 24

More Related