1 / 35

10 years of research on Power Management (now called green computing)

10 years of research on Power Management (now called green computing) . Rami Melhem Daniel Mosse Bruce Childers. Introduction Power management in real-time systems Power management in multi-core processors Performance-Resilience-Power Tradeoff Management of memory power

delora
Download Presentation

10 years of research on Power Management (now called green computing)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 10 years of research on Power Management (now called green computing) Rami Melhem Daniel Mosse Bruce Childers

  2. Introduction • Power management in real-time systems • Power management in multi-core processors • Performance-Resilience-Power Tradeoff • Management of memory power • Phase Change Memory

  3. Two common techniques: Throttling Turn off (or change mode of) unused components (Need to predict usage patterns to avoid time and energy overhead of on/off or mode switching) Frequency and voltage scaling Scale down core’s speed (frequency and voltage) Power management techniques Designing power efficient components is orthogonal to power management

  4. Gracefully reduce performance Dynamic power Pd = C f 3 + Pind Static power: independent of f. Frequency/voltage scaling power C f 3 Idle time Pind Static power time • When frequency is halved: • Time is doubled • C f 3 is divided by 8 • Energy caused by C f 3 is divided by 4 • Energy caused by Pindis doubled time

  5. Minimize total energy consumption - static energy decreases with speed - dynamic energy increases with speed Minimize the energy-delay product Takes performance into consideration Minimize the maximum temperature Maximize performance given a power budget Minimize energy given a deadline Minimize energy given reliability constraints Different goals of power management Pind/ f total C f 2 energy C f Energy*delay Pind/ f 2 Speed (f) f

  6. Utilize slack to slow down future tasks (Proportional, Greedy, aggressive,…) DVS in real-time systems CPU speed Worst case execution deadline Smax Smin time Static scaling time PMP PMP (power management points) PMP Remaining time Dynamic scaling Remaining time PMP

  7. Implementation of Power Management Points • Can be implemented as periodic OS interrupst • Difficulty: OS does not know how much execution is remaining branch loop min average max • Compiler can insert code to provide hints to the OS

  8. Example of compiler/OS collaboration PMH PMH PMH min average max At a power management hint • Compiler records WCET based on the longest remaining path At a power management point • OS uses knowledge about current load to set up the speed

  9. PMHs: Power management hints PMPs: Power management points Application Source Code PMHs time Interrupts for executing PMPs Compiler/OS collaboration Run-time information Static analysis Compiler (knows the task) OS/HW (knows the system)

  10. DVS for multiple cores To derive a simple analytical model, assume Amdahl’s law: - p% of computation can be perfectly parallelized. s p One core • Manage energy by determining: • The speed for the serial section • The number of cores used in the parallel section • The speed in the parallel section Two cores Slowing down the cores Using more cores Slowing down the parallel section

  11. Streaming applications are prevalent Audio, video, real-time tasks, cognitive applications Constrains: Inter-arrival time (T) End-to-end delay (D) Power aware mapping to CMPs Determine speeds Account for communication Exclude faulty cores Mapping streaming applications to CMPs T D

  12. Core Core Core Core Mapping a linear task graph onto a linear pipeline If the # of stages = # of cores minimize Find tstage Subject to ei : energy for executing stage i ei : energy for moving data from stage i-1 to stage i ti : time for executing stagei ti : time for moving data from stage i-1 to stage i

  13. Mapping a linear task graph onto a linear pipeline If the # of stages > # of cores Core Core Core Core Group the stages so that the number of stages equals the number of cores Use a dynamic programming approach to explore possible groupings A faster solution may guarantee optimality within a specified error bound.

  14. Timing constraints are conventionally satisfied through load balanced mapping Additional constraint Minimize energy consumption Maximize performance for a given energy budget Avoid faulty cores Mapping a non-linear task graph onto CMP instance D A Maximum speed C instance B F Medium speed E G A J Minimum speed H I K B D C E F G H I J K

  15. Maximum speed/voltage (fmax) Turn OFF some PEs F A instance B C instance J G E A K I H D B D C E F Medium speed/voltage G H I J Minimum speed/voltage (fmin) K PE OFF

  16. DVS using Machine Learning • Characterize the execution state of a core by • Rate of instruction execution (IPC) • # of memory accesses per instruction • Average memory access time (depends on other threads) • During training, record for each state • The core frequency • The energy consumption • Determine the optimal frequency for each state • During execution, periodically, • Estimate the current state (through run-time measurements) • Assume that the future is a continuation of the present • Set the frequency to the best recorded during training core core core core L1 $$ L1 $$ L1 $$ L1 $$ L2 $$ L2 $$ L2 $$ L2 $$ MC M

  17. Training phase Runtime Statistical learning applied to DVS in CMPs. Learning engine Auto. policy generator Integrated DVS policy determine freq. & voltages

  18. Energy-Reliability tradeoff • Using time redundancy (checkpointing and rollbacks) If you have a time slack: 1) add checkpoints 2) reserve recovery time 3) reduce processing speed Smax For a given number of checkpoints, we can find the speed that minimizes energy consumption, Whileguaranteeing recovery and timeliness. deadline

  19. Optimal number of checkpoints More checkpoints = more overhead + less recovery slack For a given slack (C/D) and checkpoint overhead (r/C), we can find the number of checkpoints that minimizes energy consumption, and guarantee recovery and timeliness. r C D Energy # of checkpoints

  20. Faults are rare events If a fault occurs, may continue executing at Smaxafter recovery.

  21. Non-uniform check-pointing Observation:If a fault occurs, may continue executing at Smaxafter recovery. Disadvantage:increases energy consumption when a fault occurs (a rare event) Advantage: recovery in an early section can use slack created by execution of later sections at Smax Requires non-uniform checkpoints.

  22. Triple Modular Redundancy vs. Duplex Duplex: Compare and roll back if different TMR: vote and exclude the faulty result checkpoint overhead TMR is more Energy efficient Load=0.7 0.035 Duplex is more Energy efficient 0.02 l 0.1 0.2 • Efficiency of TMR Vs. Duplex depends on • static power (l), • checkpoint overhead and • load

  23. Active (779.1 mW) auto Standby (275.0 mW) 5ns 5ns 5ns 1000ns Power-down (150 mW) Self-refresh (20.87 mW) Add memory power to the mix • Example: DRAM and SRAM modules can be switched between different power states (modes) – not free: • - Mode transition power overhead • - Mode transition time overhead

  24. OS assisted Memory Power Management? • keep a histogram for patterns of bank accesses and idle time distributions. • Use machine learning techniques to select the optimal “threashold” to turn banks off.

  25. Example of compiler assisted Memory Power Management? …. Load x …. Store x …. Load z …. Load y …. Store z …. Store y …. …. Load x Load y …. …. …. …. …. …. Store y Load z …. Store z Compiler transformation Code transformations to increase the memory idle time (the time between memory accesses).

  26. Example of compiler assisted Memory Power Management? Declare A[], B[], C[], D[] …. Access A …. Access D …. Access B …. Access C …. Access B …. …. A[], B[] C[], D[] Memory allocation OR A[], D[] C[], B[] Algorithms that use the access pattern to allocate memory to banks in a way that maximizes bank idle times

  27. Phase Change Memory (PCM)A power saving memory technology • Solid State memory made of germanium-antimony alloy • Switching between states is thermal based (not electrical based) • Samsung, Intel, Hitachi and IBM developed PCM prototypes (to replace Flash).

  28. Properties of PCM • Non-volatile but faster than Flash • Byte addressable but denser and cheaper than DRAM • No static power consumption and very low switching power • Not susceptible to SEUs (single event upsets) and hence do not need error detecting or correcting codes • Errors occur only during write (not read) – use a simple read-after-write to detect errors

  29. So, where is the catch? • Slower than DRAM • factor of 2 for read and 10 for write • Low endurance • A cell fails after 107writes (as opposed to 1015 for DRAM) • Asymmetric energy consumption • write is more expensive than read • Asymmetry in bit writing • writing 0s is faster than writing 1s

  30. Goal: use PCM as main memory CPU CPU AEB Memory Controller MM PCM DRAM Proposed architecture AEB: acceleration/endurance buffer MM: memory manager Traditional architecture Advantages: cheaper + denser + lower power consumption

  31. Dealing with asymmetric read/write • Use coherence algorithms in which “writes” are not on the critical path. • Design algorithms with “read rather than write” in mind • Take advantage of the fact that writing 0s is faster than 1s • Pre-write a block with 1’s as soon as block is dirty in the cache • On write back, only write 0’s .

  32. Dealing with low write endurance(write minimization) • Block (or page) allocation algorithms should not be oblivious to the status of the block – for wear minimization • Modify the cache replacement algorithm • ex. LRR replacement (least recently read) • Lower priority to dirty pages in cache replacement • Use coherence algorithms that minimize writes (write-through is not good) • Read/compare/write == write a bit only if it is different than the current content

  33. Wear leveling • Memory allocation decisions should consider age of blocks (age = number of write cycles exerted) • Periodically change the physical location of a page (write to a location different than the one read from) • Consider memory as a consumable resource - can be periodically replaced

  34. Detailed architecture Memory Manager Request Controller Tag Array CPU Bus Interface In Flight Buffer Busy Bitmap V D Addr FSM V D Addr Requests Buffer V D Addr V St CPU R/W Size Tag AEB PCM n Control bus Data bus DRAM Controller/DMAC PCM Controller/DMAC In Flight Buffer (SRAM) In Flight Buffer AEB PCM Page Cache Pages Area Spare Table Spares

  35. Conclusion It is essential to manage the tradeoffs between Time constrains (deadlines or rates) Energy constrains Reliability constrains Compiler OS Hardware

More Related