1 / 75

Overview

Overview. Motivation (Kevin) Thermal issues (Kevin) Power modeling (David) Thermal management (David) Optimal DTM (Lev) Clustering (Antonio) Power distribution (David) What current chips do (Lev) HotSpot (Kevin). Power modeling. Research Power Simulators

Download Presentation

Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview • Motivation (Kevin) • Thermal issues (Kevin) • Power modeling (David) • Thermal management (David) • Optimal DTM (Lev) • Clustering (Antonio) • Power distribution (David) • What current chips do (Lev) • HotSpot (Kevin)

  2. Power modeling • Research Power Simulators • Wattch – Brooks and Martonosi ISCA2000 • SimplePower – Vijaykrishnan et al (Penn State) ISCA2000 • TEMPEST – Dhodapkar et al (Intel/Wisconsin) • PowerAnalyzer – Umich/Colorado • AccuPower – SUNY Binghamton • Industry Power Simulators • IBM PowerTimer – Brooks and Bose PACS2000 • Intel ALPS – Gunther, et al.

  3. Power: The Basics • Dynamic power vs. Static power • Dynamic: “switching” power • Static: “leakage” power • Dynamic power dominates, but static power increasing in importance • Trends in each • Static power: steady, per-cycle energy cost • Dynamic power: capacitive and short-circuit • Capacitive power: charging/discharging at transitions from 01 and 10 • Short-circuit power: power due to brief short-circuit current during transitions. • Mostly focus on capacitive, but recent work on others

  4. Vdd Capacitance: Function of wire length, transistor size Supply Voltage: Has been dropping with successive fab generations Vin Vout CL Clock frequency: Increasing… Activity factor: How often, on average, do wires switch? Capacitive Power dissipation Power ~ ½ CV2Af

  5. Short-Circuit Power Dissipation • Short-Circuit Current caused by finite-slope input signals • Direct Current Path between VDD and GND when both NMOS and PMOS transistors are conducting

  6. Leakage Power • Subthreshold currents grow exponentially with increases in temperature, decreases in threshold voltage

  7. Modeling Hierarchy and Tool Flow

  8. Analysis Abstraction Levels Abstraction Analysis Analysis Analysis Analysis Energy Level Capacity Accuracy Speed Resources Savings Most Worst Fastest Least Most Application Behavioral Architectural (RTL) Logic (Gate) Transistor (Circuit) Least Best Slowest Most Least

  9. Power/Performance abstractions • Low-level: • Hspice • PowerMill • Medium-Level: • RTL Models • Architecture-level: • PennState SimplePower • Intel Tempest • Princeton Wattch • IBM PowerTimer • Umich/Colorado PowerAnalyzer

  10. Low-level models: Hspice • Extracted netlists from circuit/layout descriptions • Diffusion, gate, and wiring capacitance is modeled • Analog simulation performed • Detailed device models used • Large systems of equations are solved • Can estimate dynamic and leakage power dissipation within a few percent • Slow, only practical for 10-100K transistors • PowerMill (Synopsys) is similar but about 10x faster

  11. Medium-level models: RTL • Logic simulation obtains switching events for every signal • Structural VHDL or verilog with zero or unit-delay timing models • Capacitance estimates performed • Device Capacitance • Gate sizing estimates performed, similar to synthesis • Wiring Capacitance • Wire load estimates performed, similar to placement and routing • Switching event and capacitance estimates provide dynamic power estimates

  12. Architecture level models • Bottom-up Approach: • Estimate “CV2f” via analytical models • Tools: Wattch, PowerAnalyzer, Tempest (mixed-mode) • Top-Down Approach • Estimate “CV2f” via empirical measurements • Tools: PowerTimer, AccuPower, Most Industrial Tools • Estimate “A” via statistics from architectural-performance simulators Power ~ ½ CV2Af

  13. Analytical Models: Capacitance • Requires modeling wire length and estimating transistor sizes • Related to RC Delay analysis for speed along critical path • But capacitance estimates require summing up all wire lengths, rather than only an accurate estimate of the longest one.

  14. Bit Register File: Capacitance Analysis Bit Pre-Charge Cell Access Transistors (N1) Decoders Wordlines (Number of Entries) Cell Sense Amps Number of Ports Number of Ports Bitlines (Data Width of Entries)

  15. Register File Model: Accuracy • Validated against a register file schematic used in internal Intel design • Compared capacitance values with estimates from a layout-level Intel tool • Interconnect capacitance had largest errors • Model neglects poly connections • Differences in wire lengths -- difficult to tell wire distances of schematic nodes (Numbers in Percent)

  16. Different Circuit Design Styles • RTL and Architectural level power estimation requires the tool/user to perform circuit design style assumptions • Static vs. Dynamic logic • Single vs. Double-ended bitlines in register files/caches • Sense Amp designs • Transistor and buffer sizings • Generic solutions are difficult because many styles are popular • Within individual companies, circuit design styles may be fixed

  17. Clock Gating: What, why, when? • Dynamic Power is dissipated on clock transitions • Gating off clock lines when they are unneeded reduces activity factor • But putting extra gate delays into clock lines increases clock skew • End results: • Clock gating complicates design analysis but saves power. Clock Gated Clock Gate

  18. Overview of Features Parameterized models for different CPU units Can vary size or design style as needed Abstract signal transition models for speed Can select different conditional clocking and input transition models as needed Based on SimpleScalar (has been ported to many simulators) Modular: Can add new models for new units studied Wattch: An Overview Wattch’s Design Goals • Flexibility • Planning-stage info • Speed • Modularity • Reasonable accuracy

  19. Modeling Capacitance Models depend on structure, bitwidth, design style, etc. E.g., may model capacitance of a register file with bitwidth & number of ports as input parameters Modeling Activity Factor Use cycle-level simulator to determine number and type of accesses reads, writes, how many ports Abstract model of bitline activity Number of entries Power Parameterized Data width of entries Estimate Register File Bitline Activity Power # Read Ports Number of Active Ports Model # Write Ports Unit Modeling

  20. One Cycle in Wattch • On each cycle: • determine which units are accessed • model execution time issues • model per-unit energy/power based on which units used and how many ports.

  21. Array Structures • Caches, Reg Files, Map/Bpred tables • Content-Addressable Memories (CAMs) • TLBs, Issue Queue, Reorder Buffer • Complex combinational blocks • ALUs, Dependency Check • Clocking network • Global Clock Drivers, Local Buffers Units Modeled by Wattch

  22. PowerTimer • IBM Tool First Develop During Summer of 2000 • Continued Development: 2001 => Today • Methodology Applied to Research and Product Power-Performance Simulators with IBM • Currently in Beta-Release • Working towards Full Academic Release

  23. PowerTimer: Empirical Power Pre-silicon, POWER4-like superscalar design

  24. Processor Power Density Pre-silicon, POWER4-like superscalar design Originally presented at PACS2002

  25. PowerTimer Circuit Power Data (Macros) SubUnit Power = f(SF, uArch, Tech) Power Tech Parms Compute Sub-Unit Power uArch Parms AF/SF Data Program Executable or Trace CPI Architectural Performance Simulator

  26. Energy Models Sub-Units (uArch-level Structures) Power=C1*SF+HoldPower Macro1 Power=C2*SF+HoldPower Macro2 Power Estimate SF Data Power=Cn*SF+HoldPower MacroN PowerTimer: Energy Models • Energy models for uArch structures formed by summation of circuit-level macro data

  27. Empirical Estimates with CPAM • Estimate power under “Input Hold” and “Input Switching” Modes • Input Hold: All Macro Inputs (Except Clocks) Held • Can also collect data for Clock Gate Signals • Input Switching: Apply Random Switching Patterns with 50% Switching on Input Pins Macro • 0% Switching (Hold Power) • 50% Switching Power Macro Inputs

  28. Example Unit • Made up of 5 macros

  29. PowerTimer: Models f(SF) Assumption: Power linearly dependent on Switching Factor This separates Clock Power and Switching Power Switching Power Clock Power At 0% SF, Power = Clock Power (significant without clock gating)

  30. Key Activity Data • SF => Moves along the Switching Power Curve • Estimated on a per-unit basis from RTL Analysis • AF => Moves along the Clock Power Curve • Extracted from Microarchitectural Statistics (Turandot) Changes in SF Changes in AF

  31. Microarchitectural Statistics • Stats are very similar to tracking used in Wattch, etc • Differences: • Clock Gating Modes (3 modes) • Customized Scaling Based on Circuit Style (4 styles) • Clock Gating Modes: • P_constrained = P_unconstrained (not clock-gateable) • P_constrained_1 = AF * (Pclock + Plogic) (common) • P_constrained_2 = AF * Pclock + Plogic (rare) • P_constrained_3 = Pclock + AF * Plogic (very rare) • Scaling Based on Circuit Styles • AF_1 = #valid (Latch-and-Mux, No Stall Gating) • AF_2 = #valid - #stalls (Latch-and-Mux, With Stall Gating) • AF_3 = #writes (Arrays that only gate updates) • AF_4 = #writes + #reads (Arrays, RAM Macros)

  32. Clock Gating: Valid-Bit Gating • Latch-Based Structures: Execute Pipelines, Issue Queues Clock V V V V V V

  33. Clock Gating Modes • P_constrained_1 = AF * (Pclock + Plogic) clock valid Plogic Pclock • P_constrained_2 = AF * Pclock + Plogic clock Selection Logic valid Pclock Plogic

  34. Valid-bit Gating, Stalls? • Option 1: Stalls cannot be gated clk valid Stall From Previous Pipestage Data From Previous Pipestage Data For Next Pipestage • Option 2: Stalls can be gated clk valid Stall From Previous Pipestage Data From Previous Pipestage Data For Next Pipestage

  35. Scaling: Array Structures • Option 1: Reads and Writes Eligible to Gate for Power Write Bitline Read Bitline read_wordline_active read_gate write_wordline_active write_gate Cell read_data write_gate write_data

  36. Scaling: Array Structures • Option 2: Only Writes Eligible to Gate for Power Write Bitline read_entry_n read_entry_2 read_data write_wordline_active read_entry_1 write_gate Cell read_entry_0 write_gate write_data

  37. 12 Clock Gating Modes

  38. PowerTimer Observations • PowerTimer works well for POWER4-like estimates and derivatives • Scale base microarchitecture quite well • E.g. optimal power-performance pipelining study • Lack of run-time, bit-level SF not seen as a problem within IBM (seen as noise) • Chip bit-level SFs are quite low (5-15%) • Most (60-70%) power is dissipated while maintaining state (arrays, latches, clocks) • Much state is not available in early-stage timers

  39. Comparing models: Flexibility • Flexibility necessary for certain studies • Resource tradeoff analysis • Modeling different architectures • Purely analytical tools provides fully-parameterizable power models • Within this methodology, circuit design styles could also be studied • PowerTimer scales power models in a user-defined manner for individual sub-units • Constrained to structures and circuit-styles currently in the library • Perhaps Mixed Mode tools could be very useful

  40. Comparing models: Accuracy • PowerTimer -- Based on validation of individual pieces • Extensive validation of the performance model (AFs) • Power estimates from circuits are accurate • Circuit designers must vouch for clock gating scenarios • Certain assumptions will limit accuracy or require more in-depth analysis • Analytical Tools • Inherent Issues • Analytical estimates cannot be as accurate as SPICE analysis (“C” estimates, CV2 approximation) • Practical Issues • Without industrial data, must estimate transistor sizing, bits per structure, circuit choices

  41. Comparing models: Speed • Performance simulation is slow enough! • Post-Processing vs. Run-Time Estimates • Wattch’s per-cycle power estimates: roughly 30% overhead • Post-processing (per-program power estimates) would be much faster (minimal overhead) • PowerTimer allows both no overhead post-processing and run-time analysis for certain studies (di/dt, thermal) • Some clock gating modes may require run-time analysis • Third Option: Bit Vector Dumps • Flexible Post-Processing  Huge Output Files

  42. Power modeling summary • Wattch provides excellent relative accuracy • Underestimates full chip power (some units not modeled, etc) • PowerTimer models based on circuit-level power analysis • Inaccuracy is introduced in SF/AF and scaling assumptions

  43. Overview • Motivation (Kevin) • Thermal issues (Kevin) • Power modeling (David) • Thermal management (David) • Optimal DTM (Lev) • Clustering (Antonio) • Power distribution (David) • What current chips do (Lev) • HotSpot (Kevin)

  44. Existing Work • Research Ideas • DEETM – Huang and Torrellas MICRO2000 • DTM – Brooks and Martonosi HPCA2001 • Control-Theoretic DTM – Skadron, Abdelzaher, Stan HPCA2002 • Thermal Scheduling – Cai, Lim, Daasch WCED2002 • Commercial Products • PowerPC G3 Microprocessor • Pentium III • Pentium 4

  45. Overview • Hard to optimize power-performance at design time for all cases • Forces conservative choices for issues like cooling, current delivery, resource sizes • Want to explore dynamic power optimizations for run-time power management • Dynamic Voltage/Frequency Scaling [Burd, 2000] • Dynamic Hardware Resizing [Albonesi, 1999] • Fetch Throttling [Sanchez, 1997] • Global Clock Gating [Gunther, 2001] • Speculation Control [Manne, 1998] • Dynamic Thermal Management [Brooks, 2001][Huang, 2000]

  46. Important to optimize P & T early 12FO4 14FO4 Maximum Power Budget 23FO4 18FO4

  47. Dynamic Thermal Management • Goal: • Provide dynamic techniques to cool chip when needed • Exploit natural variations due to different applications, phase behavior, … • Allow designers to target average, rather than worst-case behavior • Design Decisions: • Mechanism & policy for triggering response? • What should response be? • How to select DTM trigger levels?

  48. Power consumption impacts cost • System costs associated with power dissipation: • Thermal control cost • Heatsinks, fans • Power delivery • Power supply • Decoupling caps… From: Gunther, et al. “Managing the Impact of Increasing Microprocessor Power Consumption,” Intel Technology Journal, Q1, 2001 CPU

  49. Average and Worst Case Power • System costs are constrained by worst case power dissipation • Average case power dissipation can often be much lower • Aggressive Clock Gating • Applications variations • Underutilized resources • Not enough ILP • Floating point units during integer code execution • Currently about a 30% difference • Likely to further diverge…

  50. Designed for Cooling Capacity w/out DTM System Cost Savings Designed for Cooling Capacity w/ DTM DTM Trigger Level Temperature DTM/Response Engaged Time Dynamic Thermal Management DTM Disabled

More Related