Overview

Overview • Motivation (Kevin) • Thermal issues (Kevin) • Power modeling (David) • Thermal management (David) • Optimal DTM (Lev) • Clustering (Antonio) • Power distribution (David) • What current chips do (Lev) • HotSpot (Kevin)

Power modeling • Research Power Simulators • Wattch – Brooks and Martonosi ISCA2000 • SimplePower – Vijaykrishnan et al (Penn State) ISCA2000 • TEMPEST – Dhodapkar et al (Intel/Wisconsin) • PowerAnalyzer – Umich/Colorado • AccuPower – SUNY Binghamton • Industry Power Simulators • IBM PowerTimer – Brooks and Bose PACS2000 • Intel ALPS – Gunther, et al.

Power: The Basics • Dynamic power vs. Static power • Dynamic: “switching” power • Static: “leakage” power • Dynamic power dominates, but static power increasing in importance • Trends in each • Static power: steady, per-cycle energy cost • Dynamic power: capacitive and short-circuit • Capacitive power: charging/discharging at transitions from 01 and 10 • Short-circuit power: power due to brief short-circuit current during transitions. • Mostly focus on capacitive, but recent work on others

Vdd Capacitance: Function of wire length, transistor size Supply Voltage: Has been dropping with successive fab generations Vin Vout CL Clock frequency: Increasing… Activity factor: How often, on average, do wires switch? Capacitive Power dissipation Power ~ ½ CV2Af

Short-Circuit Power Dissipation • Short-Circuit Current caused by finite-slope input signals • Direct Current Path between VDD and GND when both NMOS and PMOS transistors are conducting

Leakage Power • Subthreshold currents grow exponentially with increases in temperature, decreases in threshold voltage

Modeling Hierarchy and Tool Flow

Analysis Abstraction Levels Abstraction Analysis Analysis Analysis Analysis Energy Level Capacity Accuracy Speed Resources Savings Most Worst Fastest Least Most Application Behavioral Architectural (RTL) Logic (Gate) Transistor (Circuit) Least Best Slowest Most Least

Power/Performance abstractions • Low-level: • Hspice • PowerMill • Medium-Level: • RTL Models • Architecture-level: • PennState SimplePower • Intel Tempest • Princeton Wattch • IBM PowerTimer • Umich/Colorado PowerAnalyzer

Low-level models: Hspice • Extracted netlists from circuit/layout descriptions • Diffusion, gate, and wiring capacitance is modeled • Analog simulation performed • Detailed device models used • Large systems of equations are solved • Can estimate dynamic and leakage power dissipation within a few percent • Slow, only practical for 10-100K transistors • PowerMill (Synopsys) is similar but about 10x faster

Medium-level models: RTL • Logic simulation obtains switching events for every signal • Structural VHDL or verilog with zero or unit-delay timing models • Capacitance estimates performed • Device Capacitance • Gate sizing estimates performed, similar to synthesis • Wiring Capacitance • Wire load estimates performed, similar to placement and routing • Switching event and capacitance estimates provide dynamic power estimates

Architecture level models • Bottom-up Approach: • Estimate “CV2f” via analytical models • Tools: Wattch, PowerAnalyzer, Tempest (mixed-mode) • Top-Down Approach • Estimate “CV2f” via empirical measurements • Tools: PowerTimer, AccuPower, Most Industrial Tools • Estimate “A” via statistics from architectural-performance simulators Power ~ ½ CV2Af

Analytical Models: Capacitance • Requires modeling wire length and estimating transistor sizes • Related to RC Delay analysis for speed along critical path • But capacitance estimates require summing up all wire lengths, rather than only an accurate estimate of the longest one.

Bit Register File: Capacitance Analysis Bit Pre-Charge Cell Access Transistors (N1) Decoders Wordlines (Number of Entries) Cell Sense Amps Number of Ports Number of Ports Bitlines (Data Width of Entries)

Register File Model: Accuracy • Validated against a register file schematic used in internal Intel design • Compared capacitance values with estimates from a layout-level Intel tool • Interconnect capacitance had largest errors • Model neglects poly connections • Differences in wire lengths -- difficult to tell wire distances of schematic nodes (Numbers in Percent)

Different Circuit Design Styles • RTL and Architectural level power estimation requires the tool/user to perform circuit design style assumptions • Static vs. Dynamic logic • Single vs. Double-ended bitlines in register files/caches • Sense Amp designs • Transistor and buffer sizings • Generic solutions are difficult because many styles are popular • Within individual companies, circuit design styles may be fixed

Clock Gating: What, why, when? • Dynamic Power is dissipated on clock transitions • Gating off clock lines when they are unneeded reduces activity factor • But putting extra gate delays into clock lines increases clock skew • End results: • Clock gating complicates design analysis but saves power. Clock Gated Clock Gate

Overview of Features Parameterized models for different CPU units Can vary size or design style as needed Abstract signal transition models for speed Can select different conditional clocking and input transition models as needed Based on SimpleScalar (has been ported to many simulators) Modular: Can add new models for new units studied Wattch: An Overview Wattch’s Design Goals • Flexibility • Planning-stage info • Speed • Modularity • Reasonable accuracy

Modeling Capacitance Models depend on structure, bitwidth, design style, etc. E.g., may model capacitance of a register file with bitwidth & number of ports as input parameters Modeling Activity Factor Use cycle-level simulator to determine number and type of accesses reads, writes, how many ports Abstract model of bitline activity Number of entries Power Parameterized Data width of entries Estimate Register File Bitline Activity Power # Read Ports Number of Active Ports Model # Write Ports Unit Modeling

One Cycle in Wattch • On each cycle: • determine which units are accessed • model execution time issues • model per-unit energy/power based on which units used and how many ports.

Array Structures • Caches, Reg Files, Map/Bpred tables • Content-Addressable Memories (CAMs) • TLBs, Issue Queue, Reorder Buffer • Complex combinational blocks • ALUs, Dependency Check • Clocking network • Global Clock Drivers, Local Buffers Units Modeled by Wattch

PowerTimer • IBM Tool First Develop During Summer of 2000 • Continued Development: 2001 => Today • Methodology Applied to Research and Product Power-Performance Simulators with IBM • Currently in Beta-Release • Working towards Full Academic Release

PowerTimer: Empirical Power Pre-silicon, POWER4-like superscalar design

Processor Power Density Pre-silicon, POWER4-like superscalar design Originally presented at PACS2002

PowerTimer Circuit Power Data (Macros) SubUnit Power = f(SF, uArch, Tech) Power Tech Parms Compute Sub-Unit Power uArch Parms AF/SF Data Program Executable or Trace CPI Architectural Performance Simulator

Energy Models Sub-Units (uArch-level Structures) Power=C1*SF+HoldPower Macro1 Power=C2*SF+HoldPower Macro2 Power Estimate SF Data Power=Cn*SF+HoldPower MacroN PowerTimer: Energy Models • Energy models for uArch structures formed by summation of circuit-level macro data

Empirical Estimates with CPAM • Estimate power under “Input Hold” and “Input Switching” Modes • Input Hold: All Macro Inputs (Except Clocks) Held • Can also collect data for Clock Gate Signals • Input Switching: Apply Random Switching Patterns with 50% Switching on Input Pins Macro • 0% Switching (Hold Power) • 50% Switching Power Macro Inputs

Example Unit • Made up of 5 macros

PowerTimer: Models f(SF) Assumption: Power linearly dependent on Switching Factor This separates Clock Power and Switching Power Switching Power Clock Power At 0% SF, Power = Clock Power (significant without clock gating)

Key Activity Data • SF => Moves along the Switching Power Curve • Estimated on a per-unit basis from RTL Analysis • AF => Moves along the Clock Power Curve • Extracted from Microarchitectural Statistics (Turandot) Changes in SF Changes in AF

Microarchitectural Statistics • Stats are very similar to tracking used in Wattch, etc • Differences: • Clock Gating Modes (3 modes) • Customized Scaling Based on Circuit Style (4 styles) • Clock Gating Modes: • P_constrained = P_unconstrained (not clock-gateable) • P_constrained_1 = AF * (Pclock + Plogic) (common) • P_constrained_2 = AF * Pclock + Plogic (rare) • P_constrained_3 = Pclock + AF * Plogic (very rare) • Scaling Based on Circuit Styles • AF_1 = #valid (Latch-and-Mux, No Stall Gating) • AF_2 = #valid - #stalls (Latch-and-Mux, With Stall Gating) • AF_3 = #writes (Arrays that only gate updates) • AF_4 = #writes + #reads (Arrays, RAM Macros)

Clock Gating: Valid-Bit Gating • Latch-Based Structures: Execute Pipelines, Issue Queues Clock V V V V V V

Clock Gating Modes • P_constrained_1 = AF * (Pclock + Plogic) clock valid Plogic Pclock • P_constrained_2 = AF * Pclock + Plogic clock Selection Logic valid Pclock Plogic

Valid-bit Gating, Stalls? • Option 1: Stalls cannot be gated clk valid Stall From Previous Pipestage Data From Previous Pipestage Data For Next Pipestage • Option 2: Stalls can be gated clk valid Stall From Previous Pipestage Data From Previous Pipestage Data For Next Pipestage

Scaling: Array Structures • Option 1: Reads and Writes Eligible to Gate for Power Write Bitline Read Bitline read_wordline_active read_gate write_wordline_active write_gate Cell read_data write_gate write_data

Scaling: Array Structures • Option 2: Only Writes Eligible to Gate for Power Write Bitline read_entry_n read_entry_2 read_data write_wordline_active read_entry_1 write_gate Cell read_entry_0 write_gate write_data

12 Clock Gating Modes

PowerTimer Observations • PowerTimer works well for POWER4-like estimates and derivatives • Scale base microarchitecture quite well • E.g. optimal power-performance pipelining study • Lack of run-time, bit-level SF not seen as a problem within IBM (seen as noise) • Chip bit-level SFs are quite low (5-15%) • Most (60-70%) power is dissipated while maintaining state (arrays, latches, clocks) • Much state is not available in early-stage timers

Comparing models: Flexibility • Flexibility necessary for certain studies • Resource tradeoff analysis • Modeling different architectures • Purely analytical tools provides fully-parameterizable power models • Within this methodology, circuit design styles could also be studied • PowerTimer scales power models in a user-defined manner for individual sub-units • Constrained to structures and circuit-styles currently in the library • Perhaps Mixed Mode tools could be very useful

Comparing models: Accuracy • PowerTimer -- Based on validation of individual pieces • Extensive validation of the performance model (AFs) • Power estimates from circuits are accurate • Circuit designers must vouch for clock gating scenarios • Certain assumptions will limit accuracy or require more in-depth analysis • Analytical Tools • Inherent Issues • Analytical estimates cannot be as accurate as SPICE analysis (“C” estimates, CV2 approximation) • Practical Issues • Without industrial data, must estimate transistor sizing, bits per structure, circuit choices

Comparing models: Speed • Performance simulation is slow enough! • Post-Processing vs. Run-Time Estimates • Wattch’s per-cycle power estimates: roughly 30% overhead • Post-processing (per-program power estimates) would be much faster (minimal overhead) • PowerTimer allows both no overhead post-processing and run-time analysis for certain studies (di/dt, thermal) • Some clock gating modes may require run-time analysis • Third Option: Bit Vector Dumps • Flexible Post-Processing  Huge Output Files

Power modeling summary • Wattch provides excellent relative accuracy • Underestimates full chip power (some units not modeled, etc) • PowerTimer models based on circuit-level power analysis • Inaccuracy is introduced in SF/AF and scaling assumptions

Overview • Motivation (Kevin) • Thermal issues (Kevin) • Power modeling (David) • Thermal management (David) • Optimal DTM (Lev) • Clustering (Antonio) • Power distribution (David) • What current chips do (Lev) • HotSpot (Kevin)

Existing Work • Research Ideas • DEETM – Huang and Torrellas MICRO2000 • DTM – Brooks and Martonosi HPCA2001 • Control-Theoretic DTM – Skadron, Abdelzaher, Stan HPCA2002 • Thermal Scheduling – Cai, Lim, Daasch WCED2002 • Commercial Products • PowerPC G3 Microprocessor • Pentium III • Pentium 4

Overview • Hard to optimize power-performance at design time for all cases • Forces conservative choices for issues like cooling, current delivery, resource sizes • Want to explore dynamic power optimizations for run-time power management • Dynamic Voltage/Frequency Scaling [Burd, 2000] • Dynamic Hardware Resizing [Albonesi, 1999] • Fetch Throttling [Sanchez, 1997] • Global Clock Gating [Gunther, 2001] • Speculation Control [Manne, 1998] • Dynamic Thermal Management [Brooks, 2001][Huang, 2000]

Important to optimize P & T early 12FO4 14FO4 Maximum Power Budget 23FO4 18FO4

Dynamic Thermal Management • Goal: • Provide dynamic techniques to cool chip when needed • Exploit natural variations due to different applications, phase behavior, … • Allow designers to target average, rather than worst-case behavior • Design Decisions: • Mechanism & policy for triggering response? • What should response be? • How to select DTM trigger levels?

Power consumption impacts cost • System costs associated with power dissipation: • Thermal control cost • Heatsinks, fans • Power delivery • Power supply • Decoupling caps… From: Gunther, et al. “Managing the Impact of Increasing Microprocessor Power Consumption,” Intel Technology Journal, Q1, 2001 CPU

Average and Worst Case Power • System costs are constrained by worst case power dissipation • Average case power dissipation can often be much lower • Aggressive Clock Gating • Applications variations • Underutilized resources • Not enough ILP • Floating point units during integer code execution • Currently about a 30% difference • Likely to further diverge…

Designed for Cooling Capacity w/out DTM System Cost Savings Designed for Cooling Capacity w/ DTM DTM Trigger Level Temperature DTM/Response Engaged Time Dynamic Thermal Management DTM Disabled

Overview

Overview

Presentation Transcript

Overview

Overview

OVERVIEW

Overview

Overview

Overview

Overview

Overview

overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview