Lower Power Design Guide. 1998. 6.7 성균관대학교 조 준 동 교수 http://vlsicad.skku.ac.kr. Contents. 1. Intoduction Trends for High-Level Lower Power Design 2. Power Management Clock/Cache/Memory Management 3. Architecture Level Design Architecture Trade offs, Transformation
성균관대학교 조 준 동 교수
Systems with limited for heat sinks
Lowering power with fixed performance: DSPs in modems and cellular phones
Reliability: Increasing power ! increasing electromigration, 40-year reliability guarantee (product life cycle of telecommunication industries)
Adding fans to reduce power cause reliability to plummet.
Higher power leads to higher packaging costs: 2-watt package can be four times greater than a 1-watt package
Myriad Constraints: timing, power, testability, area, packaging, time-to-market.
Ad-Hoc Design: Lack a systematic process leading to universal applicability.Motivation
MPU1: low-end microprocessor for embedded use
MPU2: high-end CPU with large amount of cache
ASSP1: MPEG2 decoder
ASSP2: ATM switch
Infopad (univ. of California, Berkeley), weight < 1 pound,
0.5W (reflective color display) + 0.5W (computation,communication, I/O support) = 1W (Alpha chip: 25W StrongARM: 215 MHz at 2.0V:0.3W)
runtime 50 hours, target: 100MIPS/mW.
Deep-sub micron (0.35 - 0.18) with low voltage for portable full motion video terminal; 0:5m : 40 AA NiMH; 1m : 1 AA NiMH
System-On-A-Chip to reduce external Interconnection Capacitances
Power Management: shut down idle units
Power Optimization Techniques in Software, Architecture,Logic/Circuit,
Layout Phases to reduce operations, frequency, capacitance, switching activity with maintaining the same throughput.Current Design Issues in Lower Power Problem
Short Circuit power(10-30%): Short circuit ow during transitions,
Switching (or capacitive) power(70-90%): Charging/discharging of capacitive loads during transitionsPower Component
LCD: 54.1%, HDD 16.8%, CPU 10.7%, VGA/VRAM 9.6%, SysLogic 4.5%, DRAM 1.1%, Others: 3.2%
Display mode: CPU is in sleep-mode (55 minutes), LCD (VRAM + LCDC)
CPU mode: Display is idle ( 5 minutes), Looking up - data retrival
Handwrite recognition - biggest power (memory, system bus active)Power Consumption in Multimedia Systems
DPM 4.5%, DRAM 1.1%, Others: 3.2%
(Dynamic Power Management): stops the clock switching of a specific unit generated by clock generators. The clock regenerators produce two clocks, C1 and C2 . The logic: 0.3%, 10-20% of power savings.
(Static Power Management): saving of the power dissipation in the steady mode. When the system (or subsystem) remains idle for a significant period time, then the entire chip
(or subsystem) is shut-down.
Identify power hungry modules and look for opportunities to reduce power
If f is increased, one has to increase the transistor size or Vdd.Power Management
Caches are powerdown when idle.
Block Power Management (Sleep, standby mode) Scheme by Enabling Clock
Clock Power Management Scheme by adding Clock Generation blockPower Management
Spatial locality Enabling Clock: an algorithm can be partitioned into natural clusters based on connectivity
Temporal locality: average lifetimes of variables (less temporal storage, probability of future accesses referenced in the recent past).
Precompute physical capacitance of Interconnect and switching activity (number of bus accesses)
Architecture-Driven Voltage Scaling: Choose more parallel architecture
Supply Voltage Scaling : Lowering V dd reduces energy, but increase delaysSystem-Level Solutions
Upto 40% of the on-chip power is dissipated on the buses !
Computational work varies with time. An approach to reduce the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.
The basic idea is to lower power supply when the a fixed supply for some fraction of time.
The supply voltage and clock rate are increased during high workload period.Variable Supply Voltage Block Diagram
The basic idea of averaging two samples are buffered and their work loads are averaged.
The averaged workload is then used as the effective workload to drive the power supply.
Using a pingpong buffering scheme, data samples In +2, In +3
are being buffered while In, In +1
are being processed.
1.5V and 10MHz clock rate: instruction and data memory accesses account for 47% of the total power consumption.
At first order P= C * f/2 * Vdd2
(1.15C)( 0.58V)2 (f)
Compiler takes the responsibility for finding the operations that can be issued in parallel and creating a single very long instruction containing these operations. VLIW instruction decoding is easier than superscalar instruction due to the fixed format and to no instruction dependency.
The fixed format could present more limitations to the combination of operations.
Intel P6: CISC instructions are combined on chip to provide a set of micro-operations (i.e., long instruction word) that can be executed in parallel.
As power becomes a major issue in the design of fast -Pro, the simple is the better architecture.
VLIW architecture, as they are simpler than N-issue machines, could be considered as promising architectures to achieve simultaneously
high-speed and low-power.
6% the energy consumption of such systems beyond shut down involves more logicsExample: ABCS protocol
(a)Non-local implementation from Hyper (b)Local implementation from Hyper-LP
Majority of execution cycles in signal processing programs are used for loop execution :
40% reduction in power with area increase 2%.
Architecture of Control Logic in Microprocessor the energy consumption of such systems beyond shut down involves
State Transition Diagram
Binary Code Mapping
Hardware ImplementationState/Instruction Encoding
If e has higher switching prob. (e.g., S0 =branch, S1=compare), then encode S0 and S1 with gray code style.
Optimum voltage for low-power is around 1.5V
further by the logic synthesis process before mapping to layout.
Local control model: the local controller account for a larger percentage of the total capacitance than the global controller.
Where Ntrans is the number of tansitions, nstates is the number of states, Bf is the bus factor, and Clc is the capacitance switched in any local controller in one sample period. Bf is the ratio of the number of bus accesses to the number of busses.
Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation,
Determine operating clock frequency
Resizing non-critical path transistor (In-Place Optimization)
Critical path in Synchronous Sequential logicCritical Path
Flip- flop insertion to minimize hazard activity moving a flip- flop in a circuit
A fourth-order parallel-form
(a) Local assignment
(2 global transfers),
(b) Non-local assignment
(20 global transfers)
By sampling a steady state signal at a register input,
no more glitches are propagated through the next
Where NRi: the number of resources of class Ri
dRi : the duration of a single operation
ORi : the number of operations
HAL ExampleASAP Scheduling
Probability of scheduling operations into control steps after operation o3 is scheduled to step s2Force-Directed Scheduling Example
The scheduled DFG logic
DFG with mobility labeling (inside <>)
ready operation list/resource constraintList Scheduling
Partial schedule of five nodes
Priority listStatic-List Scheduling
The final schedule
DFG2 after redundant operation insertionDFG Restructuring
Multiplexer-oriented datapath Scheduling
Bus-oriented datapathDatapath interconnections
Scheduled DFG Scheduling
Lifetime intervals of variable
Clique-partitioning solutionRegister Allocation Using Clique Partitioning
Sorted variable lifetime intervals Scheduling
Five-register allocation resultRegister Allocation: Left-Edge Algorithm
the data path structure graph
the controller state machine graph
the interface graph (between data path control inputs and the
controller output signals)