L4 architectural level design
Download
1 / 38

L4: Architectural Level Design - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

L4: Architectural Level Design . 성균관대학교 조 준 동 교수 http://vlsicad.skku.ac.kr . Spatial locality: an algorithm can be partitioned into natural clusters based on connectivity

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' L4: Architectural Level Design ' - kaori


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
L4 architectural level design

L4: Architectural Level Design

성균관대학교 조 준 동 교수

http://vlsicad.skku.ac.kr


System level solutions

Spatial locality: an algorithm can be partitioned into natural clusters based on connectivity

Temporal locality:average lifetimes of variables (less temporal storage, probability of future accesses referenced in the recent past).

Precompute physical capacitance of Interconnect and switching activity (number of bus accesses)

Architecture-Driven Voltage Scaling: Choose more parallel architecture

Supply Voltage Scaling : Lowering V dd reduces energy, but increase delays

System-Level Solutions


Software power issues
Software Power Issues natural clusters based on connectivity

Upto 40% of the on-chip power is dissipated on the buses !

  • System Software : OS, BIOS, Compilers

  • Software can affect energy consumption at various levels Inter-Instruction Effects

  • Energy cost of instruction varies depending on previous instruction

  • For example, XORBX 1; ADDAX DX;

  • Iest = (319:2+313:6)=2 = 316:4mA Iobs =323:2mA

  • The difference defined as circuit state overhead

  • Need to specify overhead as a function of pairs of instructions

  • Due to pipeline stalls, cache misses

  • Instruction reordering to improve cache hit ratio


Software power optimization

Instruction packing natural clusters based on connectivity

reduce cache miss with a high power penalty

example

Fujisu DSP

permit an ALU operation and a memory data transfer to be packed

Instruction ordering

attempt to minimize the energy associated with the circuit state effect

reordering instruction to minimize the total power for a given table

Operand swapping

minimize activity associated with the operand

attempts to swap operands to ALU or FPU

Software Power Optimization


Software power optimization1

Minimizing memory access costs natural clusters based on connectivity

minimizes the number of memory accesses required by an algorithm

example

Memory bank assignment

formulated as a graph partitioning problem

each groups correspond to a memory bank

optimum code sequence can vary using dual loads

Before

After

b

d

FOR i:= 1 TO N DO

FOR i:= 1 TO N DO

B[i] = f(A[i]);

e

B[i] = f(A[i]);

C[i] = g(B[i]);

FOR i:= 1 TO N DO

a

c

C[i] = g(B[i]);

END_FOR;

Software Power Optimization

access graph

for code fragment

b

c

Bank A

Bank B

e

a

d

partitioned access graph


Power management mode

Support power management natural clusters based on connectivity

easy control for applications and OS

APM : Advanced power management

power states

Full On

APM Enabled

APM Standby

APM Suspend

Off

APM System

Power Management Mode

APM-Aware

Application

APM-Aware

Application

APM-Aware

Device Driver

APM-Aware

Device Driver

Operating

System

APM Driver

OS dependent

OS independent

BIOS

APM BIOS

APM BIOS

Controlled

Hardware

Add-In

Device

Add-In

Device


Power management mode1
Power Management Mode natural clusters based on connectivity

Device

Responsiveness

Decrease

Full On

  • APM state transitions

Off Switch

  • APM Enable

  • Enable Call

  • APM Disable

  • Disable Call

  • Off Switch

  • Off Call

Power

Managed

APM Enabled

  • Short Inactivity

  • Standby Call

APM Standby

Off Switch

APM Suspend

Off Switch

  • Long Inactivity

  • Suspend Interrupt

  • Suspend Call

Hibernation

Power

Usage

Increase

On Switch

Off


Power management mode2

PowerPC 603 natural clusters based on connectivity

Doze

clock running to data cache, snooping logic, time base/decrementer only

Nap

clocks running to time base/decrementer only

Sleep

all clocks stopped, no external input clock

MIPS 4200

Reduced power

clocks at 1/4 bus clock frequency

Hitachi SH7032

Sleep

CPU clocks stopped, preipherals remain clocked

Standby

all clocks stopped peripherals initialized

Power Management Mode


Power optimization
Power Optimization natural clusters based on connectivity

  • Modeling and Technology

  • Circuit Design Level

  • Logic and Module Design Level

  • Architecture and System Design Level

  • Some Design Examples

    • ARM7TDMI


Some design examples

ARM7TDMI core natural clusters based on connectivity

size : 1mm2 @ 0.25um

power :

[email protected] 5V

143 MIPS/W

feature

32 bit addressing

32x8 DSP multiplier

32-bit register bank and ALU

32-bit barrel shifter

thumb instruction set

compressed 32-bit ARM instruciton

high-code density

Processor

System

Power(W)

MIPS/W

ARM7D

33Mhz 5V

0.165

185

ARM7TDMI

33Mhz 5V

0.181

143

PC403GA

40Mhz 5V

1

39

V810

25Mhz 5V

0.5

36

68349

25Mhz 5V

0.96

9

29200

16Mhz 5V

1.1

7

486DX

33Mhz 5V

4.5

6

i960SA

16Mhz 5V

1.25

4

Some Design Examples


Processor with power management
Processor with Power Management natural clusters based on connectivity

  • Clock power management

    • basic logical method

      • gated clocking

    • hardware method

      • external pin + control register bit

    • software method

      • specific instructions + control register bit


Avoiding wastful computation
Avoiding Wastful Computation natural clusters based on connectivity

  • Preservation of data correlation

  • Distributed computing / locality of reference

  • Application-specific processing

  • Demand-driven operation

  • Transformation for memory size reduction

  • Consider arrays A and C are already available in memory

  • When A is consumed another array B is generated; when C is consumed a scalar value D is produced.

  • Memory Size can be reduced by executing the j loop before the i loop so that C is consumed before B is generated and the same memory space can be used for both arrays.


Avoiding wastful computation1
Avoiding Wastful Computation natural clusters based on connectivity


Architecture lower power design
Architecture Lower Power Design natural clusters based on connectivity

  • Optimum Supply Voltage Architecture through Hardware Duplication (Trading Area for Lower Power) and/or Pipelining

    • complex and fewer instruction requires less encoding, but larger decode logic!

  • use small complex instruction with smaller instruction length (e.g., Hitachi SH: 16-bit fixed-length, arithmetic instruction uses only two operands, NEC V800: variable-length instruction decoding overhead )

  • Superscalar: CPI < 1: parallel instruction execution. VLIW architecture.


Variable supply voltage block diagram

Computational work varies with time. An approach to reduce the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

The basic idea is to lower power supply when the a fixed supply for some fraction of time.

The supply voltage and clock rate are increased during high workload period.

Variable Supply Voltage Block Diagram


Power reduction using variable supply
Power Reduction using Variable Supply the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

  • Circuits with a fixed supply voltage work at a fixed speed and idle if the data sample requires less than the

  • maximum amount of computation.

  • Power is reduced in a linear fashion since the energy per operation is fixed. If the work load for a given sample period is less than peak, then the delay of the processing element can be increased by a factor of 1/workload without loss in throughput, allowing the processor to operate at a

  • lower supply voltage. Thus, energy per operation varies.


Data driven signal processing
Data Driven Signal Processing the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

The basic idea of averaging two samples are buffered and their work loads are averaged.

The averaged workload is then used as the effective workload to drive the power supply.

Using a pingpong buffering scheme, data samples In +2, In +3

are being buffered while In, In +1

are being processed.


Datapath parallelization
Datapath Parallelization the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.


Memory parallelization
Memory Parallelization the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

At first order P= C * f/2 * Vdd2


Pipelined micro p
Pipelined Micro-P the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.


Architecture trade off
Architecture Trade-Off the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

Ppipeline =

(1.15C)( 0.58V)2 (f)

= 0.39P

Pparallel =

(2.15C)(0.58V)2 (0.5f)

= 0.36P

PIPLELINED Implementation


Different classes of risc micro p
Different Classes of RISC Micro-P the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.


Application specific coprocessor
Application Specific Coprocessor the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

  • DSP's are increasingly called upon to perform tasks for which they are not ideally suited, for example, Viterbi decoding.

  • They may also take considerably more energy than a custom solution.

  • Use the DSP for portions of algorithms for which it is well suited, and craft an application-specic coprocessor (i.e., custom hardware) for other tasks.

  • This is an example of the dierence between power and energy

  • The application-specic coprocessor may actually consume a more power than the DSP, but it may be able to accomplish the same task in far less time, resulting in a net energy savings.

  • Power consumption varies dramatically with the instruction being executed.


Clock per instruction cpi
Clock per Instruction (CPI) the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.


Superpipeline micro p
SUPERPIPELINE micro-P the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.


Vliw architecture
VLIW Architecture the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

Compiler takes the responsibility for finding the operations that can be issued in parallel and creating a single very long instruction containing these operations. VLIW instruction decoding is easier than superscalar instruction due to the fixed format and to no instruction dependency.

The fixed format could present more limitations to the combination of operations.

Intel P6: CISC instructions are combined on chip to provide a set of micro-operations (i.e., long instruction word) that can be executed in parallel.

As power becomes a major issue in the design of fast -Pro, the simple is the better architecture.

VLIW architecture, as they are simpler than N-issue machines, could be considered as promising architectures to achieve simultaneously

high-speed and low-power.


Architecture optimization

2’ the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.s complement architecture

correlator example

64MHz random input

64KHz accumulated output

1024 length

accumulator acts as a low-pass filter

higher order bits have little switching activity

high switching activity of the adder

all of the input bits to the adder switch each time the input changes sign

14

current_sum

4

14

+

4

in_latched

add_out

CLK

(64MHz)

CLK

(64MHz)

CLK

(64KHz)

add_out

1.0

Transition Activity

sign-extension

0.5

in_latched

current_sum

0.0

0

2

4

6

8

10

12

Bit Position

Architecture Optimization


Architecture optimization1

Sign-magnitude architecture the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

low switching activity in high order bit

no sign-extension is being performed

higher order bits only need an incrementer

power is not sensitive to very rapid fluctuations in the input data

13

sign-bit

(to control)

3

POSACC

+

13

gated clk

clk(64KHz)

4

-

14

14

3

13

13

clk(64KHz)

3

+

clk

(64MHz)

NEGACC

gated clk

clk(64KHz)

sum(2’s complement)

1.0

Transition Activity

suma

suma + sumb

(sign-magnitude)

0.5

sumb

0.0

0

2

4

6

8

10

12

Bit Position

Architecture Optimization

input pattern

2’s(mW)

Sign(mW)

constant(7,7,…)

1.97

2.25

ramp(-7,-6,..,6,7..)

2.13

2.51

random

3.42

2.51

min->max->min

(-7,+7,-7,+7,…)

5.28

2.46


Architecture optimization2

Ordering of input signals the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

the ordering of operations can result in reduced switching activity

example

multiplication with a constant

: IN + (IN >> 7) + (IN >> 8)

topology II

the output of first adder has a small amplitude

-> lower switching activity

switched 30% less

SUM1

SUM2

IN

+

+

IN

IN

>>7

>>8

SUM1

SUM2

0.4

0.4

Transition Activity

Transition Activity

0.2

0.2

0.0

0.0

0

0

2

2

4

4

6

6

8

8

10

10

12

12

Bit Position

Bit Position

SUM1

SUM1

SUM2

SUM2

IN

>>8

+

+

IN

IN

>>7

Architecture Optimization


Architecture optimization3

Reducing glitching activity the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

static design can exhibit spurious transitions

finite propagation delay from one logic block to the next

important to balance all signal path and reduce the logic depth

multiple input addition

4 input case : 1.5 larger than tree implementation

8 input case : 2.5 larger than tree implementation

A

B

A

B

C

D

+

C

+

+

+

D

+

+

Architecture Optimization

Chained implemenation

Tree implemenation


Synchronous vs asynchronous systems
Synchronous VS. Asynchronous SYSTEMS the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

  • Synchronous system: A signal path starts from a clocked flip- flop through combinational gates and ends at another clocked flip- flop. The clock signals do not participate in computation but are required for synchronizing purposes. With advancement in technology, the systems tend to get bigger and bigger, and as a result the delay on the clock wires can no longer be ignored. The problem of clock skew is thus becoming a bottleneck for many system designers. Many gates switch unnecessarily just because they are connected to the clock, and not because they have to process new inputs. The biggest gate is the clock driver itself which must switch.

  • Asynchronous system (self-timed): an input signal (request) starts the computation on a module and an output signal (acknowledge) signifies the completion of the computation and the availability of the requested data. Asynchronous systems are potentially response to transitions on any of their inputs at anytime, since they have no clock with which to sample their inputs.


Synchronous vs asynchronous systems1
Synchronous VS. Asynchronous SYSTEMS the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

  • More difficult to implement, requiring explicit synchronization between communication blocks without clocks

  • If the signal feeds directly to conventional gate-level circuitry, invalid

  • logic levels could propagate throughout the system.

  • Glitches, which are filtered out by the clock in synchronous designs, may cause an asynchronous design to malfunction.

  • Asynchronous designs are not widely used, designers can't find the supporting design tools and methodologies they need.

  • DCC Error Corrector of Compact cassette player saves power of 80% as compared to the synchronous counterpart.

  • Offers more architectural options/freedom encourages distributed, localized control offers more freedom to adapt the supply voltage


Asynchronous modules
Asynchronous Modules the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.


Example abcs protocol

6% more logics the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.

Example: ABCS protocol


Control synthesis flow
Control Synthesis Flow the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.


Pipelined self timed micro p
PIPELINED SELF-TIMED micro P the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.


Programming style
Programming Style the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.


Speed vs power optimization
Speed vs. Power Optimization the energy consumption of such systems beyond shut down involves the dynamic adjustment of supply voltage based on computational workload.


ad