- 449 Views
- Uploaded on
- Presentation posted in: General

SoC 저전력 설계 기법

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

SoC 저전력 설계 기법

조 준 동

SungKyunKwan University

VADA Lab.

- Introduction
- SOC Design Trends
- System Level Low Power Design
- Architecture Level Low Power Design
- Conclusion

- Expected to integrate more and more complex
- Web-browsing, real-time video processing, speech recognition and synthesis

- Average operating power at or below 100mW and standby power levels at or below 2mW
- Performance levels must increase from 300 million operations per second (MOPS) today to 2500 MOPS in 2016

Achieving functionality while

maximizing battery life and minimizing size

GPS

Noise cancellationheadphones

Cochlear implant

Cellular phone

Medicalwatch

Hearing

aid

Portable audio

Digital still camera

Digital radio

- How accurate should I make my FDCT?

- The new version of ITRS predicts that Moore’s law will continue on a two to three year cycle throughout this period (2001-2016)
- One of the key design challenges is to effectively use the dramatically increasing transistor counts, given certain power and productivity constraints
- “Bottom-up” - based on system constraints
“Top-down” - based on design resource constraints

1000

신호처리

ASIC

200 MOPS/mW

100

에너지 효율

(MOPS/mW)

재구성 구조

10-80 MOPS/mW

10

신호처리 프로세서

ASIPs, DSPs

3 MOPS/mW

1

임베디드 프로세서(ARM)

0.5 MOPS/mW

0.1

가용성

6

- WiFi – 10-100Mbits/sec unlicensed band
- OFDM, M-ary coding

- 3G – .1-2 Mbits/sec wide area cellular
- CDMA, GMSK

- Bluetooth – .8 Mbit/sec cable replacement
- Frequency hop

- ZigBee – .02-.2 Kbits/sec low power, low cost
- QPSK

- UWB – Recently allowed by FCC
- Short pulses (no carrier), bi-phase or PPM

UWB

100 Mbit/sec

802.11g

802.11a

802.11b

10 Mbit/sec

1 Mbit/sec

3G

Bluetooth

ZigBee

100 kbits/sec

ZigBee

10 kbits/sec

UWB

0 GHz

1GHz

2 GHz

3 GHz

4 GHz

5 GHz

6 GHz

$1000

3G

$100

802.11a

802.11b,g

UWB

$10

Bluetooth

ZigBee

ZigBee

$1

UWB

$ .10

0 GHz

1GHz

2 GHz

3 GHz

4 GHz

5 GHz

6 GHz

10 W

802.11a

802.11bg

3G

1 W

100 mW

Bluetooth

UWB

ZigBee

10 mW

ZigBee

UWB

1 mW

0 GHz

1GHz

2 GHz

3 GHz

4 GHz

5 GHz

6 GHz

- Practical reasons
(Reducing power requirements of high throughput portable applications)

- Financial reasons
(Reducing packaging costs and achieving memory savings)

- Technological reasons
(Excessive heat prevents the realization of high density chips and limits their functionalities)

- Portable devices: Battery life-time
- Telecom and military: Reliability (reduced power decreases electromigration, hence increases reliability)
- High volume products: Unit cost
(reduced power decreases packaging cost)

- ADVANTAGES
- Smaller geometries
- Higher clock frequencies

- DISADVANTAGES
- Higher power consumption
- Lower reliability

- Average power consumption by a node cycling at each period T:
(each period has a 01 or a 1 0 transition)

- Average power consumed by a node with partial activity
- (only a fraction of the periods has a transition)

- Power dissipation in logic blocks, consists of both dynamic (switching) and static (standby)

- Memory power is due primarily to row/column decoders and bit and word line switching activity
- Consider the power dissipated when the bitlines are switched by approximately VDD during write cycles

- Low-power digital SOC designs of the future will be 90-95% memory and 5-10% logic, including overhead
- Future chips may be dominated by memory due to power and resource constraints

- Reducing waste by Hardware Simplification: redundant h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations
- All in one Approach(SOC): I/O pin and buffer reduction
- Voltage Reducible Hardwares
- 2-D pipelining (systolic arrays)
- Parallel processing

- Voltage and process scaling
- Design methodologies
- Power-aware design flows and tools, trade area forlower power

- Architecture Design
- Power down techniques
- Clock gating, dynamic power management

- Dynamic voltage scaling based on workload
- Power conscious RT/ logic synthesis
- Better cell library design and resizing methods
- Cap. reduction, threshold control, transistor layout

- Fast and accurate analysis in the design process
- Power budgeting
- Knowledge-based architectural and implementation decisions
- Package selection
- Power hungry module identification

- Detailed and comprehesive analysis at the later stages
- Satisfaction of power budget and constraints
- Hot spots

- Algorithm selection / algorithm transformation
- Identification of hot spots
- Low Power data encoding
- Quality of Service vs. Power
- Low Power Memory mapping
- Resource Sharing / Allocation

- C/C++ Compilation
- Program Execution
- Building design representation
- Loading profiling data
- Setting constraints
- Power estimation
- Identification of Hot Spots

- Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction execution
- five instruction in parallel (IU, FPU, BPU, LSU, SRU) , RISC
- FPU is pipelined so a multiply-add instruction can be issued every clock cycle
- Low power 3.3-volt design
- 603e provides four software controllable power-saving modes.

- Copper Processor with SOI
- IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times

- How Does SOI Reduce Capacitance ?

- Eliminated junction capacitance by using SOI (similar to glass) is placed between the impuritis and the silicon substrate
- high performance, low power, low soft error

- Motivation: Aluminum resists the flow of electricity as wires are made thinner and narrower.
- Performance: 40% speed-up
- Cost: 30% less expensive
- Power: Less power from batteries
- Chip Size: 60% smaller than Aluminum chip

- Circuit function
- Circuit technology
- Input probabilities
- Circuit topology

- Signal probability of a signal g(t) is given by

- Signal activity of a logic signal g(t) is given by

where ng(t) is the number of transitions of g(t) in the time interval between –T/2 and T/2.

Factors Influencing Ceff:

- Assume that there are M mutually independent signals g1, g2,...gM each having a signal probability Pi and a signal activity Ai, for i n.
- For static CMOS, the signal probability at
the output of a gate is determined according to the probability of 1s (or 0s) in the logic

description of the gate

P1

P1

P1P2

1-(1-P1)(1-P2)

P1

1-P1

P2

P2

Factors Influencing Ceff:

- Transistors connected to the same input are turning on and off simultaneously when the input changes
- CLof a static CMOS gate is charged to VDD any time a 01 transition at the output node is required.
- CL of a static CMOS gate is discharged to ground any time a 1 0 transition at the output node is required.

NOR Gate

- State transition diagram of the NOR gate

- State transition diagram of the NOR gate

- Signal activity calculation: Boolean Difference

- It signifies the condition under which output f is sensitized to input xi
- If the primary inputs to function f are not spatially correlated, the signal activity at f is

- Strategy:
1. Modify the architecture of the system so as to make it faster.

2. Reduce VDD so as to restore the original speed. Power consumption has decreased.

- The most common architectural changes rely on the exploitation of parallelization and pipelining.
- Drawback:
The additional circuitry required to compensate the speed degradation may dominate, and the power consumption may increase.

- Consequence:
Parallelism and pipelining do not always pay-off.

Ppar=0.36Pref

Ppar=0.2Pref

- The technique of loop unrolling replicates the body of a loop some number of times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality.

Loop overhead is cut in half because two iterations are performed in each iteration.

If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body.

Instruction parallelism is increased because the second assignment can be performed while the results of the first are being stored and the loop variables are being updated.

Two output samples are computed in parallel based on two input samples.

Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation,

The transformation yields critical path of 3, thus voltage can be dropped.

0 0 0 0 0

1 0 1 0 0

1 0 1 1 1

1 1 1 1 0

1 0 1 0 0

1 0 1 1 1

0 0 1 0 1

0 0 1 1 0

- Bus-invert (BI) code
- Appropriate for random data patterns
- Redundant code (1 extra bus line)
- Reduce avg. transitions up to 25%

X

Z

0 0 0 0

1 0 1 0

0 1 0 0

1 1 1 1

1 0 1 0

0 1 0 0

1 1 0 1

0 0 1 1

D

Majority

voter

inv

D

Z

X

inv

R. J. Fletcher, “Integrated circuit having outputs configured for reduced state changes,” May 1987, U.S. Patent 4667337.

M. R. Stan and W. P. Burleson, “Bus-invert coding for low-power I/O,” IEEE Tr. on VLSI Systems, Mar. 1995, pp. 49-58.

- Partition the chip into multiple sub-units each of which is designed to operate at a specific supply voltage

3V

5V

5V

SLOW

3V

FAST

5V

SLOW

SLOW

3V

3V

SLOW

3V

VADA Lab’s 저전력 IP’s

Low-Power Equalizer for xDSL

21% 전력 감소, SNR=40dB

스마트 카드용 차세대

저전력 보안 프로세서 칩 설계

ECC, Rijndael, DES, SHA

Maximizing Memory Data Reuse for Lower Power Motion Estimation

33% 전력 감소, 52Mhz 2.1배 면적증가

(SCI 논문)

OFDM-based high-speed wireless LAN platform

20.7Mhz, 237000 gates

IS-95 기반 CDMA의Double Dwell Searcher저전력 및 co-design 설계

67% 전력 감소, 41% 면적감소

Fast and Low Power Viterbi Search

Engine using Inverse Hidden Markov Model

68% 전력 감소, 71%속도개선,

1.9배면적증가

삼성 휴먼 테크 우수논문상, ‘02

High-Flexible Design of OFDM

Tranceiverfor DVB-T (개발 중)

- 변화된 수 체계의 사용
- Scheduling/ordering
- 알고리즘 치환
- 신호 및 통계적 분석

- Logarithmic Number System의 사용
- Log 수 체계
- 연산 모듈 중 크기가 가장 큰 FFT에 적용
- look-up table이 크기에 변수
- 어떤 수를 부호와 크기 영역으로 분리한다. 크기 영역에 대해서 2의 log를 취한 값을 산출한다.
- 변환된 log 값을 어떤 n 비트로 제한된 표현 범위의 값을 갖는 2진수로 표현.

- LNS 연산
- 곱셈 : 가산
- 가감산 : 가산고 감산 및 look-up table

- 연산의 정확도
- 소수부가 2비트 이상의 경우 BER 성능 감소 없음

- 전력 소모
- 실험 결과 일반 butterfly FFT에 비하여 약 60% 정도 까지 전력 소모가 감소함
- 7.8mW -> 3.1mW

- coefficient ordering
- radix-4 pipeline 저전력 FFT 프로세서의 전력 소모를 줄이기 위해 연산 순서를 변형
- Coefficient ordering
- 복소 곱셈기의 고정된 계수 입력에 대한 스위칭 동작 감축

- 새로운 commutator 구조
- 추가적인 dual-port RAM 사용

- Coefficient ordering
- 16과 64 포인트 FFT에 대하여 각각 23% 및 9%의 전력 감소 효과.
- 보다 큰 FFT에서 효과가 감소

- radix-4 pipeline 저전력 FFT 프로세서의 전력 소모를 줄이기 위해 연산 순서를 변형

- 64-point FFT에 적용
- 64 포인트 FFT를 알고리즘 변환에 의해 수식을 치환
- 2개의 2차원 구조의 8 포인트 FFT로 분할한다.
- 복소 곱셈은 shift-and-add 방식으로 구현한다.

- 전력 소모
- in-house 0.25µ/m BiCMOS technology 공정의 20 MHz 1.8v 공급 전압 하에서 평균 동적 전력 소모 41mW

- 전력 소모의 비율
- 전체 전력 소모의 절반 가량은 복소 곱셈기에서 이루어 진다.

- Butterfly 곱셈의 내용 분석
- 계수 곱셈의 경우
- generic stage에서 M개의 계수 중에서 총 0.25*M+3은 1

- (1, 0)의 cosine과 sine에 대해서 clock gating 사용 가능

- 계수 곱셈의 경우
- Frequency division duplex 모뎀의 경우
- ETSI 표준의 4.3125KHz tone spacing을 갖는, 4096 DMT
- upstram carrier중 41%, donwstream중 26%, 그외 30%는 사용되지 않는다.

- ETSI 표준의 4.3125KHz tone spacing을 갖는, 1024 DMT
- 각각 13%, 68%, 18% 이다.

- 59~87%의 IFFT(up) 입력은 0이고 31~74%dml FFT(down)입력은 0이다.
- clock gating 가능.
- 초기 입력 단에서 적용 가능

- ETSI 표준의 4.3125KHz tone spacing을 갖는, 4096 DMT

- 50% of the total power
- FIR (massively pipelined circuit):
video processing: edge detection

voice-processing (data transmission like xDSL)

Telephony: 50% (70%/30%) idle,

동시에 이야기하지 않음.

with every clock cycle, data are loaded into the working register banks, even if there are no data changes.

PSM off

PSM on

power

power

750mW

100ms

50mW

time

time

- Sleep to save energy, periodically wake to check for pending data
- PSM protocol: when to sleep and when to wake?

- A PSM-static protocol has a regular sleep/wake cycle

Measurements of Enterasys Networks RoamAbout 802.11 NIC

PSM off

PSM on

Server

Server

Mobile Device

Mobile Device

Access Point

Access Point

0ms

SYN

AWAKE

ACK

DATA

SLEEP

time

100ms

200ms

Ronny Krashinsky and

Hari Balakrishnan, MIT

If PSM-static is too coarse-grained, it harms performance by delaying network data

If PSM-static is too fine-grained, it wastes energy by waking unnecessarily

Solution: dynamicallyadapt to network activity to maintain performance while minimizing energy

- Stay awaketo avoid delaying very fast RTTs
- Back off (listen to fewer beacons) while idle

Compromise between performance and energy

- Most Computationally demanding part of Video Encoding
- Example: CCIR 601 format
- 720 by 576 pixel
- 16 by 16 macro block (n = 16)
- 32 by 32 search area (p = 8)
- 25 Hz Frame rate (f frame = 25)
- 9 Giga Operations/Sec is needed for Full Search Block Matching Algorithm.

- Adjusting the search area at frame-rate according to the changing characteristics of video sequences
- Reducing Power Consumption by avoiding unnecessary computation

Motion Vector Distributions

From P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995

1st Iter 2nd Iter 3rd Iter

Worst-case error -25% -6% -1.6%

Prob. of Error<1% 10% 70% 99.8%

With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case)

CDMA 단말기에 사용하기위한 MSM

(Mobile Station Modem) 칩의 Searcher Engine에 대한 RTL수준 저전력 설계 구현. 동작 주파수 : 12.5MHz

Data flow graph를 사용하여 rescheduling, pre-computation 및 strength reduction, Synchronous Accumulator를 이용한 저전력 설, area와 power를 각각 최대 67.68%, 41.35% 감소 시킴. San Kim and Jun-Dong Cho, “Low Power CDMA Searcher”, CAD and VLSI Workshop, May. 1999.

- Inki Hwang, San Kim and Jun-Dong Cho, “CDMA Searcher Co-Design”,
- ASIC Workshop, Sep. 1999.

그림 1). 상세 블록도

- IS-95 기반의 DS/CDMA 시스템에서 기지국에서 전송하는 파일롯 채널을 입력으로 하여, 초기 동기를 획득하는 장치
- 탐색자 (Searcher)의 종류
- 상관기를 사용하는 방식, 정합필터를 응용한 방식
- 상관기를 사용한 직렬 탐색 및 Double Dwell 방식을 사용함.

- 국부 (단말기) PN 코드 발생기
- 15개의 register를 사용하여 생성.
- 생성 다항식

- 기지국에서 전송하는 파일럿 채널을 단말기에서 발생된 PN부호열과 역확산 과정 수행.
- 역확산된 결과를 동기 누적 횟수 Nc만큼 누적한 후 에너지 계산 과정을 거침 (제곱 연산).
- 에너지 계산 결과값들은 첫번째 임계치와 비교하여 초과할 경우 뒷 단에서 비동기 누적(Nn) 수행.
- 그렇지 못할 경우 PN부호열을 한 칩 빨리 발생시키고 입력되는 신호에 대하여 앞의 과정을 반복.
- 비동기 누적을 거친 결과값을 두번째 임계치와 비교.
- 초과하면 탐색 과정을 종료하고, 그렇지 않을 경우 PN부호열을 한 칩 빨리 발생시키고 앞의 과정을 반복.

- Precomputation for external idleness : M. Alidina, 1994

- A comparator example : Shrinivas Devadas, 1994

The three input ALU consumes much less power than an ALU and an ASU

A drawback of using a 3I-ALU is the added complexity in calculating the carry and overflow.

RX

TX

RX

TX

RX

TX

-TX

RX

I

Q

I

Q

Q

I

Q

I

동기 누적단

- Carry Save Adder (or 3 Iinput ALU) 사용
임계치 비교

- Pre-computation 적용
에너지 계산단

- Data Flow 순서를 변화하여 곱셈 과정을 줄임

XOR

XOR

XOR

XOR

CSA

CSA

동기 누적단

| |

| |

>

max 값 선택

>

θ

와 비교

1

2

()

에너지 계산단

+

비동기 누적단

>

θ

와 비교

2

Throughput

16QAM, R=1/2

Modulation/Coding

transition, 8PSK->16QAM

16QAM, R=1/4

8PSK, R=1/4

Hull of AMC

QPSK, R=1/4

C/I