L6:Lower Power Architecture Design 1999. 8.2 성균관대학교 조 준 동 교수 http://vada.skku.ac.kr SungKyunKwan Univ.
Through WAVE PIPELINING SungKyunKwan Univ.
Wave-pipelining on FPGA • Pipeline의 문제점 • Balanced partitioning • Delay element overhead • Tclk > Tmax - Tmin + clock skew + setup/hold time • Area, Power, 전체 지연시간의 증가 • Clock distribution problem • Wavepipelining = high throughput w/o such overhead =Ideal pipelining SungKyunKwan Univ.
FPGA on WavePipeline • LUT의 delay는 다양한 logic function에서도 비슷하다. • 동일delay를 구성할 수 있다. • FPGA element delay (wire, LUT, interconnection) • Powerful layout editor • Fast design cycle SungKyunKwan Univ.
WP advantages • Area efficient - register, clock distribution network & clock buffer 필요 없음. • Low power dissipation • Higher throughput • Low latency SungKyunKwan Univ.
Disadvantage • Degraded performance in certain case • Difficult to achieve sharp rise and fall time in synchronous design • Layout is critical for balancing the delay • Parameter variation - power supply and temperature dependence SungKyunKwan Univ.
Experimental Results By 이재형, SKKU SungKyunKwan Univ.
Observation • WP multiplier는 delay를 조절하기 위한 LUTs의 추가가 많아서 전력소모 면에서 큰 이득은 보지 못했다. • FPGA에서 delay를 조절하기 위해 LUTs나 net delay를 사용하지 않고 별도의 delay 소자를 사용하면 보다 효과적 • 또한, 동일한 level을 가지는 multiplier를 설계하면 WP 구현이 용이하고 pipeline 구조보다 전력소모나 면적에서 큰 이득을 얻을 수 있을 것이다. SungKyunKwan Univ.
VON NEUMANN VERSUS HARVARD SungKyunKwan Univ.
Power vs Area of Micro-coded Microprocessor 1.5V and 10MHz clock rate: instruction and data memory accesses account for 47% of the total power consumption. SungKyunKwan Univ.
Memory Architecture SungKyunKwan Univ.
Exploiting Locality for Low-Power Design • A spatially local cluster: group of algorithm operations that are tightly connected to each other in the flow graph representation. • Two nodes are tightly connected to each other on the flow graph representation if the shortest distance between them, in terms of number of edges traversed, is low. • Power consumption (mW) in the maximally time-shared and fully-parallel versions of the QMF sub-band coder filter • Improvement of a factor of 10.5 at the expense of a 20% increase in area • The interconnect elements (buses, multiplexers, and buffers) consumes 43% and 28% of the total power in the time-shared and parallel versions. SungKyunKwan Univ.
Cascade filter layouts (a)Non-local implementation from Hyper (b)Local implementation from Hyper-LP SungKyunKwan Univ.
Frequency Multipliers and Dividers SungKyunKwan Univ.
Low Power DSP • Instruction Buffer (또는 Cache) locality 이용 Program memory의 access를 줄인다. • Decoded Instruction Buffer • LOOP의 첫번째 iteration의 decoding결과를 RAM에 저장한 후 재사용 • Fetch/Decoding 과정을 제거 • 30~40% Power Saving SungKyunKwan Univ.
Stage-Skip Pipeline • The power savings is achieved by stopping the instruction fetch and decode stages of the processor during the loop execution except its first iteration. • DIB = Decoded Instruction Buffer • 40 % power savings using DSP or RISC processor. SungKyunKwan Univ.
Stage-Skip Pipeline • Selector: selects the output from either the instruction decoder or DIB • The decoded instruction signals for a loop are temporarily stored in the DIB and are reused in each iteration of the loop. • The power wasted in the conventional pipeline is saved in our pipeline by stopping the instruction fetching and decoding for each loop execution. SungKyunKwan Univ.
Stage-Skip Pipeline Majority of execution cycles in signal processing programs are used for loop execution : 40% reduction in power with area increase 2%. SungKyunKwan Univ.
Two’s complement implementation of an accumulator SungKyunKwan Univ.
Sign magnitude implementation of an accumulator. SungKyunKwan Univ.
Number representation trade-off for arithmetic SungKyunKwan Univ.
Signal statistics for Sign Magnitude implementation of the accumulator datapath assuming random inputs. SungKyunKwan Univ.