Lower Power High Level Synthesis 1999. 8 성균관대학교 조 준 동 교수 http://vada.skku.ac.kr SungKyunKwan Univ.
System Partitioning • To decide which components of the system will be realized in hardware and which will be implemented in software • High-quality partitioning is critical in high-level synthesis. To be useful, high-level synthesis algorithms should be able to handle very large systems. Typically, designers partition high-level design specifications manually into procedures, each of which is then synthesized individually. Different partitionings of the high-level specifications may produce substantial differences in the resulting IC chip areas and overall system performance. • To decide whether the system functions are distributed or not. Distributed processors, memories and controllers can lead to significant power savings. The drawback is the increase in area. E.g., a non-distributed and a distributed design of a vector quantizer. SungKyunKwan Univ.
Circuit Partitioning • graph and physical representation SungKyunKwan Univ.
VHDL example process communication control/data flow graph Behavioral description SungKyunKwan Univ.
Clustering Example • Two-cluster Partition • Three-cluster Partition SungKyunKwan Univ.
Clustering (Cont’d) SungKyunKwan Univ.
Note that we can take node weights into account by letting the weight of a node (i,j) in Nc be the sum of the weights of the nodes I and j. We can similarly take edge weights into account by letting the weight of an edge in Ec be the sum of the weights of the edges "collapsed" into it. Furthermore, we can choose the edge (i,j) which matches j to i in the construction of Nc above to have the large weight of all edges incident on i; this will tend to minimize the weights of the cut edges. This is called heavy edge matching in METIS, and is illustrated on the right. Multilevel Kernighan-Lin SungKyunKwan Univ.
Given a partition (Nc+,Nc-) from step (2) of Recursive_partition, it is easily expanded to a partition (N+,N-) in step (3) by associating with each node in Nc+ or Nc- the nodes of N that comprise it. This is again shown below: Finally, in step (4) of Recurive_partition, the approximate partition from step (3) is improved using a variation of Kernighan-Lin. Multilevel Kernighan-Lin SungKyunKwan Univ.
상위 수준 합성 단계 SungKyunKwan Univ.
for(I=0;I<=2;I=I+1begin @(posedge clk); if(fgb[I]%8; begin p=rgb[I]%8; g=filter(x,y)*8; end ............ Control Datapath Memory 상위 수준 합성( High Level Synthesis ) scheduling Memory inferencing Register sharing Control interencing Instructions Operations Variables Arrays signals Operators, Registers, Memory, Multiplexor Control constraints 회로의 동작적 기술 RTL(register transfer level) architecture 상위 수준 합성 SungKyunKwan Univ.
High-Level Synthesis • The allocation task determines the type and quantity of resources used in the RTL design. It also determines the clocking scheme, memory hierarchy and pipelining style. To perform the required trade-offs, the allocation task must determine the exact area and performance values. • The scheduling task schedules operations and memory references into clock cycles. If the number of clock cycles is a constraint, the scheduler has to produce a design with the fewest functional units • The binding task assigns operations and memory references within each clock cycle to available hardware units. A resource can be shared by different operations if they are mutually exclusive, i.e. they will never execute simultaneously. SungKyunKwan Univ.
상위 수준 합성 과정 예 SungKyunKwan Univ.
Low Power Scheduling SungKyunKwan Univ.
상위 레벨에서 제안된 저전력 방법 • Sibling연산의 연산자 공유 [ Fang , 96 ] • 데이타 correlation를 고려한 resource sharing [ Gebotys, 97 ] • FU의 shut down방법(Demand-driven operation) [ Alidina, 94 ] • 연산의 규칙성 이용 [ Rabaey, 96 ] • Dual 전압 사용 [ Sarrafzadeh, 96 ] • Spurious연산의 최소화 [ Hwang, 96 ] • 최소 비용의 흐름 알고리즘을 사용한 스위칭 동작 최소화 + 연결구조 단순화를 통한 캐패시턴스 최소화 [Cho,97] SungKyunKwan Univ.
레지스터의 전력 소모 모델 Power(Register) = switching(x)(Cout,Mux+Cin,Register)+switching(y) x (Cout,Register+Cin,DeMux) switching(x)=switching(y)이므로 Power(Register)=switching(y) xCtotal SungKyunKwan Univ.
e=a+b; g=c+d; f=e+b; h=f*g; 회로의 CDFG 표현 a b c d +1 +2 e g +3 f *1 h CDFG( control data flow graph ) SungKyunKwan Univ.
Schematic to CDFG of FIR3 SungKyunKwan Univ.
레지스터와 리소스의 수 결정 a b c d e f g h 1 2 3 4 SungKyunKwan Univ.
High-Level Power Estimation • Pcore = PDP + PMEM + PCNTR + PPROC • PDP = PREG +PMUX +PFU+PINT, • where PREG is the power of the registers • PMUX is the power of multiplexers • PFU is the power of functional units • PINT is the power of physical interconnet capacitance SungKyunKwan Univ.
High-Level Power Estimation: PREG • Compute the lifetimes of all the variables in the given VHDL code. • Represent the lifetime of each variable as a vertical line from statement i through statement i + n in the column j reserved for the corresponding varibale v j . • Determine the maximum number N of overlapping lifetimes computing the maximum number of vertical lines intersecting with any horizontal cut-line. • Estimate the minimal number of N of set of registers necessary to implement the code by using register sharing. Register sharing has to be applied whenever a group of variables, with the same bit-width b i . • Select a possible mapping of variables into registers by using register sharing • Compute the number w i of write to the variables mapped to the same set of registers. Estimate n i of each set of register dividing w i by the number of statements S: i =wi/S; hence TR imax = n i f clk . • Power of latches and flip flops is consumed not only during output transitions, but also during all clock edges by the internal clock buffers • The non-switching power PNSK dissipated by internal clock buffers accounts for 30% of the average power for the 0.38-micron and 3.3 V operating system. • In total, SungKyunKwan Univ.
PCNTR • After scheduling, the control is defined and optimized by the hardware mapper and further by the logic synthesis process before mapping to layout. • Like interconnect, therefore, the control needs to be estimated statistically. Global control model Local control model: the local controller account for a larger percentage of the total capacitance than the global controller. Where Ntrans is the number of tansitions, Nstates is the number of states, Clc is the capacitance switched in any local controller in one sample period and Bf is the ratio of the number of bus accesses to the number of busses. SungKyunKwan Univ.
Ntrans • The number of transitions depends on assignment, scheduling, optimizations, logic optimization, the standard cell library used, the amount of glitchings and the statistics of the inputs. SungKyunKwan Univ.
Factors of the coarse-grained model(obtained by switch level simulator) SungKyunKwan Univ.
Low Power Scheduling and Binding (a)저전력을 고려하지 않은 스케쥴링 (b) 저전력을 고려한 스케쥴링 SungKyunKwan Univ.
The coarse-grained model provides a fast estimation of the power consumption when no information of the activity of the input data to the functional units is available. SungKyunKwan Univ.
Fine-grained model When information of the activity of the input data to the functional units is available. SungKyunKwan Univ.
Effect of the operand activity on the power consumptionof an 8 X 8-bit Booth multiplier. AHD Input data SungKyunKwan Univ.
High-Level Power Estimation: PMUX and PFU SungKyunKwan Univ.
Loop Interchange If matrix A is laid out in memory in column-major form, execution order (a.2) implies more cache misses than the execution order in (b.2). Thus, the compiler chooses algorithm (b.1) to reduce the running time. SungKyunKwan Univ.
Motion Estimation SungKyunKwan Univ.
Motion Estimation (low power) SungKyunKwan Univ.
Matrix-vector product algorithm SungKyunKwan Univ.
Retiming Flip- flop insertion to minimize hazard activity moving a flip- flop in a circuit SungKyunKwan Univ.
Exploiting spatial locality for interconnect power reduction Global Local Adder1 Adder2 SungKyunKwan Univ.
Balancing maximal time-sharing and fully-parallel implementation A fourth-order parallel-form IIR filter (a) Local assignment (2 global transfers), (b) Non-local assignment (20 global transfers) SungKyunKwan Univ.
Retiming/pipelining for Critical path SungKyunKwan Univ.
Effective Resource Utilization SungKyunKwan Univ.
Hazard propagation elimination by clocked sampling By sampling a steady state signal at a register input, no more glitches are propagated through the next combinational logics. SungKyunKwan Univ.
Regularity • Common patterns enable the design of less complex architecture and therefore simpler interconnect structure (muxes, buffers, and buses). Regular designs often have less control hardware. SungKyunKwan Univ.
Module Selection • Select the clock period, choose proper hardware modules for all operations(e.g., Wallace or Booth Multiplier), determine where to pipeline (or where to put registers), such that a minimal hardware cost is obtained under given timing and throughput constraints. • Full pipelining: ineffective clock period mismatches between the execution times of the operators. performing operations in sequence without immediate buffering can result in a reduction of the critical path. • Clustering operations into non-pipelining hardware modules, the reusability of these modules over the complete computational graph be maximized. • During clustering, more expensive but faster hardware may be swapped in for operations on the critical path if the clustering violates timing constraints SungKyunKwan Univ.
Estimation • Estimate min and max bounds on the required resources to • delimit the design space min bounds to serve as an initial solution • serve as entries in a resource utilization table which guides the transformation, assignment and scheduling operations • Max bound on execution time is tmax: topological ordering of DFG using ASAP and ALAP • Minimum bounds on the number of resources for each resource class Where NRi: the number of resources of class Ri dRi : the duration of a single operation ORi : the number of operations SungKyunKwan Univ.
Exploringthe Design Space • Find the minimal area solution constrained to the timing constraints • By checking the critical paths, it determine if the proposed graph violates the timing constraints. If so, retiming, pipelining and tree height reduction can be applied. • After acceptable graph is obtained, the resource allocation process is • initiated. • change the available hardware (FU's, registers, busses) • redistribute the time allocation over the sub-graphs • transform the graph to reduce the hardware requirements. • Use a rejectionless probabilistic iterative search technique (a variant of Simulated Annealing), where moves are always accepted. This approach reduces computational complexity and gives faster convergence. SungKyunKwan Univ.
Data path Synthesis SungKyunKwan Univ.
Scheduling and Binding • The scheduling task selects the control step, in which a given operation will happen, i.e., assign each operation to an execution cycle • Sharing: Bind a resource to more than one operation. • Operations must not execute concurrently. • Graph scheduled hierachically in a bottom-up fashion • Power tradeoffs • Shorter schedules enable supply voltage (Vdd) scaling • Schedule directly impacts resource sharing • Energy consumption depends what the previous instruction was • Reordering to minimize the switching on the control path • Clock selection • Eliminate slacks • Choose optimal system clock period SungKyunKwan Univ.
Algorithm HAL Example ASAP Scheduling SungKyunKwan Univ.
Algorithm ALAP Scheduling • HAL Example SungKyunKwan Univ.
Force Directed Scheduling Used as priority function. Force is related to concurrency. Sort operations for least force. Mechanical analogy: Force = constant displacement. constant = operation-type distribution. displacement = change in probability. SungKyunKwan Univ.
Force Directed Scheduling SungKyunKwan Univ.
Example : Operation V6 SungKyunKwan Univ.
Force-Directed Scheduling • Algorithm (Paulin) SungKyunKwan Univ.