220 likes | 357 Views
Ahmed Hemani www.it.kth.se/~hemani. IL2200 - High Level Synthesis. High Level Synthesis. WHILE G < K LOOP F := E*(A+B); G := (A+B)*(C+D); END LOOP;. Algorithm. Controller. PLA. Latches. Library. +. -. Constraints Area Time: Clock Period Nr. of clock steps Power.
E N D
Ahmed Hemani www.it.kth.se/~hemani IL2200 - High Level Synthesis
High Level Synthesis WHILE G < K LOOP F := E*(A+B); G := (A+B)*(C+D); END LOOP; Algorithm Controller PLA Latches Library + - Constraints Area Time: Clock Period Nr. of clock steps Power * < Datapath K X < A C B D E Y + * F G
Control & Data Flow Graph K B A D C < + + E * * F G WHILE G < K LOOP F := E*(A+B); G := (A+B)*(C+D); END LOOP; • Set of operations - V • Data dependencies - D V V • Control dependencies - C V V • Nodes and edges have place holders for synthesized information • Compiler like optimisation at source code level
The corner stone of algorithmic synthesis optimisation strategy. Reuse Same hardware resource Operations that reuse a resource are never executed at the same time Operations with potentially different functional requirement assigned to the same resource type In mutually exclusive control branches Assigned to different states in a state machine
Spread across the entire synthesis process Algorithm specifies functional requirement Many units in Library satisfies the requirement Constraints guide the selection Judiciously maximize the reuse potential Adds information like delay, area and power. Allocate type and amount of resource K 16 32 B A D C < 32 +1 +2 E *2 *1 F G For operations +1 and +2 to reuse same adder it is essential we allocate an adder that can serve both of them add32 mult32 +1 *1 +2 *2 + *
Algorithm specifies relative order. The soul of algorithmic synthesis. Tightly coupled to allocation. Time constrained scheduling & area constrained scheduling. Area-time trade-offs. Schedule operations. B A C D 1 E + * + + + + + * * * * * +1 2 +2 *1 3 *2 F G B A 1 E +1 C D 2 B A D C *1 3 1 E +2 +2 +1 F 4 2 *2 *2 *1 G F G
Algorithm does not specify the registers. New registers are architected to hold the values that cross clock step boundaries. Registers are necessary to reuse resources. Optimized using lifetime analysis. Strongly influenced by scheduling. Algorithmic synthesis generates too many registers. Storage Synthesis 1 A B 2 C D E 3 +1 +2 *1 *2 F G X Y
Interconnect elements like multiplexers and busses implement the control flow. Interconnect elements are also instrumental in implementing reusage. For every reused resource an interconnect element is architected. Interconnect Synthesis A B C D E +1 1 X +2 *1 2 Y 3 *2 F G Datapath Datapath after operand interchange X A C B D Y E E A C B D Y X + * + * F G F G
Registers can be reused by doing lifetime analysis of the values they hold. The lifetime of registers Y and G do not overlap RegisterMerging A B C D E +1 1 X +2 *1 2 Y Y 3 *2 F G G Datapath after register merging Datapath after operand interchange E A C B D Y E A C B D G X X + * + * F G F
E A B D C C D +1 1 1 X B A *1 +2 +2 2 2 Y E 3 3 *2 +1 Y X *2 *1 G F G F Area Ports Busses Registers Adders + Multipliers * 2 3 4 Control steps 2 ports 3 regs 5 busses 1 adder 2 multiplier 3 ports 4 regs 6 busses 1 adder 1 multiplier
FIR Basics – HW Impl. perspective • Two vectors of size k – • x the samples vector and • h the impulse response of the filter – also known as co-efficients • x vector is also known as the delay line – because it preserves the previous k-1 samples – the delayed samples • A new x – x(0) is sampled – every sample period – marked by sample clock • When a new sample arrives, the previous samples are shifted, so that the oldes sample x(k-1) is shifted out
Algorithm or ??? c0 x0 c4 c1 c3 c2 x1 x2 x4 x3 + + + + × × × × ×
c0 x0 c1 c2 c3 c4 x1 x2 x3 x4 C_step 1 Sample Clk System Clk Critical Path Delay Line Adders Multipliers Registers Multiplexors + + + + × × × × ×
Sample Clk System Clk Critical Path Delay Line Adders Multipliers Registers Multiplexors c0 x0 c1 c4 c3 c2 x1 x2 x4 x3 + + + + × × × × ×
c0 x0 . . . . xn-1 cn-1 . . × Multiply Add Accumulate (MAC) +
Sample Clk System Clk Critical Path Delay Line Adders Multipliers Registers Multiplexors c0 x0 c1 c4 c3 c2 x1 x2 x4 x3 + + + + × × × × ×
c0 x0 c1 c3 c2 c4 x1 x4 x2 x3 Sample Clk System Clk Critical Path Delay Line Adders Multipliers Registers Multiplexors + + + + × × × × ×
Symmetric FIR Filter ci = ck-i c0.x0 + c1.x1 + c2.x2 + c3.x3 + c4.x4 c0 = c4, c1 = c3 c0.(x0 + x4) + c1.(x1 + x3) + c2.x2 Roughly reduces the number of multiplication by half
Structure of FIR The top level should be structural VHDL F S M Coeff ROM Delay Line The FIR components: FSM, Coeff ROM, Delay Line and FIR Arithmetic Should be behavioural or behavioural RTL where necessary FIR Arithmetic
Fully Parallel The Delay Line Implemented as shift register x2 x0 x1 x3 x4 The Co-efficients: Hardwired c0 c1 c2 c3 c4 + + + + Adder tree