170 likes | 313 Views
ECE 565 High-Level Synthesis--Introduction. Shantanu Dutt ECE Dept., UIC. HLS Flow. Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects).
E N D
ECE 565High-Level Synthesis--Introduction Shantanu Dutt ECE Dept., UIC
HLS Flow • Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects) Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization)
HLS Flow (contd) (Binding) Allocation: Simple counting of FUs after the above 2 stages
ldd ldc ldx c d ldy x y I1 I0 I0 I1 ldb lda mux a b mux mux2 mux1 + X 1 2 3 4 5 6 demux demux cc 3(i+1) ldz z reg. “a” loaded lda = 1 Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) a) Non-overlapped scheduling X c1(1) c1(2) + c2(1) c3(2) c3(1) c2(2) cc’s mux1=0, mux2=0 demux=0, ldy=1 [y c+d] (c2) Controller FSM: cc 3i Reset Note: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted. lda=1, ldb=1, ldc=1, ldd=1, mux1=1, mux2=1 demux=1, ldz=1 Note: Unspecified control signals have either an inactive value, or if such a concept doesn’t exists for the cs, then the don’t-care value ldx=1 cc 3(i+2) [x a x b] (c1) [z x+y] (c3)
ldd ldc ldx c d ldy x y I1 I0 I0 I1 ldb lda mux a b mux mux2 mux1 + X demux demux 1 2 3 4 5 6 ldz z Simple HLS Examples (contd) 2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) b) Overlapped scheduling X c1(1) c1(2) + c2(1) c3(1) c2(2) c3(2) cc’s cc 3(i+1) ldc=1, ldd=1, mux1=0, mux2=0, demux=0, ldx=1, ldy=1 [y c+d, x a x b] (c1, c2) Controller FSM: cc 3i Reset • For 4 iterations, the overlapped schedule takes 9 cc’s versus 12 cc’s by the non-overlapped sched. • Overlap. sched: Time for n iterations = 2n+1 • Throughput = n/(2n+1) ~ 0.5 outputs/cc • Nonoverlap. sched: Time for n iterations = 3n • Throughput = n/3n ~ 0.33 outputs/cc • ~ 34% throughput improvement using an overlapped schedule lda=1, ldb=1, mux1=1, mux2=1 demux=1, ldz=1 [z x+y] (c3)
in1 in in2 T F Distributor • Some DFG control operation nodes: Selectot T F Condition (T/F) Condition (T/F) out out2 out1 Simple HLS Examples (contd) • Conditional code: If (a > b) then c a-b; Else c b-a; • Possible DFGs corresponding to the above conditional code:
Iterative code: while (a > b) a a-b; b a a r1 b ldb lda ldr1 1 T F 0 sel Mux b’ mux > - b’+1 = 2’s compl. of -b To fsm + cin 1 s xor ovfl = 1 -ve = 0 +ve Initialized to F dist T F demux Demux 0 1 ldfina a final a + c1 c2 c1 c2 Scheduling & binding: cc’s Simple HLS Examples (contd) c2 c1
Delay Nodes in DFGs A delay node is generally implemented as a register; a delay node thus becomes a state variable.
Delay Nodes in DFGs (contd) register Mapping to the architecture Transformation in the DFG
Detailed HLS Example (contd) Note: Not clear how register allocation has been done. It is sub-optimal. The synthesized architecture
Detailed HLS Example—Register Allocation (contd) • In the conflict graph (one per FU), there is an edge between 2 variable nodes if their lifetimes overlap (indicating that different registers need to be allocated to them) • Graph coloring in general is NP-hard • The above type of conflict graph is called an interval graph (derived from a 1-dimensional interval) • Min. graph coloring can be solved optimally in linear time (using the left-edge algorithm that we will see later for channel routing)