HighLevel Synthesis with LegUp A Crash Course for Users and Researchers. Jason Anderson, Stephen Brown, Andrew Canis , Jongsok (James) Choi 11 February 2013 ACM FPGA Symposium Monterey, CA. Dept. of Electrical and Computer Engineering University of Toronto . Berlin. Hong Kong. LegUp.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi
11 February 2013
ACM FPGA SymposiumMonterey, CA
Dept. of Electrical and Computer EngineeringUniversity of Toronto
*US Bureau of Labour Statistics ‘08
int FIR(int ntaps, int sum) {
int i;
for (i=0; i < ntaps; i++)
sum += h[i] * z[i];
return (sum);
}
....
Processor
(MIPS)
C Compiler
Program code
SelfProfiling
Processor
Profiling Data:
Execution Cycles
Power
Cache Misses
Altered SW binary (calls HW accelerators)
Highlevelsynthesis
Suggested
programsegments to target to HW
P
Hardenedprogramsegments
FPGA fabric
FPGA
Cyclone II or Stratix IV
Hardware Accelerator
Hardware Accelerator
Memory
Memory
MIPS Processor
AVALON INTERFACE
OnChip Cache Memory
Memory Controller
OffChip Memory
ALTERA DE2 or DE4 Board
MIPS P
instr
Instr. $
PC
Op Decoder
tAddr+= V1
tAddr += (tAddr << 8)
tAddr ^= (tAddr >> 4)
b = (tAddr >> B1) & B2
a = (tAddr + (tAddr << A1)) >> A2
fNum = (a ^ tab[b])
Address Hash
(in hardware)
ret
call
target
address
Call Stack
counter
0
1
0
1
function #
reset
Data Counter(for current function)
(ret  call)
Popped F#
0
+
count
Incr. when PC changes
F#
Counter Storage
Memory
(for all functions)
PC
count
See paper IEEE ASAP’11
int main () {
…
sum = dotproduct(N);
...
}
intdotproduct(int N) {
…
for (i=0; i<N; i++) {
sum += A[i] * B[i];
}
return sum;
}
int main () {
…
sum = dotproduct(N);
...
}
intdotproduct(int N) {
…
for (i=0; i<N; i++) {
sum += A[i] * B[i];
}
return sum;
}
#define dotproduct_DATA (volatile int *) 0xf0000000
#define dotproduct_STATUS (volatile int *) 0xf0000008
#define dotproduct_ARG1 (volatile int *) 0xf000000C
int legup_dotproduct(int N) {
*dotproduct_ARG1 = (volatile int) N;
*dotproduct_STATUS = 1;
return *dotproduct_DATA;
}
int main () {
…
sum = dotproduct(N);
...
}
HLS
set_accelerator_function “dotproduct”
HW Accelerator
int main () {
…
sum = dotproduct(N);
...
}
#define dotproduct_DATA (volatile int *) 0xf0000000
#define dotproduct_STATUS(volatile int *) 0xf0000008
#define dotproduct_ARG1 (volatile int *) 0xf000000C
intlegup_dotproduct(int N) {
*dotproduct_ARG1 = (volatile int) N;
*dotproduct_STATUS = 1;
return *dotproduct_DATA;
}
sum = legup_dotproduct(N);
int main () {
…
...
}
#define dotproduct_DATA (volatile int *) 0xf0000000
#define dotproduct_STATUS(volatile int *) 0xf0000008
#define dotproduct_ARG1 (volatile int *) 0xf000000C
intlegup_dotproduct(int N) {
*dotproduct_ARG1 = (volatile int) N;
*dotproduct_STATUS = 1;
return *dotproduct_DATA;
}
sum = legup_dotproduct(N);
SW
MIPS Processor
31
23
22
0
9bit Tag 23bit Index
23
22
23
22
31
31
0
0
Tag=2 Index=3
Tag=3 Index=7
FF
FF
B[0]
A[0]
0
0
2
...
...
32
A[13]
32
3
B[13]
A[13]
13
13
32
31
0
….
….
23
22
Tag=2 Index=13
B[99]
A[99]
99
99
BRAM Tag=2
BRAM Tag=3
Hybrid (software/hardware):
18x less energy than software
Results now considerably better than LegUp 1.0 release
CFG
C Program
BB0
Compiler
int FIR(int ntaps, int sum) {
int i;
for (i=0; i < ntaps; i++)
sum += h[i] * z[i];
return sum;
}
....
LLVM
BB1
BB2
CFG
load
load
load
BB0
+
BB1
+
store
BB2
C Compiler (LLVM)
Optimized LLVM IR
Target H/W Characterization
C Program
Allocation
Scheduling
Binding
RTL Generation
Synthesizable Verilog
FSM
Schedule
State 0
load
load
State 1
+
load
+
State 2
store
State 3
Schedule
Datapath
load
load
2port RAM
FF
+
load
+
+
store
+
<<

Data flow graph (DFG):
already accessible in LLVM.
add
shift
sub
mod
xor
shr
or
A
+
E
H
+
+
+
+
B
F
C
+
+
G
Say want to schedule with
only have 2 addersin the HW (lab #2)
+
D
A B C E F D G H
operations
edge costs
hardware functional units
for (inti = 0; i < N; i++) {
sum[i] = a + b + c + d
}
cycle
a
b
+
1
c
+
2
d
+
3
Steady State
for (inti = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
for (inti = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]
%scevgep5 = getelementptr %b, %i.04
%0 = load %scevgep5
%scevgep6 = getelementptr %c, %i.04
%1 = load %scevgep6
%2 = add nsw i32 %1, %0
%scevgep = getelementptr %a, %i.04
store %2, %scevgep
%3 = add %i.04, 1
%exitcond = eq %3, 100
br %exitcond, %bb2, %bb
for (inti = 0; i < N; i++) {
a[i] = b[i]+ c[i]
}
%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]
%scevgep5 = getelementptr %b, %i.04
%0 = load %scevgep5
%scevgep6 = getelementptr %c, %i.04
%1 = load %scevgep6
%2 = add nsw i32 %1, %0
%scevgep = getelementptr %a, %i.04
store %2, %scevgep
%3 = add %i.04, 1
%exitcond = eq %3, 100
br %exitcond, %bb2, %bb
for (inti = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]
%scevgep5 = getelementptr %b, %i.04
%0 = load %scevgep5
%scevgep6 = getelementptr %c, %i.04
%1 = load %scevgep6
%2 = add nsw i32 %1, %0
%scevgep = getelementptr %a, %i.04
store %2, %scevgep
%3 = add %i.04, 1
%exitcond = eq %3, 100
br %exitcond, %bb2, %bb
for (inti = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]
%scevgep5 = getelementptr %b, %i.04
%0 = load %scevgep5
%scevgep6 = getelementptr %c, %i.04
%1 = load %scevgep6
%2 = add nsw i32 %1, %0
%scevgep = getelementptr %a, %i.04
store %2, %scevgep
%3 = add %i.04, 1
%exitcond = eq %3, 100
br %exitcond, %bb2, %bb
for (inti = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]
%scevgep5 = getelementptr %b, %i.04
%0 = load %scevgep5
%scevgep6 = getelementptr %c, %i.04
%1 = load %scevgep6
%2 = add nsw i32 %1, %0
%scevgep = getelementptr %a, %i.04
store %2, %scevgep
%3 = add %i.04, 1
%exitcond = eq %3, 100
br %exitcond, %bb2, %bb
for (inti = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
Cycle:
for (inti = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
Cycle:
for (inti = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
Memory Port Conflict
for (inti = 0; i < N; i++) {
a[i] = b[i] + c[i]
}
# of functional units
II = minII
Attempt to modulo schedule loop with II
II = II + 1
Fail
Success
i
i + k*II k = 0 to N1
(i1) mod II + 1
3 Cycles
i=0
i=1
i=2
i=3
i=4
Stage 1
i=0
i=1
i=2
i=3
i=4
Stage 2
i=0
i=1
i=2
i=3
i=4
Stage 3
Prologue
Epilogue
Kernel
(Steady State)
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
a[j] = b[i] + a[j1];
Depends on previous iteration