High level synthesis with legup a crash course for users and researchers
Download
1 / 80

High-Level Synthesis with LegUp A Crash Course for Users and Researchers - PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on
  • Presentation posted in: General

High-Level Synthesis with LegUp A Crash Course for Users and Researchers. Jason Anderson, Stephen Brown, Andrew Canis , Jongsok (James) Choi 11 February 2013 ACM FPGA Symposium Monterey, CA. Dept. of Electrical and Computer Engineering University of Toronto . Berlin. Hong Kong. LegUp.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

High-Level Synthesis with LegUp A Crash Course for Users and Researchers

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


High-Level Synthesis with LegUpA Crash Course for Users and Researchers

Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi

11 February 2013

ACM FPGA SymposiumMonterey, CA

Dept. of Electrical and Computer EngineeringUniversity of Toronto


Berlin

Hong Kong

LegUp

LegUp

LegUp

LegUp

LegUp

LegUp

LegUp

LegUp

LegUp

New York City

Tokyo


Tutorial Outline

  • Overview of LegUp and its algorithms (60 min)

  • Labs (“hands on” via VirtualBox)

    • Lab 1: Using the LegUp Framework (30 min)

    • Break

    • Lab 2: Adding resource constraints (30 min)

    • Lab 3: Changing How LegUp implements hardware (30 min)


Project Motivation

  • Hardware design has advantages over software:

    • Speed

    • Energy-efficiency

  • Hardware design is difficult and skills are rare:

    • 10 software engineers for every hardware engineer*

  • We need a CAD flow that simplifies hardware design for software engineers

*US Bureau of Labour Statistics ‘08


Top-Level Vision

int FIR(int ntaps, int sum) {

int i;

for (i=0; i < ntaps; i++)

sum += h[i] * z[i];

return (sum);

}

....

Processor

(MIPS)

C Compiler

Program code

Self-Profiling

Processor

Profiling Data:

Execution Cycles

Power

Cache Misses

Altered SW binary (calls HW accelerators)

High-levelsynthesis

Suggested

programsegments to target to HW

P

Hardenedprogramsegments

FPGA fabric


LegUp: Key Features

  • C to Verilog high-level synthesis

  • Many benchmarks (incl. 12 CHStone)

  • MIPS processor (Tiger)

  • Hardware profiler

  • Automated verification tests

  • Open source, freely downloadable

    • Like ABC (Synthesis) or VPR (Place & Route)

    • 600+ downloads since March 2011

    • http://legup.eecg.utoronto.ca


System Architecture

FPGA

Cyclone II or Stratix IV

Hardware Accelerator

Hardware Accelerator

Memory

Memory

MIPS Processor

AVALON INTERFACE

On-Chip Cache Memory

Memory Controller

Off-Chip Memory

ALTERA DE2 or DE4 Board


High-Level Synthesis Framework

  • Leverage LLVM compiler infrastructure:

    • Language support: C/C++

    • Standard compiler optimizations

    • More on this shortly

    • We support a large subset of ANSI C:


Hardware Profiler Architecture

  • Monitor instr. bus to detect function call/ret.

  • Call: Hash (in HW) from function address to index; push to stack.

  • Ret: pop function index from stack.

  • Use function indexes to associate profiling data (e.g. cycles, power) with counters.

MIPS P

instr

Instr. $

PC

Op Decoder

tAddr+= V1

tAddr += (tAddr << 8)

tAddr ^= (tAddr >> 4)

b = (tAddr >> B1) & B2

a = (tAddr + (tAddr << A1)) >> A2

fNum = (a ^ tab[b])

Address Hash

(in hardware)

ret

call

target

address

Call Stack

counter

0

1

0

1

function #

reset

Data Counter(for current function)

(ret | call)

Popped F#

0

+

count

Incr. when PC changes

F#

Counter Storage

Memory

(for all functions)

PC

count

See paper IEEE ASAP’11


Processor/Accelerator Hybrid Flow

int main () {

sum = dotproduct(N);

...

}

intdotproduct(int N) {

for (i=0; i<N; i++) {

sum += A[i] * B[i];

}

return sum;

}


Processor/Accelerator Hybrid Flow

int main () {

sum = dotproduct(N);

...

}

intdotproduct(int N) {

for (i=0; i<N; i++) {

sum += A[i] * B[i];

}

return sum;

}

#define dotproduct_DATA (volatile int *) 0xf0000000

#define dotproduct_STATUS (volatile int *) 0xf0000008

#define dotproduct_ARG1 (volatile int *) 0xf000000C

int legup_dotproduct(int N) {

*dotproduct_ARG1 = (volatile int) N;

*dotproduct_STATUS = 1;

return *dotproduct_DATA;

}


Processor/Accelerator Hybrid Flow

int main () {

sum = dotproduct(N);

...

}

HLS

set_accelerator_function “dotproduct”

HW Accelerator


Processor/Accelerator Hybrid Flow

int main () {

sum = dotproduct(N);

...

}

#define dotproduct_DATA (volatile int *) 0xf0000000

#define dotproduct_STATUS(volatile int *) 0xf0000008

#define dotproduct_ARG1 (volatile int *) 0xf000000C

intlegup_dotproduct(int N) {

*dotproduct_ARG1 = (volatile int) N;

*dotproduct_STATUS = 1;

return *dotproduct_DATA;

}

sum = legup_dotproduct(N);


Processor/Accelerator Hybrid Flow

int main () {

...

}

#define dotproduct_DATA (volatile int *) 0xf0000000

#define dotproduct_STATUS(volatile int *) 0xf0000008

#define dotproduct_ARG1 (volatile int *) 0xf000000C

intlegup_dotproduct(int N) {

*dotproduct_ARG1 = (volatile int) N;

*dotproduct_STATUS = 1;

return *dotproduct_DATA;

}

sum = legup_dotproduct(N);

SW

MIPS Processor


How Does LegUp Handle Memory and Pointers?

  • LegUp stores each array in a separate FPGA BRAM

  • BRAM data width matches the data in the array

  • Each BRAM is identified by a 9-bit tag

  • Addresses consist of the RAM tag and array index:

  • A shared memory controller uses the tag bit to determine which BRAM to read or write from

  • The array index is the address passed to the BRAM

31

23

22

0

9-bit Tag 23-bit Index


Pointer Example

  • We have two arrays in the C function:

    • int A[100], B[100]

  • Tag 0 is reserved for NULL pointers

  • Tag 1 is reserved for off-chip memory

  • Assign tag 2 to array A and tag 3 to array B

  • Address of A[3]: Address of B[7]:

23

22

23

22

31

31

0

0

Tag=2 Index=3

Tag=3 Index=7


Shared Memory Controller

  • Both arrays A and B have 100 element BRAMs

  • Load from pointer D:

FF

FF

B[0]

A[0]

0

0

2

...

...

32

A[13]

32

3

B[13]

A[13]

13

13

32

31

0

….

….

23

22

Tag=2 Index=13

B[99]

A[99]

99

99

BRAM Tag=2

BRAM Tag=3


Core Benchmarks (+Many More)

  • 12 CHStone Benchmarks (JIP’09) and Dhrystone

    • Too large/complex for academic HLS tools

  • Include golden input/output test vectors

  • Not supported by academic tools


Experimental ResultsLegUp1.0 (2011) for Cyclone II

  • Pure software on MIPS

    Hybrid (software/hardware):

  • Second most compute-intensive function (and descendants) in H/W

  • Same as 2 but with most compute-intensive

  • Pure hardware using LegUp

  • Pure hardware using eXCite (commercial tool)


Experimental Results


Comparison: LegUpvseXCite

  • Benchmarks compiled to hardware

  • eXCite: Commercial high-level synthesis tool

    • Couldn’t compile Dhrystone


Energy Consumption

18x less energy than software


Current Release: LegUp 3.0

  • Loop pipelining

  • Dual and multi-ported memory support

  • Bitwidth minimization

  • Multi-pumping DSP units for area reduction

  • Alias analysis for dependency checks

  • Parallel accelerators via Pthreads & OpenMP

    Results now considerably better than LegUp 1.0 release


LegUp 3.0 vs. LegUp 1.0


LLVM Compiler and HLS Algorithms


LLVM Compiler

  • Open-source compiler framework.

    • http://llvm.org

  • Used by Apple, NVIDIA, AMD, others.

  • Competitive quality with gcc.

  • LegUp HLS is a “back-end” of LLVM.

  • LLVM: low-level virtual machine.


LLVM Compiler

  • LLVM will compile C code into a control flow graph (CFG)

  • LLVM will perform standard optimizations

    • 50+ different optimizations in LLVM

CFG

C Program

BB0

Compiler

int FIR(int ntaps, int sum) {

int i;

for (i=0; i < ntaps; i++)

sum += h[i] * z[i];

return sum;

}

....

LLVM

BB1

BB2


Control Flow Graph

  • Control flow graph is composed of basic blocks

  • basic block:is a sequence of instructions terminated with exactly one branch

    • Can be represented by an acyclic data flow graph:

CFG

load

load

load

BB0

+

BB1

+

store

BB2


LLVM Details

  • Instructions in basic blocks are primitive computational operations:

    • shift, add, divide, xor, and, etc.

  • Or are control-flow operations:

    • branch, call, etc.

  • The CDFG is represented in LLVM’s intermediate representation (IR)

    • IR is machine-independent assembly code.


High-Level Synthesis Flow

C Compiler (LLVM)

Optimized LLVM IR

Target H/W Characterization

C Program

Allocation

Scheduling

  • User Constraints

  • Timing

  • Resource

Binding

RTL Generation

Synthesizable Verilog


Scheduling

  • Scheduling: is the task of scheduling operations into clock cycles using a finite state machine

FSM

Schedule

State 0

load

load

State 1

+

load

+

State 2

store

State 3


Binding

  • Binding: is the task of assigning scheduled operations to functional units in the datapath

Schedule

Datapath

load

load

2-port RAM

FF

+

load

+

+

store


High-Level Synthesis: Scheduling


SDC Scheduling

  • SDC  System of Difference Constraints

    • Cong, Zhang, “An efficient and versatile scheduling algorithm based on SDC formulation”. DAC 2006: 433-438.

  • Basic idea: formulate scheduling as a mathematical optimization problem

    • Linear objective function + linear constraints (==, <=, >=).

  • The problem is a linear program (LP)

    • Solvable in polynomial time with standard solvers


Define Variables

  • For each operation i to schedule, create a variable ti.

  • The ti’s will hold the cycle # in which each op is scheduled.

  • Here we have:

    • tadd, tshift, tsub

+

<<

-

Data flow graph (DFG):

already accessible in LLVM.


Dependency Constraints

  • In this example, the subtract can only happen after the add and shift.

  • tsub – tadd >= 0

  • tsub – tshift >= 0

  • Hence the name difference constraints.

add

shift

sub


Handling Clock Period Constraints

mod

  • Target period: P (e.g., 10 ns)

  • For each chain of dependant operations in DFG, estimate the path delay D (LegUp’s models)

    • E.g.: D from mod -> or = 23 ns.

  • Compute: R = ceiling(D/P) - 1

    • E.g.: R = 2

  • Add the difference constraint:

    • tor - tmod >= 2

xor

shr

or


Resource Constraints

  • Restriction on # of operations of a given type that can execute in a cycle

  • Why we need it?

    • Want to use dual-port RAMs in FPGA

      • Allow up to 2 load/store operations in a cycle

    • Floating point

      • Do not want to instantiate many FP cores of a given type, probably just one

      • Scheduling must honour # of FP cores available


Resource Constraints in SDC

  • Res-constrained scheduling is NP-hard.

  • Implemented approach in [Cong & Zhang DAC2006]

A

+

E

H

+

+

+

+

B

F

C

+

+

G

Say want to schedule with

only have 2 addersin the HW (lab #2)

+

D


Add SDC Constraints

  • Generate a topological ordering of the resource-constrained operations.

  • Say constrained to 2 adders in HW.

  • Starting at C in the ordering, create a constraint: tC – tA > 0

  • Next consider, E, add constraint: tE- tB > 0

  • Continue to the end

  • Resulting schedule will have <= 2 adds / cycle

A B C E F D G H


ASAP Objective Function

  • Minimize the sum of the variables:

  • Operations will be scheduled as early as possible, subject to the constraints

  • LP program solvable in polynomial time


High-Level Synthesis: Binding


High-Level Synthesis: Binding

  • Weighted bipartite matching-based binding

    • Huang, Chen, Lin, Hsu, “Data path allocation based on bipartite weighted matching”. DAC 1990: 499-504.

  • Finds the minimum weighted matching of a bipartite graph at each step

    • Solve using the Hungarian Method (polynomial)

operations

edge costs

hardware functional units


Binding

  • Bind the following scheduled program


Binding

  • Resource Sharing: requires 3 multipliers


Binding

  • Functional Units

  • Bind the first cycle

  • 1

  • 1

  • 1


Binding

  • Functional Units

  • Bind the second cycle

  • 2

  • 2

  • 1


Binding

  • Functional Units

  • Bind the third cycle

  • 2

  • 2

  • 2


Binding

  • Functional Units

  • Bind the fourth cycle

  • 3

  • 2

  • 2


Binding

  • Functional Units

  • Required Multiplexing:

  • 3

  • 2

  • 2


High-Level Synthesis: Challenges

  • Easy to extract instruction level parallelism using dependencies within a basic block

  • But C code is inherently sequential and it is difficult to extract higher level parallelism

  • Coarse-grained parallelism:

    • function pipelining

  • Fine-grained parallelism:

    • loop pipelining


Loop Pipelining


Motivating Example

  • Cycles: 3N

  • Adders: 3

  • Utilization: 33%

for (inti = 0; i < N; i++) {

sum[i] = a + b + c + d

}

cycle

a

b

+

1

c

+

2

d

+

3


Loop Pipelining

Steady State

  • Cycles: N+2 (~1 cycle per iteration)

  • Adders: 3

  • Utilization: 100% in steady state


Loop Pipelining Example

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}

  • Each iteration requires:

    • 2 loads from memory

    • 1 store

  • No dependencies between iterations


Loop Pipelining Example

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}

  • Cycle latency of operations:

    • Load: 2 cycles

    • Store: 1 cycle

    • Add: 1 cycle

  • Single memory port


LLVM Instructions

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5

%scevgep6 = getelementptr %c, %i.04

%1 = load %scevgep6

%2 = add nsw i32 %1, %0

%scevgep = getelementptr %a, %i.04

store %2, %scevgep

%3 = add %i.04, 1

%exitcond = eq %3, 100

br %exitcond, %bb2, %bb

for (inti = 0; i < N; i++) {

a[i] = b[i]+ c[i]

}


LLVM Instructions

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5

%scevgep6 = getelementptr %c, %i.04

%1 = load %scevgep6

%2 = add nsw i32 %1, %0

%scevgep = getelementptr %a, %i.04

store %2, %scevgep

%3 = add %i.04, 1

%exitcond = eq %3, 100

br %exitcond, %bb2, %bb

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}


LLVM Instructions

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5

%scevgep6 = getelementptr %c, %i.04

%1 = load %scevgep6

%2 = add nsw i32 %1, %0

%scevgep = getelementptr %a, %i.04

store %2, %scevgep

%3 = add %i.04, 1

%exitcond = eq %3, 100

br %exitcond, %bb2, %bb

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}


LLVM Instructions

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5

%scevgep6 = getelementptr %c, %i.04

%1 = load %scevgep6

%2 = add nsw i32 %1, %0

%scevgep = getelementptr %a, %i.04

store %2, %scevgep

%3 = add %i.04, 1

%exitcond = eq %3, 100

br %exitcond, %bb2, %bb

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}


LLVM Instructions

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5

%scevgep6 = getelementptr %c, %i.04

%1 = load %scevgep6

%2 = add nsw i32 %1, %0

%scevgep = getelementptr %a, %i.04

store %2, %scevgep

%3 = add %i.04, 1

%exitcond = eq %3, 100

br %exitcond, %bb2, %bb

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}


Scheduling LLVM Instructions

Cycle:

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}

  • Each iteration requires:

    • 2 loads from memory

    • 1 store

  • There are no dependencies between iterations


Scheduling LLVM Instructions

Cycle:

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}

  • Each iteration requires:

    • 2 loads from memory

    • 1 store

  • There are no dependencies between iterations

Memory Port Conflict


Loop Pipelining Example

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}

  • Initiation Interval (II)

    • Constant time interval between starting successive iterations of the loop

  • The loop requires 6 cycles per iteration (II=6)

  • Can we do better?


Minimum Initiation Interval

  • Resource minimum II:

    • Due to limited # of functional units

    • ResMII = Uses of functional unit

      # of functional units

  • Recurrence minimum II:

    • Due to loop carried dependencies

  • Minimum II = max(ResMII, RecMII)


Resource Constraints

  • Assume unlimited functional units (adders, …)

  • Only constraint: single ported memory controller

  • Reservation table:

  • The resource minimum initiation interval is 3


Iterative Modulo Scheduling

  • There are no loop carried dependencies so Minimum II = ResMII = 3

  • Iterative: Not always possible to schedule the loop for minimum II

II = minII

Attempt to modulo schedule loop with II

II = II + 1

Fail

Success


Iterative Modulo Scheduling

  • Operations in the loop that execute in cycle:

    i

  • Must also execute in cycles:

    i + k*II k = 0 to N-1

  • Therefore to detect resource conflicts look in the reservation table under cycle:

    (i-1) mod II + 1

  • Hence the name “modulo scheduling”


New Pipelined Schedule


Modulo Reservation Table

  • Store couldn’t be scheduled in cycle 6

  • Slot = (6-1) mod 3 + 1 = 3

  • Already taken by an earlier load


Iterative Modulo Scheduling

  • Now we have a valid schedule for II=3

  • We need to construct the loop kernel, prologue, and epilogue

  • The loop kernel is what is executed when the pipeline is in steady state

    • The kernel is executed every II cycles

  • First we divide the schedule into stages of II cycles each


Pipeline Stages

Stage:

00

1

2

3


Pipelined Loop Iterations

3 Cycles

i=0

i=1

i=2

i=3

i=4

Stage 1

i=0

i=1

i=2

i=3

i=4

Stage 2

i=0

i=1

i=2

i=3

i=4

Stage 3

Prologue

Epilogue

Kernel

(Steady State)


Loop Dependencies

for (i = 0; i < M; i++)

for (j = 0; j < N; j++)

a[j] = b[i] + a[j-1];

  • May cause non-zero recurrence min II.

  • Several papers in FPGA 2013 deal with discovering/optimizing loop dependencies

Depends on previous iteration


Limitations and Current Research


LegUp HLS Limitations

  • HLS will likely do better for datapath-oriented parts of a design.

  • Results likely quite sensitive to how loops are structured in your C code.

  • Difficult for HLS to “beat” optimized structured HW design.


FPGA/Altera-Specific Aspects of LegUp

  • Memory

    • On-chip (AltSyncRAM), off-chip (DDR2/SDRAM controller)

  • IP cores

    • Divider, floating point units

  • On-chip SOC interconnect

    • Avalon interface

  • LegUp-generated Verilog fairly FPGA-agnostic:

    • Not difficult to migrate to target ASICs


Current Research Work

  • Impact of compiler optimizations on HLS

  • Enhanced parallel accelerator support

    • Combining Pthreads+OpenMP

  • Smaller processor

  • Improved loop pipelining

  • Software fallback for bitwidth-optimized accelerators

  • Enhanced GUI to display CDFG connected with the schedule


Current Work: PCIe Support

  • Enable use of LegUp-generated accelerators in an HPC environment

    • Communicating with an x86 processor via PCIe

  • Message passing or memory transfers

    • Software API for fpga_malloc, fpga_free, send, receive

  • DE4 / Stratix IV support in next LegUp release


On to the Labs!


ad
  • Login