High level synthesis with legup a crash course for users and researchers
This presentation is the property of its rightful owner.
Sponsored Links
1 / 80

High-Level Synthesis with LegUp A Crash Course for Users and Researchers PowerPoint PPT Presentation


  • 41 Views
  • Uploaded on
  • Presentation posted in: General

High-Level Synthesis with LegUp A Crash Course for Users and Researchers. Jason Anderson, Stephen Brown, Andrew Canis , Jongsok (James) Choi 11 February 2013 ACM FPGA Symposium Monterey, CA. Dept. of Electrical and Computer Engineering University of Toronto . Berlin. Hong Kong. LegUp.

Download Presentation

High-Level Synthesis with LegUp A Crash Course for Users and Researchers

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


High level synthesis with legup a crash course for users and researchers

High-Level Synthesis with LegUpA Crash Course for Users and Researchers

Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi

11 February 2013

ACM FPGA SymposiumMonterey, CA

Dept. of Electrical and Computer EngineeringUniversity of Toronto


High level synthesis with legup a crash course for users and researchers

Berlin

Hong Kong

LegUp

LegUp

LegUp

LegUp

LegUp

LegUp

LegUp

LegUp

LegUp

New York City

Tokyo


Tutorial outline

Tutorial Outline

  • Overview of LegUp and its algorithms (60 min)

  • Labs (“hands on” via VirtualBox)

    • Lab 1: Using the LegUp Framework (30 min)

    • Break

    • Lab 2: Adding resource constraints (30 min)

    • Lab 3: Changing How LegUp implements hardware (30 min)


Project motivation

Project Motivation

  • Hardware design has advantages over software:

    • Speed

    • Energy-efficiency

  • Hardware design is difficult and skills are rare:

    • 10 software engineers for every hardware engineer*

  • We need a CAD flow that simplifies hardware design for software engineers

*US Bureau of Labour Statistics ‘08


Top level vision

Top-Level Vision

int FIR(int ntaps, int sum) {

int i;

for (i=0; i < ntaps; i++)

sum += h[i] * z[i];

return (sum);

}

....

Processor

(MIPS)

C Compiler

Program code

Self-Profiling

Processor

Profiling Data:

Execution Cycles

Power

Cache Misses

Altered SW binary (calls HW accelerators)

High-levelsynthesis

Suggested

programsegments to target to HW

P

Hardenedprogramsegments

FPGA fabric


Legup key features

LegUp: Key Features

  • C to Verilog high-level synthesis

  • Many benchmarks (incl. 12 CHStone)

  • MIPS processor (Tiger)

  • Hardware profiler

  • Automated verification tests

  • Open source, freely downloadable

    • Like ABC (Synthesis) or VPR (Place & Route)

    • 600+ downloads since March 2011

    • http://legup.eecg.utoronto.ca


System architecture

System Architecture

FPGA

Cyclone II or Stratix IV

Hardware Accelerator

Hardware Accelerator

Memory

Memory

MIPS Processor

AVALON INTERFACE

On-Chip Cache Memory

Memory Controller

Off-Chip Memory

ALTERA DE2 or DE4 Board


High level synthesis framework

High-Level Synthesis Framework

  • Leverage LLVM compiler infrastructure:

    • Language support: C/C++

    • Standard compiler optimizations

    • More on this shortly

    • We support a large subset of ANSI C:


Hardware profiler architecture

Hardware Profiler Architecture

  • Monitor instr. bus to detect function call/ret.

  • Call: Hash (in HW) from function address to index; push to stack.

  • Ret: pop function index from stack.

  • Use function indexes to associate profiling data (e.g. cycles, power) with counters.

MIPS P

instr

Instr. $

PC

Op Decoder

tAddr+= V1

tAddr += (tAddr << 8)

tAddr ^= (tAddr >> 4)

b = (tAddr >> B1) & B2

a = (tAddr + (tAddr << A1)) >> A2

fNum = (a ^ tab[b])

Address Hash

(in hardware)

ret

call

target

address

Call Stack

counter

0

1

0

1

function #

reset

Data Counter(for current function)

(ret | call)

Popped F#

0

+

count

Incr. when PC changes

F#

Counter Storage

Memory

(for all functions)

PC

count

See paper IEEE ASAP’11


Processor accelerator hybrid flow

Processor/Accelerator Hybrid Flow

int main () {

sum = dotproduct(N);

...

}

intdotproduct(int N) {

for (i=0; i<N; i++) {

sum += A[i] * B[i];

}

return sum;

}


Processor accelerator hybrid flow1

Processor/Accelerator Hybrid Flow

int main () {

sum = dotproduct(N);

...

}

intdotproduct(int N) {

for (i=0; i<N; i++) {

sum += A[i] * B[i];

}

return sum;

}

#define dotproduct_DATA (volatile int *) 0xf0000000

#define dotproduct_STATUS (volatile int *) 0xf0000008

#define dotproduct_ARG1 (volatile int *) 0xf000000C

int legup_dotproduct(int N) {

*dotproduct_ARG1 = (volatile int) N;

*dotproduct_STATUS = 1;

return *dotproduct_DATA;

}


Processor accelerator hybrid flow2

Processor/Accelerator Hybrid Flow

int main () {

sum = dotproduct(N);

...

}

HLS

set_accelerator_function “dotproduct”

HW Accelerator


Processor accelerator hybrid flow3

Processor/Accelerator Hybrid Flow

int main () {

sum = dotproduct(N);

...

}

#define dotproduct_DATA (volatile int *) 0xf0000000

#define dotproduct_STATUS(volatile int *) 0xf0000008

#define dotproduct_ARG1 (volatile int *) 0xf000000C

intlegup_dotproduct(int N) {

*dotproduct_ARG1 = (volatile int) N;

*dotproduct_STATUS = 1;

return *dotproduct_DATA;

}

sum = legup_dotproduct(N);


Processor accelerator hybrid flow4

Processor/Accelerator Hybrid Flow

int main () {

...

}

#define dotproduct_DATA (volatile int *) 0xf0000000

#define dotproduct_STATUS(volatile int *) 0xf0000008

#define dotproduct_ARG1 (volatile int *) 0xf000000C

intlegup_dotproduct(int N) {

*dotproduct_ARG1 = (volatile int) N;

*dotproduct_STATUS = 1;

return *dotproduct_DATA;

}

sum = legup_dotproduct(N);

SW

MIPS Processor


How does legup handle memory and pointers

How Does LegUp Handle Memory and Pointers?

  • LegUp stores each array in a separate FPGA BRAM

  • BRAM data width matches the data in the array

  • Each BRAM is identified by a 9-bit tag

  • Addresses consist of the RAM tag and array index:

  • A shared memory controller uses the tag bit to determine which BRAM to read or write from

  • The array index is the address passed to the BRAM

31

23

22

0

9-bit Tag 23-bit Index


Pointer example

Pointer Example

  • We have two arrays in the C function:

    • int A[100], B[100]

  • Tag 0 is reserved for NULL pointers

  • Tag 1 is reserved for off-chip memory

  • Assign tag 2 to array A and tag 3 to array B

  • Address of A[3]: Address of B[7]:

23

22

23

22

31

31

0

0

Tag=2 Index=3

Tag=3 Index=7


Shared memory controller

Shared Memory Controller

  • Both arrays A and B have 100 element BRAMs

  • Load from pointer D:

FF

FF

B[0]

A[0]

0

0

2

...

...

32

A[13]

32

3

B[13]

A[13]

13

13

32

31

0

….

….

23

22

Tag=2 Index=13

B[99]

A[99]

99

99

BRAM Tag=2

BRAM Tag=3


Core benchmarks many more

Core Benchmarks (+Many More)

  • 12 CHStone Benchmarks (JIP’09) and Dhrystone

    • Too large/complex for academic HLS tools

  • Include golden input/output test vectors

  • Not supported by academic tools


Experimental results legup 1 0 2011 for cyclone ii

Experimental ResultsLegUp1.0 (2011) for Cyclone II

  • Pure software on MIPS

    Hybrid (software/hardware):

  • Second most compute-intensive function (and descendants) in H/W

  • Same as 2 but with most compute-intensive

  • Pure hardware using LegUp

  • Pure hardware using eXCite (commercial tool)


Experimental results

Experimental Results


Comparison legup vs excite

Comparison: LegUpvseXCite

  • Benchmarks compiled to hardware

  • eXCite: Commercial high-level synthesis tool

    • Couldn’t compile Dhrystone


Energy consumption

Energy Consumption

18x less energy than software


Current release legup 3 0

Current Release: LegUp 3.0

  • Loop pipelining

  • Dual and multi-ported memory support

  • Bitwidth minimization

  • Multi-pumping DSP units for area reduction

  • Alias analysis for dependency checks

  • Parallel accelerators via Pthreads & OpenMP

    Results now considerably better than LegUp 1.0 release


Legup 3 0 vs legup 1 0

LegUp 3.0 vs. LegUp 1.0


Llvm compiler and hls algorithms

LLVM Compiler and HLS Algorithms


Llvm compiler

LLVM Compiler

  • Open-source compiler framework.

    • http://llvm.org

  • Used by Apple, NVIDIA, AMD, others.

  • Competitive quality with gcc.

  • LegUp HLS is a “back-end” of LLVM.

  • LLVM: low-level virtual machine.


Llvm compiler1

LLVM Compiler

  • LLVM will compile C code into a control flow graph (CFG)

  • LLVM will perform standard optimizations

    • 50+ different optimizations in LLVM

CFG

C Program

BB0

Compiler

int FIR(int ntaps, int sum) {

int i;

for (i=0; i < ntaps; i++)

sum += h[i] * z[i];

return sum;

}

....

LLVM

BB1

BB2


Control flow graph

Control Flow Graph

  • Control flow graph is composed of basic blocks

  • basic block:is a sequence of instructions terminated with exactly one branch

    • Can be represented by an acyclic data flow graph:

CFG

load

load

load

BB0

+

BB1

+

store

BB2


Llvm details

LLVM Details

  • Instructions in basic blocks are primitive computational operations:

    • shift, add, divide, xor, and, etc.

  • Or are control-flow operations:

    • branch, call, etc.

  • The CDFG is represented in LLVM’s intermediate representation (IR)

    • IR is machine-independent assembly code.


High level synthesis flow

High-Level Synthesis Flow

C Compiler (LLVM)

Optimized LLVM IR

Target H/W Characterization

C Program

Allocation

Scheduling

  • User Constraints

  • Timing

  • Resource

Binding

RTL Generation

Synthesizable Verilog


Scheduling

Scheduling

  • Scheduling: is the task of scheduling operations into clock cycles using a finite state machine

FSM

Schedule

State 0

load

load

State 1

+

load

+

State 2

store

State 3


Binding

Binding

  • Binding: is the task of assigning scheduled operations to functional units in the datapath

Schedule

Datapath

load

load

2-port RAM

FF

+

load

+

+

store


High level synthesis scheduling

High-Level Synthesis: Scheduling


Sdc scheduling

SDC Scheduling

  • SDC  System of Difference Constraints

    • Cong, Zhang, “An efficient and versatile scheduling algorithm based on SDC formulation”. DAC 2006: 433-438.

  • Basic idea: formulate scheduling as a mathematical optimization problem

    • Linear objective function + linear constraints (==, <=, >=).

  • The problem is a linear program (LP)

    • Solvable in polynomial time with standard solvers


Define variables

Define Variables

  • For each operation i to schedule, create a variable ti.

  • The ti’s will hold the cycle # in which each op is scheduled.

  • Here we have:

    • tadd, tshift, tsub

+

<<

-

Data flow graph (DFG):

already accessible in LLVM.


Dependency constraints

Dependency Constraints

  • In this example, the subtract can only happen after the add and shift.

  • tsub – tadd >= 0

  • tsub – tshift >= 0

  • Hence the name difference constraints.

add

shift

sub


Handling clock period constraints

Handling Clock Period Constraints

mod

  • Target period: P (e.g., 10 ns)

  • For each chain of dependant operations in DFG, estimate the path delay D (LegUp’s models)

    • E.g.: D from mod -> or = 23 ns.

  • Compute: R = ceiling(D/P) - 1

    • E.g.: R = 2

  • Add the difference constraint:

    • tor - tmod >= 2

xor

shr

or


Resource constraints

Resource Constraints

  • Restriction on # of operations of a given type that can execute in a cycle

  • Why we need it?

    • Want to use dual-port RAMs in FPGA

      • Allow up to 2 load/store operations in a cycle

    • Floating point

      • Do not want to instantiate many FP cores of a given type, probably just one

      • Scheduling must honour # of FP cores available


Resource constraints in sdc

Resource Constraints in SDC

  • Res-constrained scheduling is NP-hard.

  • Implemented approach in [Cong & Zhang DAC2006]

A

+

E

H

+

+

+

+

B

F

C

+

+

G

Say want to schedule with

only have 2 addersin the HW (lab #2)

+

D


Add sdc constraints

Add SDC Constraints

  • Generate a topological ordering of the resource-constrained operations.

  • Say constrained to 2 adders in HW.

  • Starting at C in the ordering, create a constraint: tC – tA > 0

  • Next consider, E, add constraint: tE- tB > 0

  • Continue to the end

  • Resulting schedule will have <= 2 adds / cycle

A B C E F D G H


Asap objective function

ASAP Objective Function

  • Minimize the sum of the variables:

  • Operations will be scheduled as early as possible, subject to the constraints

  • LP program solvable in polynomial time


High level synthesis binding

High-Level Synthesis: Binding


High level synthesis binding1

High-Level Synthesis: Binding

  • Weighted bipartite matching-based binding

    • Huang, Chen, Lin, Hsu, “Data path allocation based on bipartite weighted matching”. DAC 1990: 499-504.

  • Finds the minimum weighted matching of a bipartite graph at each step

    • Solve using the Hungarian Method (polynomial)

operations

edge costs

hardware functional units


Binding1

Binding

  • Bind the following scheduled program


Binding2

Binding

  • Resource Sharing: requires 3 multipliers


Binding3

Binding

  • Functional Units

  • Bind the first cycle

  • 1

  • 1

  • 1


Binding4

Binding

  • Functional Units

  • Bind the second cycle

  • 2

  • 2

  • 1


Binding5

Binding

  • Functional Units

  • Bind the third cycle

  • 2

  • 2

  • 2


Binding6

Binding

  • Functional Units

  • Bind the fourth cycle

  • 3

  • 2

  • 2


Binding7

Binding

  • Functional Units

  • Required Multiplexing:

  • 3

  • 2

  • 2


High level synthesis challenges

High-Level Synthesis: Challenges

  • Easy to extract instruction level parallelism using dependencies within a basic block

  • But C code is inherently sequential and it is difficult to extract higher level parallelism

  • Coarse-grained parallelism:

    • function pipelining

  • Fine-grained parallelism:

    • loop pipelining


Loop pipelining

Loop Pipelining


Motivating example

Motivating Example

  • Cycles: 3N

  • Adders: 3

  • Utilization: 33%

for (inti = 0; i < N; i++) {

sum[i] = a + b + c + d

}

cycle

a

b

+

1

c

+

2

d

+

3


Loop pipelining1

Loop Pipelining

Steady State

  • Cycles: N+2 (~1 cycle per iteration)

  • Adders: 3

  • Utilization: 100% in steady state


Loop pipelining example

Loop Pipelining Example

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}

  • Each iteration requires:

    • 2 loads from memory

    • 1 store

  • No dependencies between iterations


Loop pipelining example1

Loop Pipelining Example

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}

  • Cycle latency of operations:

    • Load: 2 cycles

    • Store: 1 cycle

    • Add: 1 cycle

  • Single memory port


Llvm instructions

LLVM Instructions

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5

%scevgep6 = getelementptr %c, %i.04

%1 = load %scevgep6

%2 = add nsw i32 %1, %0

%scevgep = getelementptr %a, %i.04

store %2, %scevgep

%3 = add %i.04, 1

%exitcond = eq %3, 100

br %exitcond, %bb2, %bb

for (inti = 0; i < N; i++) {

a[i] = b[i]+ c[i]

}


Llvm instructions1

LLVM Instructions

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5

%scevgep6 = getelementptr %c, %i.04

%1 = load %scevgep6

%2 = add nsw i32 %1, %0

%scevgep = getelementptr %a, %i.04

store %2, %scevgep

%3 = add %i.04, 1

%exitcond = eq %3, 100

br %exitcond, %bb2, %bb

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}


Llvm instructions2

LLVM Instructions

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5

%scevgep6 = getelementptr %c, %i.04

%1 = load %scevgep6

%2 = add nsw i32 %1, %0

%scevgep = getelementptr %a, %i.04

store %2, %scevgep

%3 = add %i.04, 1

%exitcond = eq %3, 100

br %exitcond, %bb2, %bb

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}


Llvm instructions3

LLVM Instructions

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5

%scevgep6 = getelementptr %c, %i.04

%1 = load %scevgep6

%2 = add nsw i32 %1, %0

%scevgep = getelementptr %a, %i.04

store %2, %scevgep

%3 = add %i.04, 1

%exitcond = eq %3, 100

br %exitcond, %bb2, %bb

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}


Llvm instructions4

LLVM Instructions

%i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]

%scevgep5 = getelementptr %b, %i.04

%0 = load %scevgep5

%scevgep6 = getelementptr %c, %i.04

%1 = load %scevgep6

%2 = add nsw i32 %1, %0

%scevgep = getelementptr %a, %i.04

store %2, %scevgep

%3 = add %i.04, 1

%exitcond = eq %3, 100

br %exitcond, %bb2, %bb

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}


Scheduling llvm instructions

Scheduling LLVM Instructions

Cycle:

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}

  • Each iteration requires:

    • 2 loads from memory

    • 1 store

  • There are no dependencies between iterations


Scheduling llvm instructions1

Scheduling LLVM Instructions

Cycle:

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}

  • Each iteration requires:

    • 2 loads from memory

    • 1 store

  • There are no dependencies between iterations

Memory Port Conflict


Loop pipelining example2

Loop Pipelining Example

for (inti = 0; i < N; i++) {

a[i] = b[i] + c[i]

}

  • Initiation Interval (II)

    • Constant time interval between starting successive iterations of the loop

  • The loop requires 6 cycles per iteration (II=6)

  • Can we do better?


Minimum initiation interval

Minimum Initiation Interval

  • Resource minimum II:

    • Due to limited # of functional units

    • ResMII = Uses of functional unit

      # of functional units

  • Recurrence minimum II:

    • Due to loop carried dependencies

  • Minimum II = max(ResMII, RecMII)


Resource constraints1

Resource Constraints

  • Assume unlimited functional units (adders, …)

  • Only constraint: single ported memory controller

  • Reservation table:

  • The resource minimum initiation interval is 3


Iterative modulo scheduling

Iterative Modulo Scheduling

  • There are no loop carried dependencies so Minimum II = ResMII = 3

  • Iterative: Not always possible to schedule the loop for minimum II

II = minII

Attempt to modulo schedule loop with II

II = II + 1

Fail

Success


Iterative modulo scheduling1

Iterative Modulo Scheduling

  • Operations in the loop that execute in cycle:

    i

  • Must also execute in cycles:

    i + k*II k = 0 to N-1

  • Therefore to detect resource conflicts look in the reservation table under cycle:

    (i-1) mod II + 1

  • Hence the name “modulo scheduling”


New pipelined schedule

New Pipelined Schedule


Modulo reservation table

Modulo Reservation Table

  • Store couldn’t be scheduled in cycle 6

  • Slot = (6-1) mod 3 + 1 = 3

  • Already taken by an earlier load


Iterative modulo scheduling2

Iterative Modulo Scheduling

  • Now we have a valid schedule for II=3

  • We need to construct the loop kernel, prologue, and epilogue

  • The loop kernel is what is executed when the pipeline is in steady state

    • The kernel is executed every II cycles

  • First we divide the schedule into stages of II cycles each


Pipeline stages

Pipeline Stages

Stage:

00

1

2

3


Pipelined loop iterations

Pipelined Loop Iterations

3 Cycles

i=0

i=1

i=2

i=3

i=4

Stage 1

i=0

i=1

i=2

i=3

i=4

Stage 2

i=0

i=1

i=2

i=3

i=4

Stage 3

Prologue

Epilogue

Kernel

(Steady State)


Loop dependencies

Loop Dependencies

for (i = 0; i < M; i++)

for (j = 0; j < N; j++)

a[j] = b[i] + a[j-1];

  • May cause non-zero recurrence min II.

  • Several papers in FPGA 2013 deal with discovering/optimizing loop dependencies

Depends on previous iteration


Limitations and current research

Limitations and Current Research


Legup hls limitations

LegUp HLS Limitations

  • HLS will likely do better for datapath-oriented parts of a design.

  • Results likely quite sensitive to how loops are structured in your C code.

  • Difficult for HLS to “beat” optimized structured HW design.


Fpga altera specific aspects of legup

FPGA/Altera-Specific Aspects of LegUp

  • Memory

    • On-chip (AltSyncRAM), off-chip (DDR2/SDRAM controller)

  • IP cores

    • Divider, floating point units

  • On-chip SOC interconnect

    • Avalon interface

  • LegUp-generated Verilog fairly FPGA-agnostic:

    • Not difficult to migrate to target ASICs


Current research work

Current Research Work

  • Impact of compiler optimizations on HLS

  • Enhanced parallel accelerator support

    • Combining Pthreads+OpenMP

  • Smaller processor

  • Improved loop pipelining

  • Software fallback for bitwidth-optimized accelerators

  • Enhanced GUI to display CDFG connected with the schedule


Current work pcie support

Current Work: PCIe Support

  • Enable use of LegUp-generated accelerators in an HPC environment

    • Communicating with an x86 processor via PCIe

  • Message passing or memory transfers

    • Software API for fpga_malloc, fpga_free, send, receive

  • DE4 / Stratix IV support in next LegUp release


On to the labs

On to the Labs!


  • Login