Accelerator Compiler
This presentation is the property of its rightful owner.
Sponsored Links
1 / 86

Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012 PowerPoint PPT Presentation


  • 49 Views
  • Uploaded on
  • Presentation posted in: General

Accelerator Compiler for the VENICE Vector Processor. Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012. Outline:. Motivation Background Implementation Results Conclusion. Outline:. Motivation Background Implementation Results Conclusion. FPGA. VHDL. Motivation. Multi-core.

Download Presentation

Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Accelerator Compiler

for the VENICE Vector Processor

Zhiduo Liu

Supervisor: Guy Lemieux

Sep. 28th, 2012


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Outline:

Motivation

Background

Implementation

Results

Conclusion


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Outline:

Motivation

Background

Implementation

Results

Conclusion


Zhiduo liu supervisor guy lemieux sep 28 th 2012

FPGA

VHDL

Motivation

Multi-core

ParC

Cilk

Erlang

System Verilog

Verilog

OpenMP

OpenCL

aJava

SSE

MPI

Bluespec

OpenGL

Pthread

GPU

X10

CUDA

StreamIt

Sh

OpenHMPP

Many-core

Fortress

Sponge

Chapel

Computer clusters

Vector Processor


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Simplification

FPGA

VHDL

Motivation

Multi-core

ParC

Cilk

Erlang

System Verilog

Verilog

OpenMP

OpenCL

aJava

SSE

MPI

Bluespec

OpenGL

Pthread

GPU

X10

CUDA

StreamIt

Sh

OpenHMPP

Many-core

Fortress

Sponge

Chapel

Computer clusters

Vector Processor


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Motivation

Single Description


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Contributions

The compiler serves as a new back-end of a single-description multiple-device language.

The compiler makes VENICE easier to program and debug.

The compiler provides auto-parallelization and optimization.

[1] Z. Liu, A. Severance, S. Singh and G. Lemieux, “Accelerator Compiler for the VENICE Vector Processor,” in FPGA 2012.

[2] C. Chou, A. Severance, A. Brant, Z. Liu, S. Sant, G. Lemieux, “VEGAS: soft vector processor with scratchpad memory,” in FPGA 2011.


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Outline:

Motivation

Background

Implementation

Results

Conclusion


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Complicated

ALIGN

WR RD

ALIGN

EX1

EX2

ACCUM


Zhiduo liu supervisor guy lemieux sep 28 th 2012

#include "vector.h“

int main()

{

int A[] = {1,2,3,4,5,6,7,8};

const int data_len = sizeof ( A );

int *va = ( int *) vector_malloc ( data_len );

vector_dma_to_vector ( va, A, data_len );

vector_wait_for_dma ();

vector_set_vl ( data_len / sizeof (int) );

vector ( SVW, VADD, va, 42, va );

vector_instr_sync();

vector_dma_to_host ( A, va, data_len );

vector_wait_for_dma ();

vector_free ();

}

Program in VENICE assembly

  • Allocate vectors in scratchpad

  • Move data from main memory to scratchpad

  • Wait for DMA transaction to be completed

  • Setup for vector instructions

  • Perform vector computations

  • Wait for vector operations to be completed

  • Move data from scratchpad to main memory

  • Wait for DMA transaction to be completed

  • Deallocate memory from scratchpad


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Program in Accelerator

  • Create a Target

  • Create Parallel Array objects

  • Write expressions

  • Call ToArray to evaluate expressions

  • Delete Target object

#include "Accelerator.h"

using namespace ParallelArrays;

using namespace MicrosoftTargets;

int main()

{

int A[] = {1,2,3,4,5,6,7,8};

Target *tgt = CreateVectorTarget();

IPA b = IPA( A, sizeof (A)/sizeof (int));

IPA c = b + 42;

tgt->ToArray( c, A, sizeof (A)/sizeof (int));

tgt->Delete();

}

Target *tgt = CreateMulticoreTarget();

Target *tgt= CreateDX9Target();


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Assembly Programming :

Accelerator Programming :

Write in Accelerator

Write Assembly

Compile with Microsoft Visual Studio

Doesn’t compile?

Or result incorrect?

Compile with Gcc

Compile with Gcc

Doesn’t compile?

Download to board

Download to board

Get Result

Get Result

Result Incorrect?


Zhiduo liu supervisor guy lemieux sep 28 th 2012

  • Assembly Programming :

  • Hard to program

  • Long debug cycle

  • Not portable

  • Manual – Not always optimal or correct (wysiwyg)

  • Accelerator Programming :

  • Easy to program

  • Easy to debug

  • Can also target other devices

  • Automated compiler optimizations


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Outline:

Motivation

Background

Implementation

Results

Conclusion


Zhiduo liu supervisor guy lemieux sep 28 th 2012

D

#include "Accelerator.h"

using namespace ParallelArrays;

using namespace MicrosoftTargets;

int main()

{

Target *tgtVector = CreateVectorTarget();

const int length = 8192;

int a[] = {1,2,3,4, … , 8192};

int d[length];

IPA A = IPA( a, length);

IPA B = Evaluate( Rotate(A, [1]) + 1 );

IPA C = Evaluate( Abs( A + 2 ));

IPA D = ( A + B ) * C ;

tgtVector->ToArray( D, d, length * sizeof(int));

tgtVector->Delete();

}

×

Abs

+

+

A

+

2

A

1

Rot

A


Zhiduo liu supervisor guy lemieux sep 28 th 2012

D

×

Abs

+

+

A

+

2

A

1

Rot

A


Zhiduo liu supervisor guy lemieux sep 28 th 2012

D

×

Abs

+

+

A

+

2

A

1

A

(rot)


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

B

D

Abs

+

×

1

+

A

(rot)

Abs

+

2

A

D

+

A

+

×

2

A

1

A

(rot)

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Combine Operations

Abs

+

B

2

A

D

+

×

1

A

(rot)

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Combine Operations

|+|

A

2

B

D

+

×

1

A

(rot)

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Scratchpad Memory

“Virtual Vector Register File”


Zhiduo liu supervisor guy lemieux sep 28 th 2012

“Virtual Vector Register File”


Zhiduo liu supervisor guy lemieux sep 28 th 2012

“Virtual Vector Register File”

Number of vector registers = ?

Vector register size = ?


Zhiduo liu supervisor guy lemieux sep 28 th 2012

“Virtual Vector Register File”

Number of vector registers = ?

Vector register size = ?


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Evaluation Order

B

+

2

A

(rot)

+

5

2

D

1

A

(rot)

1

3

1

3

1

4

2

3

×

0

0

1

2

1

2

1

1

C

+

1

1

2

1

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

“Virtual Vector Register File”

Number of vector registers = 3

Vector register size = ?


Zhiduo liu supervisor guy lemieux sep 28 th 2012

“Virtual Vector Register File”

Number of vector registers = 3

Vector register size = Capacity/3


Zhiduo liu supervisor guy lemieux sep 28 th 2012

C

Convert to LIR

5

|+|

4

3

3

3

A

2

B

D

+

2

2

1

1

1

2

×

1

A

(rot)

C

+

A

B


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Code Generation


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Code Generation


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Code Generation


Zhiduo liu supervisor guy lemieux sep 28 th 2012

#include "vector.h“

int main(){

int A[8192] = {1,2,3,4, … 8192};

int *va = ( int *) vector_malloc ( 32772 );

int *vb = ( int *) vector_malloc ( 32768 );

int *vc = ( int *) vector_malloc ( 32768 );

int *vd = ( int *) vector_malloc ( 32772 );

int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 );

for(int i=0; i<4; i++){

vector_set_vl ( 1024 );

vtemp = va;

va = vd;

vd = vtemp;

vector_wait_for_dma ();

if(i<3)

vector_dma_to_vector ( va, A+(i+1)*1024, 32772 );

if(i>0){

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

}

vector ( SVW, VADD, vb, 1, va+1 );

Code Generation


Zhiduo liu supervisor guy lemieux sep 28 th 2012

#include "vector.h“

int main(){

int A[8192] = {1,2,3,4, … 8192};

int *va = ( int *) vector_malloc ( 32772 );

int *vb = ( int *) vector_malloc ( 32768 );

int *vc = ( int *) vector_malloc ( 32768 );

int *vd = ( int *) vector_malloc ( 32772 );

int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 );

for(int i=0; i<4; i++){

vector_set_vl ( 1024 );

vtemp = va;

va = vd;

vd = vtemp;

vector_wait_for_dma ();

if(i<3)

vector_dma_to_vector ( va, A+(i+1)*1024, 32772 );

if(i>0){

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

}

vector ( SVW, VADD, vb, 1, va+1 );

vector_abs ( SVW, VADD, vc, 2, va );

Code Generation


Zhiduo liu supervisor guy lemieux sep 28 th 2012

#include "vector.h“

int main(){

int A[8192] = {1,2,3,4, … 8192};

int *va = ( int *) vector_malloc ( 32772 );

int *vb = ( int *) vector_malloc ( 32768 );

int *vc = ( int *) vector_malloc ( 32768 );

int *vd = ( int *) vector_malloc ( 32772 );

int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 );

for(int i=0; i<4; i++){

vector_set_vl ( 1024 );

vtemp = va;

va = vd;

vd = vtemp;

vector_wait_for_dma ();

if(i<3)

vector_dma_to_vector ( va, A+(i+1)*1024, 32772 );

if(i>0){

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

}

vector ( SVW, VADD, vb, 1, va+1 );

vector_abs ( SVW, VADD, vc, 2, va );

vector ( VVW, VADD, vb, vb, va );

Code Generation


Zhiduo liu supervisor guy lemieux sep 28 th 2012

#include "vector.h“

int main(){

int A[8192] = {1,2,3,4, … 8192};

int *va = ( int *) vector_malloc ( 32772 );

int *vb = ( int *) vector_malloc ( 32768 );

int *vc = ( int *) vector_malloc ( 32768 );

int *vd = ( int *) vector_malloc ( 32772 );

int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 );

for(int i=0; i<4; i++){

vector_set_vl ( 1024 );

vtemp = va;

va = vd;

vd = vtemp;

vector_wait_for_dma ();

if(i<3)

vector_dma_to_vector ( va, A+(i+1)*1024, 32772 );

if(i>0){

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

}

vector ( SVW, VADD, vb, 1, va+1 );

vector_abs ( SVW, VADD, vc, 2, va );

vector ( VVW, VADD, vb, vb, va );

vector ( VVW, VADD, vc, vc, vb );

}

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

vector_wait_for_dma ();

vector_free ();

}

Code Generation


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Expression Graph

Convert to IR

LIR

IR

Calculate Register Size

CSE

Allocate Memory

Combine Memory transforms

Initialize Memory

Sub-divide IR

Move Bounds to Leaves

Transfer Data To Scratchpad

Constant folding

Set VL

Combine Operations

Write Vector Instructions

Evaluation Ordering

Transfer Result To Host

Need Double buffering?

Buffer Counting

VENICE Code

Convert To LIR


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Outline:

Motivation

Background

Implementation

Results

Conclusion


Zhiduo liu supervisor guy lemieux sep 28 th 2012

370x


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Compare to Intel CPU

Compile Time


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Using smaller data types

Speedup using bytes

Speedup using halfwords


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Outline:

Motivation

Background

Implementation

Results

Conclusion


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Conclusions:

The compiler greatly improves the programming and debugging experience for VENICE.

The compiler produces highly optimized VENICE code and achieves performance close-to or better-than hand-optimized code.

The compiler demonstrates the feasibility of using high-abstraction languages, such as Microsoft Accelerator with pluggable 3rd-party back-end support to provide a sustainable solution for future emerging hardware.


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Thank you !


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Look-up Table


Zhiduo liu supervisor guy lemieux sep 28 th 2012

“Virtual Vector Register File”

Number of vector registers = 4

Vector register size = 1024


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Combine Operators for Motion Estimation


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Performance Degradation on median

int *v_min = v_input1;

int *v_max = v_input2;

vector ( VVW, VOR, v_tmp, v_min, v_min );

vector ( VVW, VSUB, v_sub, v_max, v_min );

vector ( VVW, VCMV_LTZ, v_min, v_max, v_sub );

vector ( VVW, VCMV_LTZ, v_max, v_tmp, v_sub );

Human-written compare-and-swap

vector ( VVW, VSUB, v_sub, v_input1, v_input2 );

vector ( VVW, VCMV_GTEZ, v_min, v_input2, v_sub );

vector ( VVW, VCMV_LTZ, v_min, v_input1, v_sub );

vector ( VVW, VCMV_GTEZ, v_min, v_input1, v_sub );

vector ( VVW, VCMV_LTZ, v_max, v_input2, v_sub );

Compiler-generated compare-and-swap


Zhiduo liu supervisor guy lemieux sep 28 th 2012

Double Buffering


  • Login