Accelerator Compiler
Download
1 / 86

Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012 - PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on

Accelerator Compiler for the VENICE Vector Processor. Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012. Outline:. Motivation Background Implementation Results Conclusion. Outline:. Motivation Background Implementation Results Conclusion. FPGA. VHDL. Motivation. Multi-core.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012' - fathia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Accelerator Compiler

for the VENICE Vector Processor

Zhiduo Liu

Supervisor: Guy Lemieux

Sep. 28th, 2012


Outline:

Motivation

Background

Implementation

Results

Conclusion


Outline:

Motivation

Background

Implementation

Results

Conclusion


FPGA

VHDL

Motivation

Multi-core

ParC

Cilk

Erlang

System Verilog

Verilog

OpenMP

OpenCL

aJava

SSE

MPI

Bluespec

OpenGL

Pthread

GPU

X10

CUDA

StreamIt

Sh

OpenHMPP

Many-core

Fortress

Sponge

Chapel

Computer clusters

Vector Processor


Simplification

FPGA

VHDL

Motivation

Multi-core

ParC

Cilk

Erlang

System Verilog

Verilog

OpenMP

OpenCL

aJava

SSE

MPI

Bluespec

OpenGL

Pthread

GPU

X10

CUDA

StreamIt

Sh

OpenHMPP

Many-core

Fortress

Sponge

Chapel

Computer clusters

Vector Processor


Motivation

Single Description


Contributions

The compiler serves as a new back-end of a single-description multiple-device language.

The compiler makes VENICE easier to program and debug.

The compiler provides auto-parallelization and optimization.

[1] Z. Liu, A. Severance, S. Singh and G. Lemieux, “Accelerator Compiler for the VENICE Vector Processor,” in FPGA 2012.

[2] C. Chou, A. Severance, A. Brant, Z. Liu, S. Sant, G. Lemieux, “VEGAS: soft vector processor with scratchpad memory,” in FPGA 2011.


Outline:

Motivation

Background

Implementation

Results

Conclusion


Complicated

ALIGN

WR RD

ALIGN

EX1

EX2

ACCUM


#include "vector.h“

int main()

{

int A[] = {1,2,3,4,5,6,7,8};

const int data_len = sizeof ( A );

int *va = ( int *) vector_malloc ( data_len );

vector_dma_to_vector ( va, A, data_len );

vector_wait_for_dma ();

vector_set_vl ( data_len / sizeof (int) );

vector ( SVW, VADD, va, 42, va );

vector_instr_sync();

vector_dma_to_host ( A, va, data_len );

vector_wait_for_dma ();

vector_free ();

}

Program in VENICE assembly

  • Allocate vectors in scratchpad

  • Move data from main memory to scratchpad

  • Wait for DMA transaction to be completed

  • Setup for vector instructions

  • Perform vector computations

  • Wait for vector operations to be completed

  • Move data from scratchpad to main memory

  • Wait for DMA transaction to be completed

  • Deallocate memory from scratchpad


Program in Accelerator

  • Create a Target

  • Create Parallel Array objects

  • Write expressions

  • Call ToArray to evaluate expressions

  • Delete Target object

#include "Accelerator.h"

using namespace ParallelArrays;

using namespace MicrosoftTargets;

int main()

{

int A[] = {1,2,3,4,5,6,7,8};

Target *tgt = CreateVectorTarget();

IPA b = IPA( A, sizeof (A)/sizeof (int));

IPA c = b + 42;

tgt->ToArray( c, A, sizeof (A)/sizeof (int));

tgt->Delete();

}

Target *tgt = CreateMulticoreTarget();

Target *tgt= CreateDX9Target();


Assembly Programming :

Accelerator Programming :

Write in Accelerator

Write Assembly

Compile with Microsoft Visual Studio

Doesn’t compile?

Or result incorrect?

Compile with Gcc

Compile with Gcc

Doesn’t compile?

Download to board

Download to board

Get Result

Get Result

Result Incorrect?


  • Assembly Programming :

  • Hard to program

  • Long debug cycle

  • Not portable

  • Manual – Not always optimal or correct (wysiwyg)

  • Accelerator Programming :

  • Easy to program

  • Easy to debug

  • Can also target other devices

  • Automated compiler optimizations


Outline:

Motivation

Background

Implementation

Results

Conclusion


D

#include "Accelerator.h"

using namespace ParallelArrays;

using namespace MicrosoftTargets;

int main()

{

Target *tgtVector = CreateVectorTarget();

const int length = 8192;

int a[] = {1,2,3,4, … , 8192};

int d[length];

IPA A = IPA( a, length);

IPA B = Evaluate( Rotate(A, [1]) + 1 );

IPA C = Evaluate( Abs( A + 2 ));

IPA D = ( A + B ) * C ;

tgtVector->ToArray( D, d, length * sizeof(int));

tgtVector->Delete();

}

×

Abs

+

+

A

+

2

A

1

Rot

A


D

×

Abs

+

+

A

+

2

A

1

Rot

A


D

×

Abs

+

+

A

+

2

A

1

A

(rot)


C

B

D

Abs

+

×

1

+

A

(rot)

Abs

+

2

A

D

+

A

+

×

2

A

1

A

(rot)

C

+

A

B


C

Combine Operations

Abs

+

B

2

A

D

+

×

1

A

(rot)

C

+

A

B


C

Combine Operations

|+|

A

2

B

D

+

×

1

A

(rot)

C

+

A

B


Scratchpad Memory

“Virtual Vector Register File”



“Virtual Vector Register File”

Number of vector registers = ?

Vector register size = ?


“Virtual Vector Register File”

Number of vector registers = ?

Vector register size = ?


C

Evaluation Order

B

+

2

A

(rot)

+

5

2

D

1

A

(rot)

1

3

1

3

1

4

2

3

×

0

0

1

2

1

2

1

1

C

+

1

1

2

1

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B


“Virtual Vector Register File”

Number of vector registers = 3

Vector register size = ?


“Virtual Vector Register File”

Number of vector registers = 3

Vector register size = Capacity/3


C

Convert to LIR

5

|+|

4

3

3

3

A

2

B

D

+

2

2

1

1

1

2

×

1

A

(rot)

C

+

A

B





#include "vector.h“

int main(){

int A[8192] = {1,2,3,4, … 8192};

int *va = ( int *) vector_malloc ( 32772 );

int *vb = ( int *) vector_malloc ( 32768 );

int *vc = ( int *) vector_malloc ( 32768 );

int *vd = ( int *) vector_malloc ( 32772 );

int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 );

for(int i=0; i<4; i++){

vector_set_vl ( 1024 );

vtemp = va;

va = vd;

vd = vtemp;

vector_wait_for_dma ();

if(i<3)

vector_dma_to_vector ( va, A+(i+1)*1024, 32772 );

if(i>0){

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

}

vector ( SVW, VADD, vb, 1, va+1 );

Code Generation


#include "vector.h“

int main(){

int A[8192] = {1,2,3,4, … 8192};

int *va = ( int *) vector_malloc ( 32772 );

int *vb = ( int *) vector_malloc ( 32768 );

int *vc = ( int *) vector_malloc ( 32768 );

int *vd = ( int *) vector_malloc ( 32772 );

int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 );

for(int i=0; i<4; i++){

vector_set_vl ( 1024 );

vtemp = va;

va = vd;

vd = vtemp;

vector_wait_for_dma ();

if(i<3)

vector_dma_to_vector ( va, A+(i+1)*1024, 32772 );

if(i>0){

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

}

vector ( SVW, VADD, vb, 1, va+1 );

vector_abs ( SVW, VADD, vc, 2, va );

Code Generation


#include "vector.h“

int main(){

int A[8192] = {1,2,3,4, … 8192};

int *va = ( int *) vector_malloc ( 32772 );

int *vb = ( int *) vector_malloc ( 32768 );

int *vc = ( int *) vector_malloc ( 32768 );

int *vd = ( int *) vector_malloc ( 32772 );

int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 );

for(int i=0; i<4; i++){

vector_set_vl ( 1024 );

vtemp = va;

va = vd;

vd = vtemp;

vector_wait_for_dma ();

if(i<3)

vector_dma_to_vector ( va, A+(i+1)*1024, 32772 );

if(i>0){

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

}

vector ( SVW, VADD, vb, 1, va+1 );

vector_abs ( SVW, VADD, vc, 2, va );

vector ( VVW, VADD, vb, vb, va );

Code Generation


#include "vector.h“

int main(){

int A[8192] = {1,2,3,4, … 8192};

int *va = ( int *) vector_malloc ( 32772 );

int *vb = ( int *) vector_malloc ( 32768 );

int *vc = ( int *) vector_malloc ( 32768 );

int *vd = ( int *) vector_malloc ( 32772 );

int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 );

for(int i=0; i<4; i++){

vector_set_vl ( 1024 );

vtemp = va;

va = vd;

vd = vtemp;

vector_wait_for_dma ();

if(i<3)

vector_dma_to_vector ( va, A+(i+1)*1024, 32772 );

if(i>0){

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

}

vector ( SVW, VADD, vb, 1, va+1 );

vector_abs ( SVW, VADD, vc, 2, va );

vector ( VVW, VADD, vb, vb, va );

vector ( VVW, VADD, vc, vc, vb );

}

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

vector_wait_for_dma ();

vector_free ();

}

Code Generation


Expression Graph

Convert to IR

LIR

IR

Calculate Register Size

CSE

Allocate Memory

Combine Memory transforms

Initialize Memory

Sub-divide IR

Move Bounds to Leaves

Transfer Data To Scratchpad

Constant folding

Set VL

Combine Operations

Write Vector Instructions

Evaluation Ordering

Transfer Result To Host

Need Double buffering?

Buffer Counting

VENICE Code

Convert To LIR


Outline:

Motivation

Background

Implementation

Results

Conclusion



Compare to Intel CPU

Compile Time


Using smaller data types

Speedup using bytes

Speedup using halfwords


Outline:

Motivation

Background

Implementation

Results

Conclusion


Conclusions:

The compiler greatly improves the programming and debugging experience for VENICE.

The compiler produces highly optimized VENICE code and achieves performance close-to or better-than hand-optimized code.

The compiler demonstrates the feasibility of using high-abstraction languages, such as Microsoft Accelerator with pluggable 3rd-party back-end support to provide a sustainable solution for future emerging hardware.




“Virtual Vector Register File”

Number of vector registers = 4

Vector register size = 1024



Performance Degradation on median

int *v_min = v_input1;

int *v_max = v_input2;

vector ( VVW, VOR, v_tmp, v_min, v_min );

vector ( VVW, VSUB, v_sub, v_max, v_min );

vector ( VVW, VCMV_LTZ, v_min, v_max, v_sub );

vector ( VVW, VCMV_LTZ, v_max, v_tmp, v_sub );

Human-written compare-and-swap

vector ( VVW, VSUB, v_sub, v_input1, v_input2 );

vector ( VVW, VCMV_GTEZ, v_min, v_input2, v_sub );

vector ( VVW, VCMV_LTZ, v_min, v_input1, v_sub );

vector ( VVW, VCMV_GTEZ, v_min, v_input1, v_sub );

vector ( VVW, VCMV_LTZ, v_max, v_input2, v_sub );

Compiler-generated compare-and-swap



ad