slide1
Download
Skip this Video
Download Presentation
Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012

Loading in 2 Seconds...

play fullscreen
1 / 86

Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012 - PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on

Accelerator Compiler for the VENICE Vector Processor. Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012. Outline:. Motivation Background Implementation Results Conclusion. Outline:. Motivation Background Implementation Results Conclusion. FPGA. VHDL. Motivation. Multi-core.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012' - fathia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Accelerator Compiler

for the VENICE Vector Processor

Zhiduo Liu

Supervisor: Guy Lemieux

Sep. 28th, 2012

slide2
Outline:

Motivation

Background

Implementation

Results

Conclusion

slide3
Outline:

Motivation

Background

Implementation

Results

Conclusion

slide4

FPGA

VHDL

Motivation

Multi-core

ParC

Cilk

Erlang

System Verilog

Verilog

OpenMP

OpenCL

aJava

SSE

MPI

Bluespec

OpenGL

Pthread

GPU

X10

CUDA

StreamIt

Sh

OpenHMPP

Many-core

Fortress

Sponge

Chapel

Computer clusters

Vector Processor

slide5

Simplification

FPGA

VHDL

Motivation

Multi-core

ParC

Cilk

Erlang

System Verilog

Verilog

OpenMP

OpenCL

aJava

SSE

MPI

Bluespec

OpenGL

Pthread

GPU

X10

CUDA

StreamIt

Sh

OpenHMPP

Many-core

Fortress

Sponge

Chapel

Computer clusters

Vector Processor

slide6
Motivation

Single Description

slide7
Contributions

The compiler serves as a new back-end of a single-description multiple-device language.

The compiler makes VENICE easier to program and debug.

The compiler provides auto-parallelization and optimization.

[1] Z. Liu, A. Severance, S. Singh and G. Lemieux, “Accelerator Compiler for the VENICE Vector Processor,” in FPGA 2012.

[2] C. Chou, A. Severance, A. Brant, Z. Liu, S. Sant, G. Lemieux, “VEGAS: soft vector processor with scratchpad memory,” in FPGA 2011.

slide8
Outline:

Motivation

Background

Implementation

Results

Conclusion

slide9

Complicated

ALIGN

WR RD

ALIGN

EX1

EX2

ACCUM

slide10

#include "vector.h“

int main()

{

int A[] = {1,2,3,4,5,6,7,8};

const int data_len = sizeof ( A );

int *va = ( int *) vector_malloc ( data_len );

vector_dma_to_vector ( va, A, data_len );

vector_wait_for_dma ();

vector_set_vl ( data_len / sizeof (int) );

vector ( SVW, VADD, va, 42, va );

vector_instr_sync();

vector_dma_to_host ( A, va, data_len );

vector_wait_for_dma ();

vector_free ();

}

Program in VENICE assembly

  • Allocate vectors in scratchpad
  • Move data from main memory to scratchpad
  • Wait for DMA transaction to be completed
  • Setup for vector instructions
  • Perform vector computations
  • Wait for vector operations to be completed
  • Move data from scratchpad to main memory
  • Wait for DMA transaction to be completed
  • Deallocate memory from scratchpad
slide11

Program in Accelerator

  • Create a Target
  • Create Parallel Array objects
  • Write expressions
  • Call ToArray to evaluate expressions
  • Delete Target object

#include "Accelerator.h"

using namespace ParallelArrays;

using namespace MicrosoftTargets;

int main()

{

int A[] = {1,2,3,4,5,6,7,8};

Target *tgt = CreateVectorTarget();

IPA b = IPA( A, sizeof (A)/sizeof (int));

IPA c = b + 42;

tgt->ToArray( c, A, sizeof (A)/sizeof (int));

tgt->Delete();

}

Target *tgt = CreateMulticoreTarget();

Target *tgt= CreateDX9Target();

slide12

Assembly Programming :

Accelerator Programming :

Write in Accelerator

Write Assembly

Compile with Microsoft Visual Studio

Doesn’t compile?

Or result incorrect?

Compile with Gcc

Compile with Gcc

Doesn’t compile?

Download to board

Download to board

Get Result

Get Result

Result Incorrect?

slide13

Assembly Programming :

  • Hard to program
  • Long debug cycle
  • Not portable
  • Manual – Not always optimal or correct (wysiwyg)
  • Accelerator Programming :
  • Easy to program
  • Easy to debug
  • Can also target other devices
  • Automated compiler optimizations
slide14
Outline:

Motivation

Background

Implementation

Results

Conclusion

slide17

D

#include "Accelerator.h"

using namespace ParallelArrays;

using namespace MicrosoftTargets;

int main()

{

Target *tgtVector = CreateVectorTarget();

const int length = 8192;

int a[] = {1,2,3,4, … , 8192};

int d[length];

IPA A = IPA( a, length);

IPA B = Evaluate( Rotate(A, [1]) + 1 );

IPA C = Evaluate( Abs( A + 2 ));

IPA D = ( A + B ) * C ;

tgtVector->ToArray( D, d, length * sizeof(int));

tgtVector->Delete();

}

×

Abs

+

+

A

+

2

A

1

Rot

A

slide18

D

×

Abs

+

+

A

+

2

A

1

Rot

A

slide19

D

×

Abs

+

+

A

+

2

A

1

A

(rot)

slide20

C

B

D

Abs

+

×

1

+

A

(rot)

Abs

+

2

A

D

+

A

+

×

2

A

1

A

(rot)

C

+

A

B

slide23

C

Combine Operations

Abs

+

B

2

A

D

+

×

1

A

(rot)

C

+

A

B

slide24

C

Combine Operations

|+|

A

2

B

D

+

×

1

A

(rot)

C

+

A

B

slide25
Scratchpad Memory

“Virtual Vector Register File”

slide27
“Virtual Vector Register File”

Number of vector registers = ?

Vector register size = ?

slide28
“Virtual Vector Register File”

Number of vector registers = ?

Vector register size = ?

slide29

C

Evaluation Order

B

+

2

A

(rot)

+

5

2

D

1

A

(rot)

1

3

1

3

1

4

2

3

×

0

0

1

2

1

2

1

1

C

+

1

1

2

1

A

B

slide30

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide31

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide32

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide33

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide34

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide35

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide36

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide37

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide38

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide39

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide40

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide41

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide42

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide43

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide44

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide45

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide46

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide47

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide48

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide49

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide50

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide51

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide52

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide53

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide54

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide55

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide56

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide57

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide58

C

Count number of virtual vector registers

B

+

2

A

(rot)

+

D

1

A

(rot)

×

C

+

A

B

slide59
“Virtual Vector Register File”

Number of vector registers = 3

Vector register size = ?

slide60
“Virtual Vector Register File”

Number of vector registers = 3

Vector register size = Capacity/3

slide61

C

Convert to LIR

5

|+|

4

3

3

3

A

2

B

D

+

2

2

1

1

1

2

×

1

A

(rot)

C

+

A

B

slide65

#include "vector.h“

int main(){

int A[8192] = {1,2,3,4, … 8192};

int *va = ( int *) vector_malloc ( 32772 );

int *vb = ( int *) vector_malloc ( 32768 );

int *vc = ( int *) vector_malloc ( 32768 );

int *vd = ( int *) vector_malloc ( 32772 );

int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 );

for(int i=0; i<4; i++){

vector_set_vl ( 1024 );

vtemp = va;

va = vd;

vd = vtemp;

vector_wait_for_dma ();

if(i<3)

vector_dma_to_vector ( va, A+(i+1)*1024, 32772 );

if(i>0){

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

}

vector ( SVW, VADD, vb, 1, va+1 );

Code Generation

slide66

#include "vector.h“

int main(){

int A[8192] = {1,2,3,4, … 8192};

int *va = ( int *) vector_malloc ( 32772 );

int *vb = ( int *) vector_malloc ( 32768 );

int *vc = ( int *) vector_malloc ( 32768 );

int *vd = ( int *) vector_malloc ( 32772 );

int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 );

for(int i=0; i<4; i++){

vector_set_vl ( 1024 );

vtemp = va;

va = vd;

vd = vtemp;

vector_wait_for_dma ();

if(i<3)

vector_dma_to_vector ( va, A+(i+1)*1024, 32772 );

if(i>0){

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

}

vector ( SVW, VADD, vb, 1, va+1 );

vector_abs ( SVW, VADD, vc, 2, va );

Code Generation

slide67

#include "vector.h“

int main(){

int A[8192] = {1,2,3,4, … 8192};

int *va = ( int *) vector_malloc ( 32772 );

int *vb = ( int *) vector_malloc ( 32768 );

int *vc = ( int *) vector_malloc ( 32768 );

int *vd = ( int *) vector_malloc ( 32772 );

int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 );

for(int i=0; i<4; i++){

vector_set_vl ( 1024 );

vtemp = va;

va = vd;

vd = vtemp;

vector_wait_for_dma ();

if(i<3)

vector_dma_to_vector ( va, A+(i+1)*1024, 32772 );

if(i>0){

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

}

vector ( SVW, VADD, vb, 1, va+1 );

vector_abs ( SVW, VADD, vc, 2, va );

vector ( VVW, VADD, vb, vb, va );

Code Generation

slide68

#include "vector.h“

int main(){

int A[8192] = {1,2,3,4, … 8192};

int *va = ( int *) vector_malloc ( 32772 );

int *vb = ( int *) vector_malloc ( 32768 );

int *vc = ( int *) vector_malloc ( 32768 );

int *vd = ( int *) vector_malloc ( 32772 );

int *vtemp = va;

vector_dma_to_vector ( va, A, 32772 );

for(int i=0; i<4; i++){

vector_set_vl ( 1024 );

vtemp = va;

va = vd;

vd = vtemp;

vector_wait_for_dma ();

if(i<3)

vector_dma_to_vector ( va, A+(i+1)*1024, 32772 );

if(i>0){

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

}

vector ( SVW, VADD, vb, 1, va+1 );

vector_abs ( SVW, VADD, vc, 2, va );

vector ( VVW, VADD, vb, vb, va );

vector ( VVW, VADD, vc, vc, vb );

}

vector_instr_sync();

vector_dma_to_host ( A+(i-1)*1024, vc, 32768 );

vector_wait_for_dma ();

vector_free ();

}

Code Generation

slide69

Expression Graph

Convert to IR

LIR

IR

Calculate Register Size

CSE

Allocate Memory

Combine Memory transforms

Initialize Memory

Sub-divide IR

Move Bounds to Leaves

Transfer Data To Scratchpad

Constant folding

Set VL

Combine Operations

Write Vector Instructions

Evaluation Ordering

Transfer Result To Host

Need Double buffering?

Buffer Counting

VENICE Code

Convert To LIR

slide70
Outline:

Motivation

Background

Implementation

Results

Conclusion

slide75
Using smaller data types

Speedup using bytes

Speedup using halfwords

slide76
Outline:

Motivation

Background

Implementation

Results

Conclusion

slide77
Conclusions:

The compiler greatly improves the programming and debugging experience for VENICE.

The compiler produces highly optimized VENICE code and achieves performance close-to or better-than hand-optimized code.

The compiler demonstrates the feasibility of using high-abstraction languages, such as Microsoft Accelerator with pluggable 3rd-party back-end support to provide a sustainable solution for future emerging hardware.

slide82
“Virtual Vector Register File”

Number of vector registers = 4

Vector register size = 1024

slide84
Performance Degradation on median

int *v_min = v_input1;

int *v_max = v_input2;

vector ( VVW, VOR, v_tmp, v_min, v_min );

vector ( VVW, VSUB, v_sub, v_max, v_min );

vector ( VVW, VCMV_LTZ, v_min, v_max, v_sub );

vector ( VVW, VCMV_LTZ, v_max, v_tmp, v_sub );

Human-written compare-and-swap

vector ( VVW, VSUB, v_sub, v_input1, v_input2 );

vector ( VVW, VCMV_GTEZ, v_min, v_input2, v_sub );

vector ( VVW, VCMV_LTZ, v_min, v_input1, v_sub );

vector ( VVW, VCMV_GTEZ, v_min, v_input1, v_sub );

vector ( VVW, VCMV_LTZ, v_max, v_input2, v_sub );

Compiler-generated compare-and-swap

ad