streamroller automatic synthesis of prescribed throughput accelerator pipelines n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines PowerPoint Presentation
Download Presentation
Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines

Loading in 2 Seconds...

play fullscreen
1 / 18

Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines. Manjunath Kudlur, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan. app.c. LA. LA. LA. LA. Automated C to Gates Solution. SoC design 10-100 Gops, 200 mW power budget

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines' - allie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
streamroller automatic synthesis of prescribed throughput accelerator pipelines

Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines

Manjunath Kudlur, Kevin Fan, Scott Mahlke

Advanced Computer Architecture Lab

University of Michigan

automated c to gates solution

app.c

LA

LA

LA

LA

Automated C to Gates Solution
  • SoC design
    • 10-100 Gops, 200 mW power budget
    • Low level tools ineffective
  • Automated accelerator synthesis for whole application
    • Correct by construction
    • Increase designer productivity
    • Faster time to market
streaming applications

Transform

Quantizer

Coder

Coded Image

Image

Inverse

Quantizer

Inverse

Transform

Motion

Estimator

Motion

Predictor

H.264 Encoder

OVSF

Generator

Data out

Baseband

Trasmitter

Data in

Block

Interleaver

Conv./

Turbo

CRC

RRC

Filter

Spreader/

Scrambler

W-CDMA Transmitter

Streaming Applications
  • Data “streaming” through kernels
  • Kernels are tight loops
    • FIR, Viterbi, DCT
  • Coarse grain dataflow between kernels
    • Sub-blocks of images, network packets
software overview
Software Overview

1

SRAM

Buffers

System

Level

Synthesis

Frontend

Analyses

2

3

4

Whole Application

Loop Graph

Accelerator Pipeline

Multifunction Accelerator

input specification

inp

row_trans

tmp1

col_trans

tmp2

zigzag_trans

out

Input Specification
  • Sequential C program
  • Kernel specification
    • Perfectly nested FOR loop
    • Wrapped inside C function
    • All data access made explicit
  • System specification
    • Function with main input/output
    • Local arrays to pass data
    • Sequence of calls to kernels

row_trans(char inp[8][8],

char out[8][8] ) {

dct(char inp[8][8],

char out[8][8]) {

for(i=0; i<8; i++) {

for(j=0; j<8; j++) {

. . . = inp[i][j];

out[i][j] = . . . ;

}

}

char tmp1[8][8], tmp2[8][8];

row_trans(inp, tmp1);

col_trans(tmp1, tmp2);

zigzag_trans(tmp2, out);

}

}

col_trans(char inp[8][8],

char out[8][8]);

zigzag_trans(char inp[8][8],

char out[8][8]);

performance specification

8

8

8

8

Performance Specification

Input image

(1024 x 768)

  • High performance DCT
    • Process one 1024x768 image every 2ms
    • Given 400 Mhz clock
      • One image every 800000 cycles
      • One block every 64 cycles
  • Low Performance DCT
    • Process one 1024x768 image every 4ms
    • One block every 128 cycles

inp

row_trans

tmp1

col_trans

Task

tmp2

zigzag_trans

Output coeffs

out

Performance goal :

Task throughput in number of cycles between tasks

building blocks
Building Blocks

Kernel 1

tmp1

Kernel 2

tmp2

Kernel 3

Multifunction Loop Accelerator

[CODES/ISSS ’06]

tmp3

Kernel 4

SRAM buffers

system schema overview

Kernel 1

Kernel 1

Kernel 1

K2

K2

K2

K3

K3

K3

Kernel 4

Kernel 4

Kernel 4

Kernel 5

Kernel 5

Kernel 5

System Schema Overview

LA 1

Kernel 1

Task throughput

Kernel 2

Kernel 3

LA 2

time

Kernel 4

Kernel 5

LA 3

cost components

Throughput = 1 task/200 cycles

Task 1

Throughput = 1 task/100 cycles

K1

Task 1

K1

K1

Task 2

II=1

II=2

K1

TC=100

TC=100

200

K2

K1

Task 2

100

K2

K1

Task 3

400

K2

K2

200

II=1

II=2

TC=100

TC=100

K3

K2

K3

K2

K1

300

K3

K2

600

K3

K3

K3

II=1

II=2

TC=100

TC=100

K3

High performance

Cost Components
  • Cost of loop accelerator data path
    • Cost of FUs, shift registers, muxes, interconnect
  • Initiation interval (II)
    • Key parameter that decides LA cost
      • Low II → high performance → high cost
    • Loop execution time ≈ (trip count) x II
    • Appropriate II chosen to satisfy task throughput

Low performance

cost components contd

K1

LA 1

K1

TC=100

100

K2

K2

200

TC=100

K1

K3

300

K4

K2

K3

400

TC=100

K3

K4

K3

TC=100

Cost Components (Contd..)
  • Grouping of loops into a multifunction LA
    • More loops in a single LA → LA occupied for longer time in current task

Throughput = 1 task / 200 cycles

LA 2

LA 1 occupied for 200 cycles

LA 3

cost components contd1

LA 1

LA 1

K1

II=1

K1

TC=100

K1

100

100

tmp1

K2

K2

K1

LA 2

LA 2

K2

200

200

II=1

TC=100

K1

K3

K3

K2

tmp2

300

300

K2

K3

K3

II=1

TC=100

LA 3

LA 3

K3

Cost Components (Contd..)
  • Cost of SRAM buffers for intermediate arrays
  • More buffers → more task overlap → high performance

tmp1 buffer in use by LA2

Adjacent tasks use different buffers

ilp formulation
ILP Formulation
  • Variables
    • II for each loop
    • Which loops are combined into single LA
    • Number of buffers for temp array
  • Objective function
    • Cost of LAs + cost of buffers
  • Constraints
    • Overall task throughput should be achieved
non linear la cost

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Non-linear LA Cost

Relative Cost

Initiation interval

IImin

IImax

IImin ≤ II ≤ IImax

II = 1*II1 + 2*II2 + 3*II3 + . . . . + 14*II14 and 0 ≤ IIi ≤ 1

Cost(II) = C1*II1 + C2*II2 + C3*II3 + . . . . + C14*II14

multifunction accelerator cost
Multifunction Accelerator Cost
  • Impractical to obtain accurate cost of all combinations
  • CLA = 0.5 * (SUMCLA + MAXCLA)

LA 1

LA 2

LA 1

LA 2

LA 1

LA 2

LA 4

LA 3

LA 4

LA 4

LA 3

LA 3

Worst Case : No sharing

Cost = Sum

Realistic Case : Some sharing

Cost = Between Sum and Max

Best case : Full sharing

Cost = Max

case study simple benchmark

1

1

1

LA 1

512 cycles

1

1

1

2

1

1

1792 cycles

LA 1

LA 2

1

1

1

LA 1

2048 cycles

1

1

1

1

1

LA 3

1

1

3

LA 2

1536 cycles

1

1

3

LA 4

1

Case Study : “Simple” benchmark

LA 1

Loop graph

TC=256

3

beamformer
Beamformer
  • Up to 20% cost savings due to hardware sharing in multifunction accelerators
  • Systems at lower throughput have over-designed LAs
    • Not profitable to pick a lower performance LA
  • Memory buffer cost significant
    • High performance producer consumer better than more buffers
  • Beamformer
  • 10 loops
  • Memory Cost – 60% to 70%
conclusions
Conclusions
  • Automated design realistic for system of loops
  • Designers can move up the abstraction hierarchy
  • Observations
    • Macro level hardware sharing can achieve significant cost savings
    • Memory cost is significant – need to simultaneously optimize for datapath and memory cost
  • ILP formulation tractable
    • Solver took less than 1 minute for systems with 30 loops