Streamroller compiler orchestrated synthesis of accelerator pipelines
Sponsored Links
This presentation is the property of its rightful owner.
1 / 18

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on
  • Presentation posted in: General

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines. Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke University of Michigan. app.c. LA. LA. LA. LA. Automated C to Gates Solution. SoC design 10-100 Gops, 200 mW power budget Low level tools ineffective

Download Presentation

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke

University of Michigan

1


app.c

LA

LA

LA

LA

Automated C to Gates Solution

  • SoC design

    • 10-100 Gops, 200 mW power budget

    • Low level tools ineffective

  • Automated accelerator synthesis for whole application

    • Correct by construction

    • Increase designer productivity

    • Faster time to market

2


Transform

Quantizer

Coder

Coded Image

Image

Inverse

Quantizer

Inverse

Transform

Motion

Estimator

Motion

Predictor

H.264 Encoder

OVSF

Generator

Data out

Baseband

Trasmitter

Data in

Block

Interleaver

Conv./

Turbo

CRC

RRC

Filter

Spreader/

Scrambler

W-CDMA Transmitter

Streaming Applications

  • Data “streaming” through kernels

  • Kernels are tight loops

    • FIR, Viterbi, DCT

  • Coarse grain dataflow between kernels

    • Sub-blocks of images, network packets

3


Kernel 1

Kernel 1

Kernel 1

K2

K2

K2

K3

K3

K3

Kernel 4

Kernel 4

Kernel 4

Kernel 5

Kernel 5

Kernel 5

System Schema Overview

LA 1

Kernel 1

Task throughput

Kernel 2

Kernel 3

LA 2

time

Kernel 4

Kernel 5

LA 3

4


inp

row_trans

tmp1

col_trans

tmp2

zigzag_trans

out

Input Specification

  • System specification

    • Function with main input/output

    • Local arrays to pass data

    • Sequence of calls to kernels

  • Sequential C program

  • Kernel specification

    • Perfectly nested FOR loop

    • Wrapped inside C function

    • All data access made explicit

row_trans(char inp[8][8],

char out[8][8] ) {

dct(char inp[8][8],

char out[8][8]) {

for(i=0; i<8; i++) {

for(j=0; j<8; j++) {

. . . = inp[i][j];

out[i][j] = . . . ;

}

}

char tmp1[8][8], tmp2[8][8];

row_trans(inp, tmp1);

col_trans(tmp1, tmp2);

zigzag_trans(tmp2, out);

}

}

col_trans(char inp[8][8],

char out[8][8]);

zigzag_trans(char inp[8][8],

char out[8][8]);

5


K1

LA 1

K1

TC=100

100

K2

K2

200

TC=100

K1

K3

300

K4

K2

K3

400

TC=100

K3

K4

K3

TC=100

System Level Decisions

  • Throughput of each LA – Initiation Interval

  • Grouping of loops into a multifunction LA

    • More loops in a single LA → LA occupied for longer time in current task

Throughput = 1 task / 200 cycles

LA 2

LA 1 occupied for 200 cycles

LA 3

6


LA 1

LA 1

K1

II=1

K1

TC=100

K1

100

100

tmp1

K2

K2

K1

LA 2

LA 2

K2

200

200

II=1

TC=100

K1

K3

K3

K2

tmp2

300

300

K2

K3

K3

II=1

TC=100

LA 3

LA 3

K3

System Decisions (Contd..)

  • Cost of SRAM buffers for intermediate arrays

  • More buffers → more task overlap → high performance

tmp1 buffer in use by LA2

Adjacent tasks use different buffers

7


1

1

1

LA 1

512 cycles

1

1

1

2

1

1

1792 cycles

LA 1

LA 2

1

1

1

LA 1

2048 cycles

1

1

1

1

1

LA 3

1

1

3

LA 2

1536 cycles

1

1

3

LA 4

1

Case Study : “Simple” benchmark

LA 1

Loop graph

TC=256

3

8


Prescribed Throughput Accelerators

  • Traditional behavioral synthesis

    • Directly translate C operatorsinto gates

  • Our approach: Application-centric Architectures

    • Achieve fixed throughput

    • Maximize hardware sharing

Operation graph

Datapath

Application

Architecture

9


Loop Accelerator Template

  • Hardware realization of modulo scheduled loop

  • Parameterized execution resources, storage, connectivity

10


Modulo

Schedule

Scheduled

Ops

Build

Datapath

FUs

Synthesize

Instantiate

Arch

Op1 Op2

Op3 …

time

.v

FU

FU

Loop

Accelerator

Verilog,

Control Signals

Concrete

Arch

Loop Accelerator Design Flow

FU Alloc

FU

FU

.c

RF

C Code,

Performance

(Throughput)

Abstract

Arch

11


Loop

Accelerator

Loop

Accelerator

LA1

LA1

Multifunction

Loop

Accelerator

LA2

LA2

LA3

Multifunction

Loop

Accelerator

LA4

LA3

LA5

Accelerator

Pipeline

Accelerator

Pipeline

Multifunction Accelerator

  • Map multiple loops to single accelerator

  • Improve hardware efficiency via reuse

  • Opportunities for sharing

    • Disjoint stages(loops 2, 3)

    • Pipeline slack(loops 4, 5)

Loop 1

Frame

Type?

Loop 2

Loop 3

Loop 4

Block 5

Application

12


DatapathUnion

FU

FU

Union

Cost SensitiveModulo Scheduler

FU

FU

Loop 1

Cost SensitiveModulo Scheduler

FU

FU

Loop 2

  • 43% average savings over sum of accelerators

  • Smart union within 3% of joint scheduling solution

13


Challenges: Throughput Enabling Transformations

  • Algorithm-level pipeline retiming

    • Splitting loops based on tiling

    • Co-scheduling adjacent loops

Loop 1

Loop 1

Critical loop

Loop 2

Loop 2a

Critical loop

Loop 2b

Loop 3

Loop 3,4

Loop 4

14


Challenges: Programmable Loop Accelerator

  • Support bug fixes, evolving standards

  • Accelerate loops not known at design time

  • Minimize additional control overhead

Interconnect

II

Local

Mem

Control

FU

FU

MEM

Controlsignals

15


Challenges: Timing Aware Synthesis

  • Technology scaling, increasing # FUs → rising interconnect cost, wire capacitance

  • Strategies to eliminate long wires

    • Preemptive: predict & prevent long wires

    • Reactive: use feedback from floorplanner

- Insert flip flop on long path

- Reschedule with added latency

FU1

FU2

FU3

16


Challenges: Adaptable Voltage/Frequency Levels

flip-flop

  • Allow voltage scaling beyond margins

  • Using shadow latches in loop accelerator

    • Localized error detection

    • Control is predefined: simple error recovery

D

Q

CLK

error

delay

shadowlatch

FU

FU

Shadowlatch

Extra queueentries

17


For More Information

  • Visit http://cccp.eecs.umich.edu

18


  • Login