Streamroller compiler orchestrated synthesis of accelerator pipelines
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on
  • Presentation posted in: General

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines. Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke University of Michigan. app.c. LA. LA. LA. LA. Automated C to Gates Solution. SoC design 10-100 Gops, 200 mW power budget Low level tools ineffective

Download Presentation

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Streamroller compiler orchestrated synthesis of accelerator pipelines

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke

University of Michigan

1


Automated c to gates solution

app.c

LA

LA

LA

LA

Automated C to Gates Solution

  • SoC design

    • 10-100 Gops, 200 mW power budget

    • Low level tools ineffective

  • Automated accelerator synthesis for whole application

    • Correct by construction

    • Increase designer productivity

    • Faster time to market

2


Streaming applications

Transform

Quantizer

Coder

Coded Image

Image

Inverse

Quantizer

Inverse

Transform

Motion

Estimator

Motion

Predictor

H.264 Encoder

OVSF

Generator

Data out

Baseband

Trasmitter

Data in

Block

Interleaver

Conv./

Turbo

CRC

RRC

Filter

Spreader/

Scrambler

W-CDMA Transmitter

Streaming Applications

  • Data “streaming” through kernels

  • Kernels are tight loops

    • FIR, Viterbi, DCT

  • Coarse grain dataflow between kernels

    • Sub-blocks of images, network packets

3


System schema overview

Kernel 1

Kernel 1

Kernel 1

K2

K2

K2

K3

K3

K3

Kernel 4

Kernel 4

Kernel 4

Kernel 5

Kernel 5

Kernel 5

System Schema Overview

LA 1

Kernel 1

Task throughput

Kernel 2

Kernel 3

LA 2

time

Kernel 4

Kernel 5

LA 3

4


Input specification

inp

row_trans

tmp1

col_trans

tmp2

zigzag_trans

out

Input Specification

  • System specification

    • Function with main input/output

    • Local arrays to pass data

    • Sequence of calls to kernels

  • Sequential C program

  • Kernel specification

    • Perfectly nested FOR loop

    • Wrapped inside C function

    • All data access made explicit

row_trans(char inp[8][8],

char out[8][8] ) {

dct(char inp[8][8],

char out[8][8]) {

for(i=0; i<8; i++) {

for(j=0; j<8; j++) {

. . . = inp[i][j];

out[i][j] = . . . ;

}

}

char tmp1[8][8], tmp2[8][8];

row_trans(inp, tmp1);

col_trans(tmp1, tmp2);

zigzag_trans(tmp2, out);

}

}

col_trans(char inp[8][8],

char out[8][8]);

zigzag_trans(char inp[8][8],

char out[8][8]);

5


System level decisions

K1

LA 1

K1

TC=100

100

K2

K2

200

TC=100

K1

K3

300

K4

K2

K3

400

TC=100

K3

K4

K3

TC=100

System Level Decisions

  • Throughput of each LA – Initiation Interval

  • Grouping of loops into a multifunction LA

    • More loops in a single LA → LA occupied for longer time in current task

Throughput = 1 task / 200 cycles

LA 2

LA 1 occupied for 200 cycles

LA 3

6


System decisions contd

LA 1

LA 1

K1

II=1

K1

TC=100

K1

100

100

tmp1

K2

K2

K1

LA 2

LA 2

K2

200

200

II=1

TC=100

K1

K3

K3

K2

tmp2

300

300

K2

K3

K3

II=1

TC=100

LA 3

LA 3

K3

System Decisions (Contd..)

  • Cost of SRAM buffers for intermediate arrays

  • More buffers → more task overlap → high performance

tmp1 buffer in use by LA2

Adjacent tasks use different buffers

7


Case study simple benchmark

1

1

1

LA 1

512 cycles

1

1

1

2

1

1

1792 cycles

LA 1

LA 2

1

1

1

LA 1

2048 cycles

1

1

1

1

1

LA 3

1

1

3

LA 2

1536 cycles

1

1

3

LA 4

1

Case Study : “Simple” benchmark

LA 1

Loop graph

TC=256

3

8


Prescribed throughput accelerators

Prescribed Throughput Accelerators

  • Traditional behavioral synthesis

    • Directly translate C operatorsinto gates

  • Our approach: Application-centric Architectures

    • Achieve fixed throughput

    • Maximize hardware sharing

Operation graph

Datapath

Application

Architecture

9


Loop accelerator template

Loop Accelerator Template

  • Hardware realization of modulo scheduled loop

  • Parameterized execution resources, storage, connectivity

10


Loop accelerator design flow

Modulo

Schedule

Scheduled

Ops

Build

Datapath

FUs

Synthesize

Instantiate

Arch

Op1 Op2

Op3 …

time

.v

FU

FU

Loop

Accelerator

Verilog,

Control Signals

Concrete

Arch

Loop Accelerator Design Flow

FU Alloc

FU

FU

.c

RF

C Code,

Performance

(Throughput)

Abstract

Arch

11


Multifunction accelerator

Loop

Accelerator

Loop

Accelerator

LA1

LA1

Multifunction

Loop

Accelerator

LA2

LA2

LA3

Multifunction

Loop

Accelerator

LA4

LA3

LA5

Accelerator

Pipeline

Accelerator

Pipeline

Multifunction Accelerator

  • Map multiple loops to single accelerator

  • Improve hardware efficiency via reuse

  • Opportunities for sharing

    • Disjoint stages(loops 2, 3)

    • Pipeline slack(loops 4, 5)

Loop 1

Frame

Type?

Loop 2

Loop 3

Loop 4

Block 5

Application

12


Union

DatapathUnion

FU

FU

Union

Cost SensitiveModulo Scheduler

FU

FU

Loop 1

Cost SensitiveModulo Scheduler

FU

FU

Loop 2

  • 43% average savings over sum of accelerators

  • Smart union within 3% of joint scheduling solution

13


Challenges throughput enabling transformations

Challenges: Throughput Enabling Transformations

  • Algorithm-level pipeline retiming

    • Splitting loops based on tiling

    • Co-scheduling adjacent loops

Loop 1

Loop 1

Critical loop

Loop 2

Loop 2a

Critical loop

Loop 2b

Loop 3

Loop 3,4

Loop 4

14


Challenges programmable loop accelerator

Challenges: Programmable Loop Accelerator

  • Support bug fixes, evolving standards

  • Accelerate loops not known at design time

  • Minimize additional control overhead

Interconnect

II

Local

Mem

Control

FU

FU

MEM

Controlsignals

15


Challenges timing aware synthesis

Challenges: Timing Aware Synthesis

  • Technology scaling, increasing # FUs → rising interconnect cost, wire capacitance

  • Strategies to eliminate long wires

    • Preemptive: predict & prevent long wires

    • Reactive: use feedback from floorplanner

- Insert flip flop on long path

- Reschedule with added latency

FU1

FU2

FU3

16


Challenges adaptable voltage frequency levels

Challenges: Adaptable Voltage/Frequency Levels

flip-flop

  • Allow voltage scaling beyond margins

  • Using shadow latches in loop accelerator

    • Localized error detection

    • Control is predefined: simple error recovery

D

Q

CLK

error

delay

shadowlatch

FU

FU

Shadowlatch

Extra queueentries

17


For more information

For More Information

  • Visit http://cccp.eecs.umich.edu

18


  • Login