Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines

Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath Kudlur, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan

app.c LA LA LA LA Automated C to Gates Solution • SoC design • 10-100 Gops, 200 mW power budget • Low level tools ineffective • Automated accelerator synthesis for whole application • Correct by construction • Increase designer productivity • Faster time to market

Transform Quantizer Coder Coded Image Image Inverse Quantizer Inverse Transform Motion Estimator Motion Predictor H.264 Encoder OVSF Generator Data out Baseband Trasmitter Data in Block Interleaver Conv./ Turbo CRC RRC Filter Spreader/ Scrambler W-CDMA Transmitter Streaming Applications • Data “streaming” through kernels • Kernels are tight loops • FIR, Viterbi, DCT • Coarse grain dataflow between kernels • Sub-blocks of images, network packets

Software Overview 1 SRAM Buffers System Level Synthesis Frontend Analyses 2 3 4 Whole Application Loop Graph Accelerator Pipeline Multifunction Accelerator

inp row_trans tmp1 col_trans tmp2 zigzag_trans out Input Specification • Sequential C program • Kernel specification • Perfectly nested FOR loop • Wrapped inside C function • All data access made explicit • System specification • Function with main input/output • Local arrays to pass data • Sequence of calls to kernels row_trans(char inp[8][8], char out[8][8] ) { dct(char inp[8][8], char out[8][8]) { for(i=0; i<8; i++) { for(j=0; j<8; j++) { . . . = inp[i][j]; out[i][j] = . . . ; } } char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out); } } col_trans(char inp[8][8], char out[8][8]); zigzag_trans(char inp[8][8], char out[8][8]);

8 8 8 8 Performance Specification Input image (1024 x 768) • High performance DCT • Process one 1024x768 image every 2ms • Given 400 Mhz clock • One image every 800000 cycles • One block every 64 cycles • Low Performance DCT • Process one 1024x768 image every 4ms • One block every 128 cycles inp row_trans tmp1 col_trans Task tmp2 zigzag_trans Output coeffs out Performance goal : Task throughput in number of cycles between tasks

Building Blocks Kernel 1 tmp1 Kernel 2 tmp2 Kernel 3 Multifunction Loop Accelerator [CODES/ISSS ’06] tmp3 Kernel 4 SRAM buffers

Kernel 1 Kernel 1 Kernel 1 K2 K2 K2 K3 K3 K3 Kernel 4 Kernel 4 Kernel 4 Kernel 5 Kernel 5 Kernel 5 System Schema Overview LA 1 Kernel 1 Task throughput Kernel 2 Kernel 3 LA 2 time Kernel 4 Kernel 5 LA 3

Throughput = 1 task/200 cycles Task 1 Throughput = 1 task/100 cycles K1 Task 1 K1 K1 Task 2 II=1 II=2 K1 TC=100 TC=100 200 K2 K1 Task 2 100 K2 K1 Task 3 400 K2 K2 200 II=1 II=2 TC=100 TC=100 K3 K2 K3 K2 K1 300 K3 K2 600 K3 K3 K3 II=1 II=2 TC=100 TC=100 K3 High performance Cost Components • Cost of loop accelerator data path • Cost of FUs, shift registers, muxes, interconnect • Initiation interval (II) • Key parameter that decides LA cost • Low II → high performance → high cost • Loop execution time ≈ (trip count) x II • Appropriate II chosen to satisfy task throughput Low performance

K1 LA 1 K1 TC=100 100 K2 K2 200 TC=100 K1 K3 300 K4 K2 K3 400 TC=100 K3 K4 K3 TC=100 Cost Components (Contd..) • Grouping of loops into a multifunction LA • More loops in a single LA → LA occupied for longer time in current task Throughput = 1 task / 200 cycles LA 2 LA 1 occupied for 200 cycles LA 3

LA 1 LA 1 K1 II=1 K1 TC=100 K1 100 100 tmp1 K2 K2 K1 LA 2 LA 2 K2 200 200 II=1 TC=100 K1 K3 K3 K2 tmp2 300 300 K2 K3 K3 II=1 TC=100 LA 3 LA 3 K3 Cost Components (Contd..) • Cost of SRAM buffers for intermediate arrays • More buffers → more task overlap → high performance tmp1 buffer in use by LA2 Adjacent tasks use different buffers

ILP Formulation • Variables • II for each loop • Which loops are combined into single LA • Number of buffers for temp array • Objective function • Cost of LAs + cost of buffers • Constraints • Overall task throughput should be achieved

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Non-linear LA Cost Relative Cost Initiation interval IImin IImax IImin ≤ II ≤ IImax II = 1*II1 + 2*II2 + 3*II3 + . . . . + 14*II14 and 0 ≤ IIi ≤ 1 Cost(II) = C1*II1 + C2*II2 + C3*II3 + . . . . + C14*II14

Multifunction Accelerator Cost • Impractical to obtain accurate cost of all combinations • CLA = 0.5 * (SUMCLA + MAXCLA) LA 1 LA 2 LA 1 LA 2 LA 1 LA 2 LA 4 LA 3 LA 4 LA 4 LA 3 LA 3 Worst Case : No sharing Cost = Sum Realistic Case : Some sharing Cost = Between Sum and Max Best case : Full sharing Cost = Max

1 1 1 LA 1 512 cycles 1 1 1 2 1 1 1792 cycles LA 1 LA 2 1 1 1 LA 1 2048 cycles 1 1 1 1 1 LA 3 1 1 3 LA 2 1536 cycles 1 1 3 LA 4 1 Case Study : “Simple” benchmark LA 1 Loop graph TC=256 3

Beamformer • Up to 20% cost savings due to hardware sharing in multifunction accelerators • Systems at lower throughput have over-designed LAs • Not profitable to pick a lower performance LA • Memory buffer cost significant • High performance producer consumer better than more buffers • Beamformer • 10 loops • Memory Cost – 60% to 70%

Conclusions • Automated design realistic for system of loops • Designers can move up the abstraction hierarchy • Observations • Macro level hardware sharing can achieve significant cost savings • Memory cost is significant – need to simultaneously optimize for datapath and memory cost • ILP formulation tractable • Solver took less than 1 minute for systems with 30 loops

Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines