1 / 21

Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability

Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability. Kevin Fan, Hyunchul Park, Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan April 8, 2008. Introduction.

haamid
Download Presentation

Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability Kevin Fan, Hyunchul Park,Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan April 8, 2008 1

  2. Introduction • Emerging applications have high performance, cost, energy demands • H.264, wireless, software radio, signal processing • 10-100 Gops required • 200 mW power budget • Applications dominated by tight loops processing large amounts of streaming data iPhone board 2

  3. LD +/- * Loop Accelerators C Code Hardware Loop 3

  4. Hardware Implementations • Customization gets order-of-magnitude performance and efficiency wins • Viterbi: 100x speedup vs. ARM9 FPGAs General PurposeProcessors DSPs CGRAs Flexibility Multifunction Loop Accelerators Loop Accelerators, ASICs Efficiency, Performance 4

  5. What About Programmability? • Software changes – bug fixes, evolving standards • dct_8x8() from H.264 reference implementation for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[b8+pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[b8+pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } } for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) { i=pos_scan[coeff_ctr][0]; j=pos_scan[coeff_ctr][1]; run++; ilev=0; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { MCcoeff = MC(coeff_ctr); runs[MCcoeff]++; } m7 = &curr_res[block_y + j][block_x]; level = iabs (m7[i]); if (img->AdaptiveRounding) { fadjust8x8[j][block_x+i] = 0; } if (level != 0) { nonzero = TRUE; if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC) { *coeff_cost += MAX_VALUE; img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]); img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff]; ++scan_pos; runs[MCcoeff]=-1; } else { *coeff_cost += MAX_VALUE; ACLevel[scan_pos ] = isignab(level,m7[i]); ACRun [scan_pos++] = run; run=-1; // reset zero level counter } level = isignab(level, m7[i]); ilev = level; } } Version 13.0 Version 13.1 Version 13.2 5

  6. Programmable Loop Accelerator • Reusable hardware → reduced NRE costs • Generalize accelerator without losing efficiency FPGAs General PurposeProcessors DSPs CGRAs Programmable Loop Accelerators Flexibility Multifunction Loop Accelerators Loop Accelerators, ASICs Efficiency, Performance 6

  7. Loop 1 Flexible Accelerators • Generalize accelerator architecture • Map new loops to existing hardware SynthesisSystem Hardware Compiler Loop 2 7

  8. Loop Accelerator Architecture CRF Point-to-point Connections … … … … … … FSM Local Mem BR + & MEM Controlsignals • Hardware realization of modulo scheduled loop • Parameterized execution resources, storage, connectivity 8

  9. Programmable Accelerator Architecture • Generalize architectural features that limit programmability CRF Literals Point-to-point Connections Bus … … … … … … Control Memory Local Mem BR +/- &/| MEM Controlsignals RR RR RR RR • ~50% area overhead vs. non-programmable accelerator 9

  10. Mapping Loops onto Hardware Processor Accelerator FUs Storage Connectivity ALU ALU LD +/- * CRF 8 16 8 10

  11. +3 +2 +5 +4 +3 +2 +5 +4 Scheduling Example MEM ADDER1 ADDER2 LD1 Time +2 +3 LD1 II=2 +4 +5 +2 +3 LD1 +4 +5 ? +2 +3 LD1 +4 11

  12. Modulo Scheduling for LAs • Large search space, few solutions • Op-centric approaches unable to find solutions • Satisfiability Modulo Theory (SMT) formulation to solve linear and SAT constraints simultaneously Loop Move Insertion SMT Scheduling Register Allocation Control Signals Machine description Increment II 12

  13. SMT Formulation • Boolean variables Xi,f,t are true if operation i is scheduled on FU f at time slot t. • Integer variables Si represent stage of operation i. i lat(i) dist(i,j) sched_time(j)  sched_time(i) + lat(i) – dist(i,j)  II j ( Xi,fi,ti Xj,fj,tj )  ( ) Sj II + tj  Si  II + ti + lat(i) – dist(i,j)  II • More details in paper 13

  14. FU FU Hardware Loop Loop Loop Loop Loop Loop Measuring Programmability • How well can different loops be mapped onto the same hardware? • Performance matters – how much does II increase? • Need set of loops with different degrees of similarity ? 14

  15. Graph Perturbation • Synthetically generated graphs • More perturbations → less similar to original graph • Iteratively apply random transformations: 15

  16. Results – Perturbed Graphs Average II increase Base II 4 8 7 2 4 4 4 4 6 9 MPEG4 Signal processing Image Math 16

  17. Results – Restricted Datapath 17

  18. Conclusion • Increase flexibility of customized hardware without sacrificing performance, efficiency • Successfully map loops to heterogeneous hardware • Compile times of 5 minutes – 1 hour • Software changing faster than hardware → patchable ASIC 18

  19. Questions? 19

  20. 20

  21. Results – Cross Compilation 21

More Related