1 / 18

Increasing Hardware Efficiency with Multifunction Loop Accelerators

Increasing Hardware Efficiency with Multifunction Loop Accelerators. Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan October 25, 2006. Introduction. Emerging applications have high performance, cost, energy demands

munin
Download Presentation

Increasing Hardware Efficiency with Multifunction Loop Accelerators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan October 25, 2006 1

  2. Introduction • Emerging applications have high performance, cost, energy demands • H.264, wireless, software radio, signal processing • 10-100 Gops required • 200 mW power budget • Applications dominated by tight loops processing large amounts of streaming data CPU Accelerators 2

  3. Automated C  gates solution • Correct by construction • Close designer productivity gap • Achieve short time-to-market .C Loop Accelerators • Order-of-magnitude performance and efficiency wins • Viterbi: 100x speedup vs. ARM9 3

  4. Our approach: Application-centric Architectures • Achieve fixed throughput • Maximize hardware sharing Application Architecture Prescribed Throughput Accelerators • Traditional behavioral synthesis • Directly translate C operatorsinto gates Operation graph Datapath 4

  5. Outline • Loop accelerator schema and design flow • Cost sensitive scheduling • Designing multifunction accelerators • Naïve • Joint scheduling • Datapath union • Synthesis results 5

  6. Loop Accelerator Template • Hardware realization of modulo scheduled loop • Parameterized execution resources, storage, connectivity 6

  7. Modulo Schedule Scheduled Ops Build Datapath FUs Synthesize Instantiate Arch Op1 Op2 Op3 … time .v FU FU Loop Accelerator Verilog, Control Signals Concrete Arch Loop Accelerator Design Flow FU Alloc FU FU .c RF C Code, Performance (Throughput) Abstract Arch 7

  8. 12 FU1 FU2 time 1 LOAD MEM + . . . ADD time 4 Schedule Datapath Datapath Derived from Schedule • Schedule to abstract architecture (FUs) • Determine register and interconnect requirements from schedule r1 = Mem[r2] r3 = r1 + 12 Source Code 8

  9. FU1 FU2 FU3 0 +1 FU1 FU2 FU3 time 1 +2 LD1 2 LD2 Cost Sensitive Scheduling • Traditional scheduling is hardware unaware • Intelligent scheduling needed to reduce hardware cost • 27% cost reduction with same performance [MICRO ’05] FU1 FU2 FU3 0 +1 +2 FU1 FU2 FU3 1 time LD1 +1 +2 2 LD2 LD1 LD2 9

  10. Loop Accelerator Loop Accelerator LA1 LA1 Multifunction Loop Accelerator LA2 LA2 LA3 Multifunction Loop Accelerator LA4 LA3 … LA5 Accelerator Pipeline Accelerator Pipeline Multifunction Accelerator • Map multiple loops to single accelerator • Improve hardware efficiency via reuse • Opportunities for sharing • Disjoint stages(loops 2, 3) • Pipeline slack(loops 4, 5) Loop 1 Frame Type? Loop 2 Loop 3 Loop 4 Block 5 … Application 10

  11. Design Strategies • Naïve method: Design single function accelerators, place side by side • Misses potential hardware sharing of FUs, storage, interconnect Cost SensitiveModulo Scheduler FU FU Loop 1 FU FU FU FU Cost SensitiveModulo Scheduler FU FU Loop 2 Multifunction datapath 11

  12. FU FU FUs FUs Op1 Op2 Op3 … Op2 Op1 … Op3 time time Joint Scheduling • Loops are independent: # possible schedules exponential in # of loops! • Infeasible for modest problems Loop 1 JointCost SensitiveModulo Scheduler Loop 2 12

  13. Multifunction Gate Costs A B C D E F G H I J • 43% average savings over sum of accelerators 13

  14. DatapathUnion FU FU Datapath Union Cost SensitiveModulo Scheduler FU FU Loop 1 Cost SensitiveModulo Scheduler FU FU Loop 2 14

  15. + * M + + */- +/- M M/* M/+ M Datapath Union • Combine similar components→ better hardware sharing→ lower cost • Trade off FU and register cost • Combining dissimilar FUs can enable register cost savings • ILP formulation minimizes FU and register cost + - M M Accel 1 Accel 2 + Multi- function accel 15

  16. Multifunction Gate Costs A B C D E F G H I J • Smart union within 3% of joint scheduling solution 16

  17. Conclusion • Multifunction accelerators highly effective in exploiting coarse grained hardware sharing • Joint scheduling achieves 43% average cost savings, but is impractical • Smart union of independent accelerators achieves 40% average savings • Compile times of 5 minutes – 1 hour 17

  18. Questions? 18

More Related