1 / 22

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System. Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan. 20 GB HD. Introduction. Emerging applications have high performance, cost, energy demands

conlan
Download Presentation

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan 1

  2. 20 GB HD Introduction • Emerging applications have high performance, cost, energy demands • H.264, wireless, software radio, signal processing • 10-100 Gops required • 200 mW power budget • Applications dominated by tight loops processing large amounts of streaming data 3.5G (HSDPA)WiMax Stereo Headset TV out Memory card [ARM 2005] PC / Mac 2

  3. Loop Accelerators • Order-of-magnitude performance and efficiency wins • Viterbi: 100x speedup vs. ARM9 Automated C  gates solution • Correct by construction • Close designer productivity gap • Achieve short time-to-market .C 3

  4. Loop Accelerator Template • Hardware realization of modulo scheduled loop • Parameterized execution resources, storage, connectivity 4

  5. 2 Modulo Schedule Scheduled Ops 3 Build Datapath 5 4 FUs Synthesize Instantiate Arch Op1 Op2 Op3 … time .v FU FU Loop Accelerator Verilog, Control Signals Concrete Arch Loop Accelerator Design Flow 1 FU Alloc FU FU .c RF C Code, Performance (Throughput) Abstract Arch 5

  6. 12 FU1 FU2 LOAD time 1 MEM + . . . ADD time 4 Schedule Datapath Modulo Scheduling andDatapath Derivation • Schedule to abstract architecture (FUs) • Determine register and interconnect requirements from schedule r1 = Mem[r2] r3 = r1 + 12 Source Code 6

  7. FU1 FU2 FU3 0 +1 FU1 FU2 FU3 time 1 +2 LD1 2 LD2 Cost Sensitive Scheduling • Traditional scheduling is hardware unaware • Intelligent scheduling needed to reduce hardware cost • Different scheduling alternatives not equal FU1 FU2 FU3 0 +1 +2 FU1 FU2 FU3 1 time LD1 +1 +2 2 LD2 LD1 LD2 7

  8. FU FU 3 4 Scheduling to Reduce Cost • Hardware cost is function of final schedule • Increased hardware sharing = reduced cost 1 • Reusing hardware is “free” • Traditional metrics (register pressure) not sufficient FU 2 No additional costfor longer lifetime 8

  9. Hardware cost = FU cost + Storage cost + Wire cost + - * << Initial Approach: Greedy • Standard iterative modulo scheduler, augmented with hardware cost model • Choose alternative which increases cost the least while unscheduled ops remain { get valid alternatives for op for each alternative { get hardware cost } schedule op using min-cost alternative update hardware cost model } 9

  10. FU Storage MUX Results – Greedy Scheduling • Local scope  local minima • Much more cost savings possible • 5% average cost savings 10

  11. +1 +2 LD3 +4 -5 Optimal Modulo Scheduling Op1 Op2 Op3 (1,0) (1,1) (3,0) (3,1) (2,0) (2,1) (FU #, time) Loop Search Space Storage cost =  widthi  depthi FU cost =  cost(FUi) • Optimal modulo schedulingextends [Eichenberger ’97] 11

  12. Results – Optimal Scheduling FU Storage MUX • 27% average cost savings 12

  13. Problem Decomposition • Exact solutions are not practical • (#FU  II  stages) ^ #ops possible schedules • 20 lines of C code  100 hours • Excessive runtimes even for modest-size loops • Decompose into more manageable sub-problems • Partitioned scheduling • Time-space decomposition 13

  14. Partitioned Scheduling • Partition the operations into small groups • Schedule groups of operations sequentially • Account for hardware contribution of previously scheduled groups • Backtrack if infeasible state reached 1 2 1 1 2 Optimal Modulo Scheduler Optimal Modulo Scheduler 3 4 3 3 4 5 5 5 14

  15. + + LD LD Operation Partitioning • Traditional partitioning: minimize edge cuts • Does not necessarily lead to good cost • Goal: maximize hardware sharing opportunities within a group + + LD << LD + * 15

  16. Results – Partitioned Scheduling FU Storage MUX • 8% average cost savings • With large number of partitions, similar to greedy 16

  17. Partition Size for Sharp • Improve cost by considering more ops at a time 17

  18. Time-Space Decomposition • Reduce scheduling complexity • View all operations together FU1 FU2 FU3 time 0: 1 2 5 0 1 5 2 time Time, space time 1: 1 3 4 3 4 1 2 3 4 FU1 FU2 FU3 FU 1: 1 5 Space, time 5 0 1 2 FU 2: 2 4 time 1 5 4 3 FU 3: 3 • Optimize for register depth during time assignment, register width and FU cost during space assignment 18

  19. Results – Time-Space Scheduling FU Storage MUX • Time, space: 19% average cost savings • Space, time: 20% average cost savings 19

  20. Real Cost Savings Viterbi, space-time decomposed scheduler, 0.37 mm2 43.2% overall area savings Viterbi, naïve scheduler, 0.66 mm2 20

  21. Conclusion • Automated C  loop accelerator synthesis system • Modulo scheduler must be cost aware • Decomposition methods make problem tractable • 20% average cost savings with space-time decomposition • Importance of global view of all operations • Individual savings up to 43% • Compile times of 1 minute – 30 minutes 21

  22. Questions? • For more information: http://cccp.eecs.umich.edu 22

More Related