1 / 22

Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures

Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures. Hyunchul Park, Kevin Fan, Manjunath Kudlur, Scott Mahlke. Advanced Computer Architecture Lab University of Michigan. Coarse-Grained Reconfigurable Architecture (CGRA).

vito
Download Presentation

Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modulo Graph Embedding : Mapping Applications onto Coarse-Grained Reconfigurable Architectures Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke Advanced Computer Architecture Lab University of Michigan 1

  2. Coarse-Grained Reconfigurable Architecture (CGRA) • Array of PEs connected in a mesh-like interconnect • Characterized by array size, node functionalities, interconnect, register file configurations • Execute compute intensive kernels in multimedia applications Config FU LRF 2

  3. CGRA : Attractive Alternative to ASICs • Suitable for running multimedia applications on embedded systems • High computation throughput • Low power consumption and scalability • High flexibility with fast configuration • Morphosys : 8x8 array with RISC processor • SIMD style execution of loops • Piperench : 1-D reconfigurable hardware • Virtualize hardware pipeline • ADRES : 8x8 array with tightly coupled VLIW • Modulo scheduling with simulated annealing 3

  4. Scheduling in CGRA • Different from conventional VLIW • Sparse interconnect and distributed register files • No dedicated routing resources • Need a good compiler to exploit the abundance of computing resources FU0 LRF FU1 LRF CentralRF FU0 FU1 FU2 FU3 FU2 LRF FU3 LRF CGRA Conventional VLIW 4

  5. Objectives of This Work • Modulo scheduling technique for CGRAs • Exploit loop-level parallelism by overlapping execution of iterations • Targeting low-cost CGRAs • Achieve quality schedule under restriction of hardware • Fast compilation time 5

  6. A A A A A A B B B B B B C C C C C C Modulo Scheduling Basics • Expose loop-level parallelism by overlapping execution of iterations • Initiation interval (II) • Each iteration is executed every II cycles II Overlapped Execution 6

  7. DFG Modulo Scheduling for CGRA • Mapping DFG onto 3-D scheduling space • Limited number of scheduling slots : (number of PEs) x II • Minimize routing cost (number of slots used for routing) • Sparse interconnect and distributed register files • Ensure routability of operands II time Scheduling Space 4x4 CGRA 7

  8. Our Approach • Systematic approach to generate good schedule in reasonable time • Minimize routing cost • Convert scheduling problem into graph embedding • Leverage graph embedding algorithm • Ensure routability of operands • Skewed scheduling space • Create a narrow, but tall scheduling space 8

  9. 1 : Minimize Routing Cost • Routing cost : number of PEs used for routing • Determined by positions of producer and consumer • Minimize distance between producers and consumers • Height-based list scheduling • Schedule operations in the order of dependence height • Place consumers close to producers • Need to carefully place operations in the same height 9

  10. PE 0 PE 1 PE 2 PE 3 Scheduling Example – Routing Cost 0 1 2 3 0 1 2 3 4 5 4’ 5’ 4 5 6 6 Routing Cost = 2 DFG 0 1 2 3 4 5 6 1x4 CGRA Routing Cost = 0 Common consumer information is important ! 10

  11. Affinity Graph Heuristic • Consider placement of operations with same height together • Use common consumer information • Affinity value between operations • Measured by the distance of common consumers in DFG • Construct affinity graph • Nodes : operations, edges : affinity values • Place operations with affinity edges close to each other 11

  12. 0 4 1 2 3 5 0 1 2 0 2 4 PE PE PE PE 1 3 5 3 4 5 PE PE PE PE Affinity Graph Example 0 1 2 3 4 5 height 3 height 2 height 1 Affinity Graph DFG Mapping onto CGRA 2x4 CGRA Drawing affinity graph onto scheduling space Bad mapping Good mapping 12

  13. Leveraging Graph Embedding • Graph embedding • Drawing a graph onto a target space • Grid layout algorithm by Li & Kurata • Embed complicated biochemical networks onto 2-D grid space • Simulated annealing • Our scheduling problem is a graph embedding problem • Draw affinity graph onto scheduling space minimizing edge length Process Flow of Grid Layout [Li 2005] 13

  14. 0 1 2 3 4 PE 0 PE 1 PE 2 5 6 7 0 1 2 3 4 5 6 2 : Ensure Routability of Operands • Resources are repeatedly used every II cycles • Routing can fail due to previously scheduled operations • Backtracking : hard to make forward progress for CGRA • Take preventative approach 0 1 2 II 3 4 5 6 1x3 CGRA 7 DFG Routing failed for Op 7 ! 14

  15. 0 5 6 0 1 2 1 2 7 3 4 0 5 6 0 1 2 1 2 7 3 4 Skewed Scheduling Space • Should prevent routing failures in advance • Skew scheduling space • Staggering down to the right • Create a narrow, but tall scheduling space • Operations can be routed to the right • Dynamically adjust scheduling space 15

  16. System Flow 16

  17. Experimental Setup • Twelve innermost loop kernels from various domains • Three designs with different RF configurations • Evaluate the impact of register file sharing Dedicated RF Shared RF Central RF 17

  18. Evaluation of Affinity Heuristic • Results of acyclic scheduling • Average of 59% reduction in routing cost 18

  19. Modulo Graph Embeddingvs. Simulated Annealing • Utilization = (# slots used for computation) / (# total slots) • Time : (~ 5 sec) vs. (5 min ~ 3 hours) 19

  20. Impact of Register File Configurations 20

  21. Conclusions • Modulo scheduler targeting low-cost CGRAs • Provide high computation throughput, scalability, power efficiency • Two heuristics to generate a good schedule • Affinity graph heuristic • Skewed scheduling space • Average utilizations of 56-68% for three designs • Systematic approach allows fast compilation time • All benchmarks finished within 5s 21

  22. Questions ? 22

More Related