1 / 27

Region-based Hierarchical Operation Partitioning for Multicluster Processors

Region-based Hierarchical Operation Partitioning for Multicluster Processors. Michael Chu, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan June 5, 2014. Register File. FU. FU. FU. FU. Clustered Architecture. Register File. Register File. FU. FU.

carr
Download Presentation

Region-based Hierarchical Operation Partitioning for Multicluster Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Region-based Hierarchical Operation Partitioningfor Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan June 5, 2014

  2. Register File FU FU FU FU Clustered Architecture Register File Register File FU FU FU FU Cluster 1 Cluster 2 Clustered Architectures Conventional Architecture • Increasing width from 4 to 8 increases total delay 29% [Palacharla ‘98] • Clustered Approach: • Decentralized architecture • Communication through interconnection network • Used in Alpha 21264, TI C6x, Analog Tigersharc and others. RF FU

  3. Basics of Multicluster Compilation • Objectives: • Balance workload per cluster • Minimize critical intercluster communication Interconnection Network + Register File Register File >> & * LW I MEM I MEM + Intercluster move Cluster 1 Cluster 2

  4. Problem #1: Local vs Global Scope Local scope clustering Global scope clustering 1 3 1 4 1 1 2 7 2 8 move 6 4 6 5 2 3 4 5 2 3 4 5 10 8 3 9 6 7 8 9 6 7 8 9 cycle cycle 5 7 11 move 10 11 10 11 move 9 10 12 12 11 12 12

  5. Problem #2: Scheduler-centric • Cluster assignment during scheduling adds complexity • Detailed resource model/reservation tables is slow • Forces local decisions Reservation Tables cycle Cluster 2 cycle Cluster 1 X X X X 1 1 1 X X X X 2 2 2 3 4 5 X X X X 1 1 6 7 8 9 X X X X 2 2 10 11 X X X X 1 1 12 X X X X 2 2

  6. Our Approach • Opposite approach to conventional clustering • Global view • Graph partitioning strategy [Aletà ‘01, ‘02] • Identify tightly coupled operations - treat uniformly • Non scheduler-centric mindset • Prescheduling technique • Doesn’t complicate scheduler • Enable global view of code • Estimate-based approach [Lapinskii ‘01]

  7. 1 10 1 1 1 1 10 8 8 8 1 1 1 1 10 8 8 8 1 1 10 1 1 Region-based Hierarchical Operation Partitioning (RHOP) • Code is considered region at a time • Weight calculation creates guides for good partitions • Partitioning clusters based on given weights Program Region int main { int x; printf(…); . . . } Weight Calculation Graph Partitioning

  8. Register File Register File 1 2 I I F F M M B B 3 4 5 6 7 8 9 10 11 12 13 14 Node Weights • Create a metric to determine resource usage Dedicated Resources Shared Resources 1 2 3 Accounts for FU’s Accounts for buses, ports

  9. 10 10 1 0 0 1 0 1 0 1 8 8 8 8 10 2 0 1 1 8 10 1 1 10 1 Edge Weights • Slack distribution allocates slack to certain edges • Edge slack = lstartdest - latencyedge - estartsrc • First come, first serve method used 1 2 (0,0) (0,0) 0 0 3 4 5 6 7 (1,1) (0,1) (0,1) (0,1) (0,1) 0 8 9 10 11 (2,2) (1,2) (0,2) (1,2) 0 1 12 13 (3,3) (2,3) (estart, lstart) 0 1 14 (4,4)

  10. RHOP - Partitioning Phase • Modified Multilevel-KL algorithm [Kernighan ‘69] • Multilevel graph partitioning consists of two stages • Coarsening stage • Refinement stage

  11. Cluster Refinement • 3 questions to answer: • Which cluster should operations move from? • How good is the current partition? • How profitable is it to move X from cluster A to B? ?

  12. 0 1 2 2.5 2.0 0.5 0.0 0.0 Cluster_wgt1= 5.0 0 1 2 0.0 0.33 0.33 0.0 0.0 Cluster_wgt2= 0.67 Where Should Operations Move From? Cluster 1 1 2 4 5 6 3 9 cycle 8 12 14 Cluster 2 7 10 11 13 cycle

  13. How Good is this Partition? Cluster 2 Max Cluster 1 0 1 2 0 1 2 2.5 2.5 0.0 0.33 2.0 2.0 0.5 0.5 0.33 0.0 0.0 0.0 0.0 0.0 0.0 Cluster_wgt1= 5.0 Cluster_wgt1= 0.67 SL= 5.0

  14. How Good is This Proposed Move? Cluster 1 SL(before)= 5.0 1 2 1.0 0.0 3 cycle SL(after)= 4.5 8 0.0 12 0.0 14 0.0 Cluster 2 Lgain= 0.5 7 4 5 6 1.33 10 11 9 2.33 Egain= -1.0 13 0.83 cycle 0.0 Mgain= 4.0 0.0

  15. Experimental Evaluation • Trimaran toolset: a retargetable VLIW compiler • Evaluated DSP kernels and SPECint2000 • 64 registers per cluster • Latencies similar to Itanium • Perfect caches • For more detailed results, see paper

  16. 2 Cluster Results vs 1 Cluster

  17. 4 Cluster Results vs 1 Cluster

  18. Conclusions • A new, region-scoped method for clustering operations • Prescheduling technique • Estimates on schedule length used instead of scheduler • Combines slack distribution with multilevel-KL partitioning • Performs better as number of resources increases Average Improvement

  19. Questions? http://cccp.eecs.umich.edu

  20. Backup Slides

  21. Previous Work

  22. Bottom-Up Greedy (BUG) • Typical clustering strategy, falls into trouble because of its limited view • Places operations without knowing the rest of the graph • Uses the scheduler to determine where to best place each operation • First used in Multiflow trace [Ellis ‘85]

  23. Graph Partitioning Algorithms • Local improvement methods: • Kernighan-Lin • Swaps pairs of operations between partitions • Fiduccia and Matheyses • KL-inspired, efficent O(|E|) algorithm • Simulated annealing • Genetic algorithms

  24. Graph Partitioning Algorithms • Global methods: • Geometric methods • Coordinate based, not suitable for clustering • Coordinate-free methods • Recursive spectral bisection (RSB) • Multilevel-RSB • Multilevel-KL

  25. RHOP - Example

  26. Improvement at Increasing CPLs

  27. Resource Manager Overhead

More Related