270 likes | 380 Views
Region-based Hierarchical Operation Partitioning for Multicluster Processors. Michael Chu, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan June 5, 2014. Register File. FU. FU. FU. FU. Clustered Architecture. Register File. Register File. FU. FU.
E N D
Region-based Hierarchical Operation Partitioningfor Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan June 5, 2014
Register File FU FU FU FU Clustered Architecture Register File Register File FU FU FU FU Cluster 1 Cluster 2 Clustered Architectures Conventional Architecture • Increasing width from 4 to 8 increases total delay 29% [Palacharla ‘98] • Clustered Approach: • Decentralized architecture • Communication through interconnection network • Used in Alpha 21264, TI C6x, Analog Tigersharc and others. RF FU
Basics of Multicluster Compilation • Objectives: • Balance workload per cluster • Minimize critical intercluster communication Interconnection Network + Register File Register File >> & * LW I MEM I MEM + Intercluster move Cluster 1 Cluster 2
Problem #1: Local vs Global Scope Local scope clustering Global scope clustering 1 3 1 4 1 1 2 7 2 8 move 6 4 6 5 2 3 4 5 2 3 4 5 10 8 3 9 6 7 8 9 6 7 8 9 cycle cycle 5 7 11 move 10 11 10 11 move 9 10 12 12 11 12 12
Problem #2: Scheduler-centric • Cluster assignment during scheduling adds complexity • Detailed resource model/reservation tables is slow • Forces local decisions Reservation Tables cycle Cluster 2 cycle Cluster 1 X X X X 1 1 1 X X X X 2 2 2 3 4 5 X X X X 1 1 6 7 8 9 X X X X 2 2 10 11 X X X X 1 1 12 X X X X 2 2
Our Approach • Opposite approach to conventional clustering • Global view • Graph partitioning strategy [Aletà ‘01, ‘02] • Identify tightly coupled operations - treat uniformly • Non scheduler-centric mindset • Prescheduling technique • Doesn’t complicate scheduler • Enable global view of code • Estimate-based approach [Lapinskii ‘01]
1 10 1 1 1 1 10 8 8 8 1 1 1 1 10 8 8 8 1 1 10 1 1 Region-based Hierarchical Operation Partitioning (RHOP) • Code is considered region at a time • Weight calculation creates guides for good partitions • Partitioning clusters based on given weights Program Region int main { int x; printf(…); . . . } Weight Calculation Graph Partitioning
Register File Register File 1 2 I I F F M M B B 3 4 5 6 7 8 9 10 11 12 13 14 Node Weights • Create a metric to determine resource usage Dedicated Resources Shared Resources 1 2 3 Accounts for FU’s Accounts for buses, ports
10 10 1 0 0 1 0 1 0 1 8 8 8 8 10 2 0 1 1 8 10 1 1 10 1 Edge Weights • Slack distribution allocates slack to certain edges • Edge slack = lstartdest - latencyedge - estartsrc • First come, first serve method used 1 2 (0,0) (0,0) 0 0 3 4 5 6 7 (1,1) (0,1) (0,1) (0,1) (0,1) 0 8 9 10 11 (2,2) (1,2) (0,2) (1,2) 0 1 12 13 (3,3) (2,3) (estart, lstart) 0 1 14 (4,4)
RHOP - Partitioning Phase • Modified Multilevel-KL algorithm [Kernighan ‘69] • Multilevel graph partitioning consists of two stages • Coarsening stage • Refinement stage
Cluster Refinement • 3 questions to answer: • Which cluster should operations move from? • How good is the current partition? • How profitable is it to move X from cluster A to B? ?
0 1 2 2.5 2.0 0.5 0.0 0.0 Cluster_wgt1= 5.0 0 1 2 0.0 0.33 0.33 0.0 0.0 Cluster_wgt2= 0.67 Where Should Operations Move From? Cluster 1 1 2 4 5 6 3 9 cycle 8 12 14 Cluster 2 7 10 11 13 cycle
How Good is this Partition? Cluster 2 Max Cluster 1 0 1 2 0 1 2 2.5 2.5 0.0 0.33 2.0 2.0 0.5 0.5 0.33 0.0 0.0 0.0 0.0 0.0 0.0 Cluster_wgt1= 5.0 Cluster_wgt1= 0.67 SL= 5.0
How Good is This Proposed Move? Cluster 1 SL(before)= 5.0 1 2 1.0 0.0 3 cycle SL(after)= 4.5 8 0.0 12 0.0 14 0.0 Cluster 2 Lgain= 0.5 7 4 5 6 1.33 10 11 9 2.33 Egain= -1.0 13 0.83 cycle 0.0 Mgain= 4.0 0.0
Experimental Evaluation • Trimaran toolset: a retargetable VLIW compiler • Evaluated DSP kernels and SPECint2000 • 64 registers per cluster • Latencies similar to Itanium • Perfect caches • For more detailed results, see paper
Conclusions • A new, region-scoped method for clustering operations • Prescheduling technique • Estimates on schedule length used instead of scheduler • Combines slack distribution with multilevel-KL partitioning • Performs better as number of resources increases Average Improvement
Questions? http://cccp.eecs.umich.edu
Bottom-Up Greedy (BUG) • Typical clustering strategy, falls into trouble because of its limited view • Places operations without knowing the rest of the graph • Uses the scheduler to determine where to best place each operation • First used in Multiflow trace [Ellis ‘85]
Graph Partitioning Algorithms • Local improvement methods: • Kernighan-Lin • Swaps pairs of operations between partitions • Fiduccia and Matheyses • KL-inspired, efficent O(|E|) algorithm • Simulated annealing • Genetic algorithms
Graph Partitioning Algorithms • Global methods: • Geometric methods • Coordinate based, not suitable for clustering • Coordinate-free methods • Recursive spectral bisection (RSB) • Multilevel-RSB • Multilevel-KL