180 likes | 317 Views
Region-based Hierarchical Operation Partitioning for Multicluster Processors. Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian Petrescu-Prahova. Clustered Register Files. Why?
E N D
Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian Petrescu-Prahova
Clustered Register Files • Why? • Register file cost and access time grows with the square of he number of register ports • Bypass logic grows quadratically with the number of operations issued per cycle • Distance separating FUs from register file increases with a large number of FUs • => Clustered register files • Decentralized architecture with several small register files • Each register file supplies operands to a subset of FUs • Multiflow Trace, Alpha 21264, TI C6x, Analog Tigersharc (two clusters); reconfigurable meshes?
Goal • Partition operations across the resources available on each cluster to maximize ILP • Minimize inter-cluster communication • Rule of thumb: • 2 identical clusters processor loose ~20% performance • 4 identical clusters processor loose ~30% performance • Nonidentical clusters lead to even more performance loss
Well Known Technique:Bottom-Up Greedy • Recurse along DFG, critical path first • Assign each operation a cluster based on estimates of when the operation and its predecessors can complete earliest (from scheduler) • Problem 1: makes local decisions (see figure) • Problem 2: is slow - needs to query accurate cluster status info for each operation considered
Region-Based Hierarchical Operation Partitioning • Works on acyclic DFGs extracted from the complete program based on region decomposition. I assume region ~ loop (?!?) • Two phases: • Weigth calculation: Node and Edge • Partitioning: Coarsening and Refining
Node Weight Calculation • Reflects the quantity of resources per operation • Ignores dependencies • Individual weight (FUs) • Shared weight (ports, buses)
Edge Weight Calculation • Measure of criticalness • Based on the notion of slack • First come first serve slack distribution
Coarsening Partitioning • Multilevel graph partitioning algorithm (Chaco, Metis) • Works by coarsening highly related nodes into partitions, takes in account only edge weights • Takes a snapshot of each step for refining step
Refinement Partitioning • Traverse back the coarsening stages, making improvements to the initial partition • At each stage the coarsened nodes available at that point are considered for movement to another cluster • Highly related operations are grouped together at each stage because we follow the coarsening process backwards • Metrics • Cluster weight • estimate of the load per cluster • the cluster with highest weight is denoted ‘the imbalanced cluster’ • System load • Estimates the load across all clusters • Gain • The gain of moving operations into other clusters
Cluster Weight • Individual resource constraint per cluster, per cycle (op groups) • Total node weight per cluster per cycle (shared constraints) • Cycle weight per cluster • Cluster weight
Sytem Load • Inter-cluster move overhead • Total load, based on cycle by cycle estimation
Gain • Load gain • Edge gain • Move gain
Evaluation • Implemented using Trimaran tool set • Compared with BUG algorithm • 5 DSP benchmarks (high ILP), SPECint2000 (low ILP) • 5 configurations, functional units: integer (I), float (F), memory (M), branch (B)
Comparison of BUG and RHOP clustering performance versus a 1-cluster machine 2-1111 processor 4-1111 processor
Histogram of RHOP versus BUG Achieved schedule length versus critical path length. Numbers of top are dynamic execution percentage
Compiling performance: number of calls to the resource table