Region-based Hierarchical Operation Partitioning for Multicluster Processors

Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian Petrescu-Prahova

Clustered Register Files • Why? • Register file cost and access time grows with the square of he number of register ports • Bypass logic grows quadratically with the number of operations issued per cycle • Distance separating FUs from register file increases with a large number of FUs • => Clustered register files • Decentralized architecture with several small register files • Each register file supplies operands to a subset of FUs • Multiflow Trace, Alpha 21264, TI C6x, Analog Tigersharc (two clusters); reconfigurable meshes?

Goal • Partition operations across the resources available on each cluster to maximize ILP • Minimize inter-cluster communication • Rule of thumb: • 2 identical clusters processor loose ~20% performance • 4 identical clusters processor loose ~30% performance • Nonidentical clusters lead to even more performance loss

Well Known Technique:Bottom-Up Greedy • Recurse along DFG, critical path first • Assign each operation a cluster based on estimates of when the operation and its predecessors can complete earliest (from scheduler) • Problem 1: makes local decisions (see figure) • Problem 2: is slow - needs to query accurate cluster status info for each operation considered

Region-Based Hierarchical Operation Partitioning • Works on acyclic DFGs extracted from the complete program based on region decomposition. I assume region ~ loop (?!?) • Two phases: • Weigth calculation: Node and Edge • Partitioning: Coarsening and Refining

Node Weight Calculation • Reflects the quantity of resources per operation • Ignores dependencies • Individual weight (FUs) • Shared weight (ports, buses)

Edge Weight Calculation • Measure of criticalness • Based on the notion of slack • First come first serve slack distribution

Coarsening Partitioning • Multilevel graph partitioning algorithm (Chaco, Metis) • Works by coarsening highly related nodes into partitions, takes in account only edge weights • Takes a snapshot of each step for refining step

Refinement Partitioning • Traverse back the coarsening stages, making improvements to the initial partition • At each stage the coarsened nodes available at that point are considered for movement to another cluster • Highly related operations are grouped together at each stage because we follow the coarsening process backwards • Metrics • Cluster weight • estimate of the load per cluster • the cluster with highest weight is denoted ‘the imbalanced cluster’ • System load • Estimates the load across all clusters • Gain • The gain of moving operations into other clusters

Cluster Weight • Individual resource constraint per cluster, per cycle (op groups) • Total node weight per cluster per cycle (shared constraints) • Cycle weight per cluster • Cluster weight

Sytem Load • Inter-cluster move overhead • Total load, based on cycle by cycle estimation

Gain • Load gain • Edge gain • Move gain

Example

Evaluation • Implemented using Trimaran tool set • Compared with BUG algorithm • 5 DSP benchmarks (high ILP), SPECint2000 (low ILP) • 5 configurations, functional units: integer (I), float (F), memory (M), branch (B)

Improvement in dynamic total cycles of RHOP over BUG

Comparison of BUG and RHOP clustering performance versus a 1-cluster machine 2-1111 processor 4-1111 processor

Histogram of RHOP versus BUG Achieved schedule length versus critical path length. Numbers of top are dynamic execution percentage

Compiling performance: number of calls to the resource table

Region-based Hierarchical Operation Partitioning for Multicluster Processors

Region-based Hierarchical Operation Partitioning for Multicluster Processors

Presentation Transcript

GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Flexicache: Software-based Instruction Caching for Embedded Processors

Region-based Hierarchical Operation Partitioning for Multicluster Processors

Scalability-Based Manycore Partitioning

Hierarchical Atlas Based EM Segmentation

Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

Hierarchical Stability Based Model Selection for Data Clustering

Selectivity-Based Partitioning

Task Partitioning for Multi-Core Network Processors

Soft Syntactic Constraints for Hierarchical Phrase-Based Translation

Hierarchical Region-Based Segmentation by Ratio-Contour

Region-based Voting

GPU-based Hierarchical Computations for View Independent Visibility

A Partitioning Methodology for BDD-based Verification

Utility-Based Partitioning of Shared Caches

Verification of Hierarchical Cache Coherence Protocols for Future Processors

Compiler-directed Data Partitioning for Multicluster Processors

On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors

A Hierarchical Framework for Content-Based Image Retrieval

Partitioning Sequences Based on Association Measures

Utility-Based Partitioning of Shared Caches

Improved Cut Sequences for Partitioning Based Placement