1 / 18

Region-based Hierarchical Operation Partitioning for Multicluster Processors

Region-based Hierarchical Operation Partitioning for Multicluster Processors. Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian Petrescu-Prahova. Clustered Register Files. Why?

dwayne
Download Presentation

Region-based Hierarchical Operation Partitioning for Multicluster Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian Petrescu-Prahova

  2. Clustered Register Files • Why? • Register file cost and access time grows with the square of he number of register ports • Bypass logic grows quadratically with the number of operations issued per cycle • Distance separating FUs from register file increases with a large number of FUs • => Clustered register files • Decentralized architecture with several small register files • Each register file supplies operands to a subset of FUs • Multiflow Trace, Alpha 21264, TI C6x, Analog Tigersharc (two clusters); reconfigurable meshes?

  3. Goal • Partition operations across the resources available on each cluster to maximize ILP • Minimize inter-cluster communication • Rule of thumb: • 2 identical clusters processor loose ~20% performance • 4 identical clusters processor loose ~30% performance • Nonidentical clusters lead to even more performance loss

  4. Well Known Technique:Bottom-Up Greedy • Recurse along DFG, critical path first • Assign each operation a cluster based on estimates of when the operation and its predecessors can complete earliest (from scheduler) • Problem 1: makes local decisions (see figure) • Problem 2: is slow - needs to query accurate cluster status info for each operation considered

  5. Region-Based Hierarchical Operation Partitioning • Works on acyclic DFGs extracted from the complete program based on region decomposition. I assume region ~ loop (?!?) • Two phases: • Weigth calculation: Node and Edge • Partitioning: Coarsening and Refining

  6. Node Weight Calculation • Reflects the quantity of resources per operation • Ignores dependencies • Individual weight (FUs) • Shared weight (ports, buses)

  7. Edge Weight Calculation • Measure of criticalness • Based on the notion of slack • First come first serve slack distribution

  8. Coarsening Partitioning • Multilevel graph partitioning algorithm (Chaco, Metis) • Works by coarsening highly related nodes into partitions, takes in account only edge weights • Takes a snapshot of each step for refining step

  9. Refinement Partitioning • Traverse back the coarsening stages, making improvements to the initial partition • At each stage the coarsened nodes available at that point are considered for movement to another cluster • Highly related operations are grouped together at each stage because we follow the coarsening process backwards • Metrics • Cluster weight • estimate of the load per cluster • the cluster with highest weight is denoted ‘the imbalanced cluster’ • System load • Estimates the load across all clusters • Gain • The gain of moving operations into other clusters

  10. Cluster Weight • Individual resource constraint per cluster, per cycle (op groups) • Total node weight per cluster per cycle (shared constraints) • Cycle weight per cluster • Cluster weight

  11. Sytem Load • Inter-cluster move overhead • Total load, based on cycle by cycle estimation

  12. Gain • Load gain • Edge gain • Move gain

  13. Example

  14. Evaluation • Implemented using Trimaran tool set • Compared with BUG algorithm • 5 DSP benchmarks (high ILP), SPECint2000 (low ILP) • 5 configurations, functional units: integer (I), float (F), memory (M), branch (B)

  15. Improvement in dynamic total cycles of RHOP over BUG

  16. Comparison of BUG and RHOP clustering performance versus a 1-cluster machine 2-1111 processor 4-1111 processor

  17. Histogram of RHOP versus BUG Achieved schedule length versus critical path length. Numbers of top are dynamic execution percentage

  18. Compiling performance: number of calls to the resource table

More Related