1 / 20

Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures. Michael Chu, Kevin Fan, Rajiv Ravindran, Scott Mahlke Advanced Computer Architecture Lab University of Michigan Workshop on Application-Specific Processors (WASP-2) December 2, 2003.

deliz
Download Presentation

Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures Michael Chu, Kevin Fan, Rajiv Ravindran, Scott Mahlke Advanced Computer Architecture Lab University of Michigan Workshop on Application-Specific Processors (WASP-2) December 2, 2003

  2. Homogeneous Clustered Architecture Heterogeneous Clustered Architecture Register File Register File Register File Register File +*- +*- +*- +*- +*- +*- * + - << + - + - << << << << << << << Cluster 1 (32-bit) Cluster 2 (32-bit) Cluster 1 (32-bit) Cluster 2 (8-bit) Clustered Architectures • Decentralize architecture to reduce register file bottleneck • Used in Lx/ST200, TI C6x, Analog Tigersharc and others. • Goal of our work: Automatic synthesis of an application-specific heterogeneous multicluster architecture

  3. RF FU FU RF FU FU Our Approach • Partition operations with both performance and required hardware cost in mind • Maintain performance and reduce cost (bitwidth, FU repertoire) • Previous work has focused on single basic block, single cluster [Note ‘91] [Paulin ‘89] [Marwedel ‘90] • Each partition dictates a cluster configuration which has an associated hardware cost

  4. Our Proposed System • Today’s Focus: Cost-Sensitive Operation Partitioning • Input: Application, High-level machine specification: • Number of clusters, number of generic FU’s • Output: Multicluster Architecture Description

  5. 1 10 1 1 1 1 10 8 8 8 1 1 1 1 10 8 8 8 1 1 10 1 1 Cost-Sensitive Operation Partitioning • Builds off Region-Based Hierarchical Operation Partitioning • Pure performance based partitioner, no notion of hardware cost • Weight calculation creates guides for good partitions • Partitioning clusters based on given weights • Cost metric added to Graph Partitioning phase which accounts for gate cost Region Weight Calculation Graph Partitioning

  6. Coarsening Phase • Progressively groups highly related operations together • Continually pairs operations together • Forces partitioner to consider several operations as a single unit • Traditional RHOP: coarsen using edge weights • Cost-centric coarsening can ignore dependence edge criticality Coarsened State 1 Coarsened State 2 Coarsened State 3 Coarsened State 4 Narrow bitwidth Wide bitwidth

  7. Partitioning Phase • Travel back through each of the coarsening steps, at each stage try refining partition • est_cycles: performance metric from traditional RHOP • Adds new cost metric for cost of the cluster

  8. Cost-Sensitive Refinement • Moves are made when they have positive benefit • When no more moves can be made, algorithm uncoarsens to previous coarsened state and tries moving again est cycles = 7 cost: 28K est cycles = 8 cost: 15K est cycles = 7 cost: 15K Narrow bitwidth Wide bitwidth

  9. * * Int Unit 1 Int Unit 2 16 16 + + + + + + + + * * 32 8 16 16 32 10 16 10 16 8 Multicluster Cost Model • Cost model determines an estimate of gate cost of clusters • Estimate minimum required cost to support partitioned operations • Factors that influence hardware cost: • Register file size/width • Functional Unit (FU) width • FU opcode repertoire • Greedy algorithm used • Ignores dependences betweenoperations • Similar to Rec/Res MII calculationsfor software pipelined loops Register File (32-bit) High cost Low cost Total cost of cluster: 1 32-bit register file 1 16-bit multiplier/adder 1 32-bit adder

  10. Experimental Methodology • Trimaran toolset: a retargetable VLIW compiler • Evaluated main loop of DSP kernels and selected benchmarks from MediaBench, MiBench and NetBench • Bitwidth information gathered through automatic program analysis • Cost estimates computed using Synopsis design tools at 0.18µ • 64 registers per cluster

  11. 2-Cluster Cost Savings and Performance Percentage Performance Loss / Cost Savings fft rls url LU crc dct fsed channel Average huffman blowfish rawcaudio rawdaudio gsmdecode gsmencode

  12. Source of Cost Savings Breakdown Normalized Cost fft rls url LU crc dct fsed channel Average huffman blowfish rawcaudio rawdaudio gsmdecode gsmencode

  13. Pareto Charts of Examined Machines fsed kernel LU kernel • A wide spectrum of machine configurations were examined • Multiple groups often appear with expensive units

  14. Work in Progress • Merging step • How can machine designs for several basic blocks be combined? • Inaccurate cost model • How can a more accurate estimate for the cost be developed? • Space Exploration (external/internal) • Number of clusters and generic FU’s are externally spacewalked • Allowable performance increase internally spacewalked • What areas of this space exploration should be external/internal? • Reprogrammability of designed machines

  15. Conclusions • Developed a cost-sensitive method for partitioning operations across clusters • Used this partitioning to define an application-specific low-cost multicluster datapath architecture • Average performance loss and cost savings for two and four cluster machines:

  16. Questions? http://cccp.eecs.umich.edu

  17. Backup Slides

  18. 4-Cluster Cost Savings and Performance Percentage Performance Loss / Cost Savings fft rls url LU dct crc fsed channel huffman Average blowfish rawcaudio rawdaudio gsmdecode gsmencode

  19. Previous Work • Datapath synthesis • Cathedral-III: complete synthesis system from IMEC • Paulin and Knight: force directed scheduling • Sehwa: designed processing pipelines from behavioral specs • PICO: designed application-specific VLIW processors • Bitwidth sensitive datapath synthesis • Valen-C: augmented C language to convey bitwidth information

  20. Weight Calculation Phase • Edge weights • Assigns higher weight to edges likely to increase schedule length when cut • Uses a slack distribution method to assign weights • Node weights • Assigns weights to each operation based on how much it is likely to effect the load of the FUs in the cluster • Higher weights attributed to operations that can • Not changed from Traditional RHOP

More Related