1 / 20

A Distributed Control Path Architecture for VLIW Processors

A Distributed Control Path Architecture for VLIW Processors. Hongtao Zhong, Kevin Fan, Scott Mahlke, and Michael Schlansker* Advanced Computer Architecture Laboratory University of Michigan *HP Laboratories. Motivation. VLIW Scaling Problem Centralized resource Highly ported structures

bridie
Download Presentation

A Distributed Control Path Architecture for VLIW Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan, Scott Mahlke, and Michael Schlansker* Advanced Computer Architecture Laboratory University of Michigan *HP Laboratories 1

  2. Motivation • VLIW Scaling Problem • Centralized resource • Highly ported structures • Wire delays Register File Register File FU FU FU FU FU FU FU FU FU FU … Instruction Fetch/Decode Instruction Fetch/Decode 2

  3. Multicluster VLIW • Distribute register files • Cluster function units • Distribute data caches • Clusters communicate through interconnection network • Used in TI C6x, Lx/ST200, Analog Tigersharc Interconnection network Cluster 0 Cluster 1 Register File Register File FU FU FU FU Instruction Fetch/Decode 3

  4. Control Path Scaling Problem • Larger I-cache • Latency • Long wires for control signals distribution • Code compression • Hardware cost, power • Grow quadratically with the number of FUs NOP NOP A IR B align/shiftnetwork X A B C D E F G PC I-cache 4

  5. Straight Forward Approach • Distribute I-fetch in spirit similar to distribution of data path • Local communication of controls • Reduce latency, hardware cost, power • Used in Multiflow Trace 14/300 processors Interconnection network Interconnection network Register File Register File Register File Register File FU FU FU FU FU FU FU FU IR IR PC PC I-cache I-cache 5

  6. DVLIW Approach • Simple distribution has problems • Doesn’t support code compression • PC still a centralized resource Interconnection network Interconnection network Register File Register File Register File Register File FU FU FU FU FU FU FU FU IR IR align/shift align/shift PC PC0 PC1 I-cache I-cache 6

  7. DVLIW Execution Model • Clusters execute in lock-step • When one cluster stalls, all clusters stall • Clusters collectively execute one thread • Each cluster runs an instruction stream • Compiler orchestrates the execution of streams • Compiler manages communication • Light weight synchronization 7

  8. DVLIW Benefits • Completely decentralized architecture • Distributed data path • Distributed control path • Supports arbitrary code compression • Exploiting ILP on multi-core style system • Good for embedded applications • Low cost • Compiler support 8

  9. DVLIW Architecture To cluster 1 To cluster 2 Banked L2 … FU IC MFU VLIWCluster 0 VLIWCluster 1 br_target Register Files … IR B A NOP align/shift VLIWCluster 2 VLIWCluster 3 L1 D-Cache Next PC A B PC L1 I-Cache Banked L2 To Banked L2 9

  10. Code Organization DVLIW Conventional VLIW • Code for each cluster is consecutive in memory • Operations in the same MultiOp stored in different memory locations • Each cluster computes its own next PC PC PC0 PC1 10

  11. Branch Mechanism • Maintain correct execution order • All clusters transfer control at the same cycle • All clusters branch to the same logical multiop • Unbundled branch in HPL-PD Each cluster specifies its own target PBR btr1, TARGET Branch CMPP pr0, (x>100)? Broadcast to all clusters BR btr1, pr0 Replicated in each cluster 11

  12. Branch Handling Example … pbr btr1, BB2 cmpp pr0, (x>100)? … br btr1, pr0 … pbr btr1, BB2’ …. …. br btr1, pr0 … pbr btr1, BB2 cmpp pr0, (x>100)? bcast pr0 br btr1, pr0 Cluster 1 Cluster 0 Conventional VLIW DVLIW 12

  13. Sleep Mode • Idle blocks after distribution • Put cluster into sleep mode • Compiler managed • Save energy • Reduce code size • Mode change happens at block boundary SLEEP BR BR BR BR WAKE Cluster 1 Cluster 0 13

  14. Experimental Setup • Trimaran toolset • Processor configuration • 4 clusters, 2 INT, 1 FP, 1 MEM, 1 BR per cluster • 16K L1 I-cache total • Perfect data cache assumed • Power Model • Verilog for instruction align/shift logic • Wire model • Cacti cache model • 21 benchmarks from MediaBench and SPECINT2000 14

  15. Change in Global Communication Bits MediaBench SPECINT 15

  16. Normalized Energy Consumption on Control Path Control path energy = (align/shift logic energy) + (wire energy) + (I-cache energy) 40% saving 67% saving 80% saving 21% saving 16

  17. Normalized Code Size Baseline: Conventional VLIW with compressed encoding Traditional method (single PC): 7x increase DVLIW: 40% increase 17

  18. Result Summary • DVLIW benefits • Order of magnitude reduction in global communication • 40% savings in control path energy • 5x code size reduction vs. simple distribution • Small overhead for ILP execution on CMP • 3% increase in execution cycles • 4% increase in I-cache stalls 18

  19. Conclusions • DVLIW removes last centralized resource in a multicluster VLIW • Fully distributed control path • Scalable architecture • More energy efficient • Stylized CMP architecture • Exploit ILP • Multiple instruction streams • Compiler orchestrated 19

  20. Thank You • For more information • http://cccp.eecs.umich.edu 20

More Related