1 / 25

Microarchitectural Wire Management for Performance and Power in Partitioned Architectures

Microarchitectural Wire Management for Performance and Power in Partitioned Architectures. Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatanand Venkatachalapathy. Overview/Motivation . Wire delays are costly for performance and power

kato
Download Presentation

Microarchitectural Wire Management for Performance and Power in Partitioned Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatanand Venkatachalapathy University of Utah

  2. Overview/Motivation • Wire delays are costly for performance and power • Latencies of 30 cycles to reach ends of a chip • 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04) • Abundant number of metal layers

  3. Wire Characteristics • Wire Resistance and capacitance per unit length (Width & Spacing)  Delay  (as delay  RC), Bandwidth 

  4. Design Space Exploration • Tuning wire width and spacing 2d d Resistance Resistance B Wires Capacitance Capacitance Bandwidth L wires

  5. Transmission Lines • Allow extremely low delay • High implementation complexity and overhead! • Large width • Large spacing between wires • Design of sensing circuit • Shielding power and ground lines adjacent to each line • Implemented in test CMOS chips • Not employed in this study

  6. Design Space Exploration • Tuning Repeater size and spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power Traditional Wires Large repeaters Optimum spacing

  7. Design Space Exploration Base case B wires Bandwidth Optimized W wires Power Optimized P wires Power and B/W Optimized PW wires Fast, low bandwidth L wires

  8. Outline • Overview • Wire Design Space Exploration • Employing L wires for Performance • PW wires: The Power Optimizers • Results • Conclusions

  9. Evaluation Platform • Centralized front-end • I-Cache & D-Cache • LSQ • Branch Predictor • Clustered back-end L1 D Cache Cluster

  10. Cache Access 5c Eff. Address Transfer 10c L S Q L1 D Cache Data return at 20c Mem. Dep Resolution 5c Cache Pipeline Cache Access 5c Cache Access 5c Eff. Address Transfer 10c Functional Unit L S Q L1 D Cache Eff. Address Transfer 10c L S Q L1 D Cache 8-bit Transfer 5c Data return at 20c Mem. Dep Resolution 5c Data return at 14c Partial Mem. Dep Resolution 3c

  11. L wires: Accelerating cache access • Transmit LSB bits of effective address through L wires • Faster memory disambiguation • Partial comparison of loads and stores in LSQ • Introduces false dependences ( < 9%) • Indexing data and tag RAM arrays • LSB bits can prefetch data out of L1$ • Reduce access latency of loads

  12. L wires: Narrow Bit Width Operands • PowerPC: Data bit-width determines FU latency • Transfer of 10 bit integers on L wires • Can introduce scheduling difficulties • A predictor table of saturating counters • Accuracy of 98% • Reduction in branch mispredict penalty

  13. Power Efficient Wires. Idea: steer non-critical data through energy efficient PW interconnect Base case B wires Power and B/W Optimized PW wires

  14. PW wires: Power/Bandwidth Efficient Regfile • Ready Register operands • Transfer of data at instruction dispatch • Transfer of input operands to remote register file • Covered by long dispatch to issue latency • Store data • Could stall commit process • Delay dependent loads IQ FU Operand is ready at cycle 90 Regfile Rename & Dispatch IQ FU Regfile IQ FU Consumer instruction Dispatched at cycle 100 Regfile IQ FU

  15. Outline • Overview • Wire Design Space Exploration • Employing L wires for Performance • PW wires: The Power Optimizers • Results • Conclusions

  16. Evaluation Methodology • Simplescalar -3.0 augmented to simulate a dynamically scheduled 4-cluster model • Crossbar interconnects (L, B and PW wires) L1 D Cache Cluster B wires (2 cycles) L wires (1 cycle) PW wires (3 cycles)

  17. Heterogeneous Interconnects • Intercluster global Interconnect • 72 B wires (64 data bits and 8 control bits) • Repeaters sized and spaced for optimum delay • 18 L wires • Wide wires and large spacing • Occupies more area • Low latencies • 144 PW wires • Poor delay • High bandwidth • Low power

  18. Analytical Model C = Ca + WsCb + Cc/S 1 2 3 Fringing Capacitance Capacitance between adjacent metal layers Capacitance between adjacent wires RC Model of the wire Total Power = Short-Circuit Power + Switching Power + Leakage Power

  19. Evaluation methodology • Simplescalar -3.0 augmented to simulate a dynamically scheduled 16-cluster model • Ring latencies • B wires ( 4 cycles) • PW wires ( 6 cycles) • L wires (2 cycles) D-cache I-Cache Cluster LSQ Cross bar Ring interconnect

  20. IPC improvements: L wires L wires improve performance by 4.2% on four cluster system and 7.1% on a sixteen cluster system

  21. Four Cluster System: ED2 Improvements Link Relativemetal area IPC Relative processor energy (10%) Relative ED2 (10%) Relative ED2 (20%) 144 B 1.0 0.95 100 100 100 288 PW 1.0 0.92 97 103.4 100.2 144 PW 36 L 1.5 0.96 97 95.0 92.1 288 B 2.0 0.98 103 96.6 99.2 288 PW,36 L 2.0 0.97 99 94.4 93.2 144 B, 36 L 2.0 0.99 101 93.3 94.5

  22. Link IPC Relative Processor Energy (20%) Relative ED2 (20%) 144 B 1.11 100 100 144 PW, 36 L 1.05 94 105.3 288 B 1.18 105 93.1 144 B, 36 L 1.19 102 88.7 288 B, 36 L 1.22 107 88.7 Sixteen Cluster system: ED2 gains

  23. Conclusions • Exposing the wire design space to the architecture • A case for micro-architectural wire management! • A low latency low bandwidth network alone helps improve performance by up to 7% • ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect • Entails hardware complexity

  24. Future work • 3-D wire model for the interconnects • Design of heterogeneous clusters • Interconnects for cache coherence and L2$

  25. Questions and Comments? Thank you!

More Related