prediction of high performance on chip global interconnection
Download
Skip this Video
Download Presentation
Prediction of High-Performance On-Chip Global Interconnection

Loading in 2 Seconds...

play fullscreen
1 / 37

Prediction of High-Performance On-Chip Global Interconnection - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

Prediction of High-Performance On-Chip Global Interconnection. Yulei Zhang 1 , Xiang Hu 1 , Alina Deutsch 2 , A. Ege Engin 3 James F. Buckwalter 1 , and Chung-Kuan Cheng 1 1 Dept. of ECE, UC San Diego, La Jolla, CA 2 IBM T. J. Watson Research Center, Yorktown Heights, NY

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Prediction of High-Performance On-Chip Global Interconnection' - holland


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
prediction of high performance on chip global interconnection

Prediction of High-Performance On-Chip Global Interconnection

Yulei Zhang1, Xiang Hu1, Alina Deutsch2, A. Ege Engin3

James F. Buckwalter1, and Chung-Kuan Cheng1

1Dept. of ECE, UC San Diego, La Jolla, CA

2IBM T. J. Watson Research Center, Yorktown Heights, NY

3Dept. of ECE, San Diego State Univ., San Diego, CA

outline
Outline
  • Introduction
    • Technology trend
    • Current approaches
  • On-Chip Global Interconnection
    • Overview: structures, tradeoffs
    • Interconnect schemes
    • Global wire modeling
    • Performance analysis
  • Design Methodologies for T-line schemes
  • Prediction of Performance Metrics
    • Experimental settings
    • Performance metrics comparison and scaling trend
      • Latency
      • Energy per bit
      • Throughput
  • Signal Integrity
  • Conclusion
introduction performance impact
Introduction – Performance Impact
  • Interconnect delay determines the system performance [ITRS08]
    • 542ps for 1mm minimum pitch Cu global wire w/o repeater @ 45nm
    • ~150ps for 10 level FO4 delay @ 45nm

[Ho2001] “Future of Wire”

introduction power dissipation
Introduction – Power Dissipation
  • Interconnects consume a significant portion of power
    • 1-2 order larger in magnitude compared with gates
      • Half of the dynamic power dissipated on repeaters to minimize latency [Zhang07]
    • Wires consume 50% of total dynamic power for a 0.13um microprocessor [Magen04]
      • About 1/3 burned on the global wires.
introduction different approaches and our contributions
Introduction – Different Approaches and Our Contributions
  • Different Approaches
    • Repeater Insertion Approach
      • Pros: High throughput density.
      • Cons: Overhead in terms of power consumption and wiring complexity.
    • T-line Approach [Zhang09]
      • Pros: Low latency.
      • Cons: low throughput density due to low bandwidth and large wire dimension
    • Equalized T-line Approach [Zhang08]
      • Pros: Low power, Low noise, Higher throughput than single-ended.
      • Cons: The area overhead brought by passive components.
  • We explore different global interconnection structures and compare their performance metrics across multiple technology nodes.
  • Contributions:
    • A simple linear model
    • A general design framework
    • A complete prediction and comparison
multi dimensional design consideration
Multi-Dimensional Design Consideration
  • Preliminary analysis results assuming 65nm CMOS process.
  • Application-oriented choice
    • Low Latency

T-TL or UT-TL -> Single-Ended T-lines

    • High Throughput

R-RC

    • Low Power

PE-TL or UE-TL

    • Low Noise

PE-TL or UE-TL

    • Low Area/Cost

R-RC

Differential T-lines

For each architecture, the more area the pentagon covers, the better overall performance is achieved.

on chip global interconnect schemes 1
On-Chip Global Interconnect Schemes (1)
  • R-RC structure
    • Repeater size/Length of segments
    • Adopt previous design methodology [Zhang07]
  • UT-TL structure
    • Full swing at wire-end
    • Tapered inverter chain as TX
  • T-TL structure
    • Optimize eye-height at wire-end
    • Non-Tapered inverter chain as TX

Repeated RC wires (R-RC)

Un-Terminatedand Terminated T-Line

(UT-TLand T-TL)

on chip global interconnect schemes 2
On-Chip Global Interconnect Schemes (2)

Un-Equalized andPassive-Equalized T-Line

(UE-TLandPE-TL)

  • Driver side: Tapered differential driver
  • Receiver side: Termination resistance, Sense-Amplifier (SA) + inverter chain
  • Passive equalizer: parallel RC network
  • Design Constraint: enough eye-opening (50mV) needed at the wire-end
global wire modeling single ended differential on chip t lines
Global Wire Modeling – Single-Ended & Differential On-Chip T-lines
  • Orthogonal layers replaced by ground planes -> 2D cap extraction, accurate when loading density is high.
  • Top-layer thick wires used -> dimension maintains as technology scales.
  • LC-mode behavior dominant

Determine the bit rate

  • Smallest wire dimensions that satisfy eye constraint
  • Notice PE-TL needs narrower wire -> Equalization helps to increase density.
global wire modeling rc wires and t lines
Global Wire Modeling – RC wires and T-lines
  • Distributed Π model composed of wire resistance and capacitance
  • Closed-form equations [Sim03] to calculate 2D wire capacitance
  • RC wire modeling
  • T-line 2D-R(f)L(f)C parameter extraction
  • T-line Modeling
    • R(f)L(f)C Tabular model -> Transient simulation to estimate eye-height.
    • Synthesized compact circuit model [Kopcsay02] -> Study signal integrity issue.

2D-C Extraction Template

2D-R(f)L(f) Extraction Template

performance analysis definitions
Performance Analysis – Definitions
  • Normalized delay (unit: ps/mm)
    • Propagation delay includes wire delay and gate delay.
  • Normalized energy per bit (unit: pJ/m)
    • Bit rate is assumed to be the inverse of propagation delay for RC wires
  • Normalized throughput (unit: Gbps/um)
performance analysis latency
Performance Analysis – Latency
  • Variables: technology-defined parameters
    • Supply voltage: Vdd (unit: V)
    • Dielectric constant:
    • Min-sized inverter FO4 delay: (unit: ps)
  • R-RC structure (min-d)
    • is roughly constant
    • FO4 delay scales w/ scaling factor S
  • T-line structures
    • Sum of wire delay and TX delay
    • Wire delay
    • TX delay improved w/ FO4 delay

Decreasing w/ technology scaling!

Increasing w/ technology scaling!

performance analysis energy per bit
Performance Analysis – Energy per Bit
  • Same variables defined before

Constant !

  • R-RC structure (min-d)
    • Vdd reduces as technology scales
    • reduces as technology scales
  • T-line structures
    • Sum of power consumed on wire and TX.
    • Power of T-line
    • Power of TX circuit
    • FO4 delay reduces exponentially

Energy decreases w/ technology scaling!

Energy decreases w/ larger slope!!

performance analysis throughput
Performance Analysis – Throughput
  • Same variables defined before
  • R-RC structure (min-d)
    • Assuming wire pitch
    • FO4 delay reduces exponentially
  • T-line structures
    • TX bandwidth
    • Neglect the minor change of wire pitch
    • K1 = 0, for UT-TL
    • FO4 delay reduces exponentially

Throughput increases by

20% per generation!

Throughput increases by

43% per generation !!

design framework for on chip t line schemes
Design Framework for On-Chip T-line Schemes
  • Proposed framework can be applied to design UT-TL/T-TL/UE-TL/PE-TLby changing wire configuration and circuit structure.
  • Different optimization routines (LP/ILP/SQP, etc) can be adopted according to the problem formulation.
experimental settings
Experimental Settings
  • Design objective: min-d
  • Technology nodes: 90nm-22nm
  • Five different global interconnection structures
  • Wire length:5mm
  • Parameter extraction
    • 2D field solver CZ2D from EIP tool suite of IBM
    • Tabular model or synthesized model
  • Transistor models
    • Predictive transistor model from [Uemura06]
    • Synopsys level 3 MOSFET model tuned according to ITRS roadmap
  • Simulation
    • HSPICE 2005
  • Modeling and Optimization
    • Linear or non-linear regression/SQP routine
    • MATLAB 2007
performance metric normalized delay results and comparison
Performance Metric: Normalized Delay – Results and Comparison
  • Technology trends
    • R-RC ↑
    • T-line schemes ↓
  • T-line structures
    • Outperform R-RC beyond 90nm
    • Single-ended: lowest delay
  • At 22nm node
    • R-RC: 55ps/mm
    • T-lines: 8ps/mm (85%reduction)
    • Speed of light: 5ps/mm
  • Linear model
    • < 6% average percent error
performance metric normalized energy per bit results and comparison
Performance Metric: Normalized Energy per Bit – Results and Comparison
  • Technology trends
    • R-RC and T-lines ↓
    • T-lines reduce more quickly
  • T-line structures
    • Outperform R-RC beyond 45nm
    • Differential: lowest energy.
    • Single-ended similar to R-RC.
      • T-TL > UT-TL
  • At 22nm node
    • R-RC: 100pJ/m
    • Single-ended: 60% reduction
    • Differential: 96% reduction
  • Linear model
    • < 12% average percent error
    • Error for T-TL and PE-TL
      • RL and passive equalizers.
performance metric normalized throughput results and comparison
Performance Metric: Normalized Throughput – Results and Comparison
  • Technology trends
    • R-RC and T-lines ↑
    • T-lines increase more quickly
  • T-line structures
    • Outperform R-RC beyond 32nm
    • Differential better than single-ended
  • At 22nm node
    • R-RC: 12Gbps/um
    • T-TL: 30% improvement
    • UE-TL: 75% improvement
    • PE-TL: ~ 2X of R-RC
  • Linear model
    • < 7% average percent error
signal integrity single ended t lines
Signal Integrity – single-ended T-lines

Worst-case switching pattern for peak noise simulation

Using w.c. pattern

Using single or multiple PRBS patterns

  • UT-TL structure
    • 380mV peak noise at 1V supply voltage w/ 7ps rise time
    • SI could be a big issue as supply voltage drops
  • T-TL less sensitive to noise
    • At the same rise time, ~ 50% reduction of peak noise
    • Peak noise ↓ as technology scales
signal integrity differential t lines
Signal Integrity – differential T-lines

Worst-case switching pattern for peak noise simulation

  • More reliable
    • Termination resistance
    • Common-mode noise reduction
  • Peak noise
    • Within ~10mV range
  • Eye-Heights
    • UE-TL
      • Eye reduces as bit rate ↑
      • Harder to meet constraint.
    • PE-TL
      • > 70mV eye even at 22nm node
      • Equalization does help!
conclusion
Conclusion
  • Compare five different global interconnections in terms of latency, energy per bit, throughput and signal integrity from 90nm to 22nm.
  • A simple linear model provided to link
    • Architecture-level performance metrics
    • Technology-defined parameters
  • Some observations from experimental results
    • T-line structures have potential to replace R-RC at future node
    • Differential T-lines are better thansingle-ended
      • Low-power/High-throughput/Low-noise
    • Equalizationcould be utilized for on-chip global interconnection
      • Higher throughput density, improve signal integrity
      • Even w/ lower energy dissipation (passive equalizations)
introduction technology trend
Introduction – Technology Trend

Scaling trend of PUL wire resistance and capacitance

Copper resistivity versus wire width

  • On-Chip Interconnect Scaling
    • Dimension shrinks
      • Wire resistance increases -> RC delay
      • Increasing capacitive coupling -> delay, power, noise, etc.
    • Performance of global wires decreases w/ technology scaling.
design methodology single ended t lines
Design methodology: single-ended T-lines

2D frequency-dependent

tabular Model

Inverter size,

number of stages,

Rload (if any)

Single-ended;

Inverter chains

SPICE

simulation

SPICE simulation to evaluate.

Optimization Routine:

1. Optimal cycle time

2. Sweep for optimal inverter chain

SPICE simulation to check in-plane crosstalk, etc

design methodology differential t lines
Design methodology: differential T-lines

2D frequency-dependent

Tabular Model

Wire width;

Driver impedance;

RC equalizer (if any); Termination resistance.

Differential lines;

SA-based TX

Closed-form equation-based model

Evaluation based on models.

Optimization Routine:

1. Binary search for wire width

2. SQP for other var. optimization

SPICE simulation to check in-plane crosstalk, etc

effects of driver impedance and termination resistance
Effects of driver impedance and termination resistance
  • Lowering driver impedance improves eye
  • Eye reduces as frequency goes up
  • Optimal termination resistance.
effects of driver impedance and termination resistance on step response
Effects of driver impedance and termination resistance on step response

Optimal Rload

  • Larger driver impedance leads to slower rise edge and lower saturation voltage
  • Larger termination resistance causes sharper rise edge but with larger reflection
crosstalk effects
Crosstalk effects
  • Three different PRBS input patterns, min-ddp solutions
  • T-line Scheme A: Delay increased by 9.6%, Power increased by 37%
  • T-line Scheme B: Delay increased by 2%, Power increased by 25.7%
transceiver design
Transceiver Design
  • Sense amplifier (SA)
    • Double-tail latch-type [Schinkel 07]
    • Optimize sizing to minimize SA delay
  • Inverter chain
    • Number of stage
      • Fixed to 6
    • Sizing of each inverter
      • RS: output resistance of inverter chain
      • Sweep the 1st inverter size to minimize the total transceiver delay for given [Veye, RS]

Double-tail latch-type voltage sense amp.

@45nm tech node:

M1/M3: 45nm/45nm

M2/M4: 250nm/45nm

M5/M6: 180nm/45nm

M7/M8: 280nm/45nm

M9: 495nm/45nm

M10/M11: 200nm/45nm

M12: 1.58um/45nm

transceiver modeling
Transceiver Modeling
  • Driver side
    • Voltage source Vswith output resistance Rs
    • Vs: full-swing pulse signal with rise time Tr=0.1Tc
    • Rs: output resistance of the last inverter in the chain.
  • Receiver side
    • Extract look-up table for TX delay and power
    • Fit the table using non-linear closed form formula
    • The relative error is within 2% for fitting models

Histogram of fitting errors at 45nm node

Transceiver delay map at 45nm node

Transceiver power map at 45nm node

slide34

Bit-rate: 50Gbps

Rs=11.06ohm, Rd=350ohm, Cd=0.38pF,

RL=107.69ohm

conclusion cont
Conclusion (cont’)

Low-Latency Application (ps/mm)

Low-Energy Application (pJ/m)

Tech Node

Tech Node

Schemes

Schemes

High-Throughput Application (Gbps/um)

Low-Noise Application

Tech Node

Tech Node

Schemes

Schemes

Item in the table: score/value. Score: the higher, the better in terms of given metric, max. score is 5. The best structure in each column marked using red color.

future works
Future Works
  • Explore novel global signaling schemes for high throughput and low energy dissipation.
    • Design, optimize > 50Gbps on-chip interconnection schemes
    • Architecture-level study to identify trade-offs
      • Wire configuration
        • Dimension optimization, ground plane, etc.
      • Un-interrupted architectures
        • Equalization implementation, TX/RX choice
      • Distributed architectures
        • Active or Passive compensation (RC equalizers, other networks, etc)
    • Novel high-speed transceiver circuitry design
    • Develop analysis and optimization capability to aid co-design and co-optimization of wire and transceiver circuit
    • Fabrication to verify analysis and demonstrate feasibility
related publications
Related Publications

[Repeated RC Wire]

  • L. Zhang, H. Chen, B. Yao, K. Hamilton, and C.K. Cheng, “Repeated on-chip interconnect analysis and evaluation of delay, power and bandwidth metrics under different design goals,” IEEEInternational Symposium on Quality Electronic Design, 2007, pp.251-256.
  • Y. Zhang, L. Zhang, A. Deutsch, G. A. Katopis, D. M. Dreps, J. F. Buckwalter, E. S. Kuh and C.K. Cheng, “Design Methodology of High Performance On-Chip Global Interconnect Using Terminated Transmission-Line, ” IEEE International Symposium on Quality Electronic Design, 2009, pp.451-458.
  • Y. Zhang, L. Zhang, A. Tsuchiya, M. Hashimoto, and C.K. Cheng, “On-chip high performance signaling using passive compensation, ” IEEE International Conference on Computer Design, 2008, pp. 182-187.
  • Y. Zhang, L. Zhang, A. Deutsch, G. A. Katopis, D. M. Dreps, J. F. Buckwalter, E. S. Kuh, and C. K. Cheng, “On-chip bus signaling using passive compensation,” IEEE Electrical Performance of Electronic Packaging, 2008, pp. 33-36.
  • L. Zhang, Y. Zhang, A. Tsuchiya, M. Hashimoto, E. Kuh, and C.K. Cheng, “High performance on-chip differential signaling using passive compensation for global communication, ” Asia and South Pacific Design Automation Conference, 2009, pp. 385-390.
  • Y. Zhang, X. Hu, A. Deutsch, A. E. Engin, J. F. Buckwalter, and C. K. Cheng, “Prediction of High-Performance On-Chip Global Interconnection, ” ACM workshop on System Level Interconnection Prediction, 2009

[Un-Terminated/Terminated T-Line]

[Passive-Equalized T-Line]

[Overview and Comparison]

ad