Prediction of High-Performance On-Chip Global Interconnection

Download Presentation

Prediction of High-Performance On-Chip Global Interconnection

Loading in 2 Seconds...

- 65 Views
- Uploaded on
- Presentation posted in: General

Prediction of High-Performance On-Chip Global Interconnection

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Prediction of High-Performance On-Chip Global Interconnection

Yulei Zhang1, Xiang Hu1, Alina Deutsch2, A. Ege Engin3

James F. Buckwalter1, and Chung-Kuan Cheng1

1Dept. of ECE, UC San Diego, La Jolla, CA

2IBM T. J. Watson Research Center, Yorktown Heights, NY

3Dept. of ECE, San Diego State Univ., San Diego, CA

- Introduction
- Technology trend
- Current approaches

- On-Chip Global Interconnection
- Overview: structures, tradeoffs
- Interconnect schemes
- Global wire modeling
- Performance analysis

- Design Methodologies for T-line schemes
- Prediction of Performance Metrics
- Experimental settings
- Performance metrics comparison and scaling trend
- Latency
- Energy per bit
- Throughput

- Signal Integrity
- Conclusion

- Interconnect delay determines the system performance [ITRS08]
- 542ps for 1mm minimum pitch Cu global wire w/o repeater @ 45nm
- ~150ps for 10 level FO4 delay @ 45nm

[Ho2001] “Future of Wire”

- Interconnects consume a significant portion of power
- 1-2 order larger in magnitude compared with gates
- Half of the dynamic power dissipated on repeaters to minimize latency [Zhang07]

- Wires consume 50% of total dynamic power for a 0.13um microprocessor [Magen04]
- About 1/3 burned on the global wires.

- 1-2 order larger in magnitude compared with gates

- Different Approaches
- Repeater Insertion Approach
- Pros: High throughput density.
- Cons: Overhead in terms of power consumption and wiring complexity.

- T-line Approach [Zhang09]
- Pros: Low latency.
- Cons: low throughput density due to low bandwidth and large wire dimension

- Equalized T-line Approach [Zhang08]
- Pros: Low power, Low noise, Higher throughput than single-ended.
- Cons: The area overhead brought by passive components.

- Repeater Insertion Approach
- We explore different global interconnection structures and compare their performance metrics across multiple technology nodes.
- Contributions:
- A simple linear model
- A general design framework
- A complete prediction and comparison

- Preliminary analysis results assuming 65nm CMOS process.
- Application-oriented choice
- Low Latency
T-TL or UT-TL -> Single-Ended T-lines

- High Throughput
R-RC

- Low Power
PE-TL or UE-TL

- Low Noise
PE-TL or UE-TL

- Low Area/Cost
R-RC

- Low Latency

Differential T-lines

For each architecture, the more area the pentagon covers, the better overall performance is achieved.

- R-RC structure
- Repeater size/Length of segments
- Adopt previous design methodology [Zhang07]

- UT-TL structure
- Full swing at wire-end
- Tapered inverter chain as TX

- T-TL structure
- Optimize eye-height at wire-end
- Non-Tapered inverter chain as TX

Repeated RC wires (R-RC)

Un-Terminatedand Terminated T-Line

(UT-TLand T-TL)

Un-Equalized andPassive-Equalized T-Line

(UE-TLandPE-TL)

- Driver side: Tapered differential driver
- Receiver side: Termination resistance, Sense-Amplifier (SA) + inverter chain
- Passive equalizer: parallel RC network
- Design Constraint: enough eye-opening (50mV) needed at the wire-end

- Orthogonal layers replaced by ground planes -> 2D cap extraction, accurate when loading density is high.
- Top-layer thick wires used -> dimension maintains as technology scales.
- LC-mode behavior dominant

Determine the bit rate

- Smallest wire dimensions that satisfy eye constraint
- Notice PE-TL needs narrower wire -> Equalization helps to increase density.

- Distributed Π model composed of wire resistance and capacitance
- Closed-form equations [Sim03] to calculate 2D wire capacitance

- RC wire modeling
- T-line 2D-R(f)L(f)C parameter extraction
- T-line Modeling
- R(f)L(f)C Tabular model -> Transient simulation to estimate eye-height.
- Synthesized compact circuit model [Kopcsay02] -> Study signal integrity issue.

2D-C Extraction Template

2D-R(f)L(f) Extraction Template

- Normalized delay (unit: ps/mm)
- Propagation delay includes wire delay and gate delay.

- Normalized energy per bit (unit: pJ/m)
- Bit rate is assumed to be the inverse of propagation delay for RC wires

- Normalized throughput (unit: Gbps/um)

- Variables: technology-defined parameters
- Supply voltage: Vdd (unit: V)
- Dielectric constant:
- Min-sized inverter FO4 delay: (unit: ps)

- R-RC structure (min-d)
- is roughly constant
- FO4 delay scales w/ scaling factor S

- T-line structures
- Sum of wire delay and TX delay
- Wire delay
- TX delay improved w/ FO4 delay

Decreasing w/ technology scaling!

Increasing w/ technology scaling!

- Same variables defined before

Constant !

- R-RC structure (min-d)
- Vdd reduces as technology scales
- reduces as technology scales

- T-line structures
- Sum of power consumed on wire and TX.
- Power of T-line
- Power of TX circuit
- FO4 delay reduces exponentially

Energy decreases w/ technology scaling!

Energy decreases w/ larger slope!!

- Same variables defined before

- R-RC structure (min-d)
- Assuming wire pitch
- FO4 delay reduces exponentially

- T-line structures
- TX bandwidth
- Neglect the minor change of wire pitch
- K1 = 0, for UT-TL
- FO4 delay reduces exponentially

Throughput increases by

20% per generation!

Throughput increases by

43% per generation !!

- Proposed framework can be applied to design UT-TL/T-TL/UE-TL/PE-TLby changing wire configuration and circuit structure.
- Different optimization routines (LP/ILP/SQP, etc) can be adopted according to the problem formulation.

- Design objective: min-d
- Technology nodes: 90nm-22nm
- Five different global interconnection structures
- Wire length:5mm
- Parameter extraction
- 2D field solver CZ2D from EIP tool suite of IBM
- Tabular model or synthesized model

- Transistor models
- Predictive transistor model from [Uemura06]
- Synopsys level 3 MOSFET model tuned according to ITRS roadmap

- Simulation
- HSPICE 2005

- Modeling and Optimization
- Linear or non-linear regression/SQP routine
- MATLAB 2007

- Technology trends
- R-RC ↑
- T-line schemes ↓

- T-line structures
- Outperform R-RC beyond 90nm
- Single-ended: lowest delay

- At 22nm node
- R-RC: 55ps/mm
- T-lines: 8ps/mm (85%reduction)
- Speed of light: 5ps/mm

- Linear model
- < 6% average percent error

- Technology trends
- R-RC and T-lines ↓
- T-lines reduce more quickly

- T-line structures
- Outperform R-RC beyond 45nm
- Differential: lowest energy.
- Single-ended similar to R-RC.
- T-TL > UT-TL

- At 22nm node
- R-RC: 100pJ/m
- Single-ended: 60% reduction
- Differential: 96% reduction

- Linear model
- < 12% average percent error
- Error for T-TL and PE-TL
- RL and passive equalizers.

- Technology trends
- R-RC and T-lines ↑
- T-lines increase more quickly

- T-line structures
- Outperform R-RC beyond 32nm
- Differential better than single-ended

- At 22nm node
- R-RC: 12Gbps/um
- T-TL: 30% improvement
- UE-TL: 75% improvement
- PE-TL: ~ 2X of R-RC

- Linear model
- < 7% average percent error

Worst-case switching pattern for peak noise simulation

Using w.c. pattern

Using single or multiple PRBS patterns

- UT-TL structure
- 380mV peak noise at 1V supply voltage w/ 7ps rise time
- SI could be a big issue as supply voltage drops

- T-TL less sensitive to noise
- At the same rise time, ~ 50% reduction of peak noise
- Peak noise ↓ as technology scales

Worst-case switching pattern for peak noise simulation

- More reliable
- Termination resistance
- Common-mode noise reduction

- Peak noise
- Within ~10mV range

- Eye-Heights
- UE-TL
- Eye reduces as bit rate ↑
- Harder to meet constraint.

- PE-TL
- > 70mV eye even at 22nm node
- Equalization does help!

- UE-TL

- Compare five different global interconnections in terms of latency, energy per bit, throughput and signal integrity from 90nm to 22nm.
- A simple linear model provided to link
- Architecture-level performance metrics
- Technology-defined parameters

- Some observations from experimental results
- T-line structures have potential to replace R-RC at future node
- Differential T-lines are better thansingle-ended
- Low-power/High-throughput/Low-noise

- Equalizationcould be utilized for on-chip global interconnection
- Higher throughput density, improve signal integrity
- Even w/ lower energy dissipation (passive equalizations)

Thank you!

Q & A

Back Up Slides

Scaling trend of PUL wire resistance and capacitance

Copper resistivity versus wire width

- On-Chip Interconnect Scaling
- Dimension shrinks
- Wire resistance increases -> RC delay
- Increasing capacitive coupling -> delay, power, noise, etc.

- Performance of global wires decreases w/ technology scaling.

- Dimension shrinks

2D frequency-dependent

tabular Model

Inverter size,

number of stages,

Rload (if any)

Single-ended;

Inverter chains

SPICE

simulation

SPICE simulation to evaluate.

Optimization Routine:

1. Optimal cycle time

2. Sweep for optimal inverter chain

SPICE simulation to check in-plane crosstalk, etc

2D frequency-dependent

Tabular Model

Wire width;

Driver impedance;

RC equalizer (if any); Termination resistance.

Differential lines;

SA-based TX

Closed-form equation-based model

Evaluation based on models.

Optimization Routine:

1. Binary search for wire width

2. SQP for other var. optimization

SPICE simulation to check in-plane crosstalk, etc

- Lowering driver impedance improves eye
- Eye reduces as frequency goes up
- Optimal termination resistance.

Optimal Rload

- Larger driver impedance leads to slower rise edge and lower saturation voltage
- Larger termination resistance causes sharper rise edge but with larger reflection

- Three different PRBS input patterns, min-ddp solutions
- T-line Scheme A: Delay increased by 9.6%, Power increased by 37%
- T-line Scheme B: Delay increased by 2%, Power increased by 25.7%

- Sense amplifier (SA)
- Double-tail latch-type [Schinkel 07]
- Optimize sizing to minimize SA delay

- Inverter chain
- Number of stage
- Fixed to 6

- Sizing of each inverter
- RS: output resistance of inverter chain
- Sweep the 1st inverter size to minimize the total transceiver delay for given [Veye, RS]

- Number of stage

Double-tail latch-type voltage sense amp.

@45nm tech node:

M1/M3: 45nm/45nm

M2/M4: 250nm/45nm

M5/M6: 180nm/45nm

M7/M8: 280nm/45nm

M9: 495nm/45nm

M10/M11: 200nm/45nm

M12: 1.58um/45nm

- Driver side
- Voltage source Vswith output resistance Rs
- Vs: full-swing pulse signal with rise time Tr=0.1Tc
- Rs: output resistance of the last inverter in the chain.

- Receiver side
- Extract look-up table for TX delay and power
- Fit the table using non-linear closed form formula
- The relative error is within 2% for fitting models

Histogram of fitting errors at 45nm node

Transceiver delay map at 45nm node

Transceiver power map at 45nm node

Bit-rate: 50Gbps

Rs=11.06ohm, Rd=350ohm, Cd=0.38pF,

RL=107.69ohm

Low-Latency Application (ps/mm)

Low-Energy Application (pJ/m)

Tech Node

Tech Node

Schemes

Schemes

High-Throughput Application (Gbps/um)

Low-Noise Application

Tech Node

Tech Node

Schemes

Schemes

Item in the table: score/value. Score: the higher, the better in terms of given metric, max. score is 5. The best structure in each column marked using red color.

- Explore novel global signaling schemes for high throughput and low energy dissipation.
- Design, optimize > 50Gbps on-chip interconnection schemes
- Architecture-level study to identify trade-offs
- Wire configuration
- Dimension optimization, ground plane, etc.

- Un-interrupted architectures
- Equalization implementation, TX/RX choice

- Distributed architectures
- Active or Passive compensation (RC equalizers, other networks, etc)

- Wire configuration
- Novel high-speed transceiver circuitry design
- Develop analysis and optimization capability to aid co-design and co-optimization of wire and transceiver circuit
- Fabrication to verify analysis and demonstrate feasibility

[Repeated RC Wire]

- L. Zhang, H. Chen, B. Yao, K. Hamilton, and C.K. Cheng, “Repeated on-chip interconnect analysis and evaluation of delay, power and bandwidth metrics under different design goals,” IEEEInternational Symposium on Quality Electronic Design, 2007, pp.251-256.
- Y. Zhang, L. Zhang, A. Deutsch, G. A. Katopis, D. M. Dreps, J. F. Buckwalter, E. S. Kuh and C.K. Cheng, “Design Methodology of High Performance On-Chip Global Interconnect Using Terminated Transmission-Line, ” IEEE International Symposium on Quality Electronic Design, 2009, pp.451-458.
- Y. Zhang, L. Zhang, A. Tsuchiya, M. Hashimoto, and C.K. Cheng, “On-chip high performance signaling using passive compensation, ” IEEE International Conference on Computer Design, 2008, pp. 182-187.
- Y. Zhang, L. Zhang, A. Deutsch, G. A. Katopis, D. M. Dreps, J. F. Buckwalter, E. S. Kuh, and C. K. Cheng, “On-chip bus signaling using passive compensation,” IEEE Electrical Performance of Electronic Packaging, 2008, pp. 33-36.
- L. Zhang, Y. Zhang, A. Tsuchiya, M. Hashimoto, E. Kuh, and C.K. Cheng, “High performance on-chip differential signaling using passive compensation for global communication, ” Asia and South Pacific Design Automation Conference, 2009, pp. 385-390.
- Y. Zhang, X. Hu, A. Deutsch, A. E. Engin, J. F. Buckwalter, and C. K. Cheng, “Prediction of High-Performance On-Chip Global Interconnection, ” ACM workshop on System Level Interconnection Prediction, 2009

[Un-Terminated/Terminated T-Line]

[Passive-Equalized T-Line]

[Overview and Comparison]