1 / 20

Clocking links in multi-chip packages: a case study

Clocking links in multi-chip packages: a case study. Tamer Ali 1 , D. Patil, F. Liu, E. Alon 2 , J. Lexau, C.-K. K. Yang 1 , R. Ho Sun Labs, Oracle 1 Also with UCLA, 2 Also with UC-Berkeley. A disclaimer. This is: Not a talk about upcoming Oracle products

laurie
Download Presentation

Clocking links in multi-chip packages: a case study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clocking links in multi-chip packages: a case study Tamer Ali1, D. Patil, F. Liu, E. Alon2, J. Lexau, C.-K. K. Yang1, R. Ho Sun Labs, Oracle 1Also with UCLA, 2Also with UC-Berkeley

  2. A disclaimer • This is: • Not a talk about upcoming Oracle products • About research done at Sun Labs and academia • ~100 people in Oracle working on advanced research • Looking at hardware, software, OS, languages, DB… • Working in conjunction with UCLA, UC-Berkeley • Providing a technology toolbox for product groups • This work was supported in part by DARPA under Agreement HR0011-08-09-0001. The contents of this presentation should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense.

  3. It’s all about the wiresFrom chips to data centers • Markets calling for ever-increasing performance • Databases: transactions to data warehousing analytics • “Customers who bought this also purchased…” • High-performance and/or ubiquitous computing • System designers responding with integration • Lots of cores on an individual chip • Lots of socketed chips on a single board • Lots of boards in a single chassis • Lots of chassis in a data center • Of course, they all have to communicate!

  4. Multicore CPUs need on-chip networksTo pass messages for cache sharing and coherence, DMAs • Current examples range from crossbars to meshes • On-chip wires • Bandwidth is relatively free: no need for overclocking • Parallel wires with global synchrony or simple asynchrony www.oracle.com www.amd.com www.intel.com www.nvidia.com

  5. Boards and chassis need traces and backplanesHere the challenge is distance, bit rate, and power • Impairments: skin+dielectric loss, reflections, xtalk… • Traces on boards and backplanes • Minimize chip connectors (solder balls) – overclock them • Use parallel clocking (source-synchronous), phase recovery Wikimedia commons www.oracle.com Stojanovic, Ph.D. Thesis, 2004

  6. Data centers need lots of cablesOptical or electrical • Reach, cable management, reliability, replaceability • Full disclosure: “the ugly” was actually a Sun data center • Long cables between boards and chassis • Minimize cabling (weight, cost) – overclock them • TX/RX have completely different clocks, so need full CDR The good The bad The ugly Taken from Al Davis, Hot Chips tutorial, 2010

  7. These interconnects are fairly well understoodOr, at least their constraints are • Long cables between boards and chassis • Minimize cabling (weight, cost) – overclock them • TX/RX have completely different clocks, so need full CDR • Traces on boards and backplanes • Minimize chip connectors (solder balls) – overclock them • Use parallel clocking (source-synchronous), phase recovery • On-chip wires • Bandwidth is relatively free: no need for overclocking • Parallel wires with global synchrony or simple asynchrony Recent work in a new layer here

  8. A package-level “macrochip” integrated systemOvercome pin/ball limits for much higher chip-to-chip bandwidth • Integrate lots of silicon: multi-core CPUs, DRAM • Connect the chips using fine-pitch optical interconnect • Chips sit face up in a silicon carrier lattice • Bridge chips interface CPU/DRAM to carrier lattice Silicon lattice w/ waveguides CPUs or DRAMs Ho et al, IEEE Design and Test, July 2010 Optical bridges

  9. An optical macrochip link • Transmitter side: WDM allows waveguide sharing • Receiver side: Light is sensed by a diode, converted to current which gets amplified in the TIA • A clock in the electrical domain at TX and RX synchronizes communication

  10. This is not a talk (or a paper) on packaging or opticsSee Ashok Krishnamoorthy this afternoon for those topics • Optical links have interesting circuit implications • Small overhead (20um pitch optical connector, 16x WDM) • No need to overclock these links; run them at clk rate • Simplify timing recovery by having a wider bit time • Target comparable energy/performance to electrical interconnects (≈1mW/Gbps): • Need significant reductions in optical component loss • Need efficient TX and RX circuits • Clocking is a big part of Rx circuits (CDR, phase adjustment, etc) • So, a question: how should these links be clocked?

  11. Clocking “ground rules” For this case study • Chips in a macrochip package share a clock source • Typically a slow (MHz) crystal oscillator • Here, chips run at 5 GHz, and links at 10 Gbps • Two basic choices in clock distribution and usage • Multiply the shared clock at the “root” of the system • Distribute a fast 5 GHz clock optically over the chip • More like on-chip clocking • Distribute the native slow clock globally • Local chips use their own mPLL at each IO channel • More like board level clocking • In both cases, do local phase adjustment at each IO channel • Which one would you choose?

  12. Slow Root Clock vs. Fast Root Clock • For slow clock distribution, PLL accumulates jitter for long time constants. • For fast clock distribution, clock buffers dominate jitter • May be we can do something about it

  13. LC Tuned Clock Buffer for fast Clock distribution • Inductor resonates with the capacitive on-chip wire • Maximum gain at clock frequency • Filters noise (Jitter) elsewhere • Jitter is reduced by sqrt(Q)

  14. Clock Buffers Insertion Criteria • Each buffer filters noise, and reduces jitter • Can we insert multiple buffers to improve jitter further? • Keep the power constant

  15. Clock Buffers Insertion Criteria (Contd.) • Cascading more clock buffers helps up to a point • Scale down the size and the power of each to keep the total power constant • After three stages, devices get out of saturation, gain drops • Note, lower frequency operation has more jitter! • Q actually drops at lower frequency • Slope decreases ➩ More susceptible to noise Devices get out of saturation 0.15ps

  16. Let’s Talk Numbers… • Fast on-chip clock distribution gives 0.25ps rms jitter • 0.15ps for three clock distribution buffers • 0.1ps for phase adjustment DLL • Equivalent to 3.8ps p-p jitter for BER of 10-12 • Slow on-chip clock distribution gives 1.8ps rms jitter • Distribute 500MHz clock, multiply by 10 inside the receiver • Assume no jitter from clock distribution • Equivalent to 25ps p-p jitter for BER 10-12 • This is equivalent to 25% of the UI !! • Fast clock distribution is a clear winner

  17. The Caveat… • Tuning the clock buffers across the chip is an issue • A varactor or capacitor bank is needed • Reduces Q and degrades performance • Or use low Q inductor to tolerate the resonance variation • Trades performance with ease of tuning • Doing dynamic voltage/frequency scaling is tricky • Clock buffers has to be scaled to match the processor clock • Power supply noise • Dominant unless good regulators are used in the DLLs • Upside: Improved system tolerance to supply noise • Due to absence of on-chip PLLs • Runs in favor for fast clock distribution

  18. Summary • Improve perf/power with many-core macro-chip • Point-point optical links pose interesting clocking challenges • On-board clocking paradigm results in high jitter • Multiplying PLLs accumulate jitter • Instead, distribute a global fast clock optically to all chips • Tuned on-chip clock buffers are the key enablers • Small bandwidth filters noise and hence jitter • Almost 10x better than using PLLs • Caveats • Need to tune the buffers • Need to tame power supply noise in DLLs

  19. <Insert Picture Here> Questions?

  20. <Insert Picture Here> Questions?

More Related