1 / 87

IBM Research GmbH Zurich Research Laboratory R schlikon, Switzerland

MikeCarlo
Download Presentation

IBM Research GmbH Zurich Research Laboratory R schlikon, Switzerland

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. IBM Research GmbH Zurich Research Laboratory Rüschlikon, Switzerland Thomas Toifl, Christian Menolfi, Marcel Kossel, Matthias Brändli Thomas Morf, Peter Buchmann, Martin Schmatz Feb 19th, 2009

    2. 2 Comparison of High-speed-serial to Wired Ethernet

    3. 3 Low-power Design Areas So, what are the design areas where power can be minimized ? First, I will briefly look into the transmitter, where I will show how a series-source terminated driver style can be used to save power. The main focus of my talk will however be on the receive side. -Since the sampling latch is the core of the receiver I will start with a discussion and characterization of the sampling latch. -Then I will show what design techniques can be applied in the input data path, before I will discuss - the implementation of low-power decision-feedback equalizer structures. Last but not least I will turn to the question of how to generate and distribute clocks for the CDR circuit. So, what are the design areas where power can be minimized ? First, I will briefly look into the transmitter, where I will show how a series-source terminated driver style can be used to save power. The main focus of my talk will however be on the receive side. -Since the sampling latch is the core of the receiver I will start with a discussion and characterization of the sampling latch. -Then I will show what design techniques can be applied in the input data path, before I will discuss - the implementation of low-power decision-feedback equalizer structures. Last but not least I will turn to the question of how to generate and distribute clocks for the CDR circuit.

    4. 4 Outline - Low power techniques Transmitter Architecture CML vs. SST Receiver data path Sampling Latch Sub-rate Processing Sampling in Data Path Receiver clock path RX Clock Generation with P-PLL DFE Architecture Integrating DFE Switched-cap DFE Conclusions This brings me to the outline of my talk which is shown here. This brings me to the outline of my talk which is shown here.

    5. 5 Transmitter Architecture CML vs. Source-Series Terminated (SST) SST driver example So, now lets start with the transmitter architecture. Here, I will first compare a CML-based transmitter with a source-series terminated (or SST) driver and will show you one example of a half-rate SST driver implementation. So, now lets start with the transmitter architecture. Here, I will first compare a CML-based transmitter with a source-series terminated (or SST) driver and will show you one example of a half-rate SST driver implementation.

    6. 6 CML vs. SST driver • 1Vppd swing • Load current : +/- 5mA • Total current: 20mA So, the picture on the left side shows the DC current flowing in the CML driver while the picture on the right side shows the SST case. In both cases we assume a differential termination at the Receiver side and a signal swing of 1V differential peak-to-peak. Since the termination resistor in the driver is 50 ohms, a current of 10mA flows in each branch, resulting in a total DC current of 20mA. The situation is different for the series terminated driver shown on the right side. Since the total resistance in the path from VDD to ground is 200 ohms, only 5mA of current are actually used. So, the picture on the left side shows the DC current flowing in the CML driver while the picture on the right side shows the SST case. In both cases we assume a differential termination at the Receiver side and a signal swing of 1V differential peak-to-peak. Since the termination resistor in the driver is 50 ohms, a current of 10mA flows in each branch, resulting in a total DC current of 20mA. The situation is different for the series terminated driver shown on the right side. Since the total resistance in the path from VDD to ground is 200 ohms, only 5mA of current are actually used.

    7. 7 Half-rate SST Driver

    8. 8 Power Consumption This plot shows the measured power consumption of an SST driver operating up to 16 Gbits per second. The red curve corresponds to the power consumed when a regular zero-one pattern is transmitted while the blue curve shows the case for random data. A power consumption of 3.6mW per Gb per second is achieved at 1V output swing. The power consumption is proportional to the speed of the transmitter which is due to the high static CMOS content.This plot shows the measured power consumption of an SST driver operating up to 16 Gbits per second. The red curve corresponds to the power consumed when a regular zero-one pattern is transmitted while the blue curve shows the case for random data. A power consumption of 3.6mW per Gb per second is achieved at 1V output swing. The power consumption is proportional to the speed of the transmitter which is due to the high static CMOS content.

    9. 9 Low-power Receiver Data Path Sampling Latch Sub-rate Processing Sampling in Data Path So, let me now turn to the receiver data path, where I will take a closer look on the sampling latch. First, I will describe a method how sampling latches can be modeled and characterized. This will then be followed by a comparison between two frequently used latch topologies: The CML latch and the DCVS (or SenseAmp) Latch. So, let me now turn to the receiver data path, where I will take a closer look on the sampling latch. First, I will describe a method how sampling latches can be modeled and characterized. This will then be followed by a comparison between two frequently used latch topologies: The CML latch and the DCVS (or SenseAmp) Latch.

    10. 10 Sampling Latch A sampling latch can be viewed as a regenerative amplifier. Wu and Wooley compared the performance of different methods for amplification with respect to their power-delay product. They compared a regenerative amplifier to a single-pole amplifier and an optimized multipole amplifier. As can be seen by the graph, regenerative amplification is the most power efficient way to amplify data. A sampling latch can be viewed as a regenerative amplifier. Wu and Wooley compared the performance of different methods for amplification with respect to their power-delay product. They compared a regenerative amplifier to a single-pole amplifier and an optimized multipole amplifier. As can be seen by the graph, regenerative amplification is the most power efficient way to amplify data.

    11. 11 CML vs. DCVS latch Now, lets compare a CML latch to a DCVS latch. Although it is in general hard to compare different topologies due to the large degrees of freedom it still gives some interesting insight. First, note that the regenerative amplification transistors, shown by the red regions, are in a similar operating point for both cases, where Vgs and Vds are approximately 0.5 V for a 1V supply. The CML latch achieves its smallest regeneration time constant tau_i when the load resistance is chosen as 2 over gm. The regeneration time constant tau_i is given by the formulas. Please note that there is a derivation in the appendix which I skip here due to time constraints. In the DCVS latch, both NFET and PFET contribute to the amplification. Interestingly, the intrinsic time constant, which corresponds to the unloaded latch, is smaller for the DCVS case than for the CML latch. Also note, that the P/N ratio, which historically was greater than 2 is now approaching 1 which is due to advances in the CMOS fabrication process. Now, lets compare a CML latch to a DCVS latch. Although it is in general hard to compare different topologies due to the large degrees of freedom it still gives some interesting insight. First, note that the regenerative amplification transistors, shown by the red regions, are in a similar operating point for both cases, where Vgs and Vds are approximately 0.5 V for a 1V supply. The CML latch achieves its smallest regeneration time constant tau_i when the load resistance is chosen as 2 over gm. The regeneration time constant tau_i is given by the formulas. Please note that there is a derivation in the appendix which I skip here due to time constraints. In the DCVS latch, both NFET and PFET contribute to the amplification. Interestingly, the intrinsic time constant, which corresponds to the unloaded latch, is smaller for the DCVS case than for the CML latch. Also note, that the P/N ratio, which historically was greater than 2 is now approaching 1 which is due to advances in the CMOS fabrication process.

    12. 12 Low-power Receiver Data Path Sampling Latch Sub-rate Processing Sampling in Data Path Now, after having discussed the latches, let me turn to the data path coming before the latches. The question is what general methods can be used to lower the power consumption. One of these methods is sub-rate processing, which I will discuss now. ( Different techniques include sampling in the data path. Once we have a sampled time-discrete data path, we can optimize the power consumption of the data path by employing incomplete settling or integrating buffers. )Now, after having discussed the latches, let me turn to the data path coming before the latches. The question is what general methods can be used to lower the power consumption. One of these methods is sub-rate processing, which I will discuss now. ( Different techniques include sampling in the data path. Once we have a sampled time-discrete data path, we can optimize the power consumption of the data path by employing incomplete settling or integrating buffers. )

    13. 13 Sub-rate processing Energy/bit in DCVS latch This graph shows the energy per bit for a DCVS latch as a function of tau over tau_sub_i, where tau_sub_i is the intrinsic regeneration time constant and tau is the required time constant. The closer the circuit is operated at the intrinsic time constant the less power efficient the circuit is. In this example, sub-rate processing by a factor 4 reduces the power consumption approximately by a factor 5.This graph shows the energy per bit for a DCVS latch as a function of tau over tau_sub_i, where tau_sub_i is the intrinsic regeneration time constant and tau is the required time constant. The closer the circuit is operated at the intrinsic time constant the less power efficient the circuit is. In this example, sub-rate processing by a factor 4 reduces the power consumption approximately by a factor 5.

    14. 14 Low-power Receiver Data Path Optimization Sampling Latch Sub-rate Processing Sampling in Data Path Now, let me discuss the introduction of sampling in the data path. Now, let me discuss the introduction of sampling in the data path.

    15. 15 Sampling in Data Path Sampling is done by T/H at the input Total input capacitance is NˇCs/2 Advantages - Signal processing now in discrete time domain -> Reduced bandwidth requirements due to -> Sub-rate -> Can use reset to erase history -> Buffers can use incomplete settling or integration Instead of doing the processing in continuous time, it is advantageous to use a track and hold as the very first processing element. The data processing is then done in the discrete time domain, which has several advantages: First, the bandwidth requirements are reduced when sampling is used in combination with sub-rate processing. Second, since now the processing is in discrete time, we can use a reset switch in the buffers to erase its history, thereby eliminating residual ISI. Hence, the buffers can now use incomplete settling or, in the limit, current integration, as we will see in the following slides.Instead of doing the processing in continuous time, it is advantageous to use a track and hold as the very first processing element. The data processing is then done in the discrete time domain, which has several advantages: First, the bandwidth requirements are reduced when sampling is used in combination with sub-rate processing. Second, since now the processing is in discrete time, we can use a reset switch in the buffers to erase its history, thereby eliminating residual ISI. Hence, the buffers can now use incomplete settling or, in the limit, current integration, as we will see in the following slides.

    16. 16 Incomplete Settling ? Integrating buffer For given CL , settling time ts=Tcycle/2, and gm/I = 2/Vdsat t=RLCLis varied Power can be reduced significantly for ts/t ? 0 (RL?8, integrator) But: Noise rises significantly when ts/t < 1.5 Can drive higher load at same noise with lower power Hence, having a sampled data path with a reset-able buffer allows as to use incomplete settling in the buffer. This means that the load resistance can be made higher, which allows to lower the current at the same gain. Taking the concept to the limit results in infinite load resistance and a current-integrating buffer, which consumes the smallest amount of power at a given gain and load capacitance CL. Note however, that with higher resistance and smaller settling time noise rises significantly. The bottom line is that with an integrating buffer a higher load capacitance can be driven with lower power at the same noise level. We will see soon how this concept can be used in a decision feedback equalizer. Hence, having a sampled data path with a reset-able buffer allows as to use incomplete settling in the buffer. This means that the load resistance can be made higher, which allows to lower the current at the same gain. Taking the concept to the limit results in infinite load resistance and a current-integrating buffer, which consumes the smallest amount of power at a given gain and load capacitance CL. Note however, that with higher resistance and smaller settling time noise rises significantly. The bottom line is that with an integrating buffer a higher load capacitance can be driven with lower power at the same noise level. We will see soon how this concept can be used in a decision feedback equalizer.

    17. 17 Low-power RX clock generation Clock generation Phase-programmable PLL (P-PLL) Design example: 40 Gbit/s RX In the next chapter I would like to discuss the options for low-power clock generation and distribution in the receiver. - I will first compare two styles of clock circuits, CML and full-swing CMOS. Then I will discuss clock generation using a phase-programmable PLL, which will be followed by a design example of a 40 Gbps CDR circuit. In the next chapter I would like to discuss the options for low-power clock generation and distribution in the receiver. - I will first compare two styles of clock circuits, CML and full-swing CMOS. Then I will discuss clock generation using a phase-programmable PLL, which will be followed by a design example of a 40 Gbps CDR circuit.

    18. 18 Quarter-rate Dual-loop architecture The circuit on this slide shows a previous solution for a dual-loop architecture for a quarter-rate CDR system. A reference clock phi_ref enters a phase-locked loop or delay-locked loop, which then generates a number of k clock phases. These clock phases are then fed into a number of phase rotators, which allows to set the phase by some digital value. The clocks coming out of the phase rotators are fed to the eight sampling latches, generating four data and four edge samples. These sampled bits then enter a digital loop filter, which finally controls the phase rotators. This forms a digital Delay Locked Loop. This DLL tracks the phase and small frequency deviations of the input data. --- Phase rotators, although often used in previous designs, have several drawbacks. Of course, they consume power and chip area. Also, the random device mismatch makes it hard to achieve accurate timing at very high data rates. The circuit on this slide shows a previous solution for a dual-loop architecture for a quarter-rate CDR system. A reference clock phi_ref enters a phase-locked loop or delay-locked loop, which then generates a number of k clock phases. These clock phases are then fed into a number of phase rotators, which allows to set the phase by some digital value. The clocks coming out of the phase rotators are fed to the eight sampling latches, generating four data and four edge samples. These sampled bits then enter a digital loop filter, which finally controls the phase rotators. This forms a digital Delay Locked Loop. This DLL tracks the phase and small frequency deviations of the input data. --- Phase rotators, although often used in previous designs, have several drawbacks. Of course, they consume power and chip area. Also, the random device mismatch makes it hard to achieve accurate timing at very high data rates.

    19. 19 Quarter-rate Dual-loop architecture One solution to avoid phase rotators is to derive the clock phases directly from the VCO. This can be done by using the phase-programmable PLL shown here. The P-PLL locks to the reference frequency phi_ref with a programmable phase offset. This phase-programmable PLL is shown in more detail in the next slide.One solution to avoid phase rotators is to derive the clock phases directly from the VCO. This can be done by using the phase-programmable PLL shown here. The P-PLL locks to the reference frequency phi_ref with a programmable phase offset. This phase-programmable PLL is shown in more detail in the next slide.

    20. 20 CDR Architecture with Phase-Programmable PLL This slides shows the architecture of the CDR circuit. The circuit inside the red boundary is the P-PLL. The P-PLL feeds 8 clock phases to the sampling stage, where eight samples are taken. A 4 to 8 demux generates eight data bits and eight edge bits at a rate of 5 GHz. The data bits are fed to the integrated PRBS 15 checking circuit, which generates an error signal whenever a bit error is detected. The data samples together with the edge samples also enter an early/late signal generator. After a majority voting stage, a single early or late signal goes into a digital loop filter, which decides if the phase should be advanced or retarded. The settings of the current phase position are converted to the appropriate control signals for the alpha-DAC, which controls the weights of the XOR phase detector outputs.This slides shows the architecture of the CDR circuit. The circuit inside the red boundary is the P-PLL. The P-PLL feeds 8 clock phases to the sampling stage, where eight samples are taken. A 4 to 8 demux generates eight data bits and eight edge bits at a rate of 5 GHz. The data bits are fed to the integrated PRBS 15 checking circuit, which generates an error signal whenever a bit error is detected. The data samples together with the edge samples also enter an early/late signal generator. After a majority voting stage, a single early or late signal goes into a digital loop filter, which decides if the phase should be advanced or retarded. The settings of the current phase position are converted to the appropriate control signals for the alpha-DAC, which controls the weights of the XOR phase detector outputs.

    21. 21 P-PLL – Key points: So let’s summarize the key points of the programmable PLL: First, the clocks from the VCO go directly into the latches, obsoleting any phase rotators. In consequence, the clock path is very short, minimizing effects caused by device mismatch and power supply noise. Unlike phase-rotators, where different clock waveforms are blended, phase rotation with a P-PLL is inherently linear Also, high-frequency noise on the input clock signal is filtered out. This is, for example, beneficial in clock-synchronous links with timing skew in the data and the clock line. Of course, there are also disadvantages: -Since clock is generated with a PLL, phase noise is accumulated. This effect is however mitigated to a large degree since the achievable bandwidth in the PLL is extremely high. Also, now since the phase rotation happens in the feedback-path, the loop dynamics of the PLL influence the data tracking loop. This effect is however negligible since due to the high loop bandwidth the loop reacts more or less spontaneously.So let’s summarize the key points of the programmable PLL: First, the clocks from the VCO go directly into the latches, obsoleting any phase rotators. In consequence, the clock path is very short, minimizing effects caused by device mismatch and power supply noise. Unlike phase-rotators, where different clock waveforms are blended, phase rotation with a P-PLL is inherently linear Also, high-frequency noise on the input clock signal is filtered out. This is, for example, beneficial in clock-synchronous links with timing skew in the data and the clock line. Of course, there are also disadvantages: -Since clock is generated with a PLL, phase noise is accumulated. This effect is however mitigated to a large degree since the achievable bandwidth in the PLL is extremely high. Also, now since the phase rotation happens in the feedback-path, the loop dynamics of the PLL influence the data tracking loop. This effect is however negligible since due to the high loop bandwidth the loop reacts more or less spontaneously.

    22. 22 The CDR circuit has been fabricated in a 65nm partially depleted digital CMOS SOI technology. The loop filter logic, the PRBS checker and the output multiplexer logic were synthesized into one block of static CMOS with digital design tools. The active area is 0.03 mm^2. The power consumption of the CDR circuit at 40Gb/s was measured to be 72mW from a 1.2V supply. This results in a power efficiency for the proposed CDR + 1:8 demux of 1.8mW/Gbps. The CDR circuit has been fabricated in a 65nm partially depleted digital CMOS SOI technology. The loop filter logic, the PRBS checker and the output multiplexer logic were synthesized into one block of static CMOS with digital design tools. The active area is 0.03 mm^2. The power consumption of the CDR circuit at 40Gb/s was measured to be 72mW from a 1.2V supply. This results in a power efficiency for the proposed CDR + 1:8 demux of 1.8mW/Gbps.

    23. 23 This slide shows a summary of previously implemented receivers or transceivers based on the P-PLL architecture. As we can see, very low values for power consumption can be achieved. This slide shows a summary of previously implemented receivers or transceivers based on the P-PLL architecture. As we can see, very low values for power consumption can be achieved.

    24. 24 DFE Architecture Low-power Options Integrating DFE Switched-cap DFE Design Example 8 Channel bus with X-Cancellation Now, let me start a different chapter and turn to the implementation of low-power DFEs. I will first give a brief general introduction to a DFE data path, which are usually implemented as either a direct DFE or a speculative DFE. I will then describe some of the low-power DFE options: First, I will show an example of how the feedback path can be optimized to gain timing margin. I will then present one example of a switched-cap approach, followed by a low-power DFE implementation using the current-integrating buffer concept. I will finish the chapter by taking a glimpse on how future DFE receivers running at very high data rates, e.g. 25 Gbps, can be implemented.Now, let me start a different chapter and turn to the implementation of low-power DFEs. I will first give a brief general introduction to a DFE data path, which are usually implemented as either a direct DFE or a speculative DFE. I will then describe some of the low-power DFE options: First, I will show an example of how the feedback path can be optimized to gain timing margin. I will then present one example of a switched-cap approach, followed by a low-power DFE implementation using the current-integrating buffer concept. I will finish the chapter by taking a glimpse on how future DFE receivers running at very high data rates, e.g. 25 Gbps, can be implemented.

    25. 25 Current-Integrating DFE ? Integrating buffer: see slide 16 Power advantage due to integrating buffer/adder 1.4 mW/Gbps for 2 taps, 90nm CMOS As we have previously shown when we discussed the design of the data path, power can be reduced by using incomplete settling or a current-integrating buffer. When the CLK signal is low, the circuit is reset. When CLK goes high, the input voltage is integrated on the load capacitance. The contribution of the different DFE taps is added on the same output node. Also, the offset voltage of the circuit and the following latch can be corrected by adding an offset current. This circuit, proposed by Matt Park achieved a power consumption of 1.4 mW per Gbps for 2 DFE taps in 90nm CMOS.As we have previously shown when we discussed the design of the data path, power can be reduced by using incomplete settling or a current-integrating buffer. When the CLK signal is low, the circuit is reset. When CLK goes high, the input voltage is integrated on the load capacitance. The contribution of the different DFE taps is added on the same output node. Also, the offset voltage of the circuit and the following latch can be corrected by adding an offset current. This circuit, proposed by Matt Park achieved a power consumption of 1.4 mW per Gbps for 2 DFE taps in 90nm CMOS.

    26. 26 Switched-cap DFE Integrating buffer (as previous slide) DFE feedback signal added by charge injection

    27. 27 Switched-cap approach for filter implementation

    28. 28 DFE Architecture Low-power Options Integrating DFE Switched-cap approach Design Example 8 Channel bus with X-Talk Cancellation Now, let me start a different chapter and turn to the implementation of low-power DFEs. I will first give a brief general introduction to a DFE data path, which are usually implemented as either a direct DFE or a speculative DFE. I will then describe some of the low-power DFE options: First, I will show an example of how the feedback path can be optimized to gain timing margin. I will then present one example of a switched-cap approach, followed by a low-power DFE implementation using the current-integrating buffer concept. I will finish the chapter by taking a glimpse on how future DFE receivers running at very high data rates, e.g. 25 Gbps, can be implemented.Now, let me start a different chapter and turn to the implementation of low-power DFEs. I will first give a brief general introduction to a DFE data path, which are usually implemented as either a direct DFE or a speculative DFE. I will then describe some of the low-power DFE options: First, I will show an example of how the feedback path can be optimized to gain timing margin. I will then present one example of a switched-cap approach, followed by a low-power DFE implementation using the current-integrating buffer concept. I will finish the chapter by taking a glimpse on how future DFE receivers running at very high data rates, e.g. 25 Gbps, can be implemented.

    29. 29

    30. 30

    31. 31

    32. 32 FEXT cancellation with X-FFE and X-DFE

    33. 33 8 channel Receiver with DFE and X-talk cancellation

    34. 34 Conclusions Low power/small area serial links achieved with - SST driver - DCVS latch, full swing CMOS clocking - Sub-rate receivers - Sampled data path with buffer reset - Integrating/Incomplete settling buffer - P-PLL multi-phase clock generation - Integrating and switched-cap DFE ? Complex equalization (>50 taps) at low power possible So, let me conclude: Low-power/small area transceivers can be achieved by using - an SST driver approach - preferring DCVS latches and full-swing CMOS clocking over CML - by using sub-rate processing - by using sampling in the data path, which allows to use a reset switch in the buffers and to use integrating or incomplete settled buffers I showed you variants for low-power DFE implementations - a fast DFE feedback path - using integrating and switched-cap DFE and a Phase-programmable PLL for multi-phase clock generation.So, let me conclude: Low-power/small area transceivers can be achieved by using - an SST driver approach - preferring DCVS latches and full-swing CMOS clocking over CML - by using sub-rate processing - by using sampling in the data path, which allows to use a reset switch in the buffers and to use integrating or incomplete settled buffers I showed you variants for low-power DFE implementations - a fast DFE feedback path - using integrating and switched-cap DFE and a Phase-programmable PLL for multi-phase clock generation.

    35. 35 Acknowledgements Christian Menolfi Peter Buchmann Christoph Hagleitner Marcel Kossel Thomas Morf Jonas Weiss John Bulzacchelli Mounir Meghelli Matthias Brändli Martin Schmatz

    36. 36 Thank you!

    37. 37 Appendix

    38. 38

    39. 39 Sampling Latch Hence it so important to start every receiver design with the properties of the sampling latch. The properties of the latch also define the requirements of the data path and the clocking. Strictly speaking we have to separate two different functions of the sampling latch: First, its sampling function, which is defined by the shape of its sampling window. This ultimately defines the achievable time resolution of the sampler. We can accurately describe the sampling window with a sensitivity function, hs(t), which will be explained in the following slides. The second function of the latch is regenerative amplification. This is defined by its regeneration time constant ?. Of course, smaller values for ? correspond to higher gain.Hence it so important to start every receiver design with the properties of the sampling latch. The properties of the latch also define the requirements of the data path and the clocking. Strictly speaking we have to separate two different functions of the sampling latch: First, its sampling function, which is defined by the shape of its sampling window. This ultimately defines the achievable time resolution of the sampler. We can accurately describe the sampling window with a sensitivity function, hs(t), which will be explained in the following slides. The second function of the latch is regenerative amplification. This is defined by its regeneration time constant ?. Of course, smaller values for ? correspond to higher gain.

    40. 40 Latch Model As I have already noted, a latch is a time-variant regenerative amplifier. Let me ask now the following question: Given a specific latch circuit, how can we model and characterize it ? To do this, the following latch model can be used, which consists of the following components: - First, a linear front-end, where the input voltage is folded with hs(t), which we will call the latch sensitivity function. The shape of this function determines the time resolution of the latch. The front-end is followed by - an ideal sampler - a regenerative amplifier with exponentially growing gain -a threshold detector, which outputs if a clean plus 1 or minus 1 was achieved, - and a feedback which captures the decision history of the latch. As I have already noted, a latch is a time-variant regenerative amplifier. Let me ask now the following question: Given a specific latch circuit, how can we model and characterize it ? To do this, the following latch model can be used, which consists of the following components: - First, a linear front-end, where the input voltage is folded with hs(t), which we will call the latch sensitivity function. The shape of this function determines the time resolution of the latch. The front-end is followed by - an ideal sampler - a regenerative amplifier with exponentially growing gain -a threshold detector, which outputs if a clean plus 1 or minus 1 was achieved, - and a feedback which captures the decision history of the latch.

    41. 41 Sensitivity Function hs(t) Evaluation Procedure In order to characterize a given latch we use the following procedure, which we have implemented as an automatic function in our simulation environment. First, we specify the allowed latch evaluation time t eval, which is defined by system requirements. After this time the latch has to deliver a valid decision at its output. We then choose a time delay ?t relative to the edge of the sampling clock. For this ?t we apply a narrow impulse to the latch, for example 2ps wide. We then run some simulations to search the amplitude of the impulse A(delta t) which is required just to flip the latch. Of course, a small value for the required amplitude A(delta t) corresponds to a high sensitive at this point in time. The sensitivity hs(delta t) is just the inverse of A( delta t) - This procedure is then repeated for all possible values of ?t, which results in a curve A(?t). Hence, by finding the minimum pulse amplitude A for all time positions we can define a latch sensitivity function hs(delta t)=1/A(?t ). In order to characterize a given latch we use the following procedure, which we have implemented as an automatic function in our simulation environment. First, we specify the allowed latch evaluation time t eval, which is defined by system requirements. After this time the latch has to deliver a valid decision at its output. We then choose a time delay ?t relative to the edge of the sampling clock. For this ?t we apply a narrow impulse to the latch, for example 2ps wide. We then run some simulations to search the amplitude of the impulse A(delta t) which is required just to flip the latch. Of course, a small value for the required amplitude A(delta t) corresponds to a high sensitive at this point in time. The sensitivity hs(delta t) is just the inverse of A( delta t) - This procedure is then repeated for all possible values of ?t, which results in a curve A(?t). Hence, by finding the minimum pulse amplitude A for all time positions we can define a latch sensitivity function hs(delta t)=1/A(?t ).

    42. 42 Latch Sensitivity Function This slide show the latch sensitivity function for different evaluation times. As already said, the evaluation time is the time the latch is allowed to use for regenerative amplification in a system. Here it is swept from 60ps to 120ps in steps of 20ps. The right plot is the same curve on a logarithmic scale. Note that the various curves are shifted on the vertical scale by approximately the same amount, which is due to the fact that the amplification rises exponentially with time. Also, the shape of the curves is approximately constant, indicating that the description of the latch with a linear front-end followed by exponential amplification is valid. The sensitivity function hs(t) defines the aperture window of the sampler relative to the clock edge. It is a more accurate description of the latch as the frequently used "setup and hold time" description. Taking the Fourier transform of hs(t) results in an equivalent transfer function for the latch, which can be integrated in the frequency characteristic of the whole receiver data path. This slide show the latch sensitivity function for different evaluation times. As already said, the evaluation time is the time the latch is allowed to use for regenerative amplification in a system. Here it is swept from 60ps to 120ps in steps of 20ps. The right plot is the same curve on a logarithmic scale. Note that the various curves are shifted on the vertical scale by approximately the same amount, which is due to the fact that the amplification rises exponentially with time. Also, the shape of the curves is approximately constant, indicating that the description of the latch with a linear front-end followed by exponential amplification is valid. The sensitivity function hs(t) defines the aperture window of the sampler relative to the clock edge. It is a more accurate description of the latch as the frequently used "setup and hold time" description. Taking the Fourier transform of hs(t) results in an equivalent transfer function for the latch, which can be integrated in the frequency characteristic of the whole receiver data path.

    43. 43 So, now that we have discussed how to characterize the properties of a latch, let me now turn to its power consumption. Let me ask the question what is the lower limit of power consumption in a pre-charged latch, given a load capacitance CL? During reset nodes vo and vob are precharged to Vdd. In the course of the latching operation one node is discharged to ground. Hence, the minimum energy per received data bit is Vdd2 times the load capacitance. We will see how far we can approach this limit.So, now that we have discussed how to characterize the properties of a latch, let me now turn to its power consumption. Let me ask the question what is the lower limit of power consumption in a pre-charged latch, given a load capacitance CL? During reset nodes vo and vob are precharged to Vdd. In the course of the latching operation one node is discharged to ground. Hence, the minimum energy per received data bit is Vdd2 times the load capacitance. We will see how far we can approach this limit.

    44. 44 Energy consumption in DCVS (=Sense Amplifier) Latch Capacitive Load (intrinsic plus external load) Ecap= 1.5 Vdd2 (Ci+CL)+Vdd2Ca Short-circuit current Isc Duration depends on the differential input signal vx tsc = t ln ( voff / vx) Average over uniform distribution: tsc˜t Esc= Vdd Isc t Now, lets look at the power consumption in a DCVS latch. The upper curves show the output nodes of the latch. The nodes are initially charged to Vdd. During regeneration both nodes are pulled to Vdd/2. One node goes up to Vdd again, the other is discharged to ground. We have two components for the power consumption. First the current from the capacitive discharge of the internal nodes at the latch output and the load capacitance. Note the factor 1.5 to take into account the discharge to Vdd/2 of both nodes. Additional internal capacitance Ca is also discharged to ground. The second current component is a short circuit current, which occurs while the output nodes are both at Vdd/2. It is interesting to note that for a DCVS latch this component depends on the actual input signal. Small input signals lead to longer regeneration time and more power consumption. Assuming that the input signal has a uniform distribution an average energy can be calculated. Now, lets look at the power consumption in a DCVS latch. The upper curves show the output nodes of the latch. The nodes are initially charged to Vdd. During regeneration both nodes are pulled to Vdd/2. One node goes up to Vdd again, the other is discharged to ground. We have two components for the power consumption. First the current from the capacitive discharge of the internal nodes at the latch output and the load capacitance. Note the factor 1.5 to take into account the discharge to Vdd/2 of both nodes. Additional internal capacitance Ca is also discharged to ground. The second current component is a short circuit current, which occurs while the output nodes are both at Vdd/2. It is interesting to note that for a DCVS latch this component depends on the actual input signal. Small input signals lead to longer regeneration time and more power consumption. Assuming that the input signal has a uniform distribution an average energy can be calculated.

    45. 45 CML vs. DCVS Latch Energy Consumption This chart shows a comparison of the energy per bit as a function of the cycle time, normalized to the minimum energy for a given load capacitance. The red curve corresponds to a DCVS latch, whereas the blue curve is for a latch optimized for maximum speed and the green curve is optimized for minimum power. A derivation for this case can be found in the appendix. For the comparison we require an amplification of exp(5) which is approximately 150. A cycle time of 100ps corresponds to a receiver with 10 GHz clocking. At this frequency the DCVS latch is about 2 times as power efficient as the CML latch. This chart shows a comparison of the energy per bit as a function of the cycle time, normalized to the minimum energy for a given load capacitance. The red curve corresponds to a DCVS latch, whereas the blue curve is for a latch optimized for maximum speed and the green curve is optimized for minimum power. A derivation for this case can be found in the appendix. For the comparison we require an amplification of exp(5) which is approximately 150. A cycle time of 100ps corresponds to a receiver with 10 GHz clocking. At this frequency the DCVS latch is about 2 times as power efficient as the CML latch.

    46. 46 CML vs. DCVS Latch Including Clock Power Now, when also the power for the clocking of the latch is included, the DCVS latch is about 1.8 times as power efficient as the CML latch. Of course, one can argue now that the situation gets much better for the CML latch if inductive peaking is used. This basically scales the curves for the CML latch by 1/1.7. The drawback is, however, that inductors, even when implemented as multi-level coils, take up much chip area. The shown comparison only compares the very first stage of the latch, however, and one should not forget that the CML latch requires additional circuitry, which is a CML slave latch and a DCVS latch to convert the data to CMOS levels. Hence, in practice the advantage of the DCVS latch will be more pronounced. Now, when also the power for the clocking of the latch is included, the DCVS latch is about 1.8 times as power efficient as the CML latch. Of course, one can argue now that the situation gets much better for the CML latch if inductive peaking is used. This basically scales the curves for the CML latch by 1/1.7. The drawback is, however, that inductors, even when implemented as multi-level coils, take up much chip area. The shown comparison only compares the very first stage of the latch, however, and one should not forget that the CML latch requires additional circuitry, which is a CML slave latch and a DCVS latch to convert the data to CMOS levels. Hence, in practice the advantage of the DCVS latch will be more pronounced.

    47. 47 CML vs. DCVS Latch with DCVS initial delay Now, when also the power for the clocking of the latch is included, the DCVS latch is about 1.8 times as power efficient as the CML latch. Of course, one can argue now that the situation gets much better for the CML latch if inductive peaking is used. This basically scales the curves for the CML latch by 1/1.7. The drawback is, however, that inductors, even when implemented as multi-level coils, take up much chip area. The shown comparison only compares the very first stage of the latch, however, and one should not forget that the CML latch requires additional circuitry, which is a CML slave latch and a DCVS latch to convert the data to CMOS levels. Hence, in practice the advantage of the DCVS latch will be more pronounced. Now, when also the power for the clocking of the latch is included, the DCVS latch is about 1.8 times as power efficient as the CML latch. Of course, one can argue now that the situation gets much better for the CML latch if inductive peaking is used. This basically scales the curves for the CML latch by 1/1.7. The drawback is, however, that inductors, even when implemented as multi-level coils, take up much chip area. The shown comparison only compares the very first stage of the latch, however, and one should not forget that the CML latch requires additional circuitry, which is a CML slave latch and a DCVS latch to convert the data to CMOS levels. Hence, in practice the advantage of the DCVS latch will be more pronounced.

    48. 48 CML vs. DCVS latch: Power Consumption Calculating the energy per bit reveals that in the case of the CML latch -the energy per bit is proportional to the ratio of the cycle time over the regeneration time constant. In the case of the DCVS latch the energy is independent. Since the latch gain is the log of this factor, one always has to specify the required gain A to do a comparison. This gain is usually dictated by system requirements.Calculating the energy per bit reveals that in the case of the CML latch -the energy per bit is proportional to the ratio of the cycle time over the regeneration time constant. In the case of the DCVS latch the energy is independent. Since the latch gain is the log of this factor, one always has to specify the required gain A to do a comparison. This gain is usually dictated by system requirements.

    49. 49 Sub-rate Processing Frequency is lowered by factor N - Lowers speed requirements in data path and latches - Allows trade-off between area and power With sub-rate processing, the frequency of the circuits is reduced by a factor N. Of course, since the data rate does not change we need to multiply the circuitry in the data path and operate them in parallel. Hence, this allows a trade-off between area and power.With sub-rate processing, the frequency of the circuits is reduced by a factor N. Of course, since the data rate does not change we need to multiply the circuitry in the data path and operate them in parallel. Hence, this allows a trade-off between area and power.

    50. 50 Sub-rate processing Inherent de-multiplexing to lower speed -> No additional circuitry needed Disadvantage: - Higher input load - More area for latches/buffers An additional advantage of sub-rate processing is that no additional de-multiplexer is needed, since the data is already sampled at lower rate, The disadvantages of sub-rate processing are eventually a higher input load and more area for the data path and the latches. Also, more circuits have to be adjusted and offset-calibrated.An additional advantage of sub-rate processing is that no additional de-multiplexer is needed, since the data is already sampled at lower rate, The disadvantages of sub-rate processing are eventually a higher input load and more area for the data path and the latches. Also, more circuits have to be adjusted and offset-calibrated.

    51. 51

    52. 52 Clocking style: CML vs. Full-swing CMOS ? So, let's look at the case that a buffer is driving an identical buffer plus a load capacitance CL. This situation is found in oscillators and clock distribution. CML buffers are differential and hence have the advantage of high power supply rejection. Using a full-swing CMOS clocking approach, however, requires a voltage regulator. We ask the question which of the two options will result in lower power consumption.So, let's look at the case that a buffer is driving an identical buffer plus a load capacitance CL. This situation is found in oscillators and clock distribution. CML buffers are differential and hence have the advantage of high power supply rejection. Using a full-swing CMOS clocking approach, however, requires a voltage regulator. We ask the question which of the two options will result in lower power consumption.

    53. 53 This slide shows a comparison of the power consumption. The circuits where compared under the specification that the rise/fall time has to be smaller than 16 percent of the cycle time. For the CML case, the width of the transistors and the common mode voltage drop delta V was optimized, with a minimum value for delta V of 250mV, which results in 400mV single-ended swing in the worst case corner. For the full-swing CMOS approach, transistor W and the regulated supply Vreg were optimized to reach the specs in the worst case corner, and power was calculated considering 0.3V regulator overhead. As the graph shows, a full-swing CMOS clocking style results in 50% less power consumption at a cycle time of 100ps. This slide shows a comparison of the power consumption. The circuits where compared under the specification that the rise/fall time has to be smaller than 16 percent of the cycle time. For the CML case, the width of the transistors and the common mode voltage drop delta V was optimized, with a minimum value for delta V of 250mV, which results in 400mV single-ended swing in the worst case corner. For the full-swing CMOS approach, transistor W and the regulated supply Vreg were optimized to reach the specs in the worst case corner, and power was calculated considering 0.3V regulator overhead. As the graph shows, a full-swing CMOS clocking style results in 50% less power consumption at a cycle time of 100ps.

    54. 54 Delay, Rise Time and Energy Consumption Let me summarize the properties of the two buffer options. The CML buffer time constant tau is the product of the effective resistance and the total capacitance (including input and output device capacitance of the buffers.) The delay time is approximately 0.9 times tau, and the rise/fall time about 1.6 times tau. In the case of an inverter-based full swing CMOS clock distribution, the delay time is given by the well-known alpha-law, and the rise/fall time is approximately equal to the delay time. Let me summarize the properties of the two buffer options. The CML buffer time constant tau is the product of the effective resistance and the total capacitance (including input and output device capacitance of the buffers.) The delay time is approximately 0.9 times tau, and the rise/fall time about 1.6 times tau. In the case of an inverter-based full swing CMOS clock distribution, the delay time is given by the well-known alpha-law, and the rise/fall time is approximately equal to the delay time.

    55. 55

    56. 56 Supply Voltages for CML vs. FS-CMOS

    57. 57

    58. 58 In principle, the proposed programmable PLL is very similar to an ordinary PLL. The control voltage at the loop filter controls the frequency of the VCO. The special element is however the phase detector. The phase detector consists of eight slave phase detectors, one for each VCO phase. All slave phase detectors are of the XOR type. A coarse phase adjustment can be readily achieved by switching on only one of the eight phase detectors, thereby locking to one of the eight phases. So the 360 degrees circle is divided into eight coarse phase positions. A fine adjustment of the phase can be achieved by multiplying the output voltages of the slave phase-detectors by some weighting factors alpha sub n, and by summing the resulting voltages. At each time, always two adjacent phase detectors are active. Hence, it is possible to interpolate between two coarse phase positions just by adapting the weighting factors. The voltage after the summation is converted to a current by a voltage to current converter, which is similar to a charge-pump in an ordinary charge-pump phase-locked loop. In principle, the proposed programmable PLL is very similar to an ordinary PLL. The control voltage at the loop filter controls the frequency of the VCO. The special element is however the phase detector. The phase detector consists of eight slave phase detectors, one for each VCO phase. All slave phase detectors are of the XOR type. A coarse phase adjustment can be readily achieved by switching on only one of the eight phase detectors, thereby locking to one of the eight phases. So the 360 degrees circle is divided into eight coarse phase positions. A fine adjustment of the phase can be achieved by multiplying the output voltages of the slave phase-detectors by some weighting factors alpha sub n, and by summing the resulting voltages. At each time, always two adjacent phase detectors are active. Hence, it is possible to interpolate between two coarse phase positions just by adapting the weighting factors. The voltage after the summation is converted to a current by a voltage to current converter, which is similar to a charge-pump in an ordinary charge-pump phase-locked loop.

    59. 59 Here we see one single XOR cell, which is equivalent to a Gilbert multiplier. The reference clock phi ref and phi refb is multiplied with the VCO phases phi 0 and phi 4. A digital coarse value, in this case coarse sub 0, switches the XOR branch either on or off. The lower part of the circuit is a current DAC, which is used to generate the weighting factor alpha. In our implementation, the fine value contains eight thermometer coded bits plus one binary coded bit of half the value. This results in 17 fine steps per octant, and 136 total steps covering a phase range of four unit intervals at 10 Gigabits per second. Here we see one single XOR cell, which is equivalent to a Gilbert multiplier. The reference clock phi ref and phi refb is multiplied with the VCO phases phi 0 and phi 4. A digital coarse value, in this case coarse sub 0, switches the XOR branch either on or off. The lower part of the circuit is a current DAC, which is used to generate the weighting factor alpha. In our implementation, the fine value contains eight thermometer coded bits plus one binary coded bit of half the value. This results in 17 fine steps per octant, and 136 total steps covering a phase range of four unit intervals at 10 Gigabits per second.

    60. 60 Interestingly, always two XOR cells can be combined. This is a consequence of the fact that two opposing phases are never used at the same time. Hence, the number of required XOR blocks is effectively halved. A phase-detector for eight coarse phase steps therefore only requires four combined XOR blocks, which is shown in the next slide. Interestingly, always two XOR cells can be combined. This is a consequence of the fact that two opposing phases are never used at the same time. Hence, the number of required XOR blocks is effectively halved. A phase-detector for eight coarse phase steps therefore only requires four combined XOR blocks, which is shown in the next slide.

    61. 61 It is to note that the current through the multi-phase phase detector is the same as in a simple XOR phase detector, hence the phase detector does not consume more power than a single XOR phase detector. A slight increase in power consumption stems from the fact that the reference clock now has to drive four times the capacitance of a simple phase detector, but this is not a dramatic increase. It is to note that the current through the multi-phase phase detector is the same as in a simple XOR phase detector, hence the phase detector does not consume more power than a single XOR phase detector. A slight increase in power consumption stems from the fact that the reference clock now has to drive four times the capacitance of a simple phase detector, but this is not a dramatic increase.

    62. 62 P-PLL based RX architecture: Clock generation for DFE

    63. 63

    64. 64 Low-power RX clock generation Clock generation for quarter-rate system Phase-programmable PLL (P-PLL) Design example: 40 Gbit/s RX Let me now describe a design example of a low-power 40 Gbit/s receiver circuit.Let me now describe a design example of a low-power 40 Gbit/s receiver circuit.

    65. 65 This slide shows the principle of operation of a quarter-rate clock and data recovery circuit. The green curve shows the data. This data is sampled by the clock phases phi0 to phi7. Since this is a quarter rate architecture, four data bits are sampled in a single clock cycle. Apart from the data samples, also the edge samples are needed to extract the timing information. So, a total number of eight samples is required in one clock cycle, which necessitates eight clock phases. In order to track the phase of the input data we have to provide the means to shift these clock phases simultaneously by some programmable phase shift. This slide shows the principle of operation of a quarter-rate clock and data recovery circuit. The green curve shows the data. This data is sampled by the clock phases phi0 to phi7. Since this is a quarter rate architecture, four data bits are sampled in a single clock cycle. Apart from the data samples, also the edge samples are needed to extract the timing information. So, a total number of eight samples is required in one clock cycle, which necessitates eight clock phases. In order to track the phase of the input data we have to provide the means to shift these clock phases simultaneously by some programmable phase shift.

    66. 66 Low-power RX design example 40 Gbit/s CDR circuit Full-swing CMOS clocking Low-power DCVS latches Quarter rate NFET-switch based fast T/H In this circuit we make use of many of the techniques we have described in this talk. -First, the circuit is nearly entirely built with full-swing CMOS logic and employs low-power DCVS latches. -Second, we benefit from using sub-rate processing by using a quarter rate design. - The timing resolution is defined by a Track-and-hold implemented as an NFET switch. Here we benefit from the large single-ended swing of the full-swing CMOS clocking style. In this circuit we make use of many of the techniques we have described in this talk. -First, the circuit is nearly entirely built with full-swing CMOS logic and employs low-power DCVS latches. -Second, we benefit from using sub-rate processing by using a quarter rate design. - The timing resolution is defined by a Track-and-hold implemented as an NFET switch. Here we benefit from the large single-ended swing of the full-swing CMOS clocking style.

    67. 67 As already noted, the sampling stage is tightly coupled to the VCO, thereby minimizing the clock path. To achieve high-speed operation, the sampler consists of a T/H stage, implemented as NMOS pass transistors T1, T1b, followed by a SenseAmp latch. FETs operated as switches achieve very high time resolution, provided that the input clock fall times are sufficiently short, and the clock swing is high enough for the expected signal swing. The VCO clock phi 2n+1 is therefore fed to a high slew-rate inverter, running from the regulated VCO supply Vcreg, which sharpens the edge at its output. A worst case clock swing of 0.9V with <8ps fall transition time is achieved. A full swing converter, denoted FSC in the slide, converts the clock from the regulated supply domain of the VCO to the unregulated, full-swing VDD domain of the following DCVS latche. Since the actual sampling time is defined by the preceding hold stage, delay variations in the FSC due to power supply noise do not influence jitter tolerance. As already noted, the sampling stage is tightly coupled to the VCO, thereby minimizing the clock path. To achieve high-speed operation, the sampler consists of a T/H stage, implemented as NMOS pass transistors T1, T1b, followed by a SenseAmp latch. FETs operated as switches achieve very high time resolution, provided that the input clock fall times are sufficiently short, and the clock swing is high enough for the expected signal swing. The VCO clock phi 2n+1 is therefore fed to a high slew-rate inverter, running from the regulated VCO supply Vcreg, which sharpens the edge at its output. A worst case clock swing of 0.9V with <8ps fall transition time is achieved. A full swing converter, denoted FSC in the slide, converts the clock from the regulated supply domain of the VCO to the unregulated, full-swing VDD domain of the following DCVS latche. Since the actual sampling time is defined by the preceding hold stage, delay variations in the FSC due to power supply noise do not influence jitter tolerance.

    68. 68 Data Path Sensitivity Function The latch characterization procedure I presented to you at the beginning of my talk was applied for the whole data path, which is the track and hold switch followed by the DCVS latch and an R/S flip flop. The resulting latch sensitivity function is shown on the left side. The width of the curve is roughly 5ps. Taking the Fourier transform of the sensitivity function results in a -3dB bandwidth of 40GHz. The latch characterization procedure I presented to you at the beginning of my talk was applied for the whole data path, which is the track and hold switch followed by the DCVS latch and an R/S flip flop. The resulting latch sensitivity function is shown on the left side. The width of the curve is roughly 5ps. Taking the Fourier transform of the sensitivity function results in a -3dB bandwidth of 40GHz.

    69. 69 VCO Layout Optimization for Max Speed In order to maximize the speed of the VCO great care was taken to minimize the capacitive loading of the wires. One question was how to arrange the VCO delay cells in an optimum way in order to reduce the length of the wires connecting the delay stages. Since the VCO uses a feed-forward technique to speed up its operation every delay stage n has to feed two delay stages, namely stage n+1 and n+2. Of course, for symmetry reasons, all wires have to be of the same length. In order to find the best arrangement an exhaustive search was performed on all possible permutations. The solution shown here (0,1,2,7,6,3,4,5) was found to be optimum, with a wire length of 4 times the width of the delay stage. In order to maximize the speed of the VCO great care was taken to minimize the capacitive loading of the wires. One question was how to arrange the VCO delay cells in an optimum way in order to reduce the length of the wires connecting the delay stages. Since the VCO uses a feed-forward technique to speed up its operation every delay stage n has to feed two delay stages, namely stage n+1 and n+2. Of course, for symmetry reasons, all wires have to be of the same length. In order to find the best arrangement an exhaustive search was performed on all possible permutations. The solution shown here (0,1,2,7,6,3,4,5) was found to be optimum, with a wire length of 4 times the width of the delay stage.

    70. 70 40 Gbit/s Input Eye - Eye opening = 19ps horizontal, 350mVdiff-pp vertical - PRBS 15 ?f= 400 ppm BER=10-12 - Measured eye width inside receiver = 11ps (error free range with phase fixed) A differential input signal from a BiCMOS 4:1 MUX and Anritsu 1775A parallel signal generator was provided to the receiver, of which the single-ended input eye is shown here. The generated eye has 19ps horizontal and 350 mV differential vertical opening. Error free operation at a bit error rate of <10-12 was measured at 40Gb/s with a PRBS15 data sequence and a maximum frequency offset of +/- 400ppm. Sweeping the delay of the input data externally allowed to measure the error free timing window of the receiver. With the given data eye an inner eye opening of 11ps was measured. A differential input signal from a BiCMOS 4:1 MUX and Anritsu 1775A parallel signal generator was provided to the receiver, of which the single-ended input eye is shown here. The generated eye has 19ps horizontal and 350 mV differential vertical opening. Error free operation at a bit error rate of <10-12 was measured at 40Gb/s with a PRBS15 data sequence and a maximum frequency offset of +/- 400ppm. Sweeping the delay of the input data externally allowed to measure the error free timing window of the receiver. With the given data eye an inner eye opening of 11ps was measured.

    71. 71 Implementation Summary So, here is a summary of the basic circuit parameters. We used a 65nm digital CMOS SOI technology. The circuit can work at a bitrate from 12 to 40 Gbps. The active area of the CDR is quite small. Power consumption was measured to be 1.8mW/Gbps from a 1.2 V supply. The PLL bandwidth is larger than 1 GHz, resulting in low jitter. A bit error rate of 10^-12 was achieved at a maximum frequency offset of 400 ppm. So, here is a summary of the basic circuit parameters. We used a 65nm digital CMOS SOI technology. The circuit can work at a bitrate from 12 to 40 Gbps. The active area of the CDR is quite small. Power consumption was measured to be 1.8mW/Gbps from a 1.2 V supply. The PLL bandwidth is larger than 1 GHz, resulting in low jitter. A bit error rate of 10^-12 was achieved at a maximum frequency offset of 400 ppm.

    72. 72 Comparison with Prior Art This charts shows a comparison with the best figures of prior art, CMOS and BiCMOS at similar data rates. The value on the X-Axis corresponds to the throughput in terabits per second for 1 watt. The value on the Y-Axis corresponds to the throughput achieved for 1 mm^2 of chip area.This charts shows a comparison with the best figures of prior art, CMOS and BiCMOS at similar data rates. The value on the X-Axis corresponds to the throughput in terabits per second for 1 watt. The value on the Y-Axis corresponds to the throughput achieved for 1 mm^2 of chip area.

    73. 73 CML Latch: Speed-optimization

    74. 74 CML Latch: Time Constants tlin and ti

    75. 75 CML Latch: Speed-optimization

    76. 76 CML Latch: Power-optimization procedure

    77. 77

    78. 78 Energy (W, Vsat)

    79. 79 Amplification time Ta (W, Vsat)

    80. 80 Equal time constants in Linear and Regenerative Phase

    81. 81

    82. 82 X-talk cancellation receiver basic architecture

    83. 83 References

    84. 84 References

    85. 85 References

    86. 86 References

    87. 87 References

    88. 88 Acronyms

More Related