Designing for 100+ MHz

Designing for 100+ MHz

1999 Designs Demand... • Higher system speed • Higher integration • smaller size, less power, better reliability • Lower cost • Shorter development time • Better product differentiation

Traditional Multi-Chip Boards • Discrete design components • CPU, memory • bus transceivers, PCI controller, FIFOs • Ethernet controller, Graphics accelerator, MPEG, DSP, etc. • programmable logic as glue and custom function • Advantages: • well-documented sophisticated functions • readily available as IP in silicon

Multi-Chip Board Problems • Physical size • Power consumption and reliability • PC board signal integrity • Limited flexibility • prevents design modifications and upgrades • prevents product diversification • prevents product customization • Poor product differentiation • standard parts = standard architecture

FPGA Advantages • Smaller size • Lower power consumption • Better signal integrity • fewer PC-board issues • Enhanced flexibility • easy modifications, upgrades, etc. • Enhanced product differentiation • proprietary architectures

FPGAs Users Want... • System clock rate of 100+ MHz • >100,000 gates • Efficient design methodologies • Availability of well-documented Cores • Reasonable cost

The FPGA Solution 4th Generation FPGALogic+Memory+Routing Delay-Locked Loop for Fast Clock and I/O 3.3 ns Synchronous Dual-Port SRAM Multi-Standard Select I/O 500 Mbps SelectMAP Configuration Temperature Sensing

Now the Challenge... Design a 100+ MHz system • Together, we can do it... • we’ll supply the ingredients... • you use them intelligently • But don’t forget... • the clock period is less than 10 ns !

Designing for 100+ MHz. • Volts, Amps, and Watts • PCB signal distribution • chip inputs and outputs • power and thermal considerations • Ones and zeros • logic emulation • Bits and bytes • memory hierarchy

Moore Meets Einstein 2048 1024 512 256 128 64 32 16 8 4 2 1 Speed Doubles Every 5 Years… ...But the speed of light never changes Trace Length MHz Clock Frequency Inches per 1/4 Clock Period ’65 ’70 ’75 ’80 ’85 ’90 ’95 ’00 ’05 ’10 Year

Volts, Amps, and Watts • PCB design issues • capacative loading • transmission lines and termination • Chip inputs and outputs • clock distribution and DLLs • I/O standards • Power and thermal considerations • temperature sensing diode • power supply decoupling • Configuration • new SelectMAP mode

Capacitive Loading • Capacitance slows outputs and increases power • output delay increase: • ~ 25 ps per pF of additional loading • output power dissipation increase: • 11 µW per MHz per pF with 3.3-V swing • Sources of capacitance • 10 pF max for each device pin • 2 pF per inch for narrow traces ( 0.8 pF/cm ) • 130 pF per inch2 for copper areas ( 20 pF/cm2) • IBIS files provide output impedance details

Transmission Lines • Some traces must be treated as transmission lines to minimize ringing • transmission line if round trip > transition time • lumped-capacitance if round trip < transition time • Signal delay on a PCB: • 140 to 180 ps per inch ( 50 to 70 ps/cm) • Lumped-capacitance trace length: • 3 inches max for a 1-ns transition time (7.5 cm) • 6 inches max for a 2-ns transition time (15 cm)

Terminated Transmission Lines Reflections and ringing TraditionalThevenintermination at the end V CC 100 Ω 50 Ω 100 Ω Dynamic termination at the end isbetter andsaves power 50 Ω 50 Ω 100 pF • Series termination at the source isbest single source and destination only! 22 Ω 27 Ω 50 Ω (50 Ω Total)

On-Chip Clock Distribution • Clock distribution introduces delay • larger chips suffer more clock delay Clock CLB Data IOB

Clock Delay Problems • Clock delay increases clock-to-output times • Clock delay leads to unacceptable input hold time • set-up time is negative • Additional data delay can eliminate the hold time • set-up time becomes positive • but tolerance build-up widens the data-valid window IOB Flip-Flop Clock Required Data Valid (without delay) Required Data Valid (with delay) Delay D Q Data Clock Distribution Delay Clock

DLLs Maximize I/O Speed • Clock-to-output time plus set-up time determinesthe I/O speed and data bandwidth • min clock period = max clock-to-out + max set-up • Traditional solution: • use highly buffered, balanced clock trees • needed to reduce internal clock skew • cannot totally eliminate the delay • The Virtex solution: • use a Delay-Locked-Loop ( DLL ) • aligns the internal and external clocks • effectively eliminates the clock-distribution delay

Virtex Has 4 Independent DLLs Clock • DLLs adjust clock delay to align internal and external clocks • digital closed-loop control • 25 to 200-MHz range, 35-picosecond resolution Error Comparator Delay CLB IOB Data

Fast Clock-to-Out With DLL • 160 MHz inter-chip data rate • 16-mA LVTTL • IOB register to IOB register Virtex FPGA Virtex FPGA 0.5 ns D Q DLL DLL 3.8 ns 1.9 ns Clock

LVTTL Data Rate with DLL 1.4 ns measured clock-to-output delay Output standard = LVTTL Fast 16mA (OBUF_F_16) Temp=100C, Vdd=2.375V, Vcco=3.3V Waveforms: 1: CLKIN 2: DATA OUT (no DLL) 3: DATA OUT (DLL deskewed) Timing w/o DLL w/ DLL r->r r->f r->r r->f 3.9n 3.9n 1.4n 1.4n

Other DLL Functions • Double the incoming clock frequency • fast internal operation – slow external clock • Clock mirroring to the PCB • Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16 • Adjust clock duty cycle to 50-50 • Create four quadrature clock phases • input four sequential bits per clock period

Duty Cycle Correction ~25% duty cycle in – 50% duty cycle out Virtex FPGA 1X DLL 25 MHz 25% Duty Cycle 25 MHz 50% Duty Cycle

Clock Doubling and Mirroring • Clock mirror with less than 100 ps skew • simplifies PCB clock distribution Virtex SDRAM 74 MHz #1 DLL 1 37 MHz SystemClock Exactly Aligned 1 Input Load 74 MHz #2 DLL 2 74 MHz Internal 37 MHz Internal Zero-DelayInternal Clock Buffer Actual HDTV Customer Example System Clock SDRAM Inside FPGA Inside FPGA

Precise Clock Mirroring 2x system clock for board use Virtex FPGA 2X DLL 66MHz Clock 132 MHz Clock

Clock Division • Divide clock by 1.5, 2, 2.5, 3, 4, 5, 8, or 16 • maintain synchronous edges CLKIn 200 MHz CLKout 200 MHz CLKDV 12.5 MHz

Multi-Standard SelectI/O GTL+ 2.5V SSTL MicroProcessor SRAM 1.8V SDRAM SDRAM 5V Tolerant FLASH Mixed Signal 5V 3.3V LVTTL Busses/Backplanes(3/5V PCI, ISA, GTL…) DSP

Mix & Match Output Standards • User-supplied voltages determine output swing • 3.3 V, 2.5 V, 1.5 V • one voltage per bank • a bank is half of a chip edge • Output characteristics are programmable on a per-pin basis • push-pull or open-drain • LVTTL drive strength • 2-mA to 24-mA sink and source current • LVTTL Slew rate

Mix & Match Input Standards Internal Reference • Internal or user-supplied threshold voltage • selectable on a per-pin basis • one user-suppliedthreshold voltage per bank • Programmable over-voltage protection • 5-V tolerant or diodeclamp to VCCO • selectable on a per-pin basis VREF Input Input Input Input Input Input VREF

SSTL Clock-to-Out With DLL • 200 MHz inter-chip data rate • SSTL 3, Class II • IOB register to IOB register Virtex FPGA Virtex FPGA 0.3 ns D Q DLL DLL 2.8 ns 1.9 ns Clock (Stub Series Transceiver Logic)

SSTL Data Rate with DLL • 1.3 ns measured clock-to-output delay • much lower noise than LVTTL Output standard = SSTL 3 Class 2 (OBUF_SSTL3_II) Temp=100C, Vdd=2.375V, Vcco=3.3V, Vtt=1.5V Waveforms: 1: CLKIN 2: DATA OUT (no DLL) 3: DATA OUT (DLL deskewed) Timing w/o DLL w/ DLL r->r r->f r->r r->f 3.5n 3.8n 1.1n 1.3n

From FPGA to System Component‘Redefining the FPGA’ Cache SRAM (Mbytes) Chip 1 Chip 1 SDRAM (133MHz) LVCMOS x2 CLK x1 CLK Low Voltage CPU SSTL3 LVTTL GTL+ High Speed System Backplane "Virtex moves FPGAs from glue to system component” - Ron Neale, EE

Power and Thermal Issues • Power and heat are serious concerns • All CMOS power consumption is dynamic • proportional to VCC2 • proportional to capacitance • proportional to frequency • Virtex conserves power • 2.5-V supply voltage • small geometries and short interconnects reduce capacitance

384 16-bit Counters 2.5 W Total 768 8-bit Counters 3.7 W Total 1536 16-bit Counters 9.8 W Total 3072 8-bit Counters 14.7 W Total XCV300 XCV1000 Virtex Power Consumption • Virtex is designed to conserve power • 100 MHz 16-bit counters • 12.5 MHz average transition rate • 6.5 mW per counter including clock distribution • 100 MHz 8-bit counters • 25 MHz average transition rate • 5 mW per counter including clock distribution

Thermal Management • Temperature-sensing diode • matched to maxim MAX 1617 A/D • programmable alarms • similar to the Pentium II solution Virtex FPGA DXP SBMCLK Maxim MAX1617 SBMDATA DXN ALERT

Power Supply Decoupling • CMOS power-supply current is dynamic • current pulse every active clock edge • Peak current can be 5x the average current • instantaneous current peaks can only besupplied by decoupling capacitors • Use one 0.1 µF ceramic chip capacitor for each power-supply pin • low L and R are more important than high C • double up for lower L and R if necessary • use direct vias to the supply planes, close to the power-supply pins

Virtex Configuration • New byte-wide SelectMAP mode • up to 528 Mbps at 66 MHz • simple handshake protocol • up to 400 Mbps at 50 MHz • no handshake required • Configuration bit-stream length • 0.5 Mbits to 6.1 Mbits Control Logic (EPLD) Busy CS Address Configuration EPROM Data WE, CS Virtex FPGA

Volts, Amps, and Watts: Recap • PCB design issues • minimize capacitance for higher speed • terminate transmission lines to reduce ringing • Chip inputs and outputs • use DLLs to maximize I/O bandwidth • use SelectI/O to interface with different standards • Power and thermal considerations • use the sensing diode to manage chip temperature • decouple the power supply well • Configuration • configure faster with the SelectMAP mode

Designing for 100+ MHz. • Volts, Amps, and Watts • PCB Signal Distribution • chip Inputs and Outputs • power and Thermal Considerations • Ones and zeros • logic Emulation • Bits and bytes • memory hierarchy

Spending the 10 ns Budget • Fast logic requires fast function generators • signals often pass through several function generators • Routing delays must also be kept short • there are routing delays between every function generator • Arithmetic delays are important • carry chains often create critical paths

You Don’t Have To Be An Expert • You don’t have to be an FPGA architecture expert to implement high-performance designs • the benefits of a good architecture are automatic • all the logic goes faster • software provides easy access to the features • You can achieve high-performance only with a good FPGA architecture • a good FPGA empowers its users • You’ll design better if you know the architecture • matching your design style to the available features increases performance and/or lowers cost

Virtex CLB • Logic and arithmetic delay reduction demands improvements in the CLB • Virtex CLB is divided into two slices, each with: • 2 function generators • 2 flip-flops • 2 bits of carry logic Carry Carry Fnct Gen Fnct Gen Carry Carry Fnct Gen Fnct Gen

Fast Function Generators • Each function generator emulates 2 to 3 levels of logic • a 10-level logic path typically requires 3 to 5 Function Generators in series • at 100 MHz, they must be less than 2 ns each including the routing • Virtex has 0.6-ns function generators • leaves 1.4 ns for each route

Connecting Function Generators • Some functions need several function generators • F5 MUXs connect pairs of function generators • functions with 5 to 9 inputs • F6 MUXs connect all 4 function generators • functions with 6 to 17 inputs Fnct Gen Fnct Gen F5 F5 F6 Fnct Gen Fnct Gen

Fast Local Routing • Local routing provides fast interconnects • in a CLB, Function Generators connect with minimal routing delays • fast paths between adjacent CLBs increases flexibility Carry Carry Fnct Gen Fnct Gen Carry Carry Fnct Gen Fnct Gen Carry Carry Fnct Gen Fnct Gen Carry Carry Fnct Gen Fnct Gen

Use Pipelining for Speed • Shorter clock periods means doing less each period • create a pipeline structure • pipeline stages operate concurrently • more functions are done at the same time • throughput increases • All function generators have output flip-flops • most pipeline support is “free”

16-Bit Pipeline in One LUT • In directly cascaded pipelines the flip-flopsare not free • One SRLUT can implementup to 16 bits of delay • shift data in and select the appropriate tap Delay Select Output 16-Bit Shift Register Input

Fast Logic Needs Fast Routing • Our typical design with 3 to 5 CLBs needed an average routing delay of 1.4 ns or less • the Virtex routingarchitecture deliversthis performance • Delay is independentof direction • dependablyshort delays

Go Farther, Faster • Virtex achieves its speed through a hierarchy of highly buffered routing resources • wires span 1, 2, or 6 CLBs • The Virtex routing architecture is designed for large arrays • today’s FPGAs are big… but tomorrow’s will be even bigger • Virtex is designed to maintain its performance even in very large arrays

No Routing Congestion • For high-speed applications, routing must be dependably fast • not just capable of being fast • In the past, high device utilization has caused routing congestion • critical nets might be forced to meander • Virtex minimizes these problems • abundant resources prevent congestion If it needs to be fast, it willbe fast – automatically!

Built-in Tri-State Busses • Bi-directional busses are supported directly by tri-state buffers built into each CLB • two drivers per CLB • segmentable every four CLB columns CLB CLB CLB CLB CLB

Designing for 100+ MHz

Designing for 100+ MHz

Presentation Transcript

Designing for Visibility

DESIGNING FOR

Designing for Reality

Designing for accessibility

Designing for readability

Designing for Bees !

Designing for Simplification

DESIGNING FOR DISCOVERY

Designing for Nonmotorists

Designing for Effectiveness

Designing for Metacognition

Designing For Testability

Designing for Coexistence

DESIGNING FOR ANIMALS VERSUS DESIGNING FOR PEOPLE

Designing for Understanding

Designing For Load

Designing for iOS

DESIGNING FOR ADOPTION

Designing for DVI

Designing for Values

Designing for Humans