1 / 42

Structured Hardware Design

Structured Hardware Design. Ian Pratt University of Cambridge Computer Laboratory Ian.pratt@cl.cam.ac.uk. Designing Hardware Systems. A good design should work first time Simulation Verification Testing Top-down methodology Decompose into modules Modules

fayola
Download Presentation

Structured Hardware Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structured Hardware Design Ian Pratt University of Cambridge Computer Laboratory Ian.pratt@cl.cam.ac.uk

  2. Designing Hardware Systems • A good design should work first time • Simulation • Verification • Testing • Top-down methodology • Decompose into modules • Modules • Well-defined functions and interfaces • Often different technologies • Using pre-existing modules desirable

  3. Broadside components • Bus: • parallel signals carrying a binary number • Represented with thick lines • Broadside components: • Building block is instantiated once for each wire in the bus • Building block inputs and outputs connected to the corresponding members of the buses • Control connections are wired in parallel • Registers, buffers, multiplexors

  4. Read-Only Memories • Non-volatile, but typically slow • Mask programmable • Cheapest in mass production by far • One-time programmable (PROM) • UV Eraseable (EPROM) • Electrically re-programmable (e.g. FLASH) • Expensive, but many rewrite cycles possible • `Field upgrades’ possible • Choose technology based on #units required and #rewrite cycles expected

  5. DRAM • Each bit stored in a small capacitor (1T) • Needs refreshing periodically • ‘Recovery time’ required after reads • Bits arranged in a square array • Accessed by row, column (multiplexed address bus) • Typically 1,4,8 bits wide • E.g.: 8Mbx8 (64Mbit) 50ns access time • New parts have synchronous interface • SDRAM / DDR / RAMBUS (still same core) • Modules E.g.: 16Mbx64 100MHz SDRAM • Made from eight 8Mbx8 parts on a PCB (DIMM)

  6. SRAM • Transparent latch per bit (6T) • Not as dense as DRAM, more expensive • Fast (7-50ns) access times • Used in caches • Easy to use – no refresh to worry about • Non-multiplexed address bus • Modern parts have synchronous interfaces • Pipelined design • E.g.: 256Kbx32 (8Mb) 10ns

  7. Clock generation • RC oscillators rather inaccurate, but cheap • Quartz crystal oscillators commonplace • Require a little care to make work • Accurate to ~50ppm • Clock multiplication • Phase Locked Loop (PLL) • E.g.: 133MHz x 7.5 = 997.5Mhz (Pentium III) • Clock distribution trees • Buffers, or PLLs to get zero propagation delay

  8. Miscellaneous • Power-on reset • Release reset after power stable • Get all flip-flops into known state • (manual reset by shorting capacitor) • Relays can be used to switch large loads • (alternative is to use power transistors) • Must protect transistor with a diode • Mechanical switches ‘bounce’ when switching • Use a 2-pole switch and RS latch

  9. ALUs • Combinatorial logic implementation • Takes two N-bit inputs and function selector • Propagation delay typically determined by carry chain • Typically twos-complement representation • ADD, ADC, SUB, NOT, AND, OR, BIC,… • Flags: Carry-out, Negative, Overflow, Zero • Output will typically be latched, along with flag status results

  10. Microprocessors • Simple microprocessor control signals: • Inputs: Clock, Reset • Output: Request, Read/nWrite, Addr<0..N> • InOut: Data<0..M> • Read cycles to fetch instructions and load data • Write cycles when updating memory • Begins execution by fetching from reset location • PC incremented unless branch/jump instruction

  11. Address decoding • Devising a memory map for a design • Address that memory/peripherals are available at • Non-volatile memory typically mapped at the reset location • Use combinatorial function of high-order address bits to generate enable signals • Devise memory map for decoding convenience

  12. The PC as a component • Motherboard cost ~£30-100 • 4+ wiring layers in PCB • CPU, DRAM, keyboard, USB, VGA, IDE, floppy, serial, parallel, audio, IRDA • Cheap general purpose platform for supporting other hardware • System-on-a-chip (SOC) implementations available soon

  13. Interconnecting Modules • How much data in bps needs to flow? • Will the connection be synchronous or async? • Is flow-control needed to limit the flow? • How long do the wires need to reach? • Is the topology fixed at design time? • Is hot-plugging needed? • Can we use an existing design?

  14. PC Parallel Port • 8 data wires, 3 control wires • Unidirectional in its most basic form • Flow-control mechanism • Master drives data then asserts strobe_bar • Slave assertsacknowledge • Slave optionally assertsbusy • When bothbusyandacknowledgeare deasserted master can send another byte

  15. RS232 Serial Ports • Asynchronous bit stream • One wire for each direction plus ground • Start, data, parity, stop • Start bits assist clock recovery • Baud rate (e.g. 300, 1200, 9600, 115200) • Various flow-control schemes • s/w: XOn/XOff characters • h/w: CTS/RTS signals • Excellent for simple debugging support

  16. Finite State Machines • Building everything from FSMs • Avoid generated clocks / async resets • Avoid loops in combinatorial logic • Current CAD tools only work with FSMs • Timing specifications: • Tck_to_out, Tsetup, Thold, Tprop • Beware of long Thold’s • Use Moore outputs between modules • Easier to characterize delay into next module • Critical path is longest logic path ending in an FF • Determines maximum clock speed

  17. Johnson Counters • Traditional binary counters require long logic paths for high-order bits • Limit clock frequency • Johnson counters are based on shift registers with feedback • E.g. using a NOR gate for a /5 with 3FFs • Clock prescalers – easy clock output • PRBS counter (XOR) 2n-1 with n FFs

  18. One Hot Coding • FSM encoding using 1FF per state • Single FF set, others all clear • Uses more FFs than necessary, but: • Only very simple decode logic required • High clock speeds • Particularly useful in FPGAs

  19. Pipelining • Split combinatorial logic into stages separated by FFs • Enables increased clock speed • Improved throughput • but, increases delay: • Tsetup + Tclock_to_out of each FF • Unbalanced pipeline stages • Feedback paths can make life tricky… • CAD tools can help distribute FFs

  20. Gated & Guarded Clocks • Clock Enable ‘safer’ than derived clocks • Internal multiplexor selects between Din and Q • But, power is proportional to clock freq, so in some designs it is necessary to: • Gate lower frequency clocks • Turn off clocks to currently idle units • When necessary, create clock by OR’ing clock with synchronised enable_bar

  21. Clock and Data Skew • Skew: when the same signal arrives at different places at slightly different times • The enemy of synchronous design… • Clock signals are especially vulnerable • Early clock can cause setup time violation on critical paths • Late clock can allow output of previous stage to race into this one (hold time violation) • Take special care routing clocks!

  22. Crossing Clock Domains • Setup/hold time violations unavoidable • Metastability can occur, but typically only briefly • Allow extra time for setup into next FF • Or, use 2FFs for safety • Synchronize each signal at a single point • Can use guard signal for buses • Guard indicates when bus is safe to sample • Or, FIFOs with separate read/write clocks

  23. FSM clocks derived from another FSM • When it’s necessary to use derived clocks: • Use a moore output to clock slave • Function should be hazard free • Be careful to avoid races with other outputs connected to slave • Mustn’t change at same time as clock • Outputs from slave back to master may restrict max clock rate

  24. Integrated Circuits • Si or GaAs substrate with implants • 200/300mm wafers, 0.3mm thick • Only the top few microns ‘active’ • Ion implant and etching steps, controlled via stencils created by exposing a photo-resistive coating to UV / X-rays via a mask generated by CAD tools • 7-30+ different masks used • Masks stepped over wafer for each die • 4-500mm2 die size

  25. CMOS Technology • nMOS, CMOS, ECL (Bipolar) • CMOS most popular (and best supported) • Feature size – reduces at 10-20% p.a. • Smaller  faster, lower power, higher density • 0.5, 0.35, 0.25, 0.18, 0.15, 0.13μm • Max die size increasing at 10-25% p.a. • Number of available T’s increasing at 60-80% p.a. • 2-7 metal wiring layers. Al (or now Cu) • Separate processes for DRAM, logic, analog

  26. Pads and IO • Pad ring around edge of die • Pads are typically 50 micron square • Contain high-power drive outputs and ESD protection circuitry • Power / ground ring around pads • Gold bond wires connect to package pins • Up to 1000+ pins (with expensive packaging) • Packaging eases handling and dissipates heat • Core bound vs. Pad bound designs

  27. Chip costs • Non Recurring Expenditure (NRE) • Design costs (labour, tools, overheads...) • Mask making costs • Per device costs • Raw wafer, Processing, Testing, Packaging • Influenced by yield • P(die defect free)  Kdie area • K is probability that any given mm2 is defect free

  28. Taxonomy of ICs • Standard parts (off-the-shelf, datasheet available) • Full-custom ASICs • For best performance, but greatest NRE • CPUs, memory, DSPs • Semi-custom standard cell ASICs • Designed from a library of standard gates/cores • Semi-custom gate array ASICs • Only a few masks required, but inefficient • Field programmable parts • FPGAs, PALs

  29. Field Programmable Gate Arrays • Volatile, re-programmable & OTP types • All programmable in situ • Array of Configurable Logic Blocks (CLBs) and switch matrices (configurable wiring with buffers) • IO Blocks (IOBs) around edge of die • CLB typically consists of LookUp Table (LUTs), 1-2 FFs and programmable MUXs • 16x1 LUT (SRAM) implements any fn of 4 variables • Allowing writes to LUT enables use as RAM • Switch matrices provide hierarchical routing

  30. Field Programmable Gate Arrays • Different families use different CLB sizes • Xilinx 4K series : 2x 4 input LUTs and 2x FFs • Others more or less fine grained • Very low NRE, rapid turnaround • Only requires a ‘place and route’ tool run • Great for prototypes, but parts typically cost 10x more than equivalent gate array • SRAM/Flash parts enable field upgrades • Switch to gate arrays in mature designs

  31. Programmable Array Logic Devices (PALs) • Programmable sum of products array feeding macrocells • Good for simple FSMs and glue logic • Macrocell enables combinatorial or registered output, usually tristateable • more complex devices also contain buried macrocells, and may organise macrocells into clusters with separate clock sources, sometimes called CPLDs (Complex Programmable Logic Devices) • New parts in-circuit-programmable, while others require a special programmer • JEDEC description file

  32. Delay and Power • Si/CMOS • nmos/pmos unipolar transistors, generally small • Power proportional to frequency • Si/BiCMOS • CMOS augmented with bipolar for driving large loads • Si/ECL • Bipolar transistors, kept unsaturated • x3 performance, but large static current • GaAs/MESFET/Bipolar • x10 performance, but yield generally poor • Up-coming technologies: SOI, SiGe

  33. Fanout and delay • Output stage speed decrease with load • Dominant aspect of load is Capacitance • Proportional to area of output conductor • Sum of input capacitances of devices driven • delay = intrinsic delay + (output load x derating factor) + propagation delay • Gate specification includes intrinsic delay, input loads and output derating figures

  34. Design Partitioning: h/w vs s/w • Hardware • Use where high throughput required, but • Harder to design and debug • Harder to modify • Software • Running on CPU(s) or microcontroller(s) • A whole PC; on a PCB; embedded on an ASIC • Better support for complexity • Field upgrades • Can help debug hardware

  35. Hardware partitioning • Partitioning logic over chips motivated by: • Availability of standard parts • Use existing parts wherever possible, especially for prototypes or low volume designs • Speed required by different function units • Use exotic technologies as sparingly as possible • Interconnection speed and width required • External interconnects much slower than on-chip and have limited pin count • ASIC size, pin count, power

  36. Logic Synthesis & Layout • Complex functions expressed algorithmically, then synthesized to gates • Good at ‘mechanical’ tasks on relatively small sections of a design • Critical sections of a design still done by hand • Place tool attempts to layout gates to minimize wiring paths • Route tool attempts to wire gates • Tools are continually improving • More feed back and integration between tools

  37. The Cambridge Fast Ring • 100MHz ECL chip implements: • Transceivers and serial de/modulator • ECL has good high-power line driving characteristics • Serial to parallel and parallel to serial • Byte alignment • CMOS chip, 50x more logic than ECL chip: • Media access control protocol / CRC generation • Small buffer memory / Host processor interface • Ring monitoring and maintenance • DRAM, VCO, PALs for glue logic to host iface

  38. External Modem • Analogue frontend to telephone line • Isolation, surge suppression, off-hook relay • Digital Signal Processor as Codec • Dedicated to a single task • Microcontroller for control • Talking to host, processing commands etc. • External NVRAM e.g. Flash to store state • RS232 Line drivers (+/- 12V) • Requires special fabrication process

  39. Scan multiplexing • Scan multiplexing saves wires (and thus pins) • Used for LEDs and switches (keyboards) • LED matrix • Drive column high, write pattern on row • Scan at >50Hz to avoid flicker • Drive LEDs hard to make bright • Pseudo dual porting enables pixel RAM to be updated • Keyboard matrix of push-to-make switches • Drive column high, read row • Pull down resistors keep row wires normally low

  40. Audio delay unit • Sample clock of 44.1kHz sufficient for audio • Single counter provides fixed delay • Read cycle followed by write to same location • Two counters (one loadable) and a mux enables variable delays • Lead write counter has over read sets delay • Could use LFSR counters, but no need here • Could use DRAM, but SRAM easier and dense enough • Accesses unlikely to be to same page, hence slow • Could use small staging FIFOs to enable burst reads & writes • Audio so slow, we could use a microcontroller

  41. Network Camera Device :1 • Standard parts for: • Video frontend and resizer, Audio digitizer • JPEG compression engine • 100Mb/s Network SERDES (de/serializer) • Three 8KBx8 SRAMs for scanline to tile conversion, controlled by PAL • Three 256KBx8 DRAM FIFOs for framebuffer • PAL for colour conversion / muxing (non compressed)

  42. Network Camera Device :2 • FPGA for assembling audio/video/CPU cells for TX • 2KBx8 dual ported SRAM acting as small 3 channel FIFO • FPGA for network interface control • MAC and CRC generation • Determines stream priority and reads cell out of SRAM and feeds it to SERDES (CoDec) • EPROM microcontroller • Communicates over network with management software • Co-ordinates frame capture and compression

More Related