1 / 33

Outline

Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Integrated Circuits and Systems Lab Departments of Computer Science and Electrical Engineering Darmstadt University of Technology. Outline. Pipelines: synchronous, asynchronous, wave pipelined,

lis
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Circuit DesignforSRCMOSAsynchronous Wave PipelinesOliver HauckIntegrated Circuits and Systems LabDepartments of Computer Science and Electrical EngineeringDarmstadt University of Technology

  2. Outline Pipelines: synchronous, asynchronous, wave pipelined, and asynchronous wave pipelined (AWP) Comparison: AWPs vs. sync, async, and sync wave pipes AWP Circuit Design Conclusion

  3. Pipelining Pipelining used as premier technique to better exploit hardware and boost performance of VLSI chips Clocking overhead presents serious threat for deeply pipelined systems built upon sub-micron CMOS processes running at GHz frequencies

  4. General Framework for Pipelines Latch/Reg Latch/Reg Logic Data Clk

  5. Some Notations...

  6. General Relations

  7. Synchronous Pipeline Latch/Reg Latch/Reg Logic Data Clk Throughput determined by longest logic path + clock/register overhead Fine-grain pipelining allows high throughput at the cost of increased clock/register overhead Negative side-effects of gate-level pipelining : Increased latency, clock load/skew, power, area, design time More area for clocking and registers than for logic Implementation options: Register- vs. latch-based, explicit latches vs. latchless TSPC vs. local clocks derived from global clock Static vs. dynamic, single-ended vs. dual-rail

  8. Asynchronous Pipeline Handshake Handshake Logic Data req_in req_out ack_in ack_out Micropipeline (Sutherland 1989) Synchronous clock replaced by asynchronous handshaking Elastic operation: input and output rate may differ momentarily, and pipeline will buffer Implementation options: 4-phase (level) vs. 2-phase (event) protocol Bundled data (matched delay) vs. completion detection Operation is data dependant, saves power during idle As with fine-grain sync pipelines, throughput can be high; handshake causes high latency and backward stall Plug & Play composability Load on req and ack lines distributed Used by Furber‘s group at Manchester U for AMULET1/2/3

  9. Synchronous Wave Pipeline Latch/Reg Latch/Reg Wave Logic Data Clk Several data waves simultaneously active in the logic Logic has to minimize delay variations over P,T,V corners Global clock used with constructive skew to adjust phases Wave pipelining potentially gives higher throughput as conventional pipelines at decreased latency and reduced clock load, area and power However, tuning the logic and the delay elements is difficult

  10. Wave Pipelining: A Short Outline • Wave pipelining occurs when combinational logic is clocked faster than latency would allow • Several data waves are then active in the logic without being separated by storage elements • Latency remains constant and throughput is determined by delaydifferences rather than absolute delay • Requirement for delay balanced logic and complicated timing are the main hurdles

  11. Wave Pipelining: A Little History • Technique stems from the 60s and has had a reputation for being exotic since • Wave pipelining was long dead before being revived by W. Burleson (U. Mass.) and M. Flynn (Stanford U., PhDs by Wong, Klass, and Nowka) and C. Gray at NCSU • Some working academic chips exist, mainly datapath • Some commercial memory is wave pipelined (e.g. ULTRA-III cache), but no logic, as far as we know

  12. Asynchronous Wave Pipeline (AWP) Wave Latch Wave Latch Wave Logic Data req_in req_out matched delay Data words associated with events on request line Several data waves and protocol events simultaneously active in the logic and the matched delay element, respectively AWP is special case of the sync wave pipeline with the constructive skew set to worst-case logic delay It is crucial that the delay element accurately tracks the delay behaviour of the logic over P, T, V corners

  13. AWPs vs. Synchronous Pipelines • No global clock, instead a local clock (request) that is fed through the pipeline and obeys a simple asynchronous protocol, i.e. data is associated with event on request • Many pipeline registers removed, thus requirements on the clock (request) relaxed • Synchronous pipelines can reach the throughput of AWPs only with excessive cost in area, power and latency

  14. AWPs vs. Asynchronous Pipelines • AWPs deliberately sacrifice the ack and keep only the req to avoid protocol overhead • AWPs not elastic: data at output has to be consumed • AWPs eliminate hazards as side-effect of delay balancing • AWPs have in common with other async methodologies: data dependant operation (avoids redundant transitions), composability(though inelastic), noglobalclock

  15. AWPs vs. Synchronous Wave Pipelines AWPs tackle two main difficulties in sync wave pipes: • Replacing the constructive skew by worst-case delay removes double-sided timing constraint, i. e. in con- trast to sync wave pipes do AWPs operate at any rate • Using dynamic self-resetting logic controls delay variation and doesn´t impact latency much

  16. Wave Pipelining Combinational Logic • Overall goal: keep data wave coherent under all possible conditions (data, PTV) • Desirable architecture features: most logic paths have same depth fanin/fanout the same everywhere • First step: pad all short paths to maximum length

  17. Example: 64-b Brent-Kung Parallel Adder 0 1 2 3 4 pg PG PG G x o r Buffers provide for same depth on every logic path All gates in the same column must have the same delay

  18. Circuits • Logic style used has to minimize delay variation • Earlier work focused on bipolar logic (ECL, CML), but CMOS is mainstream • Static CMOS is not well suited for wave piping, fixing the problem results in more power and slower speed • Pass transistor logic gives slopy edges thereby introducing delay variation • Dynamic logic is attractive as only output high transition is data-dependant, output pulldown is done by precharge

  19. Circuits (cont.) • Using dynamic logic as in Burleson´s Wave Domino jeopardizes the concept as it needs fine-grain precharge • What is needed is a dynamic logic family without precharge overhead: SRCMOS • Work done at IBM: classic paper by Chappell et al:``A 2-ns Cycle, 3.8-ns Access 512-kb CMOS ECL SRAM with a Fully Pipelined Architecture,´´ JSSC (26), 11, 1991; or, more recently: ``Implementation of a Self-Resetting CMOS 64-Bit Parallel Adder with Enhanced Testability,´´ JSSC (34), 8, 1999, by Hwang et al.

  20. SRCMOS • Distinguishing property of our SRCMOS circuits: precharge feedback is fully local, and NMOS trees are delay balanced output N inputs

  21. Operation of a 2-AND

  22. Delay Balancing at Transistor Level • NMOS tree is designed so that the precharge node is pulled down by a constant number of series devices • Short paths are padded with dummy devices • Delay variation is minimal when exactly one path is on, i. e. wide fanin OR´s are hard to use • Every output has to see the same load • Lightly loaded outputs are given dummy cap

  23. Example: Carry tree in a 64-bit adder

  24. Gim Layout

  25. Simulation of Gim cell Pulses of 4 possible input situations giving ´1´ at the output are tightly matched Note: in this case never are Pxy=Gxy=1

  26. First Pulse Problem

  27. Miller Effect

  28. 64-bit Adder Output Waveforms latching window

  29. Transistor Sizing Wprecharge Wkeeper Cfeedback Cload N Cdrive output inputs Wpd Wpd / Cdrive = const Cdrive / (Cload+Cfeedback+Wkeeper) = const Cfeedback / Wprecharge = const Wprecharge / Cdrive = const LINEAR SIZING

  30. Interconnect: Resistive Effects 0.9µm x 900µm MET2 parasitics: C=116fF, R=70 Ohms C only R/3, R/3, R/3 R/2, R/2 RC only

  31. Interconnect: Coupling Effects 2 adjacent MET2 lines coupled by C=54fF

  32. PTV Variations • SRCMOS provides some robustness by generating fresh pulses at every gate output • Pulsed operation reduces data dependancy, coupling • PTV noise is not critical when drift is in the same direction across die • Critical are: temperature gradient, supply drop, and local variations • What is needed: Rule of thumb like ``For process X, to be on the safe side, keep area between two latches < Y sqmm´´

  33. Conclusion AWPs presented as alternative approach to high-speed design, shows potential for GHz throughput without clocks AWPs avoid some problems of conventional wave pipes and (a)synchronous systems 64b adder + test circuit and EC crypto layout in the making Not covered here: feedback + controllers To do: support transistor sizing

More Related