1 / 63

Asynchronous Logic: Results and Prospects

Asynchronous Logic: Results and Prospects. Alain J. Martin California Institute of Technology NTU, March 2007. What Is Asynchronous Logic?. “ An algorithm is a sequence of computational steps.” CL&R How do we implement sequencing in a continuous physical medium?

johana
Download Presentation

Asynchronous Logic: Results and Prospects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Asynchronous Logic:Results and Prospects Alain J. MartinCalifornia Institute of Technology NTU, March 2007

  2. What Is Asynchronous Logic?

  3. “An algorithm is a sequence of computational steps.” CL&R How do we implement sequencing in a continuous physical medium? Traditional answer: use of a global time reference (“the clock”) CLK A B C D E Sequencing and Computation

  4. Yes!: “asynchronous” or “clockless” logic Also “self-timed” or “speed-independent” David Muller “Theory of Asynchronous Circuits” (1959) ILLIAC (1959) and ILLIAC II (1962) partially asynchronous PDP6 (1960) asynchronous Can we compute without a clock?

  5. Delay-insensitivity (Molnar, 198x…) Almost: “The class of delay-insensitive circuits is limited (not Turing-complete).” (Martin, 1990) Quasi-delay-insensitive (QDI) logic: Delay-insensitive Isochronic forks (only delay assumption) QDI is Turing-complete (Martin & Manohar, 1996) Can we compute without a clock and without delay assumptions?

  6. Asynchronous system: collection of modules communicating by handshake protocols Distributed system on a chip (communicating by message exchange) A B C D E ack ack ack ack ack What is an Asynchronous Circuit?

  7. Caltech QDI Approach • Quasi delay-insensitive (QDI) design • Minimal delay assumptions (only isochronic forks) • Stricter logic synthesis (DI codes for datapath, completion trees), but… • Robust and efficient (no evidence that delay assumptions improve efficiency)

  8. Why Asynchronous and QDI Logic?

  9. Scientific Reasons • Understanding the role of time in computation • Limit of delay insensitivity • Implementing a digital computation directly in a continuous physical medium • Design by program transformation (real correctness-by-construction approach) • “VLSI design as programming” paradigm

  10. Engineering Reasons • Better match for high-level synthesis • Can separate correctness from performance issues • Modularity and better use of concurrency • Large system design (SoC): Only local communication • Efficiency • Average-case instead of worst-case behavior • Less pressure for global optimization(“timing closure”) • Robustness and reliability • Robust to variations in fabrication technology, temperature, voltage, noise, SEU-tolerance • Energy efficiency

  11. Energy Advantages of Async • No clock • Up to 50% of clock power recuperated • Automatic shut-off of idle parts • Perfect clock gating • No glitches (spurious transitions) • Up to 50% of power in combinational circuits • Automatic adaptation to parameter’s variations • Voltage scaling: Perfect exchange of delay against energy through voltage scaling • Flexibility of asynchronous interfaces: • Better use of concurrency

  12. Reactive Use in Embedded Systems • Archetype of a reactive system • Average execution time may be much shorter than maximal execution time • Sleep sequence without race condition • Modeled after wait/signal with condition variables • Instant wake-up from deep sleep

  13. Robustness to PVT Variations • Increase in physical parameter variations (PVT) is becoming a huge problem… • Even worse in future technologies (nano CMOS or others) • Variations of physical parameters all affect timing • Increased timing variations reduce robustness and/or performance • Single time reference (clock) may become unavailable or too expensive in future technologies and large systems (SoC)

  14. Robustness to Voltage and Temperature Variations

  15. Single-event Upset and Soft-error Tolerance of QDI circuits • Soft-errors caused by alpha particles, cosmic rays and other radiation sources are becoming increasingly problematic, even at ground-level • QDI circuits can absorb most “dose-effects” • Single-event upsets that cause a soft-error (bit flip) can be corrected efficiently in QDI circuits • Error-correction scheme specific to QDI • Entire async microcontroller SEU-tolerant

  16. Detection and Correction of SE in QDI circuits • Single-error detection: duplicate and compare • Correction: • prevent propagation of detected SE • stability of guards corrects automatically • “Detection is correction” • Simplest, most expensive coding, but simplest detection mechanism • Entire microcontroller SEU tolerant

  17. Disadvantages of Async • Size overhead (more transistors) • Poorly understood and rarely taught • No industrial CAD tools (yet) • No well-developed testing procedure (yet) • No easy transition path for large established companies…

  18. Experimental Evidence

  19. Asynchronous Chips @ Caltech World-first Asynchronous Microprocessor (1988) Lattice-Structure Filter (1994) Lutonium 8051 Microcontroller (2005) MiniMIPS (1998)

  20. Performance: 5 MIPS, 5mA @ 2V 18 MIPS, 45mA @ 5V 26 MIPS, 100mA @ 10V 16-bit RISC, 2-micron CMOS Formal synthesis: Initial sequential description was a single page of CHP code 5 months from start of project to tape-out (small group) Fully functional on first silicon First Asynchronous Microprocessor (Caltech, 1988) • Potato-chip experiment • Runs on a potato as power supply! • 50kHz @ 0.75V, 300kHz @ 0.9V

  21. Standard 32-bit RISC ISA Single instruction issue, one branch delay slot Precise exceptions 2 on-chip caches: 4kB Icache and 4kB Dcache First prototype (1998): No TLB 2M transistors First asynchronous processor competitive with large synchronous designs Asynchronous MIPS R3000 Microprocessor

  22. MiniMIPS Low-Voltage Operation • Functional from 0.5V Vdd up • Functional at 0.4V with some transistor resizing

  23. Asynchronous MIPS: Practical Results • HP’s 0.6-micron CMOS • Expected: 275 MIPS @ 7W @ 3.3V @ 25oC • First prototype: 190 MIPS @ 4W @ 3.3V @ 25oC • Voltage range: 1V (9.66MHz @ 0.021 W) to 8V • Functional on first silicon despite • Inconsistencies in HP’s process parameters (e.g. higher Vt’s) • Long polysilicon wire overlooked in the critical fetch loop • (Testament to the robustness of asynchronous design style!) • Roughly 4x faster than commercial synchronous MIPS ported to same technology • Note: no particular effort made towards designing for low power.

  24. Lutonium-18: QDI 8051 Microcontroller • TSMC SCN018 through MOSIS • 0.18mm CMOS • 1.8V nominal • |Vt| = 0.4V to 0.5V • Expected area: 5mm2 (including 8kB SRAM) • Performance from low-level simulation (conservative!)

  25. Energy Efficiency Metric: Et2 • E = C*V2 , t = k / V • E*t2 independent of V • Estimate of energy efficiency • Comparison of designs • “Algorithmic of energy’’ • See Chapter 15 in “Power Aware Computing” book by Graybill & Melhem eds. Kluwer

  26. Voltage Scaling Advantage: Comparison to Intel Xscale

  27. Microprocessor -- Results MIPS Energy 33nJ async-0.6m 70nJ sync-0.6m MIPS CycleTime 6ns async-0.6m 21ns sync-0.6m Microcontroller -- Estimation 10.00nJ (1X) sync-0.5m 8051 Energy per Instr 1.67nJ (6X) async-0.5m icache fetch 0.56nJ (18X) async-0.18m@1.8V 0.14nJ (72X) async-0.18m@0.9V exec units (adder) (shifter) (fblock) (mem) (mult/div) 20ns (1X) sync-0.5m 8051 CycleTime 10ns (2X) async-0.5m 5ns (4X) async-0.18m@1.8V decode write back 10ns (2X) async-0.18m@0.9V regfile (bypass) Energy Breakdown and Comparisons More than 100X Et2 improvement over any other 8051 Energy Breakdown

  28. Design Methodology

  29. L0 R0 DATA L1 R1 La Ra ACK Handshakes & Dual-Rail Encoding BUFFER: *[ L?x; R!x ] • Four-phase handshake • Dual-rail encoding: • 3 wires (2 data, 1 ack) for one bit of information • Other DI codes are used: 1-of-N R! L?

  30. A QDI pipeline stage *[ L?x; R!f(x)]

  31. QDI PIPELINE vs Bundled Data • Dual-rail or 1-of-n data encoding • Completion tree • Critics: high overhead (2*N +1 wires and completion tree) • Alternative: Bundled data • N + 1 wires, no completion tree • Delay line for indicating completion, spurious transitions • Big controversy!

  32. Ra Fine-grain Pipeline (PCHB) en R R! f L? validity Rv en Lv validity L? completion La en

  33. FINE-GRAIN PIPELINE • No need for separate register • Very high throughput and low forward latency • Excellent Et^2 performance • Entirely QDI • Used in MiniMIPS and Lutonium • Area overhead significant

  34. CHP Program 2 4 7 8 1 3 5 6 Lower-Level Synthesis: HSE *[ L?x; R!x ] Handshaking Expansion *[ [ RaL0R0RaL1R1 ]; La ; [ RaR0, R1 ]; [ L0L1La  ] ] [ Ld ]; La; [ Ld ]; La  [ Ra ]; Rd; [ Ra ]; Rd

  35. CHP Program Lower-Level Synthesis: PRS Production Rule Set *[ L?x; R!x ] L0L1LvLaRaL0R0LaRaL1R1R0R1RvLvRvLa L0L1LvRaLaR0RaLaR1R0R1RvLvRvLa Handshaking Expansion *[ [ RaL0R0RaL1R1 ]; La ; [ RaR0, R1 ]; [ L0L1La  ] ] To PRS for CMOS …

  36. Each production rule has the form: guardexpr node orguard expr node These can be evaluated asIf ( guard expristrue )node = VddorIf ( guard expr istrue )node = GND A set of production rules must be stable and non-interfering(for hazard-free circuits) Lower-Level Synthesis: PRS Production Rule Set L0L1LvLaRaL0R0LaRaL1R1R0R1RvLvRvLa L0L1LvRaLaR0RaLaR1R0R1RvLvRvLa To PRS for CMOS …

  37. Asynchronous Architectures • New asynchronous solutions for pipelined microprocessors • Execution units are in parallel, allowing concurrent and out-of-order execution of instructions

  38. CAD Tools • Complete suite of tools: synthesis, simulation, verification, optimization, layout • Designer-assisted compilation • Tools are modular and customizable • Main representations: CHP, PRS, Cast

  39. Legend synthesis simulators database Design Flow sequential program chpsim DDD SDD cosim concurrent system prsim/esim spice logical PL2 physical physical PRS add ? ! Placer Router Sizer = sizedPRS collectionof cells placedcells routedcells physicallayout resize usingwire information

  40. Robustness and Reliability

  41. Robustness to Power-Supply Noise HPSICE simulation of a typical QDI asynchronous circuit: A five-stage ring of async (PCHB) pipeline stages. Technology: TSMC 0.18micron CMOS Vdd: 1.8V, Vt : .5V, Complete layout. Vdd is oscillating between 3.5V and 0V (maximal amplitude), and at various frequencies. The circuit keeps working correctly! (It will malfunction at some very high-frequency noise in phase with circuit frequency.)

  42. Robustness to Power-Supply Noise

  43. C C final intermediate SE-Tolerant QDI Circuits ’a z xa za ya xb zb yb z’b

  44. Soft-error Tolerant Asynchronous Microprocessor (STAM) • The STAM architecture defines simplified 32-bit RISC instruction set, which has eight general registers, and four types of instructions: arithmetic, branch, memory and shift operations. • A partially-wired layout of the STAM was completed TSMC.SCN 0.18um CMOS. In SPICE simulation, it runs about 120 MHz. • The soft-error tolerance of the STAM has been tested by injecting errors randomly while the STAM runs the RC4 program (a simple stream cipher) in the digital-level simulator. • About five soft errors, whose locations are chosen randomly from a list of all nodes of the STAM, are injected in each execution of an instruction. • About 25% of 203,000 nets in the STAM experience a bit-flipping in each testing • The figure shows locations of errors by dots and a box in the figure represents a CHP process.

  45. Soft-error Tolerant Asynchronous Microprocessor (STAM)

  46. Async Molecular Nanoelectronics Molecular nano was our motivation for XQDI: Extreme case of variability!

  47. “Extreme” QDI (XQDI) • Can we improve QDI to eliminate (or reduce further) the remaining variability dependencies? • Isochronic forks • Keepers onstate-holding nodes • Slew rates and oscillating rings

  48. Isochronic Forks • Only timing assumption in QDI design • New design style that (1) minimizes the number of isochronic forks, and (2) mitigates their effect • d(single transition) << d(multi-transition path) • One-sided inequality can always be satisfied

  49. Cell Design without Keeper • Keepers needed for state-holding cells • Keeper requires transistor sizing and balancing current strengths. Difficult with variability… • Example of the C-element: With keeper Without keeper

  50. Ring Oscillators • An async system is a collection of rings of operators. Oscillating rings are the engine of an asynchronous circuit. • Right choices of slew rates and number of stages guarantee that each ring oscillates. • What are the limits? How many restoring stages per ring?

More Related