1 / 31

Channel-Based Asynchronous Decoder with QDI Templates

This paper presents a low-power, high-performance standard cell-based sequential decoder implemented using QDI templates. The goal is to achieve performance close to full-custom designs with shorter design times.

victoriag
Download Presentation

Channel-Based Asynchronous Decoder with QDI Templates

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Channel-Based Asynchronous Low-Power High-Performance Standard Cell-Based Sequential Decoder Implemented with QDI Templates Recep Ö. Özdağ & Peter A. Beerel University of Southern California

  2. Motivation and Approach Background • Fine-grain asynchronous pipelines have demonstrated high-performance in largely full-custom back-end flows • Caltech’s MIPS R3000 Microprocessor [Martin97] • Fulcrum’s PivotPoint High Performance Switch [HotChips03] Problem • However full-custom flows are tedius, error-prone, and time-consuming and often require significant in-house tool automation Our Solution • Create asynchronous cell library • Integrate cell library into commercial P&R flow using Verilog modelling • Evaluate on a real design • Target a digital communication chip implementing the Fano algorithm Our Goal: Close to Full-Custom Performance with ASIC Design Times USC Asynchronous Group

  3. Dual-Rail Channel Sender Receiver Ack Data • Two wires per data bit • One acknowledgment wire • Generalizes to 1-of-N coding • Advantage: • Delay insensitive communication Channel Based Asynchronous Design Synchronization and communication between blocks implemented with handshaking using asynchronous channels by sending/receiving“data tokens” Asynchronous channel clock Synchronous System Asynchronous System USC Asynchronous Group

  4. Reg A Reg B BN-3 BN-2 BN-1 Adder Multiplier leaf cells channels FAN-2 FAN-1 FA0 FAN-3 ASIC Reg C Main FSM Subtract/ Divider Register Bank Memory Adder/ Mult. Channel-Based Design Characteristics • Architecture is typically a multi-level hierarchy of communicating blocks Netlist consists of leaf cells communicating along channels USC Asynchronous Group

  5. L L Asynchronous Leaf Cells Input Channels Output Channels Definition • Smallest block that communicates via asynchornous channels Functionality • Reads a subset of input channels • Computes F and writes to a subset of output channels Linear Pipelines • Only one input and one output channel Non Linear Pipelines • Joins and Forks • Conditional Joins: Read only some of the input channels • Conditional Splits: Write only to some of the output channels L Linear Pipeline L Conditional Join Conditional Split USC Asynchronous Group

  6. Template-Based Leaf-Cell Design L L C LCD LCD RCD RCD LCD LCD F C 2-input 1-output pipeline stage LCD RCD RCD LCD F F C L LCD RCD RCD LCD RCD RCD F Blueprint for a QDI N-input M-output pipeline stage Generation of instances from templates is straightforward 1-input 2-output pipeline stage • Each pipeline style (QDI, timed…) has a different blueprint • Create a library using a blueprint to implement the lowest level communicating blocks USC Asynchronous Group

  7. OR bit0 Done C bit1 OR OR bitn Completion Detector precharge control nmos network evaluation control Function Block Background: Caltech’s QDI Templates Precharged Half Buffer (PCHB)[Lines96] • 1-of-N Rail Channels • Delay-insensitive communication • Quasi-delay-insensitive design • Negligible timing assumptions • Dynamic Logic Function Block • Left and right completion detection R L USC Asynchronous Group

  8. PCHB Performance Analysis 2-D Pipelining:The key to high-throughput [MiniMips97] • Small forward latency per stage (as little as 2 gate delays) • Smaller completion detection units, reduces control overhead • Only local communication between blocks L31 L21 L11 L32 L22 L12 C C C C C C LCD RCD LCD RCD LCD RCD LCD RCD LCD RCD LCD RCD F1 F2 F3 F1 F2 F3 Cycle time =3 tEval + 2 tCD + 2 tc+ tprech Cycle time =3 tEval + 2 tCD + 2 tc+ tprech USC Asynchronous Group

  9. Outline • Background • Illustration of the Fano Algorithm • The base-line synchronous Fano design • The Asynchronous Fano Design • The Back-End Asynchronous Design Flow • Summary of Contributions USC Asynchronous Group

  10. 11 11 11 11 11 11 (-10) 11 11 Total Path Metric: -2 Total Path Metric: -5 Total Path Metric: -2 Total Path Metric: +3 Total Path Metric: -2 Total Path Metric: +1 Total Path Metric: 0 Total Path Metric: -2 01 01 (+3) 01 01 01 01 (+3) 01 01 1 error Estimate that transmitted a 1 00 00 00 00 00 00 (+3) 00 00 … 10 … 10 (-5) … … … … 10 10 10 10 (-5) (-5) (-5) … 10 … (-5) 10 (-5) (-5) 01 01 01 01 01 01 01 01 10 10 10 10 10 10 10 (-10) 10 (-10) 10 10 10 10 10 10 10 10 root root root root root root root root 10 10 10 10 10 10 10 10 (-5) 11 11 11 11 11 11 11 11 (-5) (-5) (-5) 0 errors 01 01 01 01 01 01 01 (-5) 01 11 11 11 11 11 11 11 (+3) (+3) 11 (+3) (+3) 10 10 (-5) 10 10 10 10 10 10 00 00 (-5) (-5) 00 00 00 00 00 (-5) 00 01 01 01 01 01 01 01 01 (-5) Estimate that transmitted a 0 0 1 0 X 0 1 0 1 Decoded bit Decoded bit Decoded bit Decoded bit Decoded bit Decoded bit Decoded bit Decoded bit X 1 1 0 0 X 1 X X X X X 0 X X X … … … … … … … … X 01 01 X 01 01 01 01 00 00 X 00 00 X 00 X Received Branch Bits Received Branch Bits Received Branch Bits Received Branch Bits Received Branch Bits Received Branch Bits Received Branch Bits Received Branch Bits X 11 11 11 11 11 11 11 Decoded Bit Index Decoded Bit Index Decoded Bit Index Decoded Bit Index Decoded Bit Index Decoded Bit Index Decoded Bit Index Decoded Bit Index 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 Background on Fano Algorithm • Fano algorithm is a depth first tree-search algorithm [Fano64] • Achieves good performance with a low average complexity … root Decoded bit … Received Branch Bits Decoded Bit Index 1 2 3 USC Asynchronous Group

  11. Critical path consists of a 2 ALU’s and 2 MUX’s The Synchronous Architecture [Asilomar99] USC Asynchronous Group

  12. Outline • Introduction and Background • The Asynchronous Fano Design • The Back-End Asynchronous Design Flow • Summary of Contributions USC Asynchronous Group

  13. The Asynchronous Fano At typical SNR most of the branches will be error free • Key idea: optimize architecture for forward moves Circuit can be partitioned into two units • Skip Ahead Unit: operates at high speed for error free sequences • Error Logic: operates when errors are encountered Circuit Operation Switches Back and Forth • Between Skip Ahead and Error Logic until it reaches end of tree Asynchronous Design Advantage • Allows seamless switching between blocks USC Asynchronous Group

  14. To BMU To BMU From BMU noError XOR_SPLIT ERROR-DETECT FILTER Comparison Result Decision_bit SkipAhead Decision Received Data compared with estimated branch bits MERGE BMU Decision XOR FAST DECISION REGISTER FAST SHIFT REGISTER XOR XOR The Skip-Ahead Unit The critical path of the Skip Ahead Unit runs at 450MHz (post layout) The Asynchronous Architecture USC Asynchronous Group

  15. The Memory Design Supports a packet length of only 128 words. Each word is a pair of branch bits. Used standard place and route tools for the physical design of the memories • Faster design time at the expense of more area and power consumption • Unacknowledged tri-state buffers on the data bus • Efficiently allows multiple drivers of the bus. • Introduceds minor timing assumptions • This is typical in synchronous design, but not typical of PCHB-based designs. 8 sets of branch bits USC Asynchronous Group

  16. Total of 8x16 = 128 bits decoded Fano: Error-Free Operation 17971ns 18449ns USC Asynchronous Group

  17. Error Encountered Move back Fano: Error Operation 17537ns 25361ns USC Asynchronous Group

  18. The Layout Asynchronous Fano Properties • TSMC 0.25 • Skip Ahead Unit runs at 450MHz • 2600m x 2600m = 6.76mm2 • Power dissipation: 32mW (@450MHz,2.5V) • 360,000 transistors • 10 man months to design + 6 man months library and flow development Compared to the Synchronous Fano • 2.15 x speed • 1/3 the power • 10 man months to design • 5x the area Received Memory Decision Memory Branch Metric Calculator Skip Ahead Unit Threshold Adjust Unit Lookup Table Counter USC Asynchronous Group

  19. Outline • Introduction and Background • The Asynchronous Fano Design • The Back-End Asynchronous Design Flow • Summary of Contributions USC Asynchronous Group

  20. Physical Design Flow Specification Schematic (Virtuoso, Synopsys) Simulation and Analysis (Hspice/Nanosim/Verilog) Symbol Schematic Functional Netlist (.v) Netlist (.sp) Asynchronous Leaf Cell/Gate Library Place & Route (Silicon Ensemble) Abstract (.lpe) Netlist (.cir) Layout (.gds) Layout Chip Assembly (Virtuoso) LVS & DRC (Virtuoso, Dracula) • Cell views: • Symbol • Schematic • Functional • Layout • Abstract Layout (.gds) Chip Fabrication Standard Flow Works USC Asynchronous Group

  21. Used for the Fano Algorithm • More suitable for designs with relaxed timing assumptions at the leaf cell level Leaf Cell Design Technology Mapping Physical P&R Gate Level Netlist Technology Mapping • Used for the STFB based adder • More suitable for designs with strict timing assumptions at the leaf cell level Physical P&R Cell Library Flow: Alternatives Leaf Level Design Leaf Cell Library Template Leaf Cell Design Technology Mapping Layout Physical P&R Gate Level Netlist Gate Library Technology Mapping Physical P&R Leaf cell level or gate level place and route USC Asynchronous Group

  22. Layout (.gds) Symbol Schematic Functional Layout Cell Abstract (Abstract generator) Abstract (.lpe) Asynchronous Gate Library Cell Library Flow Cell Design (Virtuoso) Simulation and Analysis (Hspice/Nanosim/Verilog) Netlist (.sp) DRC & LVS (Virtuoso, Dracula) Developed asynchronous gate library USC Asynchronous Group

  23. Transistor Sizing Initial cell sizes • 2X for pull down network • 8X for inverter drivers • Staticizer inverter is ~10x weaker than pull down network Additional sub-types added as necessary Create a number of subtypes for different strengths USC Asynchronous Group

  24. Charge-Sharing Considerations • Output inverters and staticizers are internal to all dynamic cells and form part of known minimum load on dynamic node (allowing 10% dip in voltage) • On each dynamic gate minimum load is guaranteed to be sufficient to ensure no charge sharing problems exist via extensive simulation Output inverters and staticizers are encapsulated with the dynamic logic into a single gate USC Asynchronous Group

  25. Netlist extraction // LAST TIME SAVED: Jun 4 17:49:17 2003 // NETLIST TIME: Jun 4 17:51:34 2003 `timescale 1ns / 1ns module Counter2 ( Backward_e, BmuErr_e[5], Forward_e, From_FSM_T, Go_Fast, Go_Slow_FSM, Go_Start_Pointer_F, Go_Start_Pointer_T, Go_e, LB, LFB, LFBTE, LFB_LFBTE, LFNB, NewStat_e0, NewStat_e1, Slow_ShiftB_e, Start, ZeroCheck, Zero_e, infi_e1, infi_e2, nReset); output Backward_e, Forward_e, From_FSM_T, Go_Fast, Go_Slow_FSM, Go_Start_Pointer_F, Go_Start_Pointer_T, Go_e, LB, LFB, LFBTE, Send_T_Re, ShiftB_e, ZeroCheck_e, Zero_False, Zero_True, infi_e; input BmuErr_e5a, BmuErr_e5b, BmuErr_e5c, BmuErr_e5d, ConnectGnd, Dec, Go, Go_Fast_Re, Go_Slow_FSM_e, Go_Start_Pointer_e, Inc, LFB_e1, LFNB_e1, NoZeroCheck, Re_LB, Re_LFB, Re_LFBTE, Re_LFNB, Re_S19, Zero_e, infi_e1, infi_e2, nReset; output [5:5] BmuErr_e; // Buses in the design wire [0:7] Forw_e; PCHB_SingleRail_SlowDataPath I54 ( .Ae(net01493), .A1(net0507), .BUFe(Send_Delta_to_Encode_e), .BUF1(Send_Delta_to_Encode), .nReset(nReset)); PCHB_BUFFER1_for_Counter_1 I204 ( .Ae(net0489), .A1(net0486), .A0(ConnectGnd), .BUFe(LFB_e1), .BUF1(LFB_LFBTE), .BUF0(nc[30]), .Start(Start), .nReset(nReset)) … Verilog netlist (.v) for placement and routing Verilog netlist of library gates is auto-generated USC Asynchronous Group

  26. Placement, Routing and Extraction * * CADENCE/LPE SPICE FILE : SPICE * DATE : 5-JUN-2003 * ****** ****** MOS XTOR PARAMETERS FROM : 7MOSXREF ****** * * *.GLOBAL VDD! GND! * * .SUBCKT INC2 DATA REQ ACK NRST4 L0 L1 * * ****** CORNER ADJUSTMENT FACTOR = 0.0000000 ****** MM2-XI60-XI36 XI36-A NET0432 VDD! VDD! PCH L=0.24U W=2.80U AD=1.04P + PD=3.54U AS=1.88P PS=6.94U NRS=0.079 NRD=0.079 MM3-XI60-XI36 XI36-A NR<6> VDD! VDD! PCH L=0.24U W=2.80U AD=1.04P + PD=3.54U AS=1.88P PS=6.94U NRS=0.079 NRD=0.079 MM7-XI60-XI36 XI36-XI60-NET029 NET0432 XI36-A GND! NCH L=0.24U W=1.20U + AD=0.24P PD=1.60U AS=0.44P PS=1.94U NRS=0.183 NRD=0.167 MM7-XI60-XI36-1 685 NET0432 GND! GND! NCH L=0.24U W=1.20U AD=0.24P + PD=1.60U AS=0.80P PS=3.74U NRS=0.183 NRD=0.167 ... ... MM1-XI59-3 NET72 XI59-NET35 VDD! VDD! PCH L=0.24U W=2.50U AD=0.93P + PD=3.24U AS=1.65P PS=6.32U NRS=0.088 NRD=0.088 * *----- TOTAL # OF MOS TRANSISTORS FOUND : 2018 *----- COMMENTED : 0 * ****** ****** RESISTORS PARAMETERS FROM : 7RESXREF ****** ****** ****** DIODE PARAMETERS FROM : 7DIOXREF ****** ****** ****** CAPACITORS PARAMETERS FROM : 7CAPXREF ****** ****** ****** CAPACITORS PARAMETERS FROM : 7CAPXMER ****** * * C1 NET77 GND! 8.00421E-15 C2 NET209 GND! 1.06917E-14 C3 NET188 GND! 1.16892E-14 C4 NET121 GND! 1.34065E-14 C5 NET215 GND! 1.02445E-14 ... USC Asynchronous Group

  27. Chip Assembly • Stream-in blocks layout (from SE to Virtuoso) • Block placement and routing • DRC, LVS and netlist extraction (.sp) • Post-layout simulation Future Work: • Static timing • Automatic block placement and routing • Synthesis USC Asynchronous Group

  28. Summary Design Flow: Standard ASIC flow for channel based asynchronous circuits • Async high performance designs with ASIC design time is possible • Verilog modelling and structural simulation is feasible • Commercial P&R tool (Silicon Ensemble) works quite well • Design flow is applicable to many templates (QDI or STFB) Architectural: Design and implementation of the Fano Algorithm • A complex design implemented both in synchronous and asynchronous • Over 2x performance with 1/3 the power at the expense of 3-5x area First freely available asynchronous library • Working on characterization and Lib file generation USC Asynchronous Group

  29. Thank You USC Asynchronous Group

  30. To BMU To BMU From BMU noError XOR_SPLIT ERROR-DETECT FILTER Comparison Result Decision_bit SkipAhead Decision Received Data compared with estimated branch bits MERGE BMU Decision XOR FAST DECISION REGISTER FAST SHIFT REGISTER XOR XOR The Skip-Ahead Unit Skip-Ahead Unit with RSPCHB A 14% throughtput improvement in the Skip-Ahead Unit using RSPCHB instead of PCHB USC Asynchronous Group

  31. Overview of New Pipeline Templates 2-D Style Timing Assumptions Throughput PCHB DI/QDI 772 MHz RSPCHB QDI 920 MHz LP2/2+ Moderate 1.0 GHz HC Aggressive 1.2 GHz Foundation of design space exploration trading robustness for performance USC Asynchronous Group

More Related