A PLA based Asynchronous Micropipelining Approach for Sub-threshold Circuit Design

A PLA based Asynchronous Micropipelining Approach for Sub-threshold Circuit Design Authors: Nikhil Jayakumar* Rajesh Garg* Bruce Gamache$ Sunil P. Khatri* *Department of Electrical Engineering,Texas A&M University. $Conexant Systems, Inc.

Outline • Motivation • Introduction • Approach • Results • Conclusions

Sub-threshold Leakage • As supply voltage scales down, the VT of the devices is scaled down as well. • Leakage increases exponentially with decreasing VT • Leakage power is becoming comparable with dynamic power. • A larger VT would reduce leakage but increase delay. • We can turn this dilemma into an opportunity !! • Use sub-threshold leakage current to implement circuits. • Set VDD less than VT.

Advantages of Sub-threshold Circuit Design • We performed simulations on a 21 stage ring oscillator (BPTM 65nm) • Power is significantly lower (100-500X). • PDP improves by 10-20X. • Transconductance is an exponential function of Vgs • Circuit noise margins are high. • Ion/Ioff = 100 – 200. • Circuits get faster at higher temperature.

Disadvantages of Sub-threshold Circuit Design • Ids is highly dependent on PVT variations • Need dynamic compensating circuitry such as the one mentioned in: • “A Variation-tolerant Sub-threshold Design Approach”, N. Jayakumar, S. Khatri [DAC’05] • Used Adaptive Body Biasing. • Ids is small which results in large delay. • Delay gets worse by 10-25X. • Therefore, application space is in very low power applications such as sensor networks. • Design methodologies for sub-threshold digital circuit design are ad-hoc.

Contribution of this paper • Provide a systematic EDA framework for the design of complex digital systems using sub-threshold Network of PLA (NPLA) based circuits. • Use asynchronous micropipelining to provide a greater throughput. • Ideally suited for Data-flow type circuits.

Why NPLAs? • NPLAs are fast and area-efficient when compared to standard-cell based designs - “Cross-talk immune VLSI design using a Network of PLAs Embedded in a Regular Layout Fabric”, S.Khatri, R. Brayton, A. Sangiovanni-Vincentelli [ICCAD’00] • Predictable delay of dynamic PLAs • Good circuit implementation choice for sub-threshold/near-threshold logic. • Regular Layout Structure • Compatible with Restrictive Design Rules (RDRs) required to handle current and future lithographic issues. • Technology independent optimizations (literal reduction) utilized better • No intervening technology mapping step. • Implementing Structured ASICs • An array of fixed-size PLAs is ideally suited for implementing Structured ASIC type designs. - “A METAL and VIA Mask Customizable VLSI Design Scheme using an Array of Dynamic PLAs”, N.Jayakumar, S.Khatri [ICCAD’04]

PLA structure – PrechargedNOR-NOR ORPLANE ANDPLANE

PLA structure – PrechargedNOR-NOR • Inputs run vertically • Wordlines run horizintally • Outputs run vertically • A dummy wordline and a dummy output line are provided for self-timing.

PLA structure – PrechargedNOR-NOR completion is the last signal to switch. Input latches to latch data from previous level

AsynchronousMicropipeline Structure • Each PLA has • Data Inputs –D (input) • Data Outputs – O (output) • Hand-shaking control signals - P1, P2 (input) • Controls asynchronous handshake • PLA evaluation/precharge done signal – completion (output) • Switches high when evaluation completes, switches low when precharge completes. • Internal clock signal – INTCLK (output) • Generated from completion, P1 and P2 to control operation of the PLA. • INTCLK = low → PLA precharges • INTCLK = high → PLA evaluates level n level 2 level 1

Handshaking Logic • PLA p (at level k) precharges (INTCLK goes low) if its P1 rises • PLA q at next higher level has latched the output data of p. • PLA p evaluates (INTCLK goes high) if its P2 rises and its completion signal is low • PLA p is currently in the precharged state (its completion signal is low). • PLA r at next lower level has completed evaluation and has new data ready (P2 for PLA p has risen). • Handshaking logic is therefore as shown below:

Micro-Pipeline Operation • Initially all PLAs are precharged. • Drive primary inputs (D of level 1 PLAs). • P2 signals of level 1 PLAs are asserted. • After evaluation is done, completion signals of level 1 PLAs go high. • Therefore level 2 PLAs start evaluating. • Data gets latched at input of level 2 PLAs, INTCLK of level 2 PLAs go high. • This causes level 1 PLAs to start precharging. • When evaluation of level 2 PLAs is done, their completion signals go high • This causes level 3 PLAs to start evaluating level n level 2 level 1

Micro-Pipeline Operation • This goes on till the PLAs at level n finish evaluation (indicated by their completion signal going high). • Consumer circuit latches the output and asserts P1 of level n PLAs • This cause level n PLAs to precharge. • When completion of level n-1 PLAs goes high and level n PLAs have precharged, then level n PLAs can evaluate again. level n level 2 level 1

Non-micropipelined vs Micropipelined • Delay for non-micropipelined NPLA = Tpchg + n x (Teval) • Delay of micropipelined PLA = (Teval + Tpchg+ handshaking time) level n level 2 level 1

Verilog Simulation of Micropipelining • We simulated the handshaking protocol in verilog. • Verified correct operation. • If consumer circuit holds off asserting P1 for level n PLAs, the entire pipeline stalls. • Note that when level i is in precharge, level i+1 is in evaluation and vice-versa.

Synthesis-Algorithm • First levelize the given multi-level network N • Generate a DFS of network nodes and sort in increasing order of levels • Greedily include new nodes from multi level network, into a current PLA. • Assume current PLA p has nodes {n} in it. • Candidate nodes {m} for inclusion in PLA p are: • Nodes in the fanout of nodes in {n}. • Nodes at the same level as nodes in {n}. • We evaluate favorability of nodes in {m} is as: favorability(m) = 2 * (#common fanins (m,{n}) + (#common fanouts (m,{n}. • The first term favors sharing of inputs with existing nodes {n}, while the second term favors sharing of outputs. • Sharing of inputs was empirically determined to be more useful in yielding smaller PLA counts. • We include the node with the highest favorability value. 5 5 4 3 2 2 1 1 1

Synthesis-Algorithm • Current PLA p is grown until it violates size constraints • Nodes {n} in the current PLA are converted into a two-level network N. • We run espresso on N. • If the number of inputs, outputs and height of this two-level network are bounded, then PLA p is grown • If not, then we start growing a new PLA. • Build a PLA dependency graph • Each vertex corresponds to a unique PLA • Each edge connects the output of a PLA to the input of another PLA • Node being included in current PLA p are constrained by the following: • the node being included should not violate size constraints of a PLA. • the inclusion of this node should not result in a cyclic PLA dependency graph • If such a node is not available pick the next most favorable node. 5 5 4 3 2 2 1 1 1

Synthesis-Algorithm • After synthesis, the output of a PLA at level i may drive PLAs at level > i+1 • Such a case will cause micro-pipelining to fail. • Insert Stutter blocks for signals which traverse one or more levels of PLAs. • Stutter blocks are banks of latches to delay signals which traverse more than 1 levels of PLAs. • Multiple stutter blocks are inserted for signals traversing multiple levels. PLA4 Stutter block PLA5 PLA3 PLA1 PLA2

Experiments • 65nm technology. • VDD = 0.2V • PLA size : 16 inputs, 14 outputs, 24 rows • Delay, Energy results from SPICE using 65nm BPTM model cards. • Comparison made with non-micropipelined PLA. • Thoughput of PLA = 1/(Teval+Tpchg+2.Heval+Hpchg) • Teval = Evaluation time for a PLA (~210ns) • Tpchg = Precharge time for a PLA (~155ns) • Heval = Handshake time before start of evaluation (~60ns) • Hpchg = Handshake time before start of precharge (~25ns)

Results - Delay • Delay = 1/throughput for micropipelined. • Delay is constant since PLA size is fixed.

Results – Area • Area estimates based on layout of PLAs along with stutter blocks.

What about Energy consumption? • Non-micropipelined NPLAs precharge together and then evaluate in a domino fashion. • Energy wasted due to leakage in the “Precharged” and the “Evaluated” states. • Micropipelined PLAs spend little time in the “Precharged” or “Evaluated” states. Timing Diagram for a non-micropipelined NPLA

Results – Energy • Results show energy consumption for one computation through the NPLA circuit. • Significant reduction in energy consumption is observed.

Conclusions • We have proposed an asynchronous micropipelined design approach that reclaims some of the speed penalty associated with subthreshold circuit design. • Ideally suited for data-flow type applications. • We implemented: • Handshaking protocol for micropipelining. • Circuit Design aspects of the approach. • Logic synthesis for micropipelined NPLAs. • We validated the approach with Verilog andSpice simulations. • Results show that: • Design can be sped up by ~ 7X. • Area Overhead is ~ 47%. • Energy consumption is lower by ~ 4X. • Techniques described can be used for regular operating conditions (VDD > VT) as well.

Thank you. Questions?

A PLA based Asynchronous Micropipelining Approach for Sub-threshold Circuit Design

A PLA based Asynchronous Micropipelining Approach for Sub-threshold Circuit Design

Presentation Transcript

Asynchronous Circuit Compilation

a principled approach for rejection threshold optimization

PLA/PALs and PLA Design Optimization

Introduction to asynchronous circuit design: specification and synthesis

Introduction to asynchronous circuit design: specification and synthesis

Dual-Threshold Voltage Design of Sub-threshold Circuits

Asynchronous Datapath Design

Asynchronous comparator design

Design and Impementation of a Sub-threshold BFSK Transmitter

Micropipeline design in asynchronous circuit

Introduction to asynchronous circuit design: specification and synthesis

On Clock Network Design for Sub-threshold Circuitry

A 256kb Sub-threshold SRAM in 65nm CMOS

A Variation-tolerant Sub-threshold Design Approach

A Sub-Atomic Subdivision Approach

Introduction to asynchronous circuit design: specification and synthesis

Sub-Threshold Standard Cell Tool

A Variation-tolerant Sub-threshold Design Approach

Design and Impementation of a Sub-threshold BFSK Transmitter

Introduction to asynchronous circuit design: specification and synthesis