Efficient asynchronous protocol converters for two phase delay insensitive global communication
Download
1 / 51

Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication - PowerPoint PPT Presentation


  • 546 Views
  • Uploaded on

Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication. Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin Columbia University, Electrical Engineering Steven M. Nowick Columbia University, Computer Science. Outline.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication' - ryanadan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient asynchronous protocol converters for two phase delay insensitive global communication l.jpg

Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication

Amitava Mitra

Intel Corp., Bangalore, India

William F. McLaughlin

Columbia University, Electrical Engineering

Steven M. Nowick

Columbia University, Computer Science


Outline l.jpg
Outline Delay-Insensitive Global Communication

  • Motivation and Contribution

    • System-on-Chip: Concepts and Trends

    • Asynchronous Signaling Styles

    • Target Asynchronous SOC Architecture

    • Contribution

  • Proposed System Architecture

  • Experimental Results

  • Extensions: Other Signaling Styles

  • Conclusions and Future Work


System on chip soc concept and trends l.jpg
System-on-Chip (SOC): Concept and Trends Delay-Insensitive Global Communication

  • Microelectronic trends enabling SOC design

    • Increasing integration density + chip size

      • Formerly discrete functions (memory, I/O) now integrated

      • Popularity of “multi-core” designs

  • Heterogeneous SOC:

    • Large complex chip with broad functionality

    • Many independent computation nodes

      • Multiple cores, memories, accelerators, multimedia processing, etc.

      • Often includes multiple timing domains

    • Complex network-style interconnect fabric

  • Challenges in Heterogeneous SOC design:

    • Wire costs not scaling down with device size

      • Increasing proportion of power and delay in interconnect

    • Robust and high-performance interconnect design:

      • High latencies between remote nodes

      • Mixed timing, timing variability/uncertainty

      • Need to support varied components: modular/scalable design


Soc communication fabric l.jpg
SOC Communication Fabric Delay-Insensitive Global Communication

  • Growing factor in overall system performance

  • Ideal Requirements:

    • Speed: high throughput, low latency

    • Low power

    • Robust to timing variations

    • Flexibility: integrate modular IPs and upgrades

  • Asynchronous design well-suited to these goals

    • Timing robust flexible designs

    • Lower power than synchronous

    • Work by Quinton, Greenstreet, and Wilton [ICCD 2005]

      • GALS-style:

        • global LEDR interconnect + local synchronous blocks

        • does not provide details of protocol converters


Asynchronous for soc communication l.jpg
Asynchronous for SOC Communication Delay-Insensitive Global Communication

  • Advantages of asynchronous global communication

    • Delay-insensitive (DI) encoding

      • Removes timing constraints on global routing

    • No clock signals to route across chip

      • Significant power advantage

    • Can support both async + sync computation

      • Delay-insensitive async logic combats growing variability concerns

      • GALS style: Globally-Asynchronous Locally-Synchronous

  • Several popular async signaling protocols

    • Dual rail four-phase, LEDR, 1-of-4, bundled data, others

    • No single protocol ideal for both logic and communication


Background ledr signaling l.jpg
Background: LEDR Signaling Delay-Insensitive Global Communication

  • Dual-rail encoding: two wires per bit – delay-insensitive

  • “Level-encoding”:

    • Data rail: holds actual data value

    • Parity rail: holds parity value

  • Alternating-phase protocol:

    • Encoding parity alternates between odd and even

Bit value

LEDR Encoding

data rail parity rail

Phase


Ledr signaling l.jpg
LEDR Signaling Delay-Insensitive Global Communication

  • Exactly one wire transition for each new data item

Data rail: carries bit value in both phases

0

1

0

0

1

1

1

data

parity

even

odd

even

odd

even

odd

even

Parity rail: phase alternates with each data item


Four phase dual rail signaling l.jpg
Four-Phase Dual-Rail Signaling Delay-Insensitive Global Communication

  • Alternative DI Code

  • Key Differences:

    • Four-phase (Return-to-Zero) protocol

      • Spacer (reset) state required between each data item

    • One-hot encoding:

      • True rail (encodes 1) & false rail (encodes 0)

1

0

1

1

Data values

True rail

False rail

Evaluation (one rail high)

Reset (both rails low)


Four phase dual rail vs ledr l.jpg
Four-Phase Dual-Rail vs. LEDR Delay-Insensitive Global Communication

  • Advantages of four-phase dual-rail:

    • Delay-insensitive logic using standard gates

      • Implementations are simple and fast: widely used

      • LEDR: complex & impractical

  • Disadvantages of four-phase dual-rail:

    • System-level communication throughput:

      • Spacer state doubles round-trip communication latency

      • LEDR: no spacer required

    • Power dissipation:

      • Two transitions/bit (up and down) for each data item

      • LEDR: only one transition/bit

  • Conclusion:

    • Four-phase dual-rail better for implementing function blocks

    • LEDR is better for global communication


Target asynchronous soc architecture l.jpg
Target Asynchronous SOC Architecture Delay-Insensitive Global Communication

  • Three major components:

    • Global communication network (LEDR)

    • Local computation nodes (varied styles)

    • New requirement: protocol converters at interfaces

      • Allow full separation of computation and communication

Our goal –

Protocol converters to enable this global LEDR SOC


Contribution l.jpg
Contribution Delay-Insensitive Global Communication

  • High-speed protocol converters to enable heterogeneous SOC architectures

    • Supports high-throughput, robust global communication

      • LEDR encoding

    • Supports efficient design of local function blocks

      • (i) 4-phase dual-rail, (ii) 1-of-4, (iii) single-rail bundled data

  • Features:

    • Family of low-latency protocol converters:

      • support above 3 local encoding styles

    • High throughput:

      • facilitates concurrent interaction of nodes

    • Timing-robust:

      • converters almost entirely QDI

    • Low design effort:

      • standard cell design flow

    • Fully implemented in 0.18 μm CMOS

      • Layout and simulation

      • FIFO throughputs up to 250 MHz


Two target soc topologies l.jpg
Two Target SOC Topologies Delay-Insensitive Global Communication

1. “Pipeline-style” topology

  • Feed-forward data path:

    • uni-directional token flow

  • Receiving node returns a single ACK (control signal)

    • Supports concurrency between nodes

Data feeds forward

Acknowledge sent back


Two soc topologies cont l.jpg
Two SOC Topologies (cont.) Delay-Insensitive Global Communication

2. “Server-style” topology

  • Client passes data token to server

  • Server computes/returns data token to client (result)

    • Explicit ACK unnecessary

  • Proposed SOC architecture supports both topologies

  • Four-phase server

    Four-phase data client

    Bi-directional data flow: data passed back to client on completion


    Outline14 l.jpg
    Outline Delay-Insensitive Global Communication

    • Motivation and Contribution

    • Proposed System Architecture

      • Architecture Overview

      • System Simulation

      • Detailed Hardware Implementation

      • Timing Analysis

    • Experimental Results

    • Extensions: Other Signaling Styles

    • Conclusions and Future Work


    Architecture overview l.jpg
    Architecture Overview Delay-Insensitive Global Communication

    Four-phase core

    • External LEDR interface, internal four-phase core

      • Four-phase signals are shown in red

      • Two-phase or transition signals are shown in yellow

    LEDR input

    LEDR output


    Control signals l.jpg
    Control Signals Delay-Insensitive Global Communication

    • Two-phase control signals

    Phase of LEDR input (request from left)

    Phase of LEDR output (forward complete)

    Acknowledge to left neighbor

    Acknowledge from right neighbor


    Control signals17 l.jpg
    Control Signals Delay-Insensitive Global Communication

    • Four-phase control signals

    Completion detect four-phase evaluate and RZ

    Enable four-phase evaluate and RZ


    System simulation l.jpg
    System Simulation Delay-Insensitive Global Communication

    • LEDR inputs begin arriving at quiescent system

    LEDR inputs arrive

    Completion detection


    System simulation19 l.jpg
    System Simulation Delay-Insensitive Global Communication

    • Input completion detection sent to control

    All input phases matching

    Transition to new phase


    System simulation20 l.jpg
    System Simulation Delay-Insensitive Global Communication

    • Control enables four-phase evaluate phase

    Enable rises


    System simulation21 l.jpg
    System Simulation Delay-Insensitive Global Communication

    • LEDR input converted to four-phase

    Enable now high

    One wire of each four-phase pair rises


    System simulation22 l.jpg
    System Simulation Delay-Insensitive Global Communication

    • Four-phase function evaluation


    System simulation23 l.jpg
    System Simulation Delay-Insensitive Global Communication

    • Four-phase bits decoded to LEDR

      • Each bit converted as soon as it computes

    LEDR outputs to next node generated

    Four-phase complete not used in evaluate phase


    System simulation24 l.jpg
    System Simulation Delay-Insensitive Global Communication

    • LEDR output completion detection

    Output pairs

    ACK from right may come any time after all pairs are sent


    System simulation25 l.jpg
    System Simulation Delay-Insensitive Global Communication

    • Control enables four-phase reset phase

    Enable falls


    System simulation26 l.jpg
    System Simulation Delay-Insensitive Global Communication

    • Function block inputs return-to-zero

      • ACK is sent concurrently to left

    Enable now low

    Pipeline concurrency:

    request new data during reset phase


    System simulation27 l.jpg
    System Simulation Delay-Insensitive Global Communication

    • Four-phase reset propagates through logic block

    New data may arrive now that ACK has been sent

    Reset Completion detection

    Enable remains low


    System simulation28 l.jpg
    System Simulation Delay-Insensitive Global Communication

    • Four-phase reset completes

      • Complete internal cycle has now been performed

    Complete falls


    System simulation29 l.jpg
    System Simulation Delay-Insensitive Global Communication

    • New evaluate phase begins when Enable rises again

      • Pre-conditions: reset finished, new data REQ, and old data ACK

    Three-way synchronization

    Input phase transitions when new data ready

    ACK transitions when outputs safe to change

    Complete low (means reset finished)


    Detailed hardware implementation l.jpg
    Detailed Hardware Implementation Delay-Insensitive Global Communication

    • Each block implemented in CMOS standard cells

    • Design has few non-QDI timing constraints

    Four-phase core

    LEDR input

    LEDR output


    Four phase encode input converter l.jpg
    Four-phase Encode (Input Converter) Delay-Insensitive Global Communication

    • Converts LEDR input to four-phase dual-rail

      • Enable=‘1’: outputs evaluate based on LEDR data

      • Enable=‘0’: outputs reset (LEDR data blocked)


    Four phase decode output converter l.jpg
    Four-phase Decode (Output Converter) Delay-Insensitive Global Communication

    • Converts four-phase bits to LEDR output

      • LEDR data rail encoding

        • Assert either S (1 value) or R (0 value), then return-to-hold

        • More robust alternative: C-element


    Four phase decode output converter33 l.jpg
    Four-phase Decode (Output Converter) Delay-Insensitive Global Communication

    • Converts four-phase bits to LEDR output

      • LEDR parity rail encoding

        • Parity output: based on 4-phase data and LEDR input phase (parity)

        • Alternating phases: green vs. red gates

        • D-latch: blocks new input parity arrival until 4-phase reset complete

    even phase

    odd phase


    1 bit completion detectors l.jpg
    1-Bit Completion Detectors Delay-Insensitive Global Communication

    • LEDR CD at input and output

    • Four-phase CD in function block

    • Both protocols have one gate CD

      • XOR (parity) for LEDR

      • OR for four-phase dual-rail

    1-bit LEDR completion detector

    1-bit four-phase completion detector


    N bit completion detectors l.jpg
    N-Bit Completion Detectors Delay-Insensitive Global Communication

    • C-element trees

      • Used for both LEDR and four-phase

        • C-element: standard cell implementation (AOI222 w/feedback)


    Control block l.jpg
    Control Block Delay-Insensitive Global Communication

    • Main Purpose: controls 4-phase function block

      • 4-phase eval requires 3-way synchronization

        • Function block: previous RZ complete

        • Primary inputs: new data arrival

        • Right interface (in pipeline): ACK received

    • In pipeline topology: also sends left ACK

    For pipeline topology only


    Control block37 l.jpg
    Control Block Delay-Insensitive Global Communication

    • Converts two-phase inputs to four-phase outputs

    Two-phase to four-phase conversion


    Control block signaling conversion l.jpg
    Control Block: Signaling Conversion Delay-Insensitive Global Communication

    Pulse-mode

    (timed)

    Transition-signal

    (falling or rising )

    Four-phase

    (level-sensitive)

    SR latch captures the pulse

    Inverter and XNOR form simple pulse gen


    Timing requirements l.jpg
    Timing Requirements Delay-Insensitive Global Communication

    • Circuits almost entirely QDI

    • Exceptions:

      • Control block:

        • Two-sided timing constraint on length of pulse

        • Sensitive to both gate and wire delays

        • Careful layout required

      • Latches: simple hold time constraints

        • SR latches can be replaced by C-elements

          • C-elements also have implementation-specific timing constraints

          • SR latch much faster than our standard cell C-element

        • D latch can be removed at cost of concurrency


    Outline40 l.jpg
    Outline Delay-Insensitive Global Communication

    • Motivation and Contribution

    • Proposed System Architecture

    • Experimental Results

      • Design Methodology

      • Datapath Setup

      • Simulation Results

      • Latency and Throughput Analysis

    • Extensions: Other Signaling Styles

    • Conclusions and Future Work


    Design methodology l.jpg
    Design Methodology Delay-Insensitive Global Communication

    • Standard cell design flow with complete layout

      • 0.18 μm TSMC CMOS process

      • 4 metal layers of 7 available used in routing

    • Custom place-and-route used

      • Only major layout concern: pulse generator circuit

      • Design could be automated with constraints on pulse

    • Analog simulations: based on layout-extracted design

      • Test vectors including limiting fast and slow cases


    Datapath implementation l.jpg
    Datapath Implementation Delay-Insensitive Global Communication

    • Two function blocks implemented

      • An 8x8 carry-save multiplier

      • An empty FIFO stage

        • FIFO contains four-phase completion detector only

        • Demonstrates minimum possible node latency

    • Blocks are QDI in evaluate, but “eager” in reset

      • Implemented in combinational CMOS

      • “DIMS”-style logic (with C-elements) could be used instead

        • QDI in both directions

        • Increases both forward and reverse latencies


    Multiplier layout l.jpg
    Multiplier Layout Delay-Insensitive Global Communication

    • Includes dual rail multiplier and all conversion circuits

      • Total area of 0.051 mm2

    • FIFO stage has area of 0.018 mm2


    Measured block latencies l.jpg
    Measured Block Latencies Delay-Insensitive Global Communication


    Performance results l.jpg
    Performance Results Delay-Insensitive Global Communication

    3 Metrics:

    • Forward Latency: input arrival  output data available

      • Average Values: Multiplier:6.8 ns; FIFO:2.9 ns.

    • Stabilization Time: input arrival  reset complete (circuit quiescent)

      • Multiplier:10.5 ns; FIFO:6.3 ns.

    • Pipelined Cycle Time: min processing time/data item (steady-state)

      • Multiplier:8.3 ns; FIFO4.0 ns.


    Performance analysis l.jpg
    Performance Analysis Delay-Insensitive Global Communication

    • Forward latency: overhead

      • 2.2 ns for both nodes

        • Overhead independent of function block size

      • Includes:

        • LEDR CD, control unit, input/output converters

    • Throughput: increased by concurrency

      • Benefit: 2.2 ns reduction in cycle time (vs. post-reset ACK)

      • Savings achieved even in environment without channel latency

    • “Core converter” overhead (no CD) extremely low

      • Only 1.1 ns average latency for converters + control

      • Completion detectors:

        • Account for half of forward latency overhead

        • Account for 55% of FIFO cycle time

      • Faster CDs would provide big improvement


    Outline47 l.jpg
    Outline Delay-Insensitive Global Communication

    • Motivation and Contribution

    • Proposed System Architecture

    • Experimental Results

    • Extensions: Other Signaling Styles

      • Converters for 1-of-4 function blocks

      • Converters for bundled data function block

    • Conclusions and Future Work


    Extensions to other local protocols l.jpg
    Extensions to Other Local Protocols Delay-Insensitive Global Communication

    • Only small changes to handle 1-of-4 or bundled data

      • No change to control block

    • 1-of-4 encoding:

      • Input/output converters:

        • Small changes to logic

      • Needs standard 1-of-4 completion detector

    • Single-rail bundled data:

      • Input converter: not needed – use LEDR data rail

      • Output converter:

        • New basic circuit required (see paper for details)

      • Function block completion detection:

        • Use bundled ‘done’ signal

        • Asymmetric delay chain (fast reset)


    Outline49 l.jpg
    Outline Delay-Insensitive Global Communication

    • Background and Motivation

    • Contribution

    • Proposed System Architecture

    • Experimental Results

    • Extensions: Other Signaling Styles

    • Conclusions and Future Work

      • Summary and Conclusion

      • Future Work


    Summary and conclusions l.jpg
    Summary and Conclusions Delay-Insensitive Global Communication

    • Support heterogeneous SOCs using hybrid protocols

      • LEDR: low-power, delay-insensitive communication fabric

      • Dual rail four-phase: Simple, fast logic blocks

    • Designed Converters for LEDR/four-phase SOC:

      • Low latency, high throughput, timing robust design

    • Robust concurrency system developed

      • Exploits four-phase reset to mask communication time

    • Simulations with realistic mid-sized function nodes

      • Demonstrated low latency overhead

      • Demonstrated low area overhead

      • Achieved throughputs up to 250 MHz for FIFO stage


    Future work l.jpg
    Future Work Delay-Insensitive Global Communication

    • Evaluating system-level benefits

      • Determine design spaces where converters most useful

        • Quantify benefits over using either protocol exclusively

    • Optimal partitioning of converter nodes

      • Explore dependence on system topology

    • Potential applications: use in async SOCs

      • Beigne/Vivet – GALS NoC Architectures (Async-06)

      • Scott et al. (Intel/Silistix) – PXA27x System (Async-07)

      • Dobkin/Ginosar/Kolodny – fast LEDR serial links (Async-06/07)

        • Convert 4-phase dual-rail to LEDR (for parallel load)


    ad