efficient asynchronous protocol converters for two phase delay insensitive global communication
Download
Skip this Video
Download Presentation
Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication

Loading in 2 Seconds...

play fullscreen
1 / 51

Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication - PowerPoint PPT Presentation


  • 550 Views
  • Uploaded on

Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication. Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin Columbia University, Electrical Engineering Steven M. Nowick Columbia University, Computer Science. Outline.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication' - ryanadan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
efficient asynchronous protocol converters for two phase delay insensitive global communication

Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication

Amitava Mitra

Intel Corp., Bangalore, India

William F. McLaughlin

Columbia University, Electrical Engineering

Steven M. Nowick

Columbia University, Computer Science

outline
Outline
  • Motivation and Contribution
    • System-on-Chip: Concepts and Trends
    • Asynchronous Signaling Styles
    • Target Asynchronous SOC Architecture
    • Contribution
  • Proposed System Architecture
  • Experimental Results
  • Extensions: Other Signaling Styles
  • Conclusions and Future Work
system on chip soc concept and trends
System-on-Chip (SOC): Concept and Trends
  • Microelectronic trends enabling SOC design
    • Increasing integration density + chip size
      • Formerly discrete functions (memory, I/O) now integrated
      • Popularity of “multi-core” designs
  • Heterogeneous SOC:
    • Large complex chip with broad functionality
    • Many independent computation nodes
      • Multiple cores, memories, accelerators, multimedia processing, etc.
      • Often includes multiple timing domains
    • Complex network-style interconnect fabric
  • Challenges in Heterogeneous SOC design:
    • Wire costs not scaling down with device size
      • Increasing proportion of power and delay in interconnect
    • Robust and high-performance interconnect design:
      • High latencies between remote nodes
      • Mixed timing, timing variability/uncertainty
      • Need to support varied components: modular/scalable design
soc communication fabric
SOC Communication Fabric
  • Growing factor in overall system performance
  • Ideal Requirements:
    • Speed: high throughput, low latency
    • Low power
    • Robust to timing variations
    • Flexibility: integrate modular IPs and upgrades
  • Asynchronous design well-suited to these goals
    • Timing robust flexible designs
    • Lower power than synchronous
    • Work by Quinton, Greenstreet, and Wilton [ICCD 2005]
      • GALS-style:
        • global LEDR interconnect + local synchronous blocks
        • does not provide details of protocol converters
asynchronous for soc communication
Asynchronous for SOC Communication
  • Advantages of asynchronous global communication
    • Delay-insensitive (DI) encoding
      • Removes timing constraints on global routing
    • No clock signals to route across chip
      • Significant power advantage
    • Can support both async + sync computation
      • Delay-insensitive async logic combats growing variability concerns
      • GALS style: Globally-Asynchronous Locally-Synchronous
  • Several popular async signaling protocols
    • Dual rail four-phase, LEDR, 1-of-4, bundled data, others
    • No single protocol ideal for both logic and communication
background ledr signaling
Background: LEDR Signaling
  • Dual-rail encoding: two wires per bit – delay-insensitive
  • “Level-encoding”:
    • Data rail: holds actual data value
    • Parity rail: holds parity value
  • Alternating-phase protocol:
    • Encoding parity alternates between odd and even

Bit value

LEDR Encoding

data rail parity rail

Phase

ledr signaling
LEDR Signaling
  • Exactly one wire transition for each new data item

Data rail: carries bit value in both phases

0

1

0

0

1

1

1

data

parity

even

odd

even

odd

even

odd

even

Parity rail: phase alternates with each data item

four phase dual rail signaling
Four-Phase Dual-Rail Signaling
  • Alternative DI Code
  • Key Differences:
    • Four-phase (Return-to-Zero) protocol
      • Spacer (reset) state required between each data item
    • One-hot encoding:
      • True rail (encodes 1) & false rail (encodes 0)

1

0

1

1

Data values

True rail

False rail

Evaluation (one rail high)

Reset (both rails low)

four phase dual rail vs ledr
Four-Phase Dual-Rail vs. LEDR
  • Advantages of four-phase dual-rail:
    • Delay-insensitive logic using standard gates
      • Implementations are simple and fast: widely used
      • LEDR: complex & impractical
  • Disadvantages of four-phase dual-rail:
    • System-level communication throughput:
      • Spacer state doubles round-trip communication latency
      • LEDR: no spacer required
    • Power dissipation:
      • Two transitions/bit (up and down) for each data item
      • LEDR: only one transition/bit
  • Conclusion:
    • Four-phase dual-rail better for implementing function blocks
    • LEDR is better for global communication
target asynchronous soc architecture
Target Asynchronous SOC Architecture
  • Three major components:
    • Global communication network (LEDR)
    • Local computation nodes (varied styles)
    • New requirement: protocol converters at interfaces
      • Allow full separation of computation and communication

Our goal –

Protocol converters to enable this global LEDR SOC

contribution
Contribution
  • High-speed protocol converters to enable heterogeneous SOC architectures
    • Supports high-throughput, robust global communication
      • LEDR encoding
    • Supports efficient design of local function blocks
      • (i) 4-phase dual-rail, (ii) 1-of-4, (iii) single-rail bundled data
  • Features:
    • Family of low-latency protocol converters:
      • support above 3 local encoding styles
    • High throughput:
      • facilitates concurrent interaction of nodes
    • Timing-robust:
      • converters almost entirely QDI
    • Low design effort:
      • standard cell design flow
    • Fully implemented in 0.18 μm CMOS
      • Layout and simulation
      • FIFO throughputs up to 250 MHz
two target soc topologies
Two Target SOC Topologies

1. “Pipeline-style” topology

  • Feed-forward data path:
    • uni-directional token flow
  • Receiving node returns a single ACK (control signal)
    • Supports concurrency between nodes

Data feeds forward

Acknowledge sent back

two soc topologies cont
Two SOC Topologies (cont.)

2. “Server-style” topology

    • Client passes data token to server
    • Server computes/returns data token to client (result)
      • Explicit ACK unnecessary
  • Proposed SOC architecture supports both topologies

Four-phase server

Four-phase data client

Bi-directional data flow: data passed back to client on completion

outline14
Outline
  • Motivation and Contribution
  • Proposed System Architecture
    • Architecture Overview
    • System Simulation
    • Detailed Hardware Implementation
    • Timing Analysis
  • Experimental Results
  • Extensions: Other Signaling Styles
  • Conclusions and Future Work
architecture overview
Architecture Overview

Four-phase core

  • External LEDR interface, internal four-phase core
    • Four-phase signals are shown in red
    • Two-phase or transition signals are shown in yellow

LEDR input

LEDR output

control signals
Control Signals
  • Two-phase control signals

Phase of LEDR input (request from left)

Phase of LEDR output (forward complete)

Acknowledge to left neighbor

Acknowledge from right neighbor

control signals17
Control Signals
  • Four-phase control signals

Completion detect four-phase evaluate and RZ

Enable four-phase evaluate and RZ

system simulation
System Simulation
  • LEDR inputs begin arriving at quiescent system

LEDR inputs arrive

Completion detection

system simulation19
System Simulation
  • Input completion detection sent to control

All input phases matching

Transition to new phase

system simulation20
System Simulation
  • Control enables four-phase evaluate phase

Enable rises

system simulation21
System Simulation
  • LEDR input converted to four-phase

Enable now high

One wire of each four-phase pair rises

system simulation22
System Simulation
  • Four-phase function evaluation
system simulation23
System Simulation
  • Four-phase bits decoded to LEDR
    • Each bit converted as soon as it computes

LEDR outputs to next node generated

Four-phase complete not used in evaluate phase

system simulation24
System Simulation
  • LEDR output completion detection

Output pairs

ACK from right may come any time after all pairs are sent

system simulation25
System Simulation
  • Control enables four-phase reset phase

Enable falls

system simulation26
System Simulation
  • Function block inputs return-to-zero
    • ACK is sent concurrently to left

Enable now low

Pipeline concurrency:

request new data during reset phase

system simulation27
System Simulation
  • Four-phase reset propagates through logic block

New data may arrive now that ACK has been sent

Reset Completion detection

Enable remains low

system simulation28
System Simulation
  • Four-phase reset completes
    • Complete internal cycle has now been performed

Complete falls

system simulation29
System Simulation
  • New evaluate phase begins when Enable rises again
    • Pre-conditions: reset finished, new data REQ, and old data ACK

Three-way synchronization

Input phase transitions when new data ready

ACK transitions when outputs safe to change

Complete low (means reset finished)

detailed hardware implementation
Detailed Hardware Implementation
  • Each block implemented in CMOS standard cells
  • Design has few non-QDI timing constraints

Four-phase core

LEDR input

LEDR output

four phase encode input converter
Four-phase Encode (Input Converter)
  • Converts LEDR input to four-phase dual-rail
    • Enable=‘1’: outputs evaluate based on LEDR data
    • Enable=‘0’: outputs reset (LEDR data blocked)
four phase decode output converter
Four-phase Decode (Output Converter)
  • Converts four-phase bits to LEDR output
    • LEDR data rail encoding
      • Assert either S (1 value) or R (0 value), then return-to-hold
      • More robust alternative: C-element
four phase decode output converter33
Four-phase Decode (Output Converter)
  • Converts four-phase bits to LEDR output
    • LEDR parity rail encoding
      • Parity output: based on 4-phase data and LEDR input phase (parity)
      • Alternating phases: green vs. red gates
      • D-latch: blocks new input parity arrival until 4-phase reset complete

even phase

odd phase

1 bit completion detectors
1-Bit Completion Detectors
  • LEDR CD at input and output
  • Four-phase CD in function block
  • Both protocols have one gate CD
    • XOR (parity) for LEDR
    • OR for four-phase dual-rail

1-bit LEDR completion detector

1-bit four-phase completion detector

n bit completion detectors
N-Bit Completion Detectors
  • C-element trees
    • Used for both LEDR and four-phase
      • C-element: standard cell implementation (AOI222 w/feedback)
control block
Control Block
  • Main Purpose: controls 4-phase function block
    • 4-phase eval requires 3-way synchronization
      • Function block: previous RZ complete
      • Primary inputs: new data arrival
      • Right interface (in pipeline): ACK received
  • In pipeline topology: also sends left ACK

For pipeline topology only

control block37
Control Block
  • Converts two-phase inputs to four-phase outputs

Two-phase to four-phase conversion

control block signaling conversion
Control Block: Signaling Conversion

Pulse-mode

(timed)

Transition-signal

(falling or rising )

Four-phase

(level-sensitive)

SR latch captures the pulse

Inverter and XNOR form simple pulse gen

timing requirements
Timing Requirements
  • Circuits almost entirely QDI
  • Exceptions:
    • Control block:
      • Two-sided timing constraint on length of pulse
      • Sensitive to both gate and wire delays
      • Careful layout required
    • Latches: simple hold time constraints
      • SR latches can be replaced by C-elements
        • C-elements also have implementation-specific timing constraints
        • SR latch much faster than our standard cell C-element
      • D latch can be removed at cost of concurrency
outline40
Outline
  • Motivation and Contribution
  • Proposed System Architecture
  • Experimental Results
    • Design Methodology
    • Datapath Setup
    • Simulation Results
    • Latency and Throughput Analysis
  • Extensions: Other Signaling Styles
  • Conclusions and Future Work
design methodology
Design Methodology
  • Standard cell design flow with complete layout
    • 0.18 μm TSMC CMOS process
    • 4 metal layers of 7 available used in routing
  • Custom place-and-route used
    • Only major layout concern: pulse generator circuit
    • Design could be automated with constraints on pulse
  • Analog simulations: based on layout-extracted design
    • Test vectors including limiting fast and slow cases
datapath implementation
Datapath Implementation
  • Two function blocks implemented
    • An 8x8 carry-save multiplier
    • An empty FIFO stage
      • FIFO contains four-phase completion detector only
      • Demonstrates minimum possible node latency
  • Blocks are QDI in evaluate, but “eager” in reset
    • Implemented in combinational CMOS
    • “DIMS”-style logic (with C-elements) could be used instead
      • QDI in both directions
      • Increases both forward and reverse latencies
multiplier layout
Multiplier Layout
  • Includes dual rail multiplier and all conversion circuits
    • Total area of 0.051 mm2
  • FIFO stage has area of 0.018 mm2
performance results
Performance Results

3 Metrics:

  • Forward Latency: input arrival  output data available
    • Average Values: Multiplier:6.8 ns; FIFO:2.9 ns.
  • Stabilization Time: input arrival  reset complete (circuit quiescent)
    • Multiplier:10.5 ns; FIFO:6.3 ns.
  • Pipelined Cycle Time: min processing time/data item (steady-state)
    • Multiplier:8.3 ns; FIFO4.0 ns.
performance analysis
Performance Analysis
  • Forward latency: overhead
    • 2.2 ns for both nodes
      • Overhead independent of function block size
    • Includes:
      • LEDR CD, control unit, input/output converters
  • Throughput: increased by concurrency
    • Benefit: 2.2 ns reduction in cycle time (vs. post-reset ACK)
    • Savings achieved even in environment without channel latency
  • “Core converter” overhead (no CD) extremely low
    • Only 1.1 ns average latency for converters + control
    • Completion detectors:
      • Account for half of forward latency overhead
      • Account for 55% of FIFO cycle time
    • Faster CDs would provide big improvement
outline47
Outline
  • Motivation and Contribution
  • Proposed System Architecture
  • Experimental Results
  • Extensions: Other Signaling Styles
    • Converters for 1-of-4 function blocks
    • Converters for bundled data function block
  • Conclusions and Future Work
extensions to other local protocols
Extensions to Other Local Protocols
  • Only small changes to handle 1-of-4 or bundled data
    • No change to control block
  • 1-of-4 encoding:
    • Input/output converters:
      • Small changes to logic
    • Needs standard 1-of-4 completion detector
  • Single-rail bundled data:
    • Input converter: not needed – use LEDR data rail
    • Output converter:
      • New basic circuit required (see paper for details)
    • Function block completion detection:
      • Use bundled ‘done’ signal
      • Asymmetric delay chain (fast reset)
outline49
Outline
  • Background and Motivation
  • Contribution
  • Proposed System Architecture
  • Experimental Results
  • Extensions: Other Signaling Styles
  • Conclusions and Future Work
    • Summary and Conclusion
    • Future Work
summary and conclusions
Summary and Conclusions
  • Support heterogeneous SOCs using hybrid protocols
    • LEDR: low-power, delay-insensitive communication fabric
    • Dual rail four-phase: Simple, fast logic blocks
  • Designed Converters for LEDR/four-phase SOC:
    • Low latency, high throughput, timing robust design
  • Robust concurrency system developed
    • Exploits four-phase reset to mask communication time
  • Simulations with realistic mid-sized function nodes
    • Demonstrated low latency overhead
    • Demonstrated low area overhead
    • Achieved throughputs up to 250 MHz for FIFO stage
future work
Future Work
  • Evaluating system-level benefits
    • Determine design spaces where converters most useful
      • Quantify benefits over using either protocol exclusively
  • Optimal partitioning of converter nodes
    • Explore dependence on system topology
  • Potential applications: use in async SOCs
    • Beigne/Vivet – GALS NoC Architectures (Async-06)
    • Scott et al. (Intel/Silistix) – PXA27x System (Async-07)
    • Dobkin/Ginosar/Kolodny – fast LEDR serial links (Async-06/07)
      • Convert 4-phase dual-rail to LEDR (for parallel load)
ad