asynchronous links for nanonets
Download
Skip this Video
Download Presentation
Asynchronous Links, for NanoNets?

Loading in 2 Seconds...

play fullscreen
1 / 78

Asynchronous Links, for NanoNets? - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

Asynchronous Links, for NanoNets?. Alex Yakovlev University of Newcastle, UK. Feature size (nm). Relative. 250. 180. 130. 90. 65. 45. 32. delay. 100. Gate delay (fanout 4). Local interconnect (M1,2). Global interconnect with repeaters. Global interconnect without repeaters. 10. 1.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Asynchronous Links, for NanoNets?' - petula


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
asynchronous links for nanonets

Asynchronous Links, for NanoNets?

Alex YakovlevUniversity of Newcastle, UK

motivation 1

Feature size (nm)

Relative

250

180

130

90

65

45

32

delay

100

Gate delay (fanout 4)

Local interconnect (M1,2)

Global interconnect with repeaters

Global interconnect without repeaters

10

1

Source: ITRS, 2003

0.1

Motivation-1
  • At very deep submicron, gate delay is much less than interconnect delay: total interconnect length can reach several meters; interconnect delay can be as much as 90% of total path delay in VDSM circuits
  • Timing issue is a problem, particularly for global wires
  • Multiple clock domains are reality, problem of interface between them
  • ITRS’05 predicted: 4x (8x) increase in global asynchronous signalling by 2012 (2020)
motivation 2
Motivation-2
  • Variability and uncertainty
    • Geometry and process: for long channels intra-die variations are less correlated for different part of the interconnect, both for interconnects and repeaters
      • e.g., M4 and M5 resistance/um massively differ, leading to mistracking (C.Visuweswariah, SLIP’06)
      • e.g. 250nm clock skew has 25% variability due to interconnect variations (Y.Liu et.al. DAC’00)
    • Behavioural: crosstalk (sidewall capacitance can cause up to 7x variation in delay (R. Ho, M.Horowitz))
a network on chip

Synchronization required

Multiple Clocks

Arbitration required

A Network on Chip

Async Links

example from the past fault tolerant self timed ring varshavsky et al 1986
Example from the Past: Fault-Tolerant Self-Timed Ring (Varshavsky et al. 1986)

For an onboard airborne computer-control system which tolerated up to two faults. Self-timed ring was a GALS system with self-checking and self-repair at the hardware level

Individually clocked subsystems

Self-timed adapters forming a ring

communication channel adapter
CommunicationChannel Adapter

Much higher reliability than a bus and other forms of redundancy

MCC was developed TTL-Schottky gate arrays, approx 2K gates.

Data (DR,DS) is encoded using 3-of-6 Sperner code (16 data values for half-byte, plus 4 tokens for ring acquisition protocol)

AR, AS – acknowledgements

RR, RS – spare (for self-repair) lines

outline
Outline
  • Token-based view of communication
  • Basics of asynchronous signalling
  • Self-timed data encoding
  • Pipelining
  • How to hide acknowledgements
  • Serial vs Parallel links
  • Arbiters and routers
  • Async2sync interface
  • CAD issues
data exchange token based view
Data exchange: token-based view
  • Question 1: when can Rx look at the incoming data?

Data validity issue – Forming a well-defined token

Data

source

tx

rx

dest

data exchange token based view1
Data exchange: token-based view
  • Question 1: when can Rx looked at the data?

Data validity issue – Forming a well-defined token

  • Question 2: when can Tx send new data?

Acknowledgement issue – Separation b/w tokens

Data

source

tx

rx

dest

data exchange token based view2

Data

source

tx

rx

dest

Data exchange: token-based view
  • Question 1: when can Rx looked at the data?

Data validity issue – Forming a well-defined token

  • Question 2: when can Tx send new data?

Acknowledgement issue – Separation b/w tokens

These are fundamental issues of flow control at the physical and link levels

The answers are determined by many design aspects: technology level, system architecture (application, pipelining), latency, throughput, power, design process etc.

tokens and spaces with global clocking

Data

source

tx

rx

dest

Tokens and spaces with global clocking
  • In globally clocked systems both Q1 and Q2 are resolved with the aid of clock pulses

clk

tokens and spaces
Tokens and spaces

Data

  • Without global clocking: Q1 can be resolved differently from Q2
  • E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing

source

tx

rx

dest

D_valid

Clk_rx

Clk_tx

bundle

tokens and spaces1
Tokens and spaces

Data

  • Without global clocking: Q1 can be resolved differently from Q2
  • E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing

source

tx

rx

dest

D_valid

ack

ack

bundle

ack

petri net model
Petri net model

dest

source

Tx

Rx

Data Valid

Tx delay

Rx delay

One way delay, but may be unsafe!

dest

source

Tx

Rx

Data Valid

ack

Tx delay or ack

Rx delay or ack

Always safe but with a round trip delay!

asynchronous handshake signalling
Asynchronous handshake signalling

Valid data tokens and safe spaces between them can be created by different means of signalling and encoding

  • Level-based -> Return-To-Zero (RTZ) or 4-phase protocol
  • Transition-based -> Non-Return-to-Zero (NRZ) or 2-phase protocol
  • Pulse-based, e.g. GasP
  • Phase-difference-based
  • Data encoding: bundled data (BD), Delay-insensitive (DI)
handshake signalling protocols

req

req

ack

ack

One cycle

Handshake Signalling Protocols
  • Level Signalling (RTZ or 4-phase)
  • Transition Signalling (RTZ or 4-phase)

req

ack

One cycle

One cycle

handshake signalling protocols1

req + ack

One cycle

Handshake Signalling Protocols
  • Pulse Signalling

req

req

ack

ack

One cycle

  • Single-track Signalling (GasP)

req

ack

gasp signalling
GasP signalling

Pull up from pred (req)

Pulse length control loops

Pull up from here (req)

Pull down here (ack)

Pull down from succ (ack)

Source: R. Ho et al, Async’04

data encoding
Data encoding
  • Bundled data
    • Code is positional binary, token is determined by Req+ signal; Req+ arrives with a safe set-up delay from data
  • Delay-insensitive codes (tokens determined by the codeword values, require a spacer, or NULL, state if RTZ)
    • 1-of-2 (Dual-rail per bit) – systematic code, encoding, decoding straightforward
    • m-of-n (n>2) – not systematic, i.e. incur encoding and decoding costs, optimal when m=n/2
    • One-hot ,1-of-n (n>2), completion detection is easy, not practical beyond n>4
    • Systematic, such as Berger, incur complex completion detection
bundled data

Data

req

ack

One cycle

Bundled Data

RTZ:

Data

req

ack

NRZ:

Data

req

ack

One cycle

One cycle

di encoded data dual rail
DI encoded data (Dual-Rail)

RTZ:

NULL (spacer)

NULL

Data.0

Data.1

Data.0

Logical 0

Logical 1

ack

Data.1

ack

One cycle

One cycle

NRZ:

Data.0

Logical 0

Logical 1

Logical 1

Logical 1

Data.1

ack

cycle

cycle

cycle

cycle

di encoded data dual rail1
DI encoded data (Dual-Rail)

RTZ:

NULL (spacer)

NULL

Data.0

Data.1

Data.0

Logical 0

Logical 1

ack

Data.1

ack

One cycle

One cycle

This coding leads to complex logic implementation; hard to track odd and even phases and logic values – hence see LEDR below

NRZ:

Data.0

Logical 0

Logical 1

Logical 1

Logical 1

Data.1

ack

cycle

cycle

cycle

cycle

di codes 1 of n and m of n
DI codes (1-of-n and m-of-n)
  • 1-of-4:
    • 0001=> 00, 0010=>01, 0100=>10, 1000=>11
  • 2-of-4:
    • 1100, 1010, 1001, 0110, 0101, 0011 – total 6 combinations (cf. 2-bit dual-rail – 4 comb.)
  • 3-of-6:
    • 111000, 110100, …, 000111 – total 20 combinations (can encode 4 bits + 4 control tokens)
  • 2-of-7:
    • 1100000, 1010000, …, 0000011 – total 21 combinations (4 bits + 5 control tokens)
di codes completion detection and decoding
DI codes completion detection and decoding
  • 1-of-4 completion detection is a 4-input OR gate (CD=d0+d1+d2+d3)
  • Decode 1-of-4 to dual rail is a set of four 2-input OR gates (q0.0=d0+d2; q0.1=d1+d3; q1.0=d0+d1; q1.1=d2+d3)
  • For m-of-n codes CD and decoding is non-trivial

From J.Bainbridge et al, ASYNC’03

incomplete di codes
Incomplete DI codes

Incomplete 2-of-7:

Composed of

1-of-3

and

1-of-4

From J.Bainbridge et al ASYNC’03

phase difference based encoding c d alessandro et al async 06 07

t_1 before t_0

t_0 before t_1

ref

t_1

t_0

sp0

sp0

sp1

sp0

sp1

data

0

0

1

0

Phase difference based encoding (C. D’Alessandro et al. ASYNC’06,’07)
  • The proposed system consists in encoding a bit of data in the phase relationship between two signals generated using a reference
  • This would ensure that any transient fault appearing on one of the reference signals will be ignored if it is not mirrored by a corresponding transition on the other line
  • Similarity with multi-wire communication
phase encoding multiple rail
Phase encoding: multiple rail
  • No group of wires has the same delay
  • All wires toggle when an item of data is sent
  • Increased number of states available ( n wires = n! states) hence more bits/symbol
  • Table illustrates examples of phase encoding compared to the respective m-of-n counterpart
phase encoding repeater
Phase encoding Repeater

1<3

3<1

2<3

3<2

1<2

2<1

Phase detectors (Mutexes)

pipelines
Pipelines

Dual-rail pipeline

From J.Bainbridge & S. Furber IEEE Micro, 2002

the problem of acking
The problem of Acking
  • Question 2 “when can Tx send new data?” has two aspects:
    • Safety (not to overflow the channel or when Tx and Rx have much variation in delay)
    • Performance (to maximize throughput and reduce latency)
  • Can we hide ack (round trip) delay?
slide31

To maintain throughput more pipeline stages are required but that costs too much latency and power

First minimize latency along a long wire (not specific to asynchronous) and then maximize throughput (using “wagging tail buffer” approach)

From R.Ho et al. ASYNC’04

slide32

Use of wagging buffer approach

Alternate between top and bottom control

From R.Ho et al. ASYNC’04

wagging tail buffer approach
“Wagging tail buffer” approach

reqtop

Top and bot control channels work at ½ frequency of data channel

acktop

data

reqbot

ackbot

serial link vs parallel link from r dobkin
Why Serial Link?

Less interconnect area

Less routing congestion

Less coupling

Less power (depends on range)

The relative improvement grows with technology scaling. The example on the right refers to:

Single gate delay serial link

Fully-shielded parallel link with 8gate delay clock cycle

Equal bit-rate

Word width N=8

Serial Link vs Parallel Link (from R. Dobkin)

Link Length [mm]

Serial Link dissipates less power

Parallel Link dissipates less power

Serial Link requires less area

Parallel Link requires less area

Technology Node [nm]

serialization model
Serialization model

Tx

Rx

Acking at the bit level

serialization model1
Serialization model

Tx

Rx

Acking at the word level

serialization model2
Serialization model

Tx

Rx

Acking at the word level (with more concurrency)

serial link top structure r dobkin async 07
Serial Link – Top Structure (R.Dobkin, Async’07)
  • Transition signaling instead of sampling: two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. data-strobe (DS)
  • Acknowledge per word instead of per bit
  • Synchronizers used at the level of the ack signals
  • Wave-pipelining over channel
  • Differential encoding (DS-DE, IEEE1355-95)
  • Reported throughput: 67Gps for 65nm process (viz. one bit per 15ps – expected FO4 inverter delay), based on simulations
encoding two phase nrz ledr

Uncoded (B)

Phase bit (P)

State bit (S)

0

0

0

0

1

0

1

1

0

0

Encoding –Two Phase NRZ LEDR
  • Two Phase Non-Return-to-Zero Level Encoded Dual Rail
    • “delta” encoding (one transition per bit)
self timed networks
Self Timed Networks
  • Router requires priority arbitration
    • Arbitration necessary at every router merge
    • Potential delay at every node on the path

BUT

    • Asynchronous merge/arbitration time is average not worst case
  • Adapters to locally clocked cells require synchronization
  • Synchronization necessary when clocks are unknown
    • Occurs when receiving data (data valid), and when sending (acknowledge)

BUT

    • Time can be long (2 cycles?)
    • Must assume worst case time (maybe)
router priority
Router priority
  • Virtual channels implement scheduling algorithm
  • Contention for link resolved by priority circuits

Flow Control

Link

Merge

Split

asynchronous arbiters
Asynchronous Arbiters
  • Multiway arbiters (e.g. for Xbar switches):
    • Cascaded mesh (latency ~ N)
    • Cascaded Tree (latency ~ logN)
    • Token-Ring (busy ring and lazy ring) (latency ~ from 1 to N)
  • Priority arbiters (e.g. for Routers with different QS):
    • Static priority (topological order)
    • Dynamic priority (request arrives with priority code)
    • Ordered (time-priority) - multiway arbiter, followed by a FIFO buffer
static priority arbiter

Lock

MUTEX

r1

s1

R1

s*

q

G1

C

r

MUTEX

r2

s2

Priority Module

R2

s*

q

G2

C

r

MUTEX

r3

s3

R3

s*

q

G3

C

r

Lock Register

s

q

C

r*

Static Priority Arbiter
slide46

Why Synchronizer?

DATA

1

CLK

DATA

Q

DFF

0

CLK

Q

1

0

Metastability

Metastability

DATA

Q

Here one clock cycle is used for the metastability to resolve.

DFF

DFF

CLK

Two DFF Synchronizer

slide48

Bus

Data

Transceiver

DSr

LDS

Device

D

LDTACK

DSr

LDS

VME Bus

Controller

DSw

LDTACK

D

DTACK

DTACK

Read Cycle

Synthesis of Asynchronous link interfaces

slide49

DSr+

DSw+

DTACK-

LDS+

D+

LDTACK+

LDS+

LDTACK-

D+

LDTACK+

DTACK+

D-

LDS-

DSr-

DTACK+

D-

DSw-

slide50

DSr+

DSw+

D

DTACK

-

DTACK

LDS+

D+

synthesis

LDTACK+

LDS+

LDS

csc

LDTACK

-

LDTACK+

D+

DSr

DTACK+

D

-

LDS

-

LDTACK

Logic asynchronous circuit

DSr

-

DTACK+

D

-

DSw

-

csc +

DSr+

DTACK-

LDS+

LDTACK-

LDTACK-

LDTACK-

DSr+

DTACK-

LDS-

LDS-

LDS-

LDTACK+

DSr+

DTACK-

D+

D-

csc -

DSr-

DTACK+

Complete State Coding (CSC)

Boolean equations:

LDS = D  csc

DTACK = D

D = LDTACK

csc = DSr

conclusions on async links
Conclusions on Async Links
  • At nm level links will be more asynchronous, perhaps first, mesochronous to avoid global clock skew
  • Delay-insensitive codes can be used to tolerate interwire-delay variability
  • Phase-encoding can be used for higher power-bit efficiency and SEU tolerance
  • Acking will be mainly used for flow control (word level) and its overhead can be ‘hidden’ by using the “wagging buffer” technique
  • Serial Links save area and power for long interconnects, with buffering (pipelining) if one wants to maintain high throughput; they also simplify building switches
  • Synthesis tools can be used to build clock-free interfaces between different links
  • Asynchronous logic can be used for building higher level circuits, e.g. arbiters for switches and routers
async 08 and nocs 08 plus slip 08
ASYNC’08 and NOCs’08 …plus SLIP’08
  • Held in Newcastle upon Tyne, UK, 7-11 April 2008 (SLIP on 5-6 April – weekend)
  • async.org.uk/async2008
  • async.org.uk/nocs2008
  • Submission deadlines:
    • Async’08: Abstract – Oct. 8 , Full paper – Oct. 15
    • NOCs’08: Abstract – Nov. 12, Full paper – Nov. 19
extras
Extras
  • More slides if I have time!
chain network components
Chain Network Components

From J.Bainbridge & S. Furber IEEE Micro, 2002

a network on chip1

Synchronization required

Multiple Clocks

Arbitration required

A Network on Chip
self timed networks1
Self Timed Networks
  • Router requires priority arbitration
    • Arbitration necessary at every router merge
    • Potential delay at every node on the path

BUT

    • Asynchronous merge/arbitration time is average not worst case
  • Adapters to locally clocked cells require synchronization
  • Synchronization necessary when clocks are unknown
    • Occurs when receiving data (data valid), and when sending (acknowledge)

BUT

    • Time can be long (2 cycles?)
    • Must assume worst case time (maybe)
router priority1
Router priority
  • Virtual channels implement scheduling algorithm
  • Contention for link resolved by priority circuits

Flow Control

Link

Merge

Split

static priority arbiter1

Lock

MUTEX

r1

s1

R1

s*

q

G1

C

r

MUTEX

r2

s2

Priority Module

R2

s*

q

G2

C

r

MUTEX

r3

s3

R3

s*

q

G3

C

r

Lock Register

s

q

C

r*

Static priority arbiter
reliability and latency
Reliability and latency
  • Asynchronous arbiters fail only if time is bounded
    • Latency depends on fixed gates plus MUTEX lock time
    •  for 2 channels,  +  ln(N-1) for more
    • This likely to be small compared with flow control latency
  • Synchronizers fail at (fairly) predictable rates but these rates may get worse
    • Latency can be 35 now for good reliability
the synchronizer

D

D

Q

Q

The synchronizer
  • Clock and valid can happen very close together
  • Flip Flop #1 gets caught in metastability
  • We wait until it is resolved (1 –2 clock periods)

DATA

VALID

#1

#2

CLK2

CLK1

slide64

t

/

t

e

=

MTBF

T

.

f

.

f

c

d

w

MTBF
  • For a 0.18 process  is 20 – 50 ps
  • Tw is similar
  • Suppose the clock and data frequencies are 2 GHz
  • t needs to be > 25 (more than one clock period) to get MTBF > 28 days
    • 100 synchronizers + 5 
    • MTBF > 1year + 2 
    • PVT variations +5 - 10 . . .
event histogram
Event Histogram

Convert to log scale, slope is 

Measurement

not always simple
Not always simple

More than one slope

350ps

120ps

140ps

synchronization strategies
Synchronization Strategies
  • Avoid synchronization time (and arbitration time) by
    • predicting clocks, stoppable clocks
    • dedicate link paths for long periods of time
  • Minimize time by circuit methods
    • Higher power, better 
    • Reducing apparent device variability - wide transistors
    • many parallel synchronizers increase throughput
  • Reduce average latency by speculation
    • Reduce synchronization time, detect errors and roll back
timing regions can have predictable relationships
Timing regions can have predictable relationships
  • Locked
    • Two clocks from same source
    • Linked by PLL
    • One produced by dividing the other
    • Some asynchronous systems
    • Some GALS
  • Not locked together but predictable
    • Two clocks same frequency, but different oscillators.
    • As above, same frequency ratio
don t synchronise when you don t need to

DATA

DATA

FIFO

ACK IN

REQ OUT

REQ IN

ACK OUT

Read done

WriteData Available

Don’t synchronise when you don’t need to
  • If the two clocks are locked together, you don’t need a synchroniser, just an asynchronous FIFO big enough to accommodate any jitter/skew
  • FIFO must never overflow
  • Next read clock can be predicted and metastability avoided
conflict prediction
Conflict Prediction

Receiver Clock

Transmitter Clock

Predicted Transmitter Clock

Synchronization problem known a cycle in advance of the Receiver clock.

We can do this thanks to the periodic nature of the clocks

problems predicting next cycle
Problems predicting next cycle
  • Difficult to predict
    • Multiple source clocks
    • Input output interfaces
  • Dynamic jitter and noise
    • GALS start up clocks take several cycles to stabilise
    • Crosstalk
    • power supply variations introducing noise into both data and clock .
    • temperature changes alter relative delays
  • As a proportion of cycle time, this is likely to increase with smaller geometries
synchronizer reliability trends
Synchronizer reliability trends
  • Clock rates increase. 10 GHz gives 100ps for a cycle.
    • Both data and clock rates up by n
    •  down by n
  • Assume  scales with cycle time reliability (MTBF) of one synchronizer down by n
  • Number of synchronizers goes up by N
    • Die reliability down by N
  • Die – die and on-die variability increases to as much as 40%
    • 40% more time needed for all synchronizers
an example
An example
  • Example
    • 10 GHz clock and data rate
    •  = 10 ps
    • 100 synchronizers
    • MBTF required 3.8 months (107 seconds )
    • Time required 41 , or 4.1 cycles + 40% =5.8 cycles
  • Does this matter?
power futures
Power futures
  • Total synchronizer area/power small, BUT
  •  very sensitive to voltage/power – both n and p transistors can turn off at low voltages – no gain
  • This affects MUTEX circuits as well
power speed tradeoffs
Power/speed tradeoffs
  • Increase Vdd when synchronisation required
  • Make synchronizer transistors wide to reduce variation and, to some extent, 
  • Make many synchronizer circuits, and select the consistently fastest one
  • Avoid reducing synchronizer Vdd when running slow
speculation
Speculation
  • Mostly, the synchronizer does not need 35 to settle
  • Only e-10 (0.005%) need more than 10
  • Why not go ahead anyway, and try again if more time was needed
low latency synchronization
Low latency synchronization
  • Data Available, or Free to write are produced early
    • After one cycle?.
  • If they prove to be in error, synchronization failed
    • Only know this after two of more cycles
  • Read Fail or Write Fail flag is then raised and the action can be repeated.

DATA

DATA

FIFO

Free to write

Data Available

Speculativesynchronizer

Speculativesynchronizer

Full

Not Empty

Write Fail

Read Fail

Write clock

Read Clock

WRITE

READ

WriteData

Read done

comments
Comments
  • Synchronization time will be an issue for future GALS
  • Latency and throughput can be affected
    • Should the flit be large to reduce the effective overhead of time and power?
  • Some power speed trade off is possible
    • Higher power synchronization can buy some performance ?
  • Speculation is complex
    • Is it worth it?
ad