Asynchronous links for nanonets
Download
1 / 78

Asynchronous Links, for NanoNets? - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

Asynchronous Links, for NanoNets?. Alex Yakovlev University of Newcastle, UK. Feature size (nm). Relative. 250. 180. 130. 90. 65. 45. 32. delay. 100. Gate delay (fanout 4). Local interconnect (M1,2). Global interconnect with repeaters. Global interconnect without repeaters. 10. 1.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Asynchronous Links, for NanoNets?' - petula


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Asynchronous links for nanonets

Asynchronous Links, for NanoNets?

Alex YakovlevUniversity of Newcastle, UK


Motivation 1

Feature size (nm)

Relative

250

180

130

90

65

45

32

delay

100

Gate delay (fanout 4)

Local interconnect (M1,2)

Global interconnect with repeaters

Global interconnect without repeaters

10

1

Source: ITRS, 2003

0.1

Motivation-1

  • At very deep submicron, gate delay is much less than interconnect delay: total interconnect length can reach several meters; interconnect delay can be as much as 90% of total path delay in VDSM circuits

  • Timing issue is a problem, particularly for global wires

  • Multiple clock domains are reality, problem of interface between them

  • ITRS’05 predicted: 4x (8x) increase in global asynchronous signalling by 2012 (2020)


Motivation 2
Motivation-2

  • Variability and uncertainty

    • Geometry and process: for long channels intra-die variations are less correlated for different part of the interconnect, both for interconnects and repeaters

      • e.g., M4 and M5 resistance/um massively differ, leading to mistracking (C.Visuweswariah, SLIP’06)

      • e.g. 250nm clock skew has 25% variability due to interconnect variations (Y.Liu et.al. DAC’00)

    • Behavioural: crosstalk (sidewall capacitance can cause up to 7x variation in delay (R. Ho, M.Horowitz))


A network on chip

Synchronization required

Multiple Clocks

Arbitration required

A Network on Chip

Async Links


Example from the past fault tolerant self timed ring varshavsky et al 1986
Example from the Past: Fault-Tolerant Self-Timed Ring (Varshavsky et al. 1986)

For an onboard airborne computer-control system which tolerated up to two faults. Self-timed ring was a GALS system with self-checking and self-repair at the hardware level

Individually clocked subsystems

Self-timed adapters forming a ring


Communication channel adapter
Communication (Varshavsky et al. 1986)Channel Adapter

Much higher reliability than a bus and other forms of redundancy

MCC was developed TTL-Schottky gate arrays, approx 2K gates.

Data (DR,DS) is encoded using 3-of-6 Sperner code (16 data values for half-byte, plus 4 tokens for ring acquisition protocol)

AR, AS – acknowledgements

RR, RS – spare (for self-repair) lines


Outline
Outline (Varshavsky et al. 1986)

  • Token-based view of communication

  • Basics of asynchronous signalling

  • Self-timed data encoding

  • Pipelining

  • How to hide acknowledgements

  • Serial vs Parallel links

  • Arbiters and routers

  • Async2sync interface

  • CAD issues


Data exchange token based view
Data exchange: token-based view (Varshavsky et al. 1986)

  • Question 1: when can Rx look at the incoming data?

    Data validity issue – Forming a well-defined token

Data

source

tx

rx

dest


Data exchange token based view1
Data exchange: token-based view (Varshavsky et al. 1986)

  • Question 1: when can Rx looked at the data?

    Data validity issue – Forming a well-defined token

  • Question 2: when can Tx send new data?

    Acknowledgement issue – Separation b/w tokens

Data

source

tx

rx

dest


Data exchange token based view2

Data (Varshavsky et al. 1986)

source

tx

rx

dest

Data exchange: token-based view

  • Question 1: when can Rx looked at the data?

    Data validity issue – Forming a well-defined token

  • Question 2: when can Tx send new data?

    Acknowledgement issue – Separation b/w tokens

    These are fundamental issues of flow control at the physical and link levels

    The answers are determined by many design aspects: technology level, system architecture (application, pipelining), latency, throughput, power, design process etc.


Tokens and spaces with global clocking

Data (Varshavsky et al. 1986)

source

tx

rx

dest

Tokens and spaces with global clocking

  • In globally clocked systems both Q1 and Q2 are resolved with the aid of clock pulses

clk


Tokens and spaces
Tokens and spaces (Varshavsky et al. 1986)

Data

  • Without global clocking: Q1 can be resolved differently from Q2

  • E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing

source

tx

rx

dest

D_valid

Clk_rx

Clk_tx

bundle


Tokens and spaces1
Tokens and spaces (Varshavsky et al. 1986)

Data

  • Without global clocking: Q1 can be resolved differently from Q2

  • E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing

source

tx

rx

dest

D_valid

ack

ack

bundle

ack


Petri net model
Petri net model (Varshavsky et al. 1986)

dest

source

Tx

Rx

Data Valid

Tx delay

Rx delay

One way delay, but may be unsafe!

dest

source

Tx

Rx

Data Valid

ack

Tx delay or ack

Rx delay or ack

Always safe but with a round trip delay!


Asynchronous handshake signalling
Asynchronous handshake signalling (Varshavsky et al. 1986)

Valid data tokens and safe spaces between them can be created by different means of signalling and encoding

  • Level-based -> Return-To-Zero (RTZ) or 4-phase protocol

  • Transition-based -> Non-Return-to-Zero (NRZ) or 2-phase protocol

  • Pulse-based, e.g. GasP

  • Phase-difference-based

  • Data encoding: bundled data (BD), Delay-insensitive (DI)


Handshake signalling protocols

req (Varshavsky et al. 1986)

req

ack

ack

One cycle

Handshake Signalling Protocols

  • Level Signalling (RTZ or 4-phase)

  • Transition Signalling (RTZ or 4-phase)

req

ack

One cycle

One cycle


Handshake signalling protocols1

req + ack (Varshavsky et al. 1986)

One cycle

Handshake Signalling Protocols

  • Pulse Signalling

req

req

ack

ack

One cycle

  • Single-track Signalling (GasP)

req

ack


Gasp signalling
GasP signalling (Varshavsky et al. 1986)

Pull up from pred (req)

Pulse length control loops

Pull up from here (req)

Pull down here (ack)

Pull down from succ (ack)

Source: R. Ho et al, Async’04


Data encoding
Data encoding (Varshavsky et al. 1986)

  • Bundled data

    • Code is positional binary, token is determined by Req+ signal; Req+ arrives with a safe set-up delay from data

  • Delay-insensitive codes (tokens determined by the codeword values, require a spacer, or NULL, state if RTZ)

    • 1-of-2 (Dual-rail per bit) – systematic code, encoding, decoding straightforward

    • m-of-n (n>2) – not systematic, i.e. incur encoding and decoding costs, optimal when m=n/2

    • One-hot ,1-of-n (n>2), completion detection is easy, not practical beyond n>4

    • Systematic, such as Berger, incur complex completion detection


Bundled data

Data (Varshavsky et al. 1986)

req

ack

One cycle

Bundled Data

RTZ:

Data

req

ack

NRZ:

Data

req

ack

One cycle

One cycle


Di encoded data dual rail
DI encoded data (Dual-Rail) (Varshavsky et al. 1986)

RTZ:

NULL (spacer)

NULL

Data.0

Data.1

Data.0

Logical 0

Logical 1

ack

Data.1

ack

One cycle

One cycle

NRZ:

Data.0

Logical 0

Logical 1

Logical 1

Logical 1

Data.1

ack

cycle

cycle

cycle

cycle


Di encoded data dual rail1
DI encoded data (Dual-Rail) (Varshavsky et al. 1986)

RTZ:

NULL (spacer)

NULL

Data.0

Data.1

Data.0

Logical 0

Logical 1

ack

Data.1

ack

One cycle

One cycle

This coding leads to complex logic implementation; hard to track odd and even phases and logic values – hence see LEDR below

NRZ:

Data.0

Logical 0

Logical 1

Logical 1

Logical 1

Data.1

ack

cycle

cycle

cycle

cycle


Di codes 1 of n and m of n
DI codes (1-of-n and m-of-n) (Varshavsky et al. 1986)

  • 1-of-4:

    • 0001=> 00, 0010=>01, 0100=>10, 1000=>11

  • 2-of-4:

    • 1100, 1010, 1001, 0110, 0101, 0011 – total 6 combinations (cf. 2-bit dual-rail – 4 comb.)

  • 3-of-6:

    • 111000, 110100, …, 000111 – total 20 combinations (can encode 4 bits + 4 control tokens)

  • 2-of-7:

    • 1100000, 1010000, …, 0000011 – total 21 combinations (4 bits + 5 control tokens)


Di codes completion detection and decoding
DI codes completion detection and decoding (Varshavsky et al. 1986)

  • 1-of-4 completion detection is a 4-input OR gate (CD=d0+d1+d2+d3)

  • Decode 1-of-4 to dual rail is a set of four 2-input OR gates (q0.0=d0+d2; q0.1=d1+d3; q1.0=d0+d1; q1.1=d2+d3)

  • For m-of-n codes CD and decoding is non-trivial

From J.Bainbridge et al, ASYNC’03


Incomplete di codes
Incomplete DI codes (Varshavsky et al. 1986)

Incomplete 2-of-7:

Composed of

1-of-3

and

1-of-4

From J.Bainbridge et al ASYNC’03


Phase difference based encoding c d alessandro et al async 06 07

t_1 before t_0 (Varshavsky et al. 1986)

t_0 before t_1

ref

t_1

t_0

sp0

sp0

sp1

sp0

sp1

data

0

0

1

0

Phase difference based encoding (C. D’Alessandro et al. ASYNC’06,’07)

  • The proposed system consists in encoding a bit of data in the phase relationship between two signals generated using a reference

  • This would ensure that any transient fault appearing on one of the reference signals will be ignored if it is not mirrored by a corresponding transition on the other line

  • Similarity with multi-wire communication


Phase encoding multiple rail
Phase encoding: multiple rail (Varshavsky et al. 1986)

  • No group of wires has the same delay

  • All wires toggle when an item of data is sent

  • Increased number of states available ( n wires = n! states) hence more bits/symbol

  • Table illustrates examples of phase encoding compared to the respective m-of-n counterpart


Phase encoding repeater
Phase encoding Repeater (Varshavsky et al. 1986)

1<3

3<1

2<3

3<2

1<2

2<1

Phase detectors (Mutexes)


Pipelines
Pipelines (Varshavsky et al. 1986)

Dual-rail pipeline

From J.Bainbridge & S. Furber IEEE Micro, 2002


The problem of acking
The problem of Acking (Varshavsky et al. 1986)

  • Question 2 “when can Tx send new data?” has two aspects:

    • Safety (not to overflow the channel or when Tx and Rx have much variation in delay)

    • Performance (to maximize throughput and reduce latency)

  • Can we hide ack (round trip) delay?


Asynchronous links for nanonets

To maintain throughput more pipeline stages are required but that costs too much latency and power

First minimize latency along a long wire (not specific to asynchronous) and then maximize throughput (using “wagging tail buffer” approach)

From R.Ho et al. ASYNC’04


Asynchronous links for nanonets

Use of wagging buffer approach that costs too much latency and power

Alternate between top and bottom control

From R.Ho et al. ASYNC’04


Wagging tail buffer approach
“Wagging tail buffer” approach that costs too much latency and power

reqtop

Top and bot control channels work at ½ frequency of data channel

acktop

data

reqbot

ackbot


Serial link vs parallel link from r dobkin

Why Serial Link? that costs too much latency and power

Less interconnect area

Less routing congestion

Less coupling

Less power (depends on range)

The relative improvement grows with technology scaling. The example on the right refers to:

Single gate delay serial link

Fully-shielded parallel link with 8gate delay clock cycle

Equal bit-rate

Word width N=8

Serial Link vs Parallel Link (from R. Dobkin)

Link Length [mm]

Serial Link dissipates less power

Parallel Link dissipates less power

Serial Link requires less area

Parallel Link requires less area

Technology Node [nm]


Serialization model
Serialization model that costs too much latency and power

Tx

Rx

Acking at the bit level


Serialization model1
Serialization model that costs too much latency and power

Tx

Rx

Acking at the word level


Serialization model2
Serialization model that costs too much latency and power

Tx

Rx

Acking at the word level (with more concurrency)


Serial link top structure r dobkin async 07
Serial Link – Top Structure (R.Dobkin, Async’07) that costs too much latency and power

  • Transition signaling instead of sampling: two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. data-strobe (DS)

  • Acknowledge per word instead of per bit

  • Synchronizers used at the level of the ack signals

  • Wave-pipelining over channel

  • Differential encoding (DS-DE, IEEE1355-95)

  • Reported throughput: 67Gps for 65nm process (viz. one bit per 15ps – expected FO4 inverter delay), based on simulations


Encoding two phase nrz ledr

Uncoded (B) that costs too much latency and power

Phase bit (P)

State bit (S)

0

0

0

0

1

0

1

1

0

0

Encoding –Two Phase NRZ LEDR

  • Two Phase Non-Return-to-Zero Level Encoded Dual Rail

    • “delta” encoding (one transition per bit)


Transmitter fast sr approach from r dobkin
Transmitter – Fast SR Approach (from R. Dobkin) that costs too much latency and power


Receiver splitter from r dobkin
Receiver that costs too much latency and power Splitter (from R. Dobkin)


Self timed networks
Self Timed Networks that costs too much latency and power

  • Router requires priority arbitration

    • Arbitration necessary at every router merge

    • Potential delay at every node on the path

      BUT

    • Asynchronous merge/arbitration time is average not worst case

  • Adapters to locally clocked cells require synchronization

  • Synchronization necessary when clocks are unknown

    • Occurs when receiving data (data valid), and when sending (acknowledge)

      BUT

    • Time can be long (2 cycles?)

    • Must assume worst case time (maybe)


Router priority
Router priority that costs too much latency and power

  • Virtual channels implement scheduling algorithm

  • Contention for link resolved by priority circuits

Flow Control

Link

Merge

Split


Asynchronous arbiters
Asynchronous Arbiters that costs too much latency and power

  • Multiway arbiters (e.g. for Xbar switches):

    • Cascaded mesh (latency ~ N)

    • Cascaded Tree (latency ~ logN)

    • Token-Ring (busy ring and lazy ring) (latency ~ from 1 to N)

  • Priority arbiters (e.g. for Routers with different QS):

    • Static priority (topological order)

    • Dynamic priority (request arrives with priority code)

    • Ordered (time-priority) - multiway arbiter, followed by a FIFO buffer


Static priority arbiter

Lock that costs too much latency and power

MUTEX

r1

s1

R1

s*

q

G1

C

r

MUTEX

r2

s2

Priority Module

R2

s*

q

G2

C

r

MUTEX

r3

s3

R3

s*

q

G3

C

r

Lock Register

s

q

C

r*

Static Priority Arbiter


Asynchronous links for nanonets

Why Synchronizer? that costs too much latency and power

DATA

1

CLK

DATA

Q

DFF

0

CLK

Q

1

0

Metastability

Metastability

DATA

Q

Here one clock cycle is used for the metastability to resolve.

DFF

DFF

CLK

Two DFF Synchronizer


Cad support async design flow
CAD support: Async design flow that costs too much latency and power


Asynchronous links for nanonets

Bus that costs too much latency and power

Data

Transceiver

DSr

LDS

Device

D

LDTACK

DSr

LDS

VME Bus

Controller

DSw

LDTACK

D

DTACK

DTACK

Read Cycle

Synthesis of Asynchronous link interfaces


Asynchronous links for nanonets

DSr+ that costs too much latency and power

DSw+

DTACK-

LDS+

D+

LDTACK+

LDS+

LDTACK-

D+

LDTACK+

DTACK+

D-

LDS-

DSr-

DTACK+

D-

DSw-


Asynchronous links for nanonets

DSr+ that costs too much latency and power

DSw+

D

DTACK

-

DTACK

LDS+

D+

synthesis

LDTACK+

LDS+

LDS

csc

LDTACK

-

LDTACK+

D+

DSr

DTACK+

D

-

LDS

-

LDTACK

Logic asynchronous circuit

DSr

-

DTACK+

D

-

DSw

-

csc +

DSr+

DTACK-

LDS+

LDTACK-

LDTACK-

LDTACK-

DSr+

DTACK-

LDS-

LDS-

LDS-

LDTACK+

DSr+

DTACK-

D+

D-

csc -

DSr-

DTACK+

Complete State Coding (CSC)

Boolean equations:

LDS = D  csc

DTACK = D

D = LDTACK

csc = DSr


Conclusions on async links
Conclusions on Async Links that costs too much latency and power

  • At nm level links will be more asynchronous, perhaps first, mesochronous to avoid global clock skew

  • Delay-insensitive codes can be used to tolerate interwire-delay variability

  • Phase-encoding can be used for higher power-bit efficiency and SEU tolerance

  • Acking will be mainly used for flow control (word level) and its overhead can be ‘hidden’ by using the “wagging buffer” technique

  • Serial Links save area and power for long interconnects, with buffering (pipelining) if one wants to maintain high throughput; they also simplify building switches

  • Synthesis tools can be used to build clock-free interfaces between different links

  • Asynchronous logic can be used for building higher level circuits, e.g. arbiters for switches and routers


Asynchronous links for nanonets


Async 08 and nocs 08 plus slip 08
ASYNC’08 and NOCs’08 …plus SLIP’08 that costs too much latency and power

  • Held in Newcastle upon Tyne, UK, 7-11 April 2008 (SLIP on 5-6 April – weekend)

  • async.org.uk/async2008

  • async.org.uk/nocs2008

  • Submission deadlines:

    • Async’08: Abstract – Oct. 8 , Full paper – Oct. 15

    • NOCs’08: Abstract – Nov. 12, Full paper – Nov. 19


Extras
Extras that costs too much latency and power

  • More slides if I have time!


Chain network components
Chain Network Components that costs too much latency and power

From J.Bainbridge & S. Furber IEEE Micro, 2002


A network on chip1

Synchronization required that costs too much latency and power

Multiple Clocks

Arbitration required

A Network on Chip


Transmitter fast sr approach from r dobkin1
Transmitter – Fast SR Approach (from R. Dobkin) that costs too much latency and power


Receiver splitter from r dobkin1
Receiver that costs too much latency and power Splitter (from R. Dobkin)


Self timed networks1
Self Timed Networks that costs too much latency and power

  • Router requires priority arbitration

    • Arbitration necessary at every router merge

    • Potential delay at every node on the path

      BUT

    • Asynchronous merge/arbitration time is average not worst case

  • Adapters to locally clocked cells require synchronization

  • Synchronization necessary when clocks are unknown

    • Occurs when receiving data (data valid), and when sending (acknowledge)

      BUT

    • Time can be long (2 cycles?)

    • Must assume worst case time (maybe)


Router priority1
Router priority that costs too much latency and power

  • Virtual channels implement scheduling algorithm

  • Contention for link resolved by priority circuits

Flow Control

Link

Merge

Split


Static priority arbiter1

Lock that costs too much latency and power

MUTEX

r1

s1

R1

s*

q

G1

C

r

MUTEX

r2

s2

Priority Module

R2

s*

q

G2

C

r

MUTEX

r3

s3

R3

s*

q

G3

C

r

Lock Register

s

q

C

r*

Static priority arbiter


Reliability and latency
Reliability and latency that costs too much latency and power

  • Asynchronous arbiters fail only if time is bounded

    • Latency depends on fixed gates plus MUTEX lock time

    •  for 2 channels,  +  ln(N-1) for more

    • This likely to be small compared with flow control latency

  • Synchronizers fail at (fairly) predictable rates but these rates may get worse

    • Latency can be 35 now for good reliability


The synchronizer

D that costs too much latency and power

D

Q

Q

The synchronizer

  • Clock and valid can happen very close together

  • Flip Flop #1 gets caught in metastability

  • We wait until it is resolved (1 –2 clock periods)

DATA

VALID

#1

#2

CLK2

CLK1


Asynchronous links for nanonets

t that costs too much latency and power

/

t

e

=

MTBF

T

.

f

.

f

c

d

w

MTBF

  • For a 0.18 process  is 20 – 50 ps

  • Tw is similar

  • Suppose the clock and data frequencies are 2 GHz

  • t needs to be > 25 (more than one clock period) to get MTBF > 28 days

    • 100 synchronizers + 5 

    • MTBF > 1year + 2 

    • PVT variations +5 - 10 . . .


Event histogram
Event Histogram that costs too much latency and power

Convert to log scale, slope is 

Measurement


Not always simple
Not always simple that costs too much latency and power

More than one slope

350ps

120ps

140ps


Synchronization strategies
Synchronization Strategies that costs too much latency and power

  • Avoid synchronization time (and arbitration time) by

    • predicting clocks, stoppable clocks

    • dedicate link paths for long periods of time

  • Minimize time by circuit methods

    • Higher power, better 

    • Reducing apparent device variability - wide transistors

    • many parallel synchronizers increase throughput

  • Reduce average latency by speculation

    • Reduce synchronization time, detect errors and roll back


Timing regions can have predictable relationships
Timing regions can have predictable relationships that costs too much latency and power

  • Locked

    • Two clocks from same source

    • Linked by PLL

    • One produced by dividing the other

    • Some asynchronous systems

    • Some GALS

  • Not locked together but predictable

    • Two clocks same frequency, but different oscillators.

    • As above, same frequency ratio


Don t synchronise when you don t need to

DATA that costs too much latency and power

DATA

FIFO

ACK IN

REQ OUT

REQ IN

ACK OUT

Read done

WriteData Available

Don’t synchronise when you don’t need to

  • If the two clocks are locked together, you don’t need a synchroniser, just an asynchronous FIFO big enough to accommodate any jitter/skew

  • FIFO must never overflow

  • Next read clock can be predicted and metastability avoided


Conflict prediction
Conflict Prediction that costs too much latency and power

Receiver Clock

Transmitter Clock

Predicted Transmitter Clock

Synchronization problem known a cycle in advance of the Receiver clock.

We can do this thanks to the periodic nature of the clocks


Problems predicting next cycle
Problems predicting next cycle that costs too much latency and power

  • Difficult to predict

    • Multiple source clocks

    • Input output interfaces

  • Dynamic jitter and noise

    • GALS start up clocks take several cycles to stabilise

    • Crosstalk

    • power supply variations introducing noise into both data and clock .

    • temperature changes alter relative delays

  • As a proportion of cycle time, this is likely to increase with smaller geometries


Synchronizer reliability trends
Synchronizer reliability trends that costs too much latency and power

  • Clock rates increase. 10 GHz gives 100ps for a cycle.

    • Both data and clock rates up by n

    •  down by n

  • Assume  scales with cycle time reliability (MTBF) of one synchronizer down by n

  • Number of synchronizers goes up by N

    • Die reliability down by N

  • Die – die and on-die variability increases to as much as 40%

    • 40% more time needed for all synchronizers


An example
An example that costs too much latency and power

  • Example

    • 10 GHz clock and data rate

    •  = 10 ps

    • 100 synchronizers

    • MBTF required 3.8 months (107 seconds )

    • Time required 41 , or 4.1 cycles + 40% =5.8 cycles

  • Does this matter?


Power futures
Power futures that costs too much latency and power

  • Total synchronizer area/power small, BUT

  •  very sensitive to voltage/power – both n and p transistors can turn off at low voltages – no gain

  • This affects MUTEX circuits as well


Power speed tradeoffs
Power/speed tradeoffs that costs too much latency and power

  • Increase Vdd when synchronisation required

  • Make synchronizer transistors wide to reduce variation and, to some extent, 

  • Make many synchronizer circuits, and select the consistently fastest one

  • Avoid reducing synchronizer Vdd when running slow


Speculation
Speculation that costs too much latency and power

  • Mostly, the synchronizer does not need 35 to settle

  • Only e-10 (0.005%) need more than 10

  • Why not go ahead anyway, and try again if more time was needed


Low latency synchronization
Low latency synchronization that costs too much latency and power

  • Data Available, or Free to write are produced early

    • After one cycle?.

  • If they prove to be in error, synchronization failed

    • Only know this after two of more cycles

  • Read Fail or Write Fail flag is then raised and the action can be repeated.

DATA

DATA

FIFO

Free to write

Data Available

Speculativesynchronizer

Speculativesynchronizer

Full

Not Empty

Write Fail

Read Fail

Write clock

Read Clock

WRITE

READ

WriteData

Read done


Comments
Comments that costs too much latency and power

  • Synchronization time will be an issue for future GALS

  • Latency and throughput can be affected

    • Should the flit be large to reduce the effective overhead of time and power?

  • Some power speed trade off is possible

    • Higher power synchronization can buy some performance ?

  • Speculation is complex

    • Is it worth it?