Clockless Computing Lecture 3
E N D
Presentation Transcript
Clockless ComputingLecture 3 Montek Singh Thu, Aug 30, 2007
Handshaking Example:Asynchronous Pipelines Pipelining basics Fine-grain pipelining Example Approach: MOUSETRAP pipelines
Background: Pipelining fetch decode execute A “coarse-grain” pipeline (e.g. simple processor) A “fine-grain” pipeline (e.g. pipelined adder) What is Pipelining?: Breaking up a complex operation on a stream of data into simpler sequential operations Storage elements(latches/registers) Performance Impact: + Throughput: significantly increased (#data items processed/second) – Latency:somewhat degraded (#seconds from input to output)
Focus of Asynchronous Community A Key Focus: Extremely fine-grain pipelines • “gate-level” pipelining = use narrowest possible stages • each stage consists of only a single level of logic gates • some of the fastest existing digital pipelines to date Application areas: • general-purpose microprocessors • instruction pipelines: often 20-40 stages • multimedia hardware (graphics accelerators, video DSP’s, …) • naturally pipelined systems, throughput is critical; input “bursty” • optical networking • serializing/deserializing FIFO’s • string matching? • KMP style string matching: variable skip lengths
MOUSETRAP: Ultra-High-SpeedTransition-Signaling Asynchronous Pipelines Singh and Nowick, Intl. Conf. on Computer Design (ICCD), September 2001 & IEEE Trans. VLSI June 2007
MOUSETRAP Pipelines Simple asynchronous implementation style, uses… • standard logic implementation: Boolean gates, transparent latches • simple control:1 gate/pipeline stage MOUSETRAP uses a “capture protocol:” Latches … • are normally transparent: beforenew data arrives • become opaque: afterdata arrives (“capture” data) Control Signaling:transition-signaling = 2-phase • simple protocol: req/ack = only 2 events per handshake (not 4) • no “return-to-zero” • each transition (up/down) signals a distinct operation Our Goal: very fast cycle time • simple inter-stage communication
MOUSETRAP: A Basic FIFO Stages communicate usingtransition-signaling: Latch Controller 1 transition per data item! ackN-1 ackN En doneN reqN reqN+1 Data in Data out Data Latch Stage N-1 Stage N Stage N+1 2nd data item flowing through the pipeline 1st data item flowing through the pipeline 1st data item flowing through the pipeline
MOUSETRAP: A Basic FIFO (contd.) Latch is disabled when current stage is “done” Latch is re-enabled when next stage is “done” Latch controller (XNOR) acts as “protocol converter”: • 2 distinct transitions (up or down) pulsed latch enable Latch Controller 2 transitions per latch cycle ackN-1 ackN En reqN reqN+1 doneN Data in Data out Data Latch Stage N-1 Stage N Stage N+1
MOUSETRAP: FIFO Cycle Time 3 Latch Controller 2 ackN-1 ackN En reqN reqN+1 doneN 1 2 Data in Data out Data Latch Fast self-loop: N disables itself Stage N-1 Stage N Stage N+1 Cycle Time = N re-enabled to compute N+1 computes N computes
Detailed Controller Operation • One pulse per data item flowing through: • down transition:caused by“done” of N • up transition:caused by“done” of N+1 Stage N’s Latch Controller ackfrom N+1 donefrom N to Latch
MOUSETRAP: Pipeline With Logic logic logic logic Logic Blocks:can use standard single-rail (non-hazard-free) “Bundled Data” Requirement: • each“req”must arrive after data inputs valid and stable Simple Extension to FIFO: insert logic block + matching delay in each stage Latch Controller ackN-1 ackN reqN+1 reqN delay delay delay doneN Data Latch Stage N-1 Stage N Stage N+1
Complex Pipelining: Forks & Joins fork join Non-Linear Pipelining: has forks/joins Contribution: introduce efficient circuit structures • Forks: distributedata + controlto multiple destinations • Joins: mergedata + controlfrom multiple sources • Enabling technology for building complex async systems Problems with Linear Pipelining: • handles limited applications; real systems are more complex
Forks and Joins: Implementation ack1 C ack ack2 req1 C req req req2 Stage N Stage N Join:merge multiple requests Fork:merge multiple acknowledges
Performance, Timing and Optzn. Stage Latency = Cycle Time = MOUSETRAP with Logic:
Timing Analysis Latch Controller ackN-1 ackN reqN+1 reqN delay delay doneN logic logic Data Latch Stage N Stage N-1 Main Timing Constraint: avoid “data overrun” (hold time) Data must be safely “captured” by Stage N before new inputs arrive fromStage N-1 • simple 1-sided timing constraint: fast latch disable • Stage N’s “self-loop” faster than entire path thru prior stage
Experimental Results • Simulations of FIFO’s: • ~3 GHz (in 0.13u IBM process) • Recent fabricated chip: GCD • ~2 GHz simulated speed • Chips tested to be fully functional • Will show demo later
In-Class Exercise • Modify MOUSETRAP to remove the “data overrun” timing constraint • How is the performance affected?
Homework #3 (due Tue Sep 11, 2007) • Read MOUSETRAP paper [TVLSI Jun ’07] • Modify MOUSETRAP to reduce power consumption • Make the latches normally opaque • Latches become transparent only when new data arrives at their inputs • Prevents glitchy/garbage data from propagation • How is the performance (throughput, latency) affected?
MOUSETRAP Advanced Topics
Special Case: Using “Clocked Logic” pull-up network A B “keeper” “keeper” logic inputs logic inputs En En logic output logic output En En pull-down network A B A General C2MOS gate Clocked-CMOS = C2MOS: eliminate explicit latches • latch folded into logic itself C2MOS AND-gate
Gate-Level MOUSETRAP: with C2MOS En,En 2 2 2 (ack,ack’) (done,done’) (En,En’) Use C2MOS:eliminate explicit latches New Control Optimization =“Dual-Rail XNOR” • eliminate 2 inverters from critical path Latch Controller ackN-1 ackN 2 2 2 doneN 2 2 reqN reqN+1 pair of bit latches C2MOS logic Stage N-1 Stage N+1 Stage N
Timing Optzn: Reducing Cycle Time Analytical Cycle Time = Goal:shorten (in steady-state operation) Steady-state = no undue pipeline congestion Observation: • XNOR switches twice per data item: • only 2nd (up) transition criticalfor performance: Solution: reduce XNOR output swing • degrade “slew” for start of pulse • allows quick pulse completion: faster rise time Still safe when congested:pulse starts on time • pulse maintained until congestion clears
Timing Optzn (contd.) “optimized” XNOR output “unoptimized” XNOR output N’s latch disabled N’s latch re-enabled N “done” N+1 “done” latch only partly disabled; recovers quicker! (no pulse width requirement)
Comparison with Wave Pipelining Two Scenarios: • Steady State: • both MOUSETRAP and wave pipelines act like transparent “flow through” combinational pipelines • Congestion: • right environment stalls: each MOUSETRAP stage safely captures data • internal stage slow: MOUSETRAP stages to its left safely capture data congestion properly handled in MOUSETRAP Conclusion: MOUSETRAP has potential of… • speed of wave pipelining • greater robustness and flexibility
Timing Issues: Handling Wide Datapaths En En reqN doneN reqN+1 reqN doneN reqN+1 Stage N-1 Stage N Stage N-1 Stage N Buffers inserted to amplify latch signals (En): Reducing Impact of Buffers: • control uses unbuffered signals buffer delay off of critical path! • datapath skewed w.r.t. control Timing assumption: buffer delays roughly equal