Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining

Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum

Presentation goals • Present and overview the synthesis framework • Demonstrate a high-level pipeline model • Demonstrate the synthesis correctness • Illustrate how the correctness is guaranteed • Present experimental results • Conclusions • Future work

Objective Industrial quality EDA flow for automated synthesis of fine-grain pipelined robust circuits from high-level specifications • Industrial quality • Easy to integrate in RTL oriented environment • Capable of handling very large designs – scalability • Automated fine-grain pipelining • To achieve high performance (throughput) • Automated to reduce design time

Choice of paradigm • Synchronous RTL • 8 logic levels per stage is the limit • Due to register, clock skew and jitter overhead • Timing closure • No pipelining automation available – stage balancing is difficult • Performance limitations • To guarantee correctness with process variation etc • Asynchronous GTL • Lower design time • Automated pipelining possible from RTL specification • Higher performance • Gate-level (finest possible) pipelining achievable • Controllable power consumption • Smoothly slows down in case of voltage reduction • Improved yield • Correct operation regardless of variations

Easy integration & scalability: Weaver flow architecture • RTL tools reuse • Creates the impression that nothing has changed • Saves development effort • Substitution based transformations • Linear complexity • Enabled by using functionally equivalent DR (dual-rail: physical) and SR (single rail: virtual) libraries

Easy integration & scalability: Weaver flow architecture • Synthesis flow • Interfacing with host synthesis engine • Transforming Synchronous RTL to Asynchronous GTL – Weaving • Dedicated library(ies) • Dual-rail encoded data logic • Cells comprising entire stages • Internal delay assumptions only

Gate-level pipeline Combinational logic REG Automated fine-grain pipelining: Gate Transfer Level (GTL)

Gate-level pipeline Let gates communicate asynchronously and independently Automated fine-grain pipelining: Gate Transfer Level (GTL)

Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used Automated fine-grain pipelining: Gate Transfer Level (GTL)

Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used Templates already exist Automated fine-grain pipelining: Gate Transfer Level (GTL)

Weaving • Critical transformations • Mapping combinational gates (basic weaving) • Mapping sequential gates • Initialization preserving liveness and safeness • Optimizations • Performance optimization • Fine-grain pipelining (natural) • Slack matching • Area optimization • Optimizing out identity function stages

De Morgan transformation Dual-rail expansion Gate substitution Generating req/ack signals Merge insertion Fork insertion Reset routing Basic Weaving

Basic Weaving: example (C17 MCNC benchmark)

Linear pipeline (RTL)

Linear pipeline • pipeline PN model with global synchronization • pipeline PN (PPN) model with local handshake

Linear pipeline • pipeline PN model with global synchronization • PPN models asynchronous full-buffer pipelines

Linear pipeline • RTL implementation • GTL implementation

Correctness • Safeness • Guarantees that the number of data portions (tokens) stays the same over the time • Liveness • Guarantees that the system operates continuously • Flow equivalence • In both RTL and GTL implementations corresponding sequential elements hold the same data values • On the same iterations (order wise) • For the same input stream

Deterministic token flow Broadcasting tokens to all channels at Forks Synchronizing at Merges Data dependent token flow Ctrl is also a dual-rail channel To guarantee liveness MUXes need to match deMUXes – hard computationally Non-linear pipelines

Non-linear pipeline liveness • Currently guaranteed for deterministic token flow only by construction (weaving) • A marking of a marked graph is live if each directed PN circuit has a marker • Linear closed pipelines can be considered instead

Closed linear PPN • Every PPN “stage” is a circuit and has a marker by definition

Closed linear PPN • Every PPN “stage” is a circuit and has a marker by definition • Each implementation loop forms two directed circuits • Forward – has at least one token inferred for a DFF

Closed linear PPN • Every PPN “stage” is a circuit and has a marker by definition • Each implementation loop forms two directed circuits • Forward – has at least one token inferred for a DFF • Feedback – has at least one NULL inferred from CL or added explicitly

Every loop has at least 2 stages Token capacity for any loop: 1  C  N- 1 Assumption we made – every loop in synchronous circuit has a DFF A loop with no CL is meaningless Liveness conditions hold by construction (Weaving) Closed linear PPN pipeline is live iff(for full-buffer pipelines)

Initialization: example

Initialization: FSM example HB HB HB HB HB HB … …

Flow equivalence • GTL data flow structure is equivalent to the source RTL by weaving • No data dependencies are removed • No additional dependencies introduced • In deterministic flow architecture • There are no token races (tokens cannot pass each other) • All forks are broadcast and all joins are synchronizers • Flow equivalence preserved by construction

Flow equivalence • GTL initialization is same as RTL 2 2 1 1 2 2 1 1 N N N 2 N N N N N N 1 N N N N 2 N N N N N 1

Flow equivalence • but token propagation is independent 2 2 1 1 2 2 1 1 3 N N 2 2 N N N N N N 3 N N N 2 N N N N N 1

Flow equivalence • but token propagation is independent 2 2 1 1 2 2 1 1 3 N N 2 2 2 N N N N N 3 3 N N 2 N N N N N N

Flow equivalence • but token propagation is independent 2 2 1 1 2 2 1 1 3 3 N N 2 2 N N N N N N 3 N N 2 2 N N N N N

Flow equivalence • but token propagation is independent 2 2 1 1 2 2 1 1 N 3 3 N N 2 N N N N N N 3 3 N N 2 2 N N N N

Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 N 3 3 N N 2 2 N N N N N N 3 3 N 2 2 2 N N N

Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 • In GTL “3” hits the first top register output N N 3 3 N N 2 N N N N 4 N 3 3 N 2 2 2 2 N N

Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 4 N N 3 N N 2 2 N N N 4 N N 3 N N 2 2 2 N N

Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 • In GTL “3” hits the first bottom register output 4 N N 3 3 N 2 2 2 N N 4 4 N 3 3 N N 2 2 2 N

Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 4 4 N 3 3 N N 2 2 2 N N 4 N N 3 N N N 2 2 N

Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 • In GTL “2” hits the second register output N 4 4 N 3 3 N 2 2 2 2 N N 4 N 3 N N N N 2 2

Flow equivalence • but token propagation is independent • In RTL “3” and “2” moved one stage ahead 3 3 2 2 3 3 2 2 • timing is independent, the order is unchanged N 4 4 N 3 3 N N 2 2 2 N 4 4 N 3 3 N N N 2 2

Optimizations • Area • Optimizing out identity function stages • Performance • Fine-grain pipelining (natural) • Slack matching

Optimizing out identity function stages • Identity function stages (buffers) are inferred for clocked DFFs and D-latches • Implement no functionality • Can be removed as long as • The token capacity is not decreased below the RTL level • The resulting circuit can still be properly initialized

Optimizing out identity function stages: example • Final implementation is the same as if the RTL had not been pipelined (except for initialization) • Saves pipelining effort DFF DFF HB HB HB HB HB HB HB HB HB HB HB HB CL CL

Slack matching implementation • Adjusting the pipeline slack to optimize its throughput • Implementation • leveling gates according to their shortest paths from primary inputs (outputs) • Inserting buffer stages to break long dependencies • Buffer stages initialized to NULL • Currently performed for circuits with no loops only • Complexity O(|X||C|2) • |X| - the number of primary inputs • |C| - the number of connection points in the netlist

Slack matching correctness • Increases the token capacity • Potentially increases performance • Does not affect the number of initial tokens • Liveness is not affected • Does not affect the system structure • The flow equivalence is not affected

RTL implementation Not pipelined GTL implementation Naturally fine-grain pipelined Slack matching performed Both implementations obtained automatically from the same VHDL behavior specification Experimental results: MCNC on average ~ x4 better performance

Experimental results: AES ~ x36 better performance ~ x12 larger

Base line • Demonstrated an automatic synthesis of • QDI (robust to variations) • automatically gate-levelpipelined • implementations from large behavioral specifications • Synthesis run time comparable with RTL synthesis (~2.5x slower) – design time could be reduced • Resulting circuits feature • increased performance (depth dependent ~4x for MCNC) • area overhead • Practical solution – first prerelease at http://async.bu.edu/weaver/ • Demonstrated correctness of transformations (weaving)

Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining

Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining

Presentation Transcript

High-Level Synthesis an introduction

Gate-Level Minimization

Automated Gene Synthesis Machines

Gate-Level Minimization

Gate-Level Minimization

Fine Grain MPI

Fine Grain Entities Recognition

Fine-Grain Parallelism

An Alternate Approach to Studying Transfer Student Outcomes

HIGH LEVEL SYNTHESIS WITH AREA CONSTRAINTS FOR FPGA DESIGNS: AN EVOLUTIONARY APPROACH

Architecture-Level Synthesis for Automatic Interconnect Pipelining

Gate-level Minimization

GATE-LEVEL MODELING

Java Meets Fine-grain Multithreading

Parents as Partners: Inquiry as an Approach to Learning

New Urbanism – An Automated Approach

An Introduction to GATE

Gate-Level Minimization

Gate-Level Minimization

Gate-Level Test Generation Using Spectral Methods at Register-Transfer Level

Fine-Grain Communication

Home automation- Automated Gate