1 / 79

Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining

Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining. Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum. Presentation goals. Present and overview the synthesis framework Demonstrate a high-level pipeline model

kevyn
Download Presentation

Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gate Transfer Level Synthesis as an Automated Approach to Fine-Grain Pipelining Alexander Smirnov Alexander Taubin Mark Karpovsky Leonid Rozenblyum

  2. Presentation goals • Present and overview the synthesis framework • Demonstrate a high-level pipeline model • Demonstrate the synthesis correctness • Illustrate how the correctness is guaranteed • Present experimental results • Conclusions • Future work

  3. Objective Industrial quality EDA flow for automated synthesis of fine-grain pipelined robust circuits from high-level specifications • Industrial quality • Easy to integrate in RTL oriented environment • Capable of handling very large designs – scalability • Automated fine-grain pipelining • To achieve high performance (throughput) • Automated to reduce design time

  4. Choice of paradigm • Synchronous RTL • 8 logic levels per stage is the limit • Due to register, clock skew and jitter overhead • Timing closure • No pipelining automation available – stage balancing is difficult • Performance limitations • To guarantee correctness with process variation etc • Asynchronous GTL • Lower design time • Automated pipelining possible from RTL specification • Higher performance • Gate-level (finest possible) pipelining achievable • Controllable power consumption • Smoothly slows down in case of voltage reduction • Improved yield • Correct operation regardless of variations

  5. Easy integration & scalability: Weaver flow architecture • RTL tools reuse • Creates the impression that nothing has changed • Saves development effort • Substitution based transformations • Linear complexity • Enabled by using functionally equivalent DR (dual-rail: physical) and SR (single rail: virtual) libraries

  6. Easy integration & scalability: Weaver flow architecture • Synthesis flow • Interfacing with host synthesis engine • Transforming Synchronous RTL to Asynchronous GTL – Weaving • Dedicated library(ies) • Dual-rail encoded data logic • Cells comprising entire stages • Internal delay assumptions only

  7. Gate-level pipeline Combinational logic REG Automated fine-grain pipelining: Gate Transfer Level (GTL)

  8. Gate-level pipeline Let gates communicate asynchronously and independently Automated fine-grain pipelining: Gate Transfer Level (GTL)

  9. Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used Automated fine-grain pipelining: Gate Transfer Level (GTL)

  10. Gate-level pipeline Let gates communicate asynchronously and independently Many pipeline styles can be used Templates already exist Automated fine-grain pipelining: Gate Transfer Level (GTL)

  11. Weaving • Critical transformations • Mapping combinational gates (basic weaving) • Mapping sequential gates • Initialization preserving liveness and safeness • Optimizations • Performance optimization • Fine-grain pipelining (natural) • Slack matching • Area optimization • Optimizing out identity function stages

  12. De Morgan transformation Dual-rail expansion Gate substitution Generating req/ack signals Merge insertion Fork insertion Reset routing Basic Weaving

  13. Basic Weaving: example (C17 MCNC benchmark)

  14. Linear pipeline (RTL)

  15. Linear pipeline • pipeline PN model with global synchronization • pipeline PN (PPN) model with local handshake

  16. Linear pipeline • pipeline PN model with global synchronization • pipeline PN (PPN) model with local handshake

  17. Linear pipeline • pipeline PN model with global synchronization • PPN models asynchronous full-buffer pipelines

  18. Linear pipeline • RTL implementation • GTL implementation

  19. Correctness • Safeness • Guarantees that the number of data portions (tokens) stays the same over the time • Liveness • Guarantees that the system operates continuously • Flow equivalence • In both RTL and GTL implementations corresponding sequential elements hold the same data values • On the same iterations (order wise) • For the same input stream

  20. Deterministic token flow Broadcasting tokens to all channels at Forks Synchronizing at Merges Data dependent token flow Ctrl is also a dual-rail channel To guarantee liveness MUXes need to match deMUXes – hard computationally Non-linear pipelines

  21. Non-linear pipeline liveness • Currently guaranteed for deterministic token flow only by construction (weaving) • A marking of a marked graph is live if each directed PN circuit has a marker • Linear closed pipelines can be considered instead

  22. Closed linear PPN • Every PPN “stage” is a circuit and has a marker by definition

  23. Closed linear PPN • Every PPN “stage” is a circuit and has a marker by definition

  24. Closed linear PPN • Every PPN “stage” is a circuit and has a marker by definition

  25. Closed linear PPN • Every PPN “stage” is a circuit and has a marker by definition • Each implementation loop forms two directed circuits • Forward – has at least one token inferred for a DFF

  26. Closed linear PPN • Every PPN “stage” is a circuit and has a marker by definition • Each implementation loop forms two directed circuits • Forward – has at least one token inferred for a DFF • Feedback – has at least one NULL inferred from CL or added explicitly

  27. Every loop has at least 2 stages Token capacity for any loop: 1  C  N- 1 Assumption we made – every loop in synchronous circuit has a DFF A loop with no CL is meaningless Liveness conditions hold by construction (Weaving) Closed linear PPN pipeline is live iff(for full-buffer pipelines)

  28. Initialization: example

  29. Initialization: FSM example HB HB HB HB HB HB … …

  30. Flow equivalence • GTL data flow structure is equivalent to the source RTL by weaving • No data dependencies are removed • No additional dependencies introduced • In deterministic flow architecture • There are no token races (tokens cannot pass each other) • All forks are broadcast and all joins are synchronizers • Flow equivalence preserved by construction

  31. Flow equivalence • GTL initialization is same as RTL 2 2 1 1 2 2 1 1 N N N 2 N N N N N N 1 N N N N 2 N N N N N 1

  32. Flow equivalence • but token propagation is independent 2 2 1 1 2 2 1 1 3 N N 2 2 N N N N N N 3 N N N 2 N N N N N 1

  33. Flow equivalence • but token propagation is independent 2 2 1 1 2 2 1 1 3 N N 2 2 2 N N N N N 3 3 N N 2 N N N N N N

  34. Flow equivalence • but token propagation is independent 2 2 1 1 2 2 1 1 3 3 N N 2 2 N N N N N N 3 N N 2 2 N N N N N

  35. Flow equivalence • but token propagation is independent 2 2 1 1 2 2 1 1 N 3 3 N N 2 N N N N N N 3 3 N N 2 2 N N N N

  36. Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 N 3 3 N N 2 2 N N N N N N 3 3 N 2 2 2 N N N

  37. Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 • In GTL “3” hits the first top register output N N 3 3 N N 2 N N N N 4 N 3 3 N 2 2 2 2 N N

  38. Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 4 N N 3 N N 2 2 N N N 4 N N 3 N N 2 2 2 N N

  39. Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 • In GTL “3” hits the first bottom register output 4 N N 3 3 N 2 2 2 N N 4 4 N 3 3 N N 2 2 2 N

  40. Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 4 4 N 3 3 N N 2 2 2 N N 4 N N 3 N N N 2 2 N

  41. Flow equivalence • but token propagation is independent 3 2 2 1 3 2 2 1 • In GTL “2” hits the second register output N 4 4 N 3 3 N 2 2 2 2 N N 4 N 3 N N N N 2 2

  42. Flow equivalence • but token propagation is independent • In RTL “3” and “2” moved one stage ahead 3 3 2 2 3 3 2 2 • timing is independent, the order is unchanged N 4 4 N 3 3 N N 2 2 2 N 4 4 N 3 3 N N N 2 2

  43. Optimizations • Area • Optimizing out identity function stages • Performance • Fine-grain pipelining (natural) • Slack matching

  44. Optimizing out identity function stages • Identity function stages (buffers) are inferred for clocked DFFs and D-latches • Implement no functionality • Can be removed as long as • The token capacity is not decreased below the RTL level • The resulting circuit can still be properly initialized

  45. Optimizing out identity function stages: example • Final implementation is the same as if the RTL had not been pipelined (except for initialization) • Saves pipelining effort DFF DFF HB HB HB HB HB HB HB HB HB HB HB HB CL CL

  46. Slack matching implementation • Adjusting the pipeline slack to optimize its throughput • Implementation • leveling gates according to their shortest paths from primary inputs (outputs) • Inserting buffer stages to break long dependencies • Buffer stages initialized to NULL • Currently performed for circuits with no loops only • Complexity O(|X||C|2) • |X| - the number of primary inputs • |C| - the number of connection points in the netlist

  47. Slack matching correctness • Increases the token capacity • Potentially increases performance • Does not affect the number of initial tokens • Liveness is not affected • Does not affect the system structure • The flow equivalence is not affected

  48. RTL implementation Not pipelined GTL implementation Naturally fine-grain pipelined Slack matching performed Both implementations obtained automatically from the same VHDL behavior specification Experimental results: MCNC on average ~ x4 better performance

  49. Experimental results: AES ~ x36 better performance ~ x12 larger

  50. Base line • Demonstrated an automatic synthesis of • QDI (robust to variations) • automatically gate-levelpipelined • implementations from large behavioral specifications • Synthesis run time comparable with RTL synthesis (~2.5x slower) – design time could be reduced • Resulting circuits feature • increased performance (depth dependent ~4x for MCNC) • area overhead • Practical solution – first prerelease at http://async.bu.edu/weaver/ • Demonstrated correctness of transformations (weaving)

More Related