1 / 28

Wagging Logic: Moore's Law will eventually fix it

Wagging Logic: Moore's Law will eventually fix it. Charlie Brej APT Group University of Manchester. Introduction. Quasi-Delay-Insensitive (QDI) approach Prove the high performance potential What is performance? Latency Throughput Why is async better? Average case performance

Download Presentation

Wagging Logic: Moore's Law will eventually fix it

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wagging Logic: Moore's Law will eventually fix it Charlie Brej APT Group University of Manchester Group Talk

  2. Introduction • Quasi-Delay-Insensitive (QDI) approach • Prove the high performance potential • What is performance? • Latency • Throughput • Why is async better? • Average case performance • Variability and data-dependant • Bit level pipelining Group Talk

  3. Ensure all wire pairs are cycled up and down QDI Forward Safe Guarding C Group Talk

  4. Viewpoint of a single output Many inputs Behaviour Group Talk

  5. All or nothing Synchronises inputs together Behaviour Group Talk

  6. Why is it so slow? • Delays: • Gate: 1, C-element: 2 • Stage data propagation: X • Cycle time (times 2 for set and reset): • Forward guarding: 2X • C-element for each gate • Acknowledge propagation: 2X • C-element for each fork (fork depth ~ gate depth) • About eight times slower than worst case! Group Talk

  7. Why is four-phase so slow? • Low latency • Low throughput • Only 1/8th of the system doing useful work • Rest is resetting/completing Workie Sleepy Sleepy Sleepy Sleepy Sleepy Sleepy Sleepy Workie Sleepy Group Talk

  8. Solutions • Ultra/Hyper/Super Pipelining • Need 8 times finer pipelining • Impossible • Each latch adds to the latency • Faster completion detection • Balanced treeing C-elements • Arranging to suit arrival order • Backward guarding • Not even close to 8x improvement Group Talk

  9. Inspiration: Wagging Latches • Alternate latch read/write • Capacity of two latches • Depth of one latch Group Talk

  10. Reset Set Reset Set Set Reset Set Reset Wagging Logic • Apply same method to the logic • Alternate logic allowing one to set while the other resets (precharges) Group Talk

  11. Wagging Logic • Between wagging stages • No need to wagg • No need to synchronize • Wagg only when communication with non-wagging logic Group Talk

  12. Non FIFO Example Group Talk

  13. Duplicate the Logic Group Talk

  14. Connect to Complementary Group Talk

  15. A Harder Example Group Talk

  16. Duplicate the Logic Group Talk

  17. Connect to Complementary Group Talk

  18. Triplicate the Logic Group Talk

  19. Connect to the next on the list Group Talk

  20. Other example Group Talk

  21. Proof of the pudding • Simple gate level simulation • My own simulator • Delays: C-element=2, Gate=1 • Example circuits • Fibonacci sequence generators • Vertically pipelined 64bit ripple carry adder • Non-pipelined 8bit ripple carry adder • 16 input XOR • Backward and Forward guarded • Relative measurements of Speed, Power, Area • 10,000 gate delays simulation Group Talk

  22. Synchronous Worst Case:74 64bit Fibonacci Performance Group Talk

  23. Synchronous Worst Case:500 8bit Fibonacci Performance Group Talk

  24. Synchronous Worst/Best Case:1250 (8 gate delays) Inc. Timing margins Inc. Flip-Flop:1000 (10 gate delays) XOR Performance Group Talk

  25. Synchronous:610 Power Consumption Group Talk

  26. Area Group Talk

  27. Future work • Larger and more complex designs • Small CPU • Layout • Silicon? • Improve completion time • Current optimal wagging ~ 5 • Target ~ 3 • Fully automated flow • Verilog Input & Output • Partitioning Group Talk

  28. Conclusions • Matching and surpassing synchronous performance every time • DI logic for performance • Very Expensive • 20 times more power • 5 times bigger (times wagging) • Fastest logic on the planet! • Discounting increase in wire delays • Assuming other things will be able to keep up Group Talk

More Related