Bounded Dataflow Networks and Latency Insensitive Circuits Cont…

Bounded Dataflow Networks and Latency Insensitive Circuits Cont… Arvind Computer Science and Artificial Intelligence Laboratory MIT Based on the work of Murali Vijayaraghavan and Arvind[MEMOCODE 2009] http://csg.csail.mit.edu/korea

Modular transformation BDN1 BDN1 BDN2 BDN2 SSM1 SSM2 SSM BDN BDN3 BDN3 SSM3 Is this transformation correct? Yes, provided each BDNiimplements SSMi and is latency insensitive then the resulting BDN implements SSM and is latency insensitive http://csg.csail.mit.edu/korea

BDN Implementing an SSM SSM BDN A BDN is said to implement an SSM iff • There is a bijective mapping between inputs (outputs) of the SSM and BDN • The output histories of the SSM and BDNmatch whenever the input histories match • The BDN is deadlock-free ... ... ... ... http://csg.csail.mit.edu/korea

Latency-Insensitive BDN (LI-BDN) • A BDN implementing an SSM is an LI-BDN iff it has • No extraneous dependencies property • Self cleaning property Theorem: A BDN where all the nodes are LI-BDNs will not deadlock http://csg.csail.mit.edu/korea

No-Extraneous Dependency (NED) property SSM Inputs combinationally connected to out out BDN Production of outQ waits only for these input FIFOs outQ http://csg.csail.mit.edu/korea

Self-Cleaning (SC) property If the BDN has enqueued all its outputs, it will dequeue all its inputs http://csg.csail.mit.edu/korea

Modular refinement - revisited LI-BDN2 Automatically generated SSM2 rest of the design SSM1 module to be refined LI-BDN1 implementing SSM1 LI-BDN2 LI-BDN1 refined manually http://csg.csail.mit.edu/korea

Writing an LI-BDN wrapper for an SSM Given the SSM: oj(t) = fj(ij1(t), ... ,ijIj(t), s(t)) // ij1, ij2, ... ijIj are combinationally connected to oj s(t+1) = g(i1(t), i2(t), ... , s(t)) LI-BDN: rule Oj when (donej)  donej True; oj.enq( fj(ij1.first, ... ,ijIj.first, s) ) rule Finish when (done1 done2 ...)  done1 False; done2 False; ...; s  g(i1.first, i2.first, ... , s); i1.deq ; i2.deq ; ... introduce a done flag and a rule for each output introduce the Finish rule http://csg.csail.mit.edu/korea

Wrapper circuit All input deq Patient SSM first Ii deq value enable Oj enq not-empty All dones donej not-full Depends-on(Oj) 1 0 http://csg.csail.mit.edu/korea

Patient SSM ... ... Combinational Logic Combinational Logic Inputs ... ... Inputs Enable Outputs ... ... Outputs http://csg.csail.mit.edu/korea

Example3-port and 1-port Register Files ra0 interface RegisterFile3Ports method Value rd0(Addr a); method Value rd1(Addr a); method Action wr(Addr a, Value x); endinterface rf ra1 rd0 wen rd1 wa en rf wd R/W out interface RegisterFile1port method ActionValue#(Value) access(Req r); endinterface //Response to write access is // unconstrained typedef union tagged{ W struct{a:Addr,v:Value}; R struct{a:Addr}; } Req; a d http://csg.csail.mit.edu/korea

LI-BDN for a 3-port register file rule RD0 when (rd0Done) rd0.enq(rf.r1(ra0.first)); rd0Done  True; rule RD1 when (rd1Done) rd1.enq(rf.r1(ra1.first)); rd1Done  True; rule finish when (rd0Done  rd1Done) ra0.deq; ra1.deq; wen.deq; wa.deq; wd.deq; if (wen.first) rf.wr(wa.first, wd.first); rd0Done  False; rd1Done  False; ra0 rf ra1 rd0 wen rd1 wa wd rd0Done rd1Done http://csg.csail.mit.edu/korea

Refinement into a one-ported register file LI-BDN rule RD0 when (rd0Done) let x  rf.access(R ra0.first); rd0.enq(x); rd0Done  True rule RD1 when (rd1Done) let x  rf.access(R ra1.first); rd1.enq(x); rd1Done  True rule finish when (rd0Done  rd1Done) ra0.deq; ra1.deq; wen.deq; wa.deq; wd.deq; if (wen.first) rf.access(W {a:wa.first, v:wd.first}); rd0Done  False; rd1Done  False; ra0 rd1Done ra1 rd0 en rf R/W wen rd1 out a wa d wd rd0Done This uses 1 port http://csg.csail.mit.edu/korea

Pipelining combinational circuits S1 R1 a c a c S3 f1 f3 R3 f1 f3 e e b d b d S2 f2 R2 f2 Can potentially reduce the critical path of the entire circuit http://csg.csail.mit.edu/korea

Optimizing an LI-BDN mux c c a a d d b b • Does not wait for don’t-care inputs • Counters used to keep track of how many inputs to drop • Can potentially increase the throughput http://csg.csail.mit.edu/korea

Summary Latency Insensitive BDNs allow true modular refinement of a system, where even the timing contract of a module can be changed without affecting the rest of the system http://csg.csail.mit.edu/korea

A Design Flow issue Exception • We can apply the technique discussed to refine this design • But where does this design come from in the first place? Verilog? Verilog Compiler Output? Bluespec? Branch Resolution Branch Prediction Mem2/ ALU/ Exception Handler Reg File Addr Calc/ Branch Resolve Branch Pred Fetch1 Fetch2 Crack Decode Mem1 Register Write • Pipelined Multiplier • Multicycle divider Register file implemented as a BRAM http://csg.csail.mit.edu/korea

Design Flow Issues • Generation of appropriate RTL is the major problem • RTL / Specifications should be written in such a way that they are amenable to refinements  Latency Insensitive Design Methodology http://csg.csail.mit.edu/korea

The PowerPC Project Cycle-accurate modeling of PowerPC on FPGAs http://csg.csail.mit.edu/korea

stall bypass AddrCalc BrRes Mem2 ALU Excep Crack BrPred Decode Mem1 PC Fetch RegRd RegWr epochs D$/DTlb2 D$/DTlb1 I$/ITlb1 I$/ITlb2 Mem Mem PPC In-order Pipeline • The designer specifies the FSM for each stage • The FIFOs are latency-insensitive, that is, the correctness of the specification does not depend upon the depth of FIFOs or the number of stages http://csg.csail.mit.edu/korea

Can be mechanized The steps in Cycle-accurate implementation on FPGAs • The specs are turned into Bluespec code to give a target SSM • Once the size of FIFOs is fixed the whole design has a precise timing specification • If the FPGA implementation requires refining some stages then cuts are made in the design to isolate the stages (SSMs) to be refined • Each SSM is turned into a BDN by introducing FIFOs for each input and output wire, including the wires going in and out of model FIFOs of the SSM • This converts the nth time cycle of the SSM into the nth enqueue into input FIFOs and nth dequeue from output FIFOs • Atomic rules for the operation of each BDN are defined so that no extraneous dependencies are introduced • This also ensures deadlock-free operation http://csg.csail.mit.edu/korea

Initial results using XUPV5 FPGA http://csg.csail.mit.edu/korea

Detailed Preliminary Results Asif Khan & Murali Vijayaraghavan (June 2009) • Cycle-accurate refinements onto Xilinx XUPV5 • Slice Logic Utilization: • Number of Slice Registers: 15448 out of 69120 22% • Number of Slice LUTs: 16702 out of 69120 24% • Specific Feature Utilization: • Number of Block RAM/FIFO: 1 out of 148 0% (only 1 BRAM for the register file) • Number of DSP48Es: 12 out of 64 18% (these are used for the divider) • Minimum period: 7.988ns (Maximum Frequency: 125.188MHz) • Partially verified by running a 50 instruction program • Compared to Jessica has port onto Xilinx XUPV5 • Takes up 92% of the area; • 20Mhz  40Mhz No numbers yet for actual work done http://csg.csail.mit.edu/korea

Conclusion • Cycle-accurate modeling of processors on FPGAs is feasible and offers a 3-orders of magnitude improvement in performance over software simulators • BDNs offer a way to refine RTL without losing cycle-accuracy • Bluespec makes quick RTL generation feasible • The generation of BDNs can be automated • We plan to release our Bluespec designs under open source licensing to strengthen PowrPC ecosystem. http://csg.csail.mit.edu/korea

Bounded Dataflow Networks and Latency Insensitive Circuits Cont…

Bounded Dataflow Networks and Latency Insensitive Circuits Cont…

Presentation Transcript

Low-Latency Networks for Financial Applications

DATAFLOW PROCESS NETWORKS

Multi-Class Latency Bounded Web Services

Bounded-Latency Alerts in Vehicular Networks

insensitive, thick-skinned

Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures

High-Performance Networks for Dataflow Architectures

Dryad and dataflow systems

High -Fidelity Latency Measurements in Low -Latency Networks

Protecting Circuits from Leakage the computationally bounded and noisy cases

Protecting Circuits from Leakage the computationally bounded and noisy cases

Reducing Latency in Tor Circuits with Unordered Delivery

Dataflow Networks

Protecting Circuits from Computationally-Bounded Leakage

Reducing Power Consumption with Relaxed Quasi Delay-Insensitive Circuits

Software and dataflow organization

Compiling Communicating Processes into Delay-Insensitive VLSI Circuits

Dataflow Monitoring

Frame Delay Through ATM Networks: MIMO Latency

Latency

Bounded Dataflow Networks and Latency Insensitive Circuits

An Exploration of the MPEG Algorithm Using Latency Insensitive Design