1 / 47

Low-Latency FIFO’s Using Token Rings

Low-Latency FIFO’s Using Token Rings. Tiberiu Chelcea. Steven M. Nowick. Columbia University New York, USA. Contributions. Two novel FIFO designs: Circular buffer of identical cells Distributed control Common buses Token passing: 2 tokens control I/O behavior No data movement

chenoa
Download Presentation

Low-Latency FIFO’s Using Token Rings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Low-Latency FIFO’s Using Token Rings Tiberiu Chelcea Steven M. Nowick Columbia University New York, USA

  2. Contributions • Two novel FIFO designs: • Circular buffer of identical cells • Distributed control • Common buses • Token passing: 2 tokens control I/O behavior • No data movement • Very low latency in an empty FIFO • Still maintain high throughput Introduction (1)

  3. Introduction Two FIFO Protocols: • Basic: simple, non-overlapped write/read to a cell • Optimized: overlapped write/read to a cell • more concurrency per cell • various low-level optimizations: • “early drive” of receiver’s data bus • single-wire signaling, etc. 3 implementations of basic, 1 of optimized HSpice simulations

  4. Related Work • Most FIFO’s targeted to high-throughput: • poor latency • data movement • One solution: modify structure to obtain lower latency [Brunvand95] • types: folded, tree, square • drawbacks: • data still moved • latency proportional to # of stages • complex critical paths Related Work (1)

  5. Low-Latency FIFO’s Commonly implemented as circular buffers • no data movement 1. Centralized Control [Sutherland89, Yakovlev95] Limitations: • complex centralized counters for head/tail positions • overhead: delay/area (including arbiters!) 2. Distributed Control [Yakovlev89, Kishinevsky93] Limitations: • no overlapped put/get to same cell (unlike ours) • significant latencies (e.g. 3-stage delay) Two closer approaches presented later on… Related Work (2)

  6. Overview of the Talk • Basic FIFO: • Basic Protocol • Implementation • Optimized FIFO: • Optimized Protocol • Implementation • Related Work • Results • Conclusions Summary (1)

  7. FIFO FIFO Interface • Interfaces to two environments: • sender communicates on put port • receiver communicates on get port • FIFO allows concurrent puts (writes) and gets (reads) put get Basic Protocol (1)

  8. put Cell Cell Cell Starter Cell get FIFO Architecture • FIFO = replicated cells + starter (circular buffer) • put/get ports each consists of a common bus (data+control) • Two tokens in FIFO: put token and get token • “Starter” cell places tokens in circulation • no data movement • When full, every cell contains data (capacity N) Basic Protocol (2)

  9. Put token requested put_req P P P G G valid Get token requested FIFO Simulation 1: Start put Starter get Basic Protocol (3)

  10. put_req put_req put_req P P P P P G G G valid valid valid valid get_req get_req FIFO Simulation 2: Steady-State Operation put Starter get Basic Protocol (4)

  11. put_req put_req P P G valid valid valid FIFO Simulation 2: Steady-State Operation put Starter valid get Basic Protocol (4)

  12. Put token not passed: next cell not ready put_req: pending Put request acknowledged P P G G valid valid valid valid valid get_req Put token requested FIFO Simulation 3: Full put Starter get Basic Protocol (4)

  13. put left right Cell get right; [put put?x]; [left left]; right; [get get!x]; [left left]; Basic Cell Protocol Pseudo-code Program: CSP Program: forever { ObtainPutToken EnqueueData PassPutToken ObtainGetToken DequeueData PassGetToken} FIFO= *[[ ]] do_put do_get Basic Protocol (6)

  14. Cell’s Handshake Behavior • Port Activity: • put & get: passive • right: active • left: passive • Channel Implementation: • 4-phase handshaking • bundled data: put and get • validity scheme [Peeters96]: • get: “middle data validity” (ack+req-) • put: early data validity (req+ack+) Basic Protocol (7)

  15. # ; ; ; right left MUX MUX ; ; get put REG Basic Cell Implementation #1: Tangram Tangram program: Handshake circuit: “Starter Cell”: can also be implemented using Tangram proc cell (put?T & get!T & right & left) begin x: T forever do right;put?x; left; right;get!x; left; od end Basic Protocol Implementation (1)

  16. right_req+ right_req+ right_ack+ right_ack+ put_req+ get_req+ right_req- right_req- right_ack- right_ack- put_req- get_req- put_ack+ get_ack+ put_req- get_req- left_req- left_req- put_ack- get_ack- left_ack+ left_ack+ left_req- left_req- left_ack- left_ack- Basic Cell Implementation #2: Petrify Basic Protocol Implementation (2)

  17. put get Put Controller Get Controller REG ptok gtok right left pass Left Controller Token Distributor Basic Cell Implementation #3: Burst-Mode Decomposed into several communicating BM machines: • Put/GetControllers: handle put/get ports • Left Controller: passes tokens to left • Token Distributor:controls token flow to the three controllers Basic Protocol Implementation (3)

  18. Synchronizes handshakes on put and ptok channels: put: environmental request ptok: put token is in cell If cell has token (ptok_r+): cell does put operation If no token, no put: put_req+/- = partial input burst => ignored put get Put Controller Get Controller REG put_req- ptok_r-/ put_ack- ptok_a- put_req+ ptok_r+/ put_ack+ ptok_a+ ptok gtok left right pass Left Controller Token Distributor Put Controller Basic Protocol Implementation (4)

  19. gtok pass_a-/ right_ack+/ pass_a+/ pass_r- ptok right_req- pass pass gtok_a-/ right_ack-/ ptok_r+ pass_r+ ptok_a+/ ptok_r- gtok_a+/ gtok_r- right pass ptok_a-/ right_ack-/ gtok_r+ pass_r+ right right pass_a+/ pass_r- right_ack+/ right_req- pass_a-/ right_req+ Token Distributor pass_a-/right_req+ put get • Receives tokens from right channel • Distributes tokens to Put and Get Controllers, respectively • Passes tokens to Left Controller right_ack+/ right_req- pass_a+/ pass_r- Put Controller Get Controller REG right_ack-/ ptok_r+ gtok_a-/ pass_r+ ptok gtok ptok_a+/ ptok_r- gtok_a+/ gtok_r- left Left Controller Token Distributor ptok_a-/ pass_r+ right_ack-/ gtok_r+ pass_a+/ pass_r- right_ack+/ right_req- pass_a-/right_req+ Basic Protocol Implementation (7)

  20. gtok_r pass_r nrr ptok_r pass_a y2 gtok_a y1 ptok_a ra y0 Token Distributor: Burst-Mode Implementation • Synthesized with the MINIMALIST CAD Package [Fuhrer,Nowick et. al,99] • Optimized for speed Basic Protocol Implementation (8)

  21. Overview of the Talk • Basic FIFO: • Basic Protocol • Implementation • Optimized FIFO: • Optimized Protocol • Implementation • Related Work • Results • Conclusions Summary (2)

  22. Problems with Basic Protocol No “Program-Level Parallelism”: • no overlappedwrite/read to same cell • large latency • poor throughput • two tokens “multiplexed” onto single channels Limited Low-Level Optimizations: • “late enable” of get data bus • handshake overheads • limited fine-grained concurrency Optimized Protocol (1)

  23. ObtainPutToken EnqueueData from the right cell PassPutToken to the left cell ObtainGetToken DequeueData PassGetToken Basic Protocol: Sequential Program Actions strictly sequential Latency (3 actions): • EnqueueData • PassPutToken • ObtainGetToken [DequeueData] Throughput (3 actions): • ObtainPutToken • EnqueueData • PassPutToken

  24. ObtainPutToken from the right cell EnqueueData PassPutToken to the left cell ObtainGetToken DequeueData PassGetToken Optimized Protocol: Concurrent Program Token passing: off critical paths Latency: 1 action Throughput: 2 actions Further low-level optimizations: • effectively improve throughput to 1 action Optimized Protocol (2)

  25. Architectural Modifications Put • Tokens passed on two separate channels • One cell can hold both tokens simultaneously: • allows overlapped writes and reads • get token may be briefly ahead of the put token ! • No explicit “Starter” cell Cell Cell Cell Cell Get Optimized Protocol (3)

  26. ObtainPutToken Obtain Put Token ObtainGetToken Obtain Put Token Put Controller PutController GetController Get Controller Obtain Get Token DataValid Data Valid Get Controller Obtain Get Token Optimized Cell Architecture put • ObtainPutToken: receives put token • ObtainGetToken: receives get token • PutController: handles communication on put channel • GetController: handles communication on get channel • DataValid: indicates the validity of REG contents we1 Put Controller we Data Valid REG re re1 get Optimized Protocol Implementation (1)

  27. put_req put_data put_ack we1 we C + PC DV OPT C + REG + C + GC re re1 OGT get_ack get_req get_data Optimized Cell Implementation • OPT/OGT: Burst-mode machines • DV: uses relative timing (synthesized using Petrify) • PC/GC: asymmetric C-elements • Optimizations: • “early data out” enabling • single-wire token passing Optimized Protocol Implementation (2)

  28. put token received on we1=single wire we+ (when request & token) triggers: latching data start passing put token resetting OPT C we1+/ + we-/ we1-/ ptok+ DV C + REG we+/ptok- + C + Enqueuing Data put_req put_data put_ack ObtainPutToken we1 we ptok PC OPT GC re re1 OGT get_ack get_req get_data Optimized Protocol Implementation (3)

  29. put_req put_data put_ack we+ we1 we valid+ C + PC we- re+ DV re- OPT C + REG valid- + C + GC re re1 OGT get_ack get_req get_data Data Valid Asymmetric protocol: • data valid: in active phase of put (we+) • data invalid: in RZ phase of get (re-) • avoids overwrite by next put Optimized Protocol Implementation (5)

  30. put_req put_data put_ack we1 we C + PC DV OPT C + REG + C + GC re re1 OGT get_ack get_req get_data Early Enable: Get Data Bus Early Enable = get token in cell Late Enable = get token + get request • Extra slack to meet bundling constraints Optimized Protocol Implementation (5)

  31. put_req put_data put_ack we1 we C + PC DV OPT C + REG + C + GC re re1 OGT get_ack get_req get_data Timing Constraints 1. Pulse-Width Requirements • 2 pulse width constraints • re and we - race between: • state change • environment path • easily met • DV synthesized using Petrify (“slowenv” option) Optimized Protocol Implementation (6)

  32. put_req put_data put_ack we1 we C + PC DV OPT C + REG + C + GC re re1 OGT get_ack get_req get_data Timing Constraints 2. Bundling Constraint • Get operation: get_ack must indicate valid data • Bundling constraint: get_data faster than get_ack+ • Moderate size FIFO’s: easy to meet • Very large FIFO’s: padded delays on control • “Early drive” of get_data alleviates the problem: extra slack Optimized Protocol Implementation (7)

  33. Related Work: Close Approaches • Two Designs [Yi95, Chu86]: • use: circular arrays, common data buses, token passing • “Word-Slice FIFO” [Yi95]: • worse throughput for get than ours (10 gates vs. 6) • tighter bundling constraints: uses “late read enable” • FIFO for Packet Networks[Chu86]: • worse throughput for put than ours (6 block delays vs. 4) • tighter bundling constraints: uses “late read enable” Related Work (3)

  34. Overview of the Talk • Basic FIFO: • Basic protocol • Implementation • Optimized FIFO: • Optimized protocol • Implementation • Related Work • Results • Conclusions Summary (3)

  35. Results • HSpice simulations: 0.6m HP CMOS, 3.3V, 27°C • Word size: 8 bits • Buses modeled carefully: • wire lengths, load • attached capacitance • Various experiments: • FIFO capacity (4- vs. 16-place) • environmental latency (slow vs. fast) Results (1)

  36. Results: Latency S= slow environment F= fast environment Results (1)

  37. Results: Throughput Basic Optimized (MegaOps/s) Tangram Petrify (centralized) Burst-Mode (distributed) FIFO 4-S put 185 200 200 404 get 161 208 172 427 FIFO 4-F put 185 200 204 423 get 162 216 175 454 FIFO 16-S put 175 190 191 335 get 161 196 164 348 FIFO 16-F put 179 195 192 359 get 162 202 167 367 Results (2)

  38. Conclusions • Presented novel FIFO designs • Two protocols: basic, optimized • circular buffers • common buses • token passing • Very low latency achieved by protocol manipulation • Maintain high throughput • Potential for low power: no data movement Conclusions (1)

  39. P P G G FIFO Behavior: Empty Put Starter Get Basic Protocol (5)

  40. Triggered by a (i) Get request and (ii) Get token Synchronizes handshaking on Get and Gtok channels If no token, only get_req+ can arrive = partial input burst If token (gtok_r+), then get_req+ becomes an input burst put get get_req- gtok_r-/ get_ack- gtok_a- get_req+ gtok_r+/ get_ack+ gtok_a+ Put Controller Get Controller REG ptok gtok left right pass Left Controller Token Distributor Get Controller Basic Protocol Implementation (5)

  41. Waits for a request for tokens and their availability Completes handshaking on both Left and Pass channel put get Put Controller Get Controller left_req- pass_r-/ left_ack- pass_a- left_req+ pass_r+/ left_ack+ pass_a+ REG ptok gtok left right pass Left Controller Token Distributor Left Controller Basic Protocol Implementation (6)

  42. Overview of Approach • The FIFO interfaces two environments • Circular structure of identical cells • Cells connected to common data and control buses • Two tokens dictate the I/O behavior • put token selects the input cell • get token selects the output cell • Once enqueued, data is not moved until dequeuing. Thus the potential for low latency Introduction (2)

  43. Introduction • Distributed control • Circular buffer of identical cells • Common buses: all cells communicate on them • Token passing determines the I/O behavior • FIFO allows concurrent reads/writes • When full, every cell contains data (capacity N)

  44. If token (ptok_r+), cell does the put operation If no token, no put: put_req+/- partial input burst => ignored put_req+/put_req- partial input bursts => ignored burst-mode implmentation handles this behavior put_req- ptok_r-/ put_ack- ptok_a- put_req+ ptok_r+/ put_ack+ ptok_a+ Put Controller put get Put Controller Get Controller REG ptok gtok left right pass Left Controller Token Distributor Basic Protocol Implementation (4)

  45. get token received on re1: single wire no 4-phase handshaking put_req put_data put_ack we1 we C + PC re1+/ re-/ gtok- DV re1-/ gtok+ OPT C + REG re+ / + C + GC re re1 OGT get_ack get_req get_data Dequeuing Data ObtainGetToken Optimized Protocol Implementation (4)

  46. When there is get request, generate re to: start passing the get token ack the receiver start reseting OGT put_req put_data put_ack we1 we C + PC re1+/ re-/ gtok- DV re1-/ gtok+ OPT C + REG re+ / + C + GC re re1 OGT get_ack get_req get_data Dequeuing Data ObtainGetToken Optimized Protocol Implementation (4)

  47. When there is put request, generate we to: latch data start passing put token reset OPT put_req put_data put_ack we1 we C we1+/ + PC we-/ we1-/ ptok+ DV OPT C + REG we+/ptok- + C + GC re re1 OGT get_ack get_req get_data Enqueuing Data ObtainPutToken ptok Optimized Protocol Implementation (3)

More Related