1 / 24

Jungju Oh, Alenka Zajic , Milos Prvulovic

Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect. Jungju Oh, Alenka Zajic , Milos Prvulovic. Contents. Introduction Hybrid Network Low-Latency Transmission Line Ring Traffic Steering Evaluation Result Conclusion.

Download Presentation

Jungju Oh, Alenka Zajic , Milos Prvulovic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Traffic Steering Between a Low-Latency UnsiwtchedTL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, AlenkaZajic, Milos Prvulovic

  2. Contents • Introduction • Hybrid Network • Low-Latency Transmission Line Ring • Traffic Steering • Evaluation • Result • Conclusion

  3. Introduction • On-chip communication latency is increasing • Broadcast interconnect • Insufficient bandwidth and delay for many-core • Growing core counts→ contention • Growing core counts→ longer wire→ larger wire capacitance → longer delay • Unfavorable wire delay with technology scaling • Packet-switched on-chip network (OCN) • Short links → fast communication between adjacent nodes • Scalable aggregated bandwidth • Packets travel many links and pipelined routers • Growing core counts → increasing hop counts/latency for far-apart cores ITRS 2012

  4. Motivation • Switched on-chip network • Good latency for local traffic, but not for long-distance traffic • Much more local than long-distance traffic • Broadcast interconnect • Avoids routing latency even for long-distance traffic • Cannot handle much traffic

  5. Hybrid Network • Exploit the strengths • Broadcast on Transmission Line: low latency • Switched on-chip network: throughput • … alleviate weakness • Limited TL throughput – use only for critical and/or long-distance traffic • High switching overheadfor long-distance traffic – use TL • Two critical components to this work • Transmission Line Broadcast Interconnect – the Why and the How • Traffic Steering – which messages use which interconnect

  6. Transmission Line • Why Transmission Line? • Extremely fast propagation • Use electromagnetic wave for signal propagation • 0.0075 ns/mm (unrepeated wire: 0.54 ns/mm) • Not affected by technology scaling • Butexpensivein terms of metal area (20 µm-wide vs. 0.135 µm global wire) • Limited throughput Transmission Line 4.193 µm 0.135 µm 4.571 µm Traditioanl Wire 4.1 µm … 8.457 µm 16 µm Ground Traditional Global Wire TL vs.

  7. Transmission Line Ring • Transmission Line • Extremely fast propagation • Butexpensive in terms of metal area • Why Ring? • Minimizes overall TL cost • Allows fast arbitration (token passing)

  8. Unidirectional Transmission Line Ring • Two major problemswith TL caused by many connections in many-core • Attenuation of signal (power split at connections) • Signal reflections/reverberations(discontinuity at connections) • Signal needs to stay stronger than sum of noise and reverberations! • Unidirectional Transmission Line (UTL) ring makes it easy to design • Chained directional couplers in a ring shape • Control of attenuation • Almost no reflected signal • Directional Coupler • Two TL lines running in parallel Transmission Line

  9. Unidirectional Transmission Line Ring • Directional Coupler • Two TL lines running in parallel • Signal into one end ① • Most comes out on other end ② • But some is transferred (EM-coupled) to same direction on other line ③ • Directivity: (almost) no signal on ④ • Chain couplers using one line, use the other to connect transmitters/receivers ② ① × ③ ④ Transmission Line Rx1 Tx1 Rx2 Tx2 Core 1 Core 2

  10. Using the UTL Ring • Simple receiver/transmitter • Simple modulation: on-off keying • 1 bit = one or more consecutive pulses • How fast can we transfer? • Depends on available spectrumof the transmission medium • UTL coupler: 20–60 GHz • 40 GHz clock, 2 pulses/bit → 20 Gbps • Transmitter • PLL (pulses) • Pass-gate (on/off pulses) • Amplifier (impedance matching) • Receiver • Pulse detector, • Shift register (collect high rate bits) Transmitter Amp PLL Data Receiver Data Detector Shift register

  11. Traffic Steering • Which packet should use which network? • Static steering • E.g. >8 hops go to TL, rest goes on mesh • Lacks adaptivity • When traffic low, 8-hop, 7-hop, etc. could benefit from ring • When traffic high, ring can become saturated

  12. Adaptive Steering • Ring-Affinity Score • More hops  more benefit from using the ring • Non-critical packet no benefit • Ring Affinity Score = latency differenceplus criticality adjustment • Threshold • Score above threshold  use ring • Adjust threshold to prevent ring bandwidth saturation • Too much traffic on the ring  queuing delays  all benefit dissapears

  13. Ring-Affinity Score • Score • : criticality adjustment • Constant penalty to non-critical coherence messages for simplification • (latency benefit) • : latency estimate for mesh • : latency estimate for UTL ring • How to get ? • Depends on packet’s hop count, mesh network congestion • Tried using just hop count times router latency, not good enough! • Small cache in each node, stores recent latencies for given hop count • E.g. 8x8 mesh  15 hop counts  15 sets in the latency cache • Each set keeps most recently observed latencies • Predictor chooses between using just the most recent latency, the average of latest latencies, or the average of all () latencies

  14. Ring Affinity Scoring • Estimating • How long to transmit? Easy. • How long to get the token? • We see everything on the ring! • Can remember who sentthe last few packets, and when • We know how far away the token is (last sender) • We can estimate how “fast” it “moves” • Example: 7 nodes in 10 cycles (0.7 nodes/cycle) • If token 30 nodes away, estimated is 21 cycles (30*0.7) • Detailed equations and explanations are in the paper Core 3 sent packet on ring at cycle 10 Core 10 sent packet on ring at cycle 20 3 10

  15. Threshold and Re-steering • Threshold adjusted to manage UTL ring utilization • Low enough to avoid excessive queuing • But high enoughnot to waste the ring throughput • Target utilizations around 75% tend to work well • Threshold Management • Packet steered to ring when its score exceeds the threshold • Increase threshold when ring utilization higher than desired • Decrease the threshold if ring utilization is too low • Re-Steeringing • Sudden burst of high-scoring packets… • Threshold adaptation takes a while • Meanwhile, ring packets have very long latencies • If ring-steered packet sits in queue too long, re-steer to the mesh • How long is too long?

  16. Evaluation • Simulated using SESC • 64-tile CMP, 2-issue OoO, 1GHz, 32KB L1 D/I cache, 1MB slice of L2 • 8×8 mesh (switched NoC) with 128 bit link width, 8 VC (24 buffers) • Applications from PARSEC 3, SPLASH-2 benchmark suites • Half of the applications show <20% improvement with idealinterconnect • Focus analysis on on-chip latency sensitive applications

  17. Speedup 1.14x

  18. Speedup • 4-concentrated mesh + UTL Ring • 8.7% improvement: 1.13× → 1.23×

  19. Speedup • 4-concentrated mesh + UTL Ring • 8.7% improvement: 1.13× → 1.23× • Flattened Butterfly + UTL Ring • 5.7% improvement: 1.10× → 1.16×

  20. Summary • Increasing core counts worsens on-chip latency • Unidirectional Transmission Line Ring • Low-latency • But limited throughput • Use UTL Ring with switched interconnect synergistically • UTL Ring for low latency • Switched interconnect for throughput • Adaptive traffic steering enables judicious use of the ring • Proposed traffic steering provides 14% performance improvement

  21. Thank you!

  22. Result: Latency Reduction of UTL Ring • UTL Ring latency is 55% lower than the mesh • Lower latency than advanced interconnects • >44% latency reduction over concentrated mesh and flattened butterfly • But we can only do this for 13% to 44% of messages (2.0% to 9.9% of the bits) 44.3% 43.9%

  23. Result: Speedup vs. Mesh Alone • Performs slightly better than advanced on-chip network • 1.14 (Mesh + UTL ring) • vs.1.13(concentrated mesh) and 1.10(flattened butterfly) 1.14× 1.13× 1.10×

  24. Adaptive vs Non-Adaptive Steering • Non-adaptive random steering • 0.63× slowdown on application (ocean-nc) with high on-chip traffic • 1.02× speedup if 30% of packets use UTL Ring randomly (RND30) • 0.96× slowdown if 50% (RND50) • Adaptive traffic steering • 1.14×speedup(up to 1.20× with 64 Gbps configuration) slowdown

More Related