Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley)

GLOBALLY-SYNCHRONIZED FRAMES FOR GUARANTEED QUALITY-OF-SERVICE IN ON-CHIP NETWORKS Jae W. Lee(MIT)Man Cheuk Ng (MIT)Krste Asanovic (UC Berkeley) ISCA-35, Beijing, China June 23th 2008

Resource sharing increases performance variation • Resource sharing (+) reduces hardware cost (-) increases performance variation • This performance variation becomes larger and larger as the number of sharers (cores) increases. P P P P P P P P P P P P P P P P multi-hop on-chip network L2$ bank L2$ bank L2$ bank memcont L2$ bank memcont.

Desired quality-of-service from shared resources Performance isolation (fairness) P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P multi-hop on-chip network L2$ bank L2$ bank L2$ bank memcont L2$ bank memcont. (hotspot) acceptedthroughput [MB/s] minimum guaranteed BW processorID 0 2 4 6 8 A C E 1 3 5 7 9 B D F

Performance isolation (fairness) Differentiated services (flexibility) Desired quality-of-service from shared resources P P P P P P P P P P P P P P P P multi-hop on-chip network L2$ bank L2$ bank L2$ bank memcont L2$ bank memcont. (hotspot) acceptedthroughput [MB/s] acceptedthroughput [MB/s] minimum guaranteed BW differentiated allocation processorID processorID 0 2 4 6 8 A C E 0 2 4 6 8 A C E 1 3 5 7 9 B D F 1 3 5 7 9 B D F

Resources w/ centralized arbitration are well investigated • Resources with centralized arbitration • SDRAM controllers • L2 cache banks • They have a single entry point for all requests. → QoS is relatively easier and well investigated. P+L1$ P+L1$ P+L1$ R R R P+L1$ P+L1$ P+L1$ on-chiprouters R R R L2$bank memctrl R R R [HPCA ‘02] [ICS ‘04] [ISCA ‘07] … [MICRO ’06] [PACT ’07] [USENIX sec. ’07] [IBM ’07] [MICRO ’07] [ISCA ’08] ...

QoS from on-chip networks is a challenge • Resources with distributed arbitration • multi-hop on-chip networks • They have distributed arbitration points.→ QoS is more difficult. • Off-chip solutions cannot be directly applied because of resource constraints. P+L1$ P+L1$ P+L1$ R R R P+L1$ P+L1$ P+L1$ on-chiprouters R R R L2$bank memctrl R R R

We guarantee QoS for flows Flow: a sequence of packets between a unique pair of end nodes (src and dest) physical links shared by flows multiple stages of arbitration for each packet We provide guaranteed QoS to each flow with: minimum bandwidth guarantees bounded maximum delay R R R R R R R R R R R R R R R R physical linkshared by 3 flows hotspotresource

Locally fair ⇒ globally fair With locally fair round-robin (RR) arbitration: Throughput (Flow A) = (0.5) C Throughput (Flow B) = (0.5)2 C Throughput (Flow C) = Throughput (Flow D) = (0.5)3 C → Throughput of a flow decreases exponentially as its distance to the destination (hotspot) increases. arbitrationpoint 1 arbitrationpoint 2 arbitrationpoint 3 SRC D DEST SRC C SRC B SRC A channel rate = C [Gb/s]

Motivational simulation In 8x8 mesh network with RR arbitration (hotspot at (8, 8)) node index (X) node index (X) node index (Y) node index (Y) 1 2 accepted throughput[flits/cycle/node] accepted throughput[flits/cycle/node] 3 4 5 hotspot 6 7 8 8 7 6 5 4 3 2 1 8x8 2D mesh w/ minimal-adaptive routing w/ dimension-ordered routing locally-fair round-robin scheduling → globally unfair bandwidth usage

Desired bandwidth allocation: an example Taken from simulation results with GSF: accepted throughput[flits/cycle/node] accepted throughput[flits/cycle/node] node index (Y) node index (Y) node index (X) node index (X) Differentiated allocation Fair allocation

Globally Synchronized Frames (GSF) provide guaranteed QoS with minimum bandwidth guarantees and maximum delay to each flow in multi-hop on-chip networks: with high network utilization comparable to best-effort virtual-channel router with minimal area/energy overhead by avoiding per-flow queues/structures in on-chip routers→ scalable to # of concurrent flows

Outline of this talk Motivation Globally-Synchronized Frames: a step-by-step development of mechanism Implementation of GSF router Evaluation Related work Conclusion

GSF takes a frame-based approach Frame is a coarse quantization of time. The network can transport a finite number of flits during this interval. We constrain each flow source to inject a certain number of flits per frame. shorter frames → coarser BW control but lower maximum delay typically 1-100s Kflits / frame (over all flows) in 8x8 mesh network R R R R R R R R R R R R R R R R shared physical link frame # 4 3 2 1 0 time

Admission control of flows Admission control: reject a new flow if it would make the network unable to transport all the injected flits within a frame interval R R R R R R R R R R R R R R R R shared physical link frame # 4 3 2 1 0 time

Single frame does not service bursty traffic well Both traffic sources have the same long-term rate: 2 flits / frame. Allocating 2 flits / frame penalizes the bursty source. 5 4 3 2 1 0 frame # time regulated src bursty src

Overlapping multiple frames to help bursty traffic Overlapping multiple frames to multiply injection slots Sources can inject flits into future frames (w/ separate per-frame buffers) Older frames have higher priorities for contended channels. Drain time of head frame does not change. Future frames can use unclaimed BW by older frames. Maximum network delay < 3 * (frame interval) Best-effort traffic: always lowest priority (throughput ↑) 7 6 6 5 5 5 4 4 4 3 3 3 2 2 2 2 future frames 1 1 0 head frame frame # time

Reclamation of frame buffers Per-frame buffers (at each node) = virtual channels At every frame window shift, frame buffers (or VCs) associated with the earliest frame in the previous epoch are reclaimed for the new futuremost frame. 7 frame # Framewindowshift 6 6 5 5 5 4 4 4 3 3 3 VC2 2 2 2 VC1 1 1 VC0 0 time epoch0 epoch1 epoch2 epoch3 epoch4 epoch5

Early reclamation improves network throughput Observation: Head frame usually drains much earlier than frame interval → low buffer utilization Terminate head frame early if empty Use a global barrier network to confirm no pending packet in router or source queue belongs to head frame. Empty buffers are reclaimed much faster and overall throughput increases.(by >30% for hotspot traffic pattern) 7 7 7 7 7 Framewindowshift Framewindowshift 6 6 6 6 6 6 6 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 0 0 0 e0 epoch0 e1 e2 epoch1 e3 epoch2 e4 e5 epoch3 epoch4 epoch5 e6 e7 frame # time

GSF in action: two-router network example (3 VCs) GSF in action Flow A Flow B Flow C Flow D VC 0 (Fr0) VC 0(Fr0) A B C VC 1(Fr1) VC 1(Fr1) A C B VC 2(Fr2) VC 2(Fr2) A B D active frame window: Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5

GSF in action GSF in action: two-router network example (3 VCs) Flow A Flow B Flow C Flow D VC 0 (Fr0) VC 0(Fr0) A B C VC 1(Fr1) VC 1(Fr1) A C B VC 2(Fr2) VC 2(Fr2) A B D active frame window: Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5

GSF in action GSF in action: two-router network example (3 VCs) Flow A Flow B Flow C Flow D VC 0 (Fr0) VC 0(Fr0) A B VC 1(Fr1) VC 1(Fr1) A C B VC 2(Fr2) VC 2(Fr2) A B D active frame window: Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5

GSF in action GSF in action: two-router network example (3 VCs) Flow A Flow B Flow C Flow D VC 0 (Fr0) VC 0(Fr0) A VC 1(Fr1) VC 1(Fr1) A C B VC 2(Fr2) VC 2(Fr2) A D B active frame window: Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5

GSF in action GSF in action: two-router network example (3 VCs) Flow A Flow B Flow C Flow D empty! VC 0 (Fr0) VC 0(Fr0) VC 1(Fr1) VC 1(Fr1) A C B VC 2(Fr2) VC 2(Fr2) A D B active frame window: Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5

GSF in action GSF in action: two-router network example (3 VCs) Flow A Flow B Flow C Flow D A B C D VC 0 (Fr3) VC 0(Fr3) VC 1(Fr1) VC 1(Fr1) A C VC 2(Fr2) VC 2(Fr2) A D B active frame window: Frame 0 Frame 0 Frame 1 Frame 1 Frame 2 Frame 2 framewindow shift Frame 3 Frame 3 Frame 4 Frame 4 Frame 5 Frame 5

Carpool lane sharing Buffers are expensive in on-chip environment. Cannot transport a flit even if there is an empty slot in other frame buffers. Carpool lane sharing: relaxing frame-VC mapping to improve buffer utilization Reserve one frame buffer (VC0) for head frame only→ does not increase the drain time of head frame The other buffers are now colorless and can be used by any frame. Head-of-line (HoL) blocking prevented by not allowing two packets to occupy a VC simultaneously (OK for shallow buffers). VC 0(Fk) VC 0(Fk) carpool lane: reserved for head frame only A A VC 1&2(Fk,F(k+1),F(k+2)) VC 1(F(k+1)) C A B A B a flit in Frame (k+1) VC2(F(k+2)) C

Baseline virtual channel (VC) router Best-effort router Three-stage pipeline with look-ahead routing:VA/NRC-SA-ST Credit-based flow control VC, SW allocators: iSlip uses round-robin arbiters(locally fair) updates the priority of each arbiter only when that arbiter generates a winning grant next routecomputation VC allocation SW allocation crossbar(5x5) flit buffers VC0 VC1 to P VC2 VC3 to N to E to S VC0 VC1 to W VC2 VC3 Baseline router for 2D mesh networks

GSF router VC0: carpool lane reserved for head frame only New registers head_frame (HF) (per node) frame_num (per VC) NRC: priority precalculation (frame_num-HF) (mod W)(0 is the highest priority.) VC and SW allocation: priority enforcement Global barrier network for frame window shifting crossbar(5x5) increment_head_frame global barrier network next routecomputation head_frame=2 next routecomputation VC allocation VC allocation SW allocation SW allocation frame_num flit buffers VC0 ≠ 2 head_frame_empty_at_node_X VC1 ≠ x to P VC2 ≠ 1 VC3 ≠ 0 to N to E to S VC0 ≠ 2 VC1 ≠ 0 to W VC2 ≠ 3 VC3 ≠ 2

Simulation setup Network simulator: Booksim 0.5 M cycles with 50K-cycle warming up Network configuration 8x8 2D mesh, dimension-ordered routing, 1 flit/cycle link capacity Four traffic patterns one QoS traffic pattern: hotspot three best-effort traffic patterns: uniform random, transpose, nearest neighbor packet size is either 1 or 9 flits (with 50-50 chance) Baseline VC router 3-stage pipeline (VA/NRC-SA-ST), 2-cycle credit pipeline delay 6 VCs/physical link, buffer depth is 5 flits/VC GSF parameters frame window size = 6 [frames], frame size = 1,000 [flits] global barrier latency = 16 [cycles] (conservative)

Flexible guaranteed QoS provided All flows receive more than their minimum guaranteed bandwidth (Ri/eMAX) in accessing hotspot. Ri:# of flit injection slots for Flow i eMAX: maximum epoch interval. Example: 8x8 mesh network accepted throughput[flits/cycle/node] accepted throughput[flits/cycle/node] 0.06 0.04 0.02 0 node index (Y) node index (Y) node index (X) node index (X) (a) fair allocation (b) differentiated allocation

Flexible guaranteed QoS provided All flows receive more than their minimum guaranteed bandwidth (Ri/eMAX) in accessing hotspot. Ri:# of flit injection slots for Flow I eMAX: maximum epoch interval. Example: 16x16 torus network with 4 hotspot nodes accepted throughput[flits/cycle/node] node index (Y) node index (X) (c) differentiated allocation

Small throughput degradation for best-effort traffic Network behavior with non-QoS traffic no latency increase in uncongested region at most 12 % degradation of network saturation throughput → can be reduced with larger frame (at the cost of delay bound increase) average delay [cycles] average delay [cycles] uniform random nearest neighbor 12 % offered load [flits/cycle/node] offered load [flits/cycle/node] average delay [cycles] transpose 7 % offered load [flits/cycle/node]

Related work QoS support in IP or multiprocessor networks Fair Queueing [SIGCOMM ’89], Virtual Clock [SIGCOMM ’90] Multi-rate channel switching [IEEE Comm ’86] Source throttling [HPCA ’01] Age-based arbitration [IEEE TPADS ’92, SC ’07] Rotating Combined Queueing (RCQ) [ISCA ’96] → expensive, inflexible, and/or without guaranteed QoS QoS on-chip networks AEthereal (strict TDM; exp. channel setup) [IEEE Design & Test ’05] SonicsMX (per-thread queues at each node) [DATE ’05] MANGO clockless NoC (partitioning GS and BE VCs) [DATE ’05] Nostrum (routes fixed at design time) [DATE ’04]

Conclusion The GSF network is guaranteed QoS-capable with minimum bandwidth guarantees and maximum delay flexible fair and differentiated bandwidth allocation no explicit channel setup required along the path robust <5 % throughput degradation on average (12 % in the worst) for four traffic patterns in 8x8 mesh network fairness vs overall throughput tradeoff with frame size simple no per-flow queues/structures in on-chip routers → scalable relatively small modifications to a conventional VC router

Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley)

Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley)

Presentation Transcript

mit human-computer interaction

Architectural Components for a Practical Quantum Computer:

Bayesian models of inductive learning

WebTables: Exploring the Power of Tables on the Web

Recognizing Indoor Scenes

15-446 Distributed Systems Spring 2009

Co-Chairs: Alexis T. Bell (UC-Berkeley) Bruce C. Gates (UC-Davis) Douglas Ray ( PN NL)

Circuit Design with Alternative Energy-Efficient Devices

A Layered Naming Architecture for the Internet (position paper)

Model-driven Data Acquisition in Sensor Networks

Finding a PhD Topic

Bio@Tech

Impact of Neighbor Selection on Performance and Resilience of Structured P2P Networks

A Brief Comparison:

Photon Extraction : the key physics for approaching solar cell efficiency limits

cdsltech/~doyle/shortcourse.htm Systems Biology Shortcourse May 21-24

Fast and Expressive Big Data Analytics with Python

Murali Vijayaraghavan vmurali@mit MIT Computer Science and Artificial Intelligence Laboratory

Rotman School of Management PhD, UC Berkeley MA, UC Berkeley SM, MIT AB, Harvard

3DDI Visualization MURI

Circuit Design with Alternative Energy-Efficient Devices

Bayesian models of inductive learning