Run-time Adaptive on-chip Communication Scheme

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C

Outline • Abstract • Introduction • Definitions • Algorithm • Motivation Case Study • Hardware Implementation • Conclusion

Abstract • During run-time varying workloads and/or constraints in embedded systems require run-time adaptivityto provide a high degree of efficiency during any operation mode/scenario. • We are presenting the first approach of an adaptive on-chip communication scheme. • It provides an adaptive routing/path allocation algorithm to meet a required level of Quality of Services (QoS) which is guaranteed bandwidth.

Introduction(1/2) • A run-time adaptive network on chip that adapts the underlying interconnection infrastructure on-demand in response to changing communication requirements imposed by an application. • To provide on-demand interconnections, we present a novel adaptive routing/path allocation algorithm that meets QoSrequirements (bandwidth).

Introduction(2/2) • The scheme makes decisions locally at each router depending on the available bandwidth in each direction to the neighboring router. • Dynamic connections are realized by re-assigning a certain number of buffer blocks to different output ports of a router on-demand. • It also increases the resource utilization, especially buffer utilization, through on-demand buffer block configuration.

Definitions(1/6) • Definition 1: An application task graph (TG) is a directed graph Gk= (T, F), • T is the set of all tasks tiused by an application • fi, j∈ F represents the connection from task tito tj Task Task connection

Definitions(1/6) • Definition 2: Physical Network (PN) is a directed graph P = (N, V, Bt, r). • N is a set of tiles ni • vi, j ∈ V represent an edge, the physical channel between niand nj • Each tile has a current buffer configuration at time t, bi,t ∈ Btrepresents the state of a buffer assignment to individual output ports. • A routing function r which determines the paths taken. Tile n2 edge Tile n1

Definitions(2/6) • Definition 3: Logical Network (LN) at time t is a directed graph Lt= (M, W) • M is a set of task groups mi • w i, j∈ W represents the set of connections between two task groups miand mj • A task group mi is a set of tasks scheduled to be executed on a particular PE. • LN is the subset of the task graph set G that are running at a specific time t. • Definition 4: The Task Mapping Function is a function lt: T’ ⊆ T → Ltwhich maps subset T’ of each task graph T to the logical network LN.

Definitions(3/6) • Definition 5: The Network Mapping Function is a function pt: Lt → S ⊆ P which maps a logical network onto a subset of the physical network. • Definition 6: A Routing Function r : N × N → V , r : (ni, nj) → vi,jreturns a path vi,jaway from the current PE (ni) given the input port for each transaction and the destination nj. For example, a path, v, that Gauss2 forwards to Filter2.

Definitions(4/6) • Definition 7: • The Buffer Configurationbi,tis the current buffer configuration of tile ni ∈ N. • A Virtual Channel (VC) is a unidirectional logical or virtual connection between the tile niand nj • Each VC is realized by an independently managed pair of message buffers referred to as the Virtual Channel Buffer (VCB).

Definitions(5/6) Task Graph • Task Mapping Function (Definition 4) Logical Network • Network Mapping Function (Definition 5) Physical Network PE PE PE PE … … Buff. Buff. Buff. Buff. • Routing Function (Definition 6)

Definitions(6/6) • Definition 8: The System Monitor M is an infrastructure which is used to collect, aggregate, and process system statistics. • Definition 9: Our Adaptive Network on Chip AdNoC is defined as the tuple AdNoC= (P, M, Lt, Gi, pt, lt, r) with the parameters as given above. • P = Physical Network (Definition 2) • M= System Monitor (Definition 8) • Lt = Logical Network (Definition 3) • Gi= Task Graph (Definition 1) • pt= Network Mapping Function (Definition 5) • lt = Task Mapping Function (Definition 4) • r = Routing Function (Definition 6)

Algorithm (1/12) • To provide bandwidth guarantee in an adaptive NoC, the underlying communication infrastructure needs to provide an adaptive path allocation strategy. • Therefore, finding a path/routing for a given logical network and physical mapping of the application is a major challenge. The run-time path allocation algorithm is given in Alg. 1.

Algorithm(2/12) Algorithm 1 Runtime/On-demand Path Selection Algorithm 1: upon receiving data at runtime do 2: if destination = processor port then 3: route ⇐ processor port 4: else 5: if flit type = head or connection in look-up table from same source port then {non- header flits are always in look-up-table} 6: route ⇐ look-up table 7: else {get route} 8: route ⇐ do weighted XY route allocation // Alg.2 9: if route found then {assign buffer to route} 10: do runtime buffer assignment for found route// Alg.3 11: end if 12: if no route found or buffer assignment unsuccessful then 13: collect router status information 14: send information to higher level 15: end if 16: end if 17: if flit type = tail and keep-alive not requested then {free buffer} 18: remove buffer from buffer table 19: remove connection from look-up table 20: end if 21: end if

Algorithm(3/12) Algorithm 2 Weighted XY Route Allocation 1: upon receiving a connection and destination do 2: if connection in look-up table from different source port then {look for potential loops} 3: loopRoute⇐ output port of other connection 4: end if 5: for all output ports pido {initialize all weights to zero}//將Weight歸零 6: wi⇐ 0 7: end for 8: dx = | destination x − current x|//dx, dy 作為等下要計算Weight的係數 9: dy = | destination y − current y| 15

Algorithm(4/12) 10: for all piwith available bandwidth > required bandwidthand loopRoute = pido {East and West output ports}//判斷東西向 11: if pipoints toward destination x then 12: wi ⇐ available bandwidth pi  dx+ total link bandwidth 13: else if pipoints away from destination x then 14: wi ⇐ available bandwidth 15: end if //判斷南北向 16: if pipoints toward destination y then // {North and South output ports} 17: wi⇐ available bandwidth pidy + total link bandwidth 18: else if pipoints away from destination y then 19: wi⇐ available bandwidth 20: end if 21: end for{route toward the port having highest weight} 22: route = piwith max wi{save the route in the look-up table} 23: look-up table ⇐ connection = route 24: return route 16

Algorithm(5/12) • For a requesting transaction, the path is checked in every possible direction and the VCB is assigned accordingly on-demand. • The weighted XY algorithm wXYpresented in Alg. 2 assigns each output port a weight based on available bandwidth and dx or dy between the current and the destination nodes. • This ideally gives the packet a maximum number of sensible routing choices along its path.The weight is also proportional to the available bandwidth.

Algorithm(6/12) • The wXYroute allocation strategy is described as follows: given is the tuple ρ= {N, E, S, W, P}. • Each i ∈ ρhas a weight wiand available bandwidth biwith bi ≤ bmax, bmax being the maximum line bandwidth.

Algorithm(7/12) • The current router coordinates are x, y. Each packet p has destination coordinates xd, ydand a required bandwidth bp. The weights are assigned as follows:

Algorithm(8/12) • The route r chosen is then: • The router distribute the VCBs to any route as needed by assigning it to the according output port.

Algorithm(9/12) Algorithm 3 On-demand Buffer Assignment 1: upon receiving a connection and direction do // 收到連結和目的地 2: search for next free buffer bfree∈ buffer pool B and not in buffer table //尋找可用的buff. 3: if bfree found then {assign available buffer to current direction} 4: current buffer bcurr⇐ bfree //將可用的buff.分配到所需地 5: buffer table ⇐ bcurr→ output port// 指向哪個port也記錄在table 6: return bcurr 7: else 8: return no buffer available 9: end if • Our scheme to assign buffers on-demand is given in Alg. 3. • The benefits of such on-demand assignment is evident: • buffers are only allocated when needed meaning that virtual channels can be reused by different ports.

Algorithm(10/12) • Fig. 3 shows an exemplary scenario to showcase the run-time behavior using different transactions in one router.

Algorithm(11/12) • t0: All four directions are occupied with four different transactions; buffers are also assigned. • t1: Transaction T5 requests a path and weights are calculated till tδtaking 4 hardware cycles. A buffer is also assigned to the calculated direction before tδ. • t2: Transaction T1, T2, and T4 free their corresponding channels and assigned buffers.

Algorithm(12/12) • t3: Four new transactions T1, T2, T4, and T6 request processing and they are granted resources. • t4: Transactions T7 requests a path and buffer but due to unavailable buffer resources, the transaction cannot be granted. So, the requesting transaction has to wait or inform the upper layer through the system monitor.

Motivation Case Study (1/9) • We motivate the need of an adaptive NoC by means of a very simple scenario. We study an MPEG decoder [1] and an Image Processing Line (IPL) [18] application. • The task graphs are shown in Figures 1a and 1b. • Assume at time t0 the NoC is running the MPEG video decoder (Fig. 1c). • At time t1, the IPL needs to be executed then it is also mapped besides the MPEG onto the processing elements. Once a mapping is performed, the routers attempt to set up meaningful routes (Fig. 1d).

Motivation Case Study (2/9) Fig. 1. Motivation to use an adaptive communication architecture A F G E H D B C

Motivation Case Study (3/9) B C A D

Motivation Case Study (4/9) B (Fig. 1d)Find Conn. F (Gauss1 to Filter2): // 參考P.16Alg.2 First ↑(Gauss1 to MC): 100%*1+100% // line 16~17 ←(Gauss1 to Gauss2):100% // line 13~14 The weighted path, Ga.1 to MC, is better than the path , Ga.1 to Ga.2, so we choose the former. Then → (MC to Filter2): 100%*1+100% // line 11~12 A C D F2 F1 E Original F, but it failed

Motivation Case Study (5/9) With G2 ↑ (Gauss2 to VLD): 100%* 1+100% // line 16~17 then → (VLDto MC): 100%* 2+100% // line 11~12 then → (MCto Filter2):X It failed, too. Because the 2 choices are not successful. In order to find an available route, we have to re-mapping. (Fig. 1e)Find Conn. G: // 參考P.16 Alg.2 First, we have 2 choices, G1 and G2: With G1 → (Gauss2 to Gauss1): 100%* 2+100% // line 11~12 then → (Gauss1 to Filter1):X It failed. F G2 E G1

Motivation Case Study (6/9) B Find Conn. F ↑(Ga.1 to VLD) : 100%* 1+ 100% →(VLD to MC) : 80%* 2+ 100% →(MC to Fi.2) : 80%* 1+ 100% Find Conn. G ←(Ga.2 to Fi.1): 100%* 1+ 100% Find Conn. H ↑(Ga.2 to Fi.2): 100%* 1+ 100% 最後經由Alg.3(P.21)，分配所需的Buffer Re-mapping 後的結果 Find Conn. E →(Ga.1 to Fi.1): 100%* 1+ 100% ↑(Ga.1 to VLD): 100% C A D F H E G

Motivation Case Study (7/9) In this example: (Fig. 1d) • Conn. E : The task Gauss1 first establishes a route to its neighboring filter task Filter1. • Conn. F : Then, it uses a deterministic XY routing algorithm for Filter2.However, that will fail due to the limited bandwidth availability.

Motivation Case Study (8/9) (Fig. 1e) • Conn. F : It forces the router at Gauss1 to try another route, using the Alg.1.And depending on Alg.3, the routers supply a corresponding buffer block, allocating the buffer to output ports on-demand. • Conn. G & H : The second Gauss task Gauss2 attempts to conduct the same action, but it fails. (Fig. 1f) • Conn. G & H : Thus it becomes necessary to invoke a re-mapping. And we can successfully find the path with enough bandwidth.

Motivation Case Study (9/9) • If path and buffer blocks are not available the mapping function sends appropriate feedback to the upper layer. • Therefore, in a dynamic run-time application scenario an adaptive on-chip communication infrastructure which can build connections on-demand to provide QoS.

Hardware Implementation • Our hardware platform for the AdNoCis illustrated in Fig. 4. • It consists of mainly two parts: • the run-time path allocation • the on-demand VCB assignment part. • The path allocation part either decides based on the lookup table or by calculating the type of the flit.

Conclusion • We have introduced the first approach of an adaptive on-chip communication architecture. It provides an adaptive path allocation algorithm to meet varying bandwidth guarantees. • Run-time connections are realized by re-assigning a number of buffer blocks on-demand. • Our buffer allocation scheme increases the buffer utilization and decreases the overall buffer use.

Run-time Adaptive on-chip Communication Scheme

Run-time Adaptive on-chip Communication Scheme

Presentation Transcript

On-Chip Optical Communication for Multicore Processors

ChIP on ChIP

On-Chip Communication (Architecture and Design)

On - Chip Communication Architectures

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks

Run-time Adaptive on-chip Communication Scheme

Ultra Fine-Grained Run-Time Power Gating of On-Chip Routers for CMPs

Real-Time System-On-A-Chip Emulation

Adaptive On-Chip Test Strategies for Complex Systems

Adaptive Single-Chip Multiprocessing

ADAM: Run-time Agent-based Distributed Application Mapping for on-chip Communication

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors

On-Chip Communication: Networks on Chip (NoCs)

On-time Network On-Chip: Analysis and Architecture

Design of Adaptive On-Chip Multiprocessor Systems

ADAM: Run-time Agent-based Distributed Application Mapping for on-chip Communication

MetaSockets Run-Time Support for Adaptive Communication Services

Run-Time...

Run-Time Support for Adaptive Communication Services

On-time Network On-Chip: Analysis and Architecture

Parallel vs. Serial On-Chip Communication

MetaSockets Run-Time Support for Adaptive Communication Services