slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Aaron Carpenter, Jianyun Hu , Jie Xu, Michael Huang, Hui Wu University of Rochester PowerPoint Presentation
Download Presentation
Aaron Carpenter, Jianyun Hu , Jie Xu, Michael Huang, Hui Wu University of Rochester

Loading in 2 Seconds...

play fullscreen
1 / 35

Aaron Carpenter, Jianyun Hu , Jie Xu, Michael Huang, Hui Wu University of Rochester - PowerPoint PPT Presentation

  • Uploaded on

A Case for Globally Shared-Medium On-Chip Interconnect Enhancing Effective Throughput for Transmission Line-Based Bus. Aaron Carpenter, Jianyun Hu , Jie Xu, Michael Huang, Hui Wu University of Rochester. Motivation: e.g. 5x5 mesh. Worse case: 4+4 = 8 hops

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Aaron Carpenter, Jianyun Hu , Jie Xu, Michael Huang, Hui Wu University of Rochester' - duncan

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

A Case for Globally Shared-Medium On-Chip InterconnectEnhancing Effective Throughput for Transmission Line-Based Bus

Aaron Carpenter, JianyunHu, Jie Xu,

Michael Huang, Hui Wu

University of Rochester

motivation e g 5x5 mesh
Motivation: e.g. 5x5 mesh

Worse case: 4+4 = 8 hops

Per hop = pipeline delay + queue delay

Example: 5 + 10 = 15 clock cycles/hop

WC 15 * 8 = 120 clock cycles

@ 1G Hz clock = 120 ns

Much slower than DRAM access

  • Non-uniform cache access (NUCA) delays create problems.
  • Significant existing research aimed to reduce unnecessary remote accesses by trying to map data closer to the threads that frequently access the data.
  • Transmission-line circuit technology allows data rates at >= 26 GHz/s = 0.04 ns per bit.
  • Latency across chip ~ 2 ns.
  • Claims to significantly reduce power because no power costs at intermediate routers (and queues).
their proposed architecture
Their Proposed Architecture
  • Use Transmission-Lines (TLs)to create a shared bus:
    • Two-level network: first-level connects 2-4 nodes per hub.
    • Shared bus connects all hubs.
    • Within a hub, can connect nodes via e.g. crossbar.
    • Centralized arbitration to control bus access.

Serpentine routing through every hubs

  • When the message want transfer from node i to j:

1. A setup step is performed to “wake up” the transmitter i.

2. In the background, the arbiter passes on the grant to node j

3. Need the time to drain the signal (waiting for the last bit is transmitted).

4. Arbiter can process next task.

implementation problems
Implementation problems
  • Where to put arbiter?
  • How to account for the communicate delay for getting requests from nodes to arbiter and grants back?
  • The overhead of routing request/grant lines between arbiter and nodes?

Put arbiter in the middle?

outline of remaining talk
Outline of Remaining Talk
  • Transmission Line
  • transmission line medium
  • transceiver circuitry
  • Node structure
  • Bus Architecture
  • Arbitration
  • Interface Circuit Design
transmission line
Transmission Line
  • transmission line medium
  • Microstrips: simple, isolation, each line can support high data rate(> 20Gb/s)
  • crosstalk from neighboring lines requires very large spacing
  • Coplanar waveguides: use a grounded strip in between the signal lines
  • significant spacing between signal lines
  • coplanar strips: the more noise-tolerant differential signaling on a pair of lines
transceiver circuitry
transceiver circuitry
  • digital systems
  • analog receiver: allows more attenuation and thus higher rates than digital systems
  • analog transmitters: can be used to gather with more sophisticated encoding schemes
in their design
In their design:
  • coplanar strips: as they utilize the space of the top metal layer more efficiently
  • basic differential transmitters and receivers
  • a data rate of 26.4Gb/s can be achieved for a pair of transmission lines with a total pitch (including spacing) of 45μm
  • Within 2.5mm of space, this pitch allows 55 pairs to be laid out, allowing 1.4 5Tb/s of total bandwidth
node structure
Node structure
  • assumption is that a chip consists of tiles
  • each with a core, an L1 cache, and a slice of a globally shared L2 (last-level) cache.
  • if an L1 miss occurs, the access will result in a packet injected into the interconnect if the address maps to a remote node
  • Otherwise, the L1 miss is served by the local L2 bank
node structure1
Node structure
  • clustering a small number of cores and L2 slices into a node
  • the backbone network only makes a stop at every node
  • intra-node fabric connects multiple L1 caches and the L2 cache banks in the node
  • clustering adds extra latency for accesses from an L1 cache to the nearest L2 bank(Figure 4-b Core0 to L20)
  • makes accessing neighboring cache banks within the node (Figure 4-b Core1 to L20) faster
  • it reduces the number of hubs a long-distance packet needs to traverse through
  • The extra cost of a larger intra-node fabric offsets the savings due to a lower number of hubs for inter-node fabric
bus architecture
Bus Architecture
  • Each node uses a high speed communication circuit to deliver packets
  • our bus is merely that allows point-to-point communication
partitioning the bus
Partitioning the bus
  • Increase throughput, use a wide bus
  • have multiple buses for diffirent packets.
  • bundling: for better utilization of the bus bandwidth, sending multiple packets for each bus arbitration
interface circuit design
Interface Circuit Design
  • a transmitter, a receiver, a serializer (SER), a deserializer (DES), and a phase and data recovery circuit (PDR).
  • Therefore, the transmitter (Tx) and receiver (Rx) are both implemented in standard CMOS technology without any special RF devices such as inductors. At 26.4Gb/s
  • synchronization between the received data and the local clock is needed
increasing effective bus throughput
Increasing Effective Bus Throughput
  • There are many ways to increase the throughput of bus at circuits or architecture level. The proposed techniques can be categorized into three groups:

1. Increasing raw link throughput.

2. Increasing the utilization efficiency.

3. Optimization on the use of buses.

increasing raw link throughput
Increasing raw link throughput
  • The potential of link throughput is high, the inherent channel bandwidth of the transmission line is quite high.
  • There are many coding methods to increase the raw throughput.
increasing raw link throughput1
Increasing raw link throughput

First, we turn to 4-PAM which double the data rate compared to OOK. The additional circuit has a DAC for transmitter and ADC for receiver. These elements increase energy and latency, we use it only for data packet bus to minimize latency impact.

increasing raw link throughput2
Increasing raw link throughput

Then we use Frequency Division Multiplexing (FDM), it allows us to use higher frequency band. The attenuation in these band increase with frequency and can be high. When it used as global bus, the higher band becomes lossy. The higher frequency channel are intended for shorter communication instead of in long transmission lines.

increasing raw link throughput3
Increasing raw link throughput

We also have a circuit support includes mixer for transmitter and receiver side and a filter for receiver end.

But it is challenging to estimate the power cost of support circuitry. We use a simplify analysis to estimate the minimum power cost to support frequency-division and multi-band transmission.

increasing the utilization efficiency
Increasing the Utilization Efficiency
  • While the underlying global transmission lines support high data rate. Using them to shuttle short packet can cause under-utilization:

1. Long lines means it take long time to drain from transmission line.

2. Packet destined for near neighbor structure are poor match to the global line structure.

  • A number of technique can address these issues, including:

Partitioning, wave-based arbitration, segmentation

  • It is straightforward to partition the same number of underlying links into more, narrower buses. Longer serialization reduces waste due to draining.
  • In partitioning, the finer granularity allows better balance the load of two type of buses.
  • For example, we can partition the five 1-flit-wide buses into any combination of meta bus and data bus. In this paper, we use a fixed configuration that achieve the best average performance.
  • We can also improve its spatial utilization in order to increase the efficiency.
  • Achieve that by dividing transmission line into few segments. If a node is communicating with another node within the same segment, only need to arbitrate this segment.
  • When communication cross multiple segments, transmitter need to obtain permission for all segments. Then the segment act as a transmission line.
  • The segment can be connected in two ways:

1:Pass gate is a passive, bi-directional connection. It will add a little bit attenuation and signal distortion, but it can be accepted.

2: Two separate uni-directional amplifiers. The cost of this approach is the power consumption for the amplifier. But with these amplifiers, source transmitter power can be lower since signal can travel at most the length of one segment.

optimization on the use of buses
Optimization on the use of buses
  • Invalidation acknowledgement omission:

With a packet-switched network, protocols rely explicit invalidation acknowledgement to provide completion.

The explicit acknowledgement can be avoided if the interconnect offers certain capability to infer the deliver.

  • Limited multicasting:

Transmission line can allow multicast operation. It is easy to support small number of receiver operating. But there is a acceptable attenuation. Even though it may not reduce traffic dramatic, it cut latency and queuing delay.

interaction between techniques
Interaction between techniques
  • These three groups of techniques are focus different sources of performance gain. But within each group, there is a varying degree of overlap.
  • In general, implementing one technique reduce the potential of another. So when multiple techniques are applied, we can reach diminishing returns.
  • Example: When we are tying to increase the utilization efficiency, we send a pulse train on bus, we wait until it propagate beyond the ends before allowing another pulse. Since propagation delay is significant than pulsed train, the duty cycle is low. But we are trying to improve the duty cycle in different ways.
experimental setu p
Experimental Setup
  • Transmission Line Links

a total pitch of 45μm and a line width of 10μm

The transmission lines are of a serpentine shape and measure about 7.5cm in total length

· Traffic and Performance Analysis

The L1 miss rate of these applications ranges up to 61 misses per thousand instructions (MPKI).

a. Percentage of L2 accesses that are remote

b. Speedup due to clustering

left is for 1 core per node, the right bar is for 2 cores per node. The baseline in this case is a 16-core mesh

performance comparison with mesh
Performance comparison with mesh
  • On average, TLL bus run 1.15x in the 16-node and 1.17x in the 8-node configurations than mesh.
  • the TLL bus reduction in network energy of about 26x than mesh
the impact of bundling
The Impact of Bundling
  • the turn-around time also wastes bus bandwidth and can be mitigated with bundling
  • too much bundling can be detrimental to performance as well
scaling up performance compare with mesh
Scaling Up performance compare with mesh
  • We conduct a limited scalability test with a 64-core system organized into 2- or 4-core nodes (32 nodes, 2 cores each; and 16 nodes, 4 cores each)
  • On average, the TLL bus performs 16% and 25% better than mesh for a 32- and 16-node system
scaling up performance compare with idealized circuit
Scaling Up performance compare with idealized circuit
  • the bus system achieves 67% and 72% of the idealized performance (using digital wire), for 32- and 16-nodes respectively.
  • in a 16-core 8-node system, the bus can achieve 91% of the ideal’s performance.
  • main-stream chip multiprocessors are unlikely to require an extreme amount of bandwidth for on-chip backbone communication
  • only a small number of nodes will be connected by packet-based backbone interconnect and the traffic on this fabric can be rather limited
  • Experimental shown in a medium-scale16-core system, this design achieves 91% of that in an idealized wire-based interconnect
  • important benefit of avoiding packet switching and relaying is the inherent energy efficiency of the communication system.