interconnection networks l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Interconnection Networks PowerPoint Presentation
Download Presentation
Interconnection Networks

Loading in 2 Seconds...

play fullscreen
1 / 80

Interconnection Networks - PowerPoint PPT Presentation


  • 138 Views
  • Uploaded on

Interconnection Networks. Reading. Appendix E. Goals. Coverage of basic concepts for high performance multiprocessor interconnection networks Primarily link & data layer communication protocols Understand established and emerging micro-architecture concepts and implementations.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Interconnection Networks' - adamdaniel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
reading
Reading
  • Appendix E
goals
Goals
  • Coverage of basic concepts for high performance multiprocessor interconnection networks
    • Primarily link & data layer communication protocols
  • Understand established and emerging micro-architecture concepts and implementations
technology trends
Technology Trends

BlackWidow

Source: W. J. Dally, “Enabling Technology for On-Chip Networks,” NOCS-1, May 2007

where is the demand
Where is the Demand?

Cables, connectors, transceivers

Throughput

Performance

cost

Area, power

  • latency, power
blurring the boundary
Blurring the Boundary
  • Use heterogeneous multi-core chips for embedded devices
    • IBM Cell  gaming
    • Intel IXP  network processors
    • Graphics processors  NVIDIA
  • Use large numbers of multicore processors to build supercomputers
    • IBM Cell
    • Intel Tera-ops processor
  • Interconnection networks are central all across the spectrum!
cray xt3
Cray XT3
  • 3D Torus interconnect
  • HyperTransport + Proprietary link/switch technology
  • 45.6GB/s switching capacity per switch
blue gene
Blue Gene

From http://i.n.com.com/i/ne/p/photo/BlueGeneL_03_550x366.jpg

intel teraop die
Intel TeraOp Die
  • 2D Mesh
  • Really a test chip
  • Aggressive speed – multi-GHz links?
ibm cell
IBM Cell

On-chip network

on chip networks
On-Chip Networks
  • Why are they different?
    • Abundant bandwidth
    • Power
    • Wire length distribution
  • Different functions
    • Operand networks
    • Cache memory
    • Message passing
some industry standards
Some Industry Standards
  • On-Chip
    • Open Core Protocol
      • Really an interface standard
    • AXI
  • PCI Express
  • AMD HyperTransport (HT)
  • Intel QuickPath Interconnect (QPI
  • Infiniband
  • Ethernet

These are not traditional switched networks

These are traditional switched networks

basic concepts
Basic Concepts
  • Link Level
  • Switch level
  • Topology
  • Routing
  • End-to-End latency
messaging hierarchy
Messaging Hierarchy

Routing Layer

Where?: Destination decisions, i.e., which output port

Largely responsible for deadlock and livelock properties

Switching Layer

When?: When is data forwarded

Physical Layer

How?: synchronization of data transfer

  • Lossless transmission (unlike Ethernet)
  • Performance rather than interoperability is key
system view
System View

Processor

Northbridge

PCIe

High latency region

Southbridge

NI

From http://www.psc.edu/publications/tech_reports/PDIO/CrayXT3-ScalableArchitecture.jpg

messaging units
Messaging Units

Data/Message

  • Data is transmitted based on a hierarchical data structuring mechanism
    • Messages  packets  flits  phits
    • While flits and phits are fixed size, packets and data may be variable sized

Packets

Flits: flow control digits

head flit

tail flit

Phits: physical flow control digits

type

Dest Info

Seq #

misc

link level flow control
Link Level Flow Control
  • Unit of synchronized communication
    • Smallest unit whose transfer is requested by the sender and acknowledged by the receiver
    • No restriction on the relative timing of control vs. data transfers
  • Flow control occurs at two levels
    • Level of buffer management (flits/packets)
    • Level of physical transfers (phits)
    • Relationship between flits and phits is machine & technology specific
  • Briefly: Bufferless flow control
physical channel flow control
Physical Channel Flow Control

Synchronous Flow Control

  • How is buffer space availability indicated?

Asynchronous Flow Control

  • What is the limiting factor on link throughput?
flow control mechanisms
Flow Control Mechanisms
  • Credit Based flow control
  • On/off flow control
  • Optimistic Flow control (also Ack/Nack)
  • Virtual Channel Flow Control
message flow control

X

Queue is

not serviced

Message Flow Control
  • Basic Network Structure and Functions
    • Credit-based flow control

Sender sends packets whenever credit counter

is not zero

sender

receiver

Credit counter

4

10

9

8

7

2

6

3

1

0

5

pipelined transfer

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

message flow control23

X

Queue is

not serviced

Message Flow Control
  • Basic Network Structure and Functions
    • Credit-based flow control

Sender resumes

injection

Receiver sends credits after they become available

sender

receiver

Credit counter

10

9

8

7

6

5

2

3

5

4

3

2

4

1

0

+5

pipelined transfer

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

timeline
Timeline*

Node 1

Node 2

credit

process credit

Round trip credit time equivalently expressed in number of flow control buffer units - trt

flit

process credit

credit

credit

*From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

performance of credit based schemes
Performance of Credit Based Schemes
  • The control bandwidth can be reduced by submitting block credits
  • Buffers must be sized to maximize link utilization
    • Large enough to host packets in transit

link bandwidth

#flit buffers

flit size

*From W. J. Dally & B. Towles, “Principles and Practices of Interconnection Networks,” Morgan Kaufmann, 2004

optimistic flow control
Optimistic Flow Control
  • Transmit flits when available
    • De-allocate when reception is acknowledged
    • Re-transmit if flit is dropped (and negative ack received)
  • Issues
    • Inefficient buffer usage
      • Messages held at source
      • Re-ordering may be required due to out of order reception
  • Also known as Ack/Nack flow control
virtual channel flow control
Virtual Channel Flow Control
  • Physical channels are idle when messages block
  • Virtual channels decouple physical channel from the message
  • Flow control (buffer allocation) is between corresponding virtual channels
virtual channels
Virtual Channels
  • Each virtual channel is a pair of unidirectional channels
    • Independently managed buffers multiplexed over the physical channel
  • Improves performance through reduction of blocking delay
  • Important in realizing deadlock freedom (later)
virtual channel flow control29
Virtual Channel Flow Control

packet

  • As the number of virtual channels increase, the increased channel multiplexing has multiple effects (more later)
    • Overall performance
    • Router complexity and critical path
  • Flits must now record VC information
    • Or send VC information out of band

flit

type

VC

flow control global view
Flow Control: Global View
  • Flow control parameters are tuned based on link length, link width and processing overhead at the end-points
  • Effective FC and buffer management is necessary for high link utilizations  network throughput
    • In-band vs. out of band flow control
  • Links maybe non-uniform, e.g., lengths/widths on chips
    • Buffer sizing for long links
  • Latency: overlapping FC, buffer management and switching  impacts end-to-end latency
  • Some research issues
    • Reliability
    • QoS
    • Congestion management
switching
Switching
  • Determines “when” message units are moved
  • Relationship with flow control has a major impact on performance
    • For example, overlapping switching and flow control
    • Overlapping routing and flow control
  • Tightly coupled with buffer management
circuit switching
Circuit Switching

Header Probe

Acknowledgment

Data

  • Hardware path setup by a routing header or probe
  • End-to-end acknowledgment initiates transfer at full hardware bandwidth
  • System is limited by signaling rate along the circuits
  • Routing, arbitration and switching overhead experienced once/message

ts

Link

tr

ts

tsetup

tdata

Time Busy

circuit switching34

Buffers

for “request”

tokens

Circuit Switching
  • Switching
    • Circuit switching

Source

end node

Destination

end node

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

routing arbitration and switching

Buffers

for “request”

tokens

Routing, Arbitration, and Switching
  • Switching
    • Circuit switching

Source

end node

Destination

end node

Request for circuit establishment

(routing and arbitration is performed during this step)

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

routing arbitration and switching36

Buffers

for “ack” tokens

Routing, Arbitration, and Switching
  • Switching
    • Circuit switching

Source

end node

Destination

end node

Request for circuit establishment

Acknowledgment and circuit establishment

(as token travels back to the source, connections are established)

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

routing arbitration and switching37
Routing, Arbitration, and Switching
  • Switching
    • Circuit switching

Source

end node

Destination

end node

Request for circuit establishment

Acknowledgment and circuit establishment

Packet transport

(neither routing nor arbitration is required)

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

routing arbitration and switching38

X

Routing, Arbitration, and Switching
  • Switching
    • Circuit switching

Source

end node

Destination

end node

HiRequest for circuit establishment

Acknowledgment and circuit establishment

Packet transport

High contention, low utilization (r)  low throughput

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

packet switching store forward

Message Header

Message Data

Link

tr

tpacket

Time Busy

Packet Switching (Store & Forward)
  • Finer grained sharing of the link bandwidth
  • Routing, arbitration, switching overheads experienced for each packet
  • Increased storage requirements at the nodes
  • Packetization and in-order delivery requirements
  • Alternative buffering schemes
    • Use of local processor memory
    • Central (to the switch) queues
routing arbitration and switching40

Buffers

for data

packets

Routing, Arbitration, and Switching
  • Switching
    • Store-and-forward switching

Store

Source

end node

Destination

end node

Packets are completely stored before any portion is forwarded

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

routing arbitration and switching41

Requirement:

buffers must be

sized to hold

entire packet

(MTU)

Routing, Arbitration, and Switching
  • Switching
    • Store-and-forward switching

Forward

Store

Source

end node

Destination

end node

Packets are completely stored before any portion is forwarded

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

virtual cut through
Virtual Cut-Through

Packet Header

  • Messages cut-through to the next router when feasible
  • In the absence of blocking, messages are pipelined
    • Pipeline cycle time is the larger of intra-router and inter-router flow control delays
  • When the header is blocked, the complete message is buffered at a switch
  • High load behavior approaches that of packet switching

Message Packet

cuts through

the Router

tw

Link

tblocking

tr

ts

Time Busy

routing arbitration and switching43
Routing, Arbitration, and Switching
  • Switching
    • Cut-through switching

Routing

Source

end node

Destination

end node

Portions of a packet may be forwarded (“cut-through”) to the next switch

before the entire packet is stored at the current switch

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

wormhole switching
Wormhole Switching
  • Messages are pipelined, but buffer space is on the order of a few flits
  • Small buffers + message pipelining  small compact switches/routers
  • Supports variable sized messages
  • Messages cannot be interleaved over a channel: routing information is only associated with the header
  • Base Latency is equivalent to that of virtual cut-through

Header Flit

Link

Single Flit

tr

ts

twormhole

Time Busy

routing arbitration and switching45

Buffers for data

packets

Requirement:

buffers must be sized

to hold entire packet

(MTU)

Buffers for flits:

packets can be larger

than buffers

Routing, Arbitration, and Switching
  • Switching
    • Virtual cut-through

Source

end node

Destination

end node

  • Wormhole

Source

end node

Destination

end node

“Virtual Cut-Through: A New Computer Communication Switching Technique,” P. Kermani and L. Kleinrock,

Computer Networks, 3, pp. 267–286, January, 1979.

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

routing arbitration and switching46

Buffers for data

packets

Requirement:

buffers must be sized

to hold entire packet

(MTU)

Buffers for flits:

packets can be larger

than buffers

Routing, Arbitration, and Switching
  • Switching
    • Virtual cut-through

Busy

Link

Packet completely

stored at

the switch

Source

end node

Destination

end node

  • Wormhole

Busy

Link

Packet stored

along the path

Source

end node

Destination

end node

“Virtual Cut-Through: A New Computer Communication Switching Technique,” P. Kermani and L. Kleinrock,

Computer Networks, 3, pp. 267–286, January, 1979.

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

comparison of switching techniques
Comparison of Switching Techniques
  • Packet switching and virtual cut-through
    • Consume network bandwidth proportional to network load
    • Predictable demands
    • VCT behaves like wormhole at low loads and like packet switching at high loads
    • Link level error control for packet switching
  • Wormhole switching
    • Provides low (unloaded) latency
    • Lower saturation point
    • Higher variance of message latency than packet or VCT switching
the network model

registers

ALU

MEM

NI

The Network Model

Metrics (for now): latency and bandwidth

Routing, switching, flow control, error control

message path

pipelined switch microarchitecture
Pipelined Switch Microarchitecture

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Input buffers

Output buffers

Physical

channel

Physical

channel

Link

Control

CrossBar

Routing Computation

Link

Control

MUX

DEMUX

MUX

DEMUX

Input buffers

Output buffers

Physical

channel

Physical

channel

Link

Control

Link

Control

Routing Computation

MUX

DEMUX

MUX

DEMUX

Switch Allocation

VC Allocation

IB (Input Buffering)

RC

VCA

SA

ST & Output Buffering

L. S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” Proc. of the 7th Int’l Symposium on High Performance Computer Architecture, Monterrey, January, 2001

classification
Classification
  • Shared medium networks
    • Example: backplane buses
  • Direct networks
    • Example: k-ary n-cubes, meshes, and trees
  • Indirect networks
    • Example: multistage interconnection networks
direct networks
Direct Networks
  • Buses do not scale, electrically or in bandwidth
  • Full connectivity too expensive (not the same as Xbars)
  • Network built on point-to-point transfers

Processor

Memory

injection channels

Ejection channels

Router

system view53
System View

Processor

Memory

Performance critical

injection channels

Ejection channels

From http://www.psc.edu/publications/tech_reports/PDIO/CrayXT3-ScalableArchitecture.jpg

Router

Processor

NB

PCIe

High latency region

SB

NI

common topologies
Common Topologies

Binary n-cube

n-dimensional mesh

Torus (3-ary 2-cube)

blocking vs non blocking networks

4

5

7

0

1

2

3

6

0

0

0

1

1

1

2

2

2

3

3

3

4

4

4

5

5

5

6

6

6

7

7

7

Blocking vs. Non-blocking Networks
  • Consider the permutation behavior
    • Model the input-output requests as permutations of the source addresses

X

non-blocking topology

blocking topology

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

crossbar network
Crossbar Network

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

non blocking clos network
Non-Blocking Clos Network

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

clos network properties
Clos Network Properties
  • General 3 stage non-blocking network
    • Originally conceived for telephone networks
  • Recursive decomposition
    • Produces the Benes network with 2x2 switches
clos network recursive decomposition
Clos Network: Recursive Decomposition

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16 port, 5-stage Clos network

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

clos network recursive decomposition61
Clos Network: Recursive Decomposition

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16 port, 7 stage Clos network = Benes topology

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

path diversity
Path Diversity

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

Alternative paths from 0 to 1. 16 port, 7 stage Clos network = Benes topology

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

path diversity63
Path Diversity

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

Alternative paths from 4 to 0. 16 port, 7 stage Clos network = Benes topology

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

path diversity64
Path Diversity

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

Contention free, paths 0 to 1 and 4 to 1. 16 port, 7 stage Clos network = Benes topology

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

moving to fat trees

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Moving to Fat Trees

Network

Bisection

Folded Clos = Folded Benes = Fat tree network

  • Nodes at tree leaves
  • Switches at tree vertices
  • Total link bandwidth is constant across all tree levels, with full bisection bandwidth
  • Equivalent to folded Benes topology
  • Preferred topology in many system area networks

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

fat trees another view

Backward

Forward

Fat Trees: Another View
  • Equivalent to the preceding multistage implementation
  • Common topology in many supercomputer installations
relevance of fat trees

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Relevance of Fat Trees
  • Popular interconnect for commodity supercomputing
  • Active research problems
    • Efficient, low latency routing
    • Fault tolerant routing
    • Scalable protocols (coherence)

Routing in MINs

routing algorithms
Routing Algorithms
  • Deterministic Routing
  • Oblivious Routing
  • Source Routing
  • Table-based routing
  • Adaptive Routing
  • Unicast vs. multicast
deterministic routing algorithms
Deterministic Routing Algorithms

Binary Hypercube

  • Strictly increasing or decreasing order of dimension
  • Acyclic channel dependencies

Mesh

Deterministic routing

overview the problem
Overview: The Problem
  • Guaranteeing message delivery
    • Fundamentally an issue of finite resources (buffers) and the manner in which they are allocated
      • How are distributed structural hazards handled
  • Waiting for resources can lead to
    • Deadlock: configuration of messages that cannot make progress
    • Livelock: messages that never reach the destination
    • Starvation: messages that never received requested resources
      • Even though the requested resource does become available
deadlocked configuration of messages
Deadlocked Configuration of Messages
  • Routing in a 2D mesh
  • What can we infer from this figure?
    • Routing
    • Wait-for relationships
example of deadlock freedom
Example of Deadlock Freedom

Binary Hypercube

  • Strictly increasing or decreasing order of dimension
  • Acyclic channel dependencies

Mesh

channel dependencies
Channel Dependencies

SAF & VCT dependency

dependency

  • When a packet holds a channel and requests another channel, there is a direct dependency between them
  • Channel dependency graph D = G(C,E)
  • For deterministic routing: single dependency at each node
  • For adaptive routing: all requested channels produce dependencies, and dependency graph may contain cycles

Header flit

Data flit

Wormhole dependency

breaking cycles in rings torii
Breaking Cycles in Rings/Torii

c10

c0

n0

n1

n0

n1

c00

c13

c03

c11

c01

c1

c3

c12

n3

n2

n3

n2

c2

c02

  • The configuration to the left can deadlock
  • Add (virtual) channels
    • We can make the channel dependency graph acyclic via routing restrictions (via the routing function)
  • Routing function is c0i whenj<i, c1i whenj >i
breaking cycles in rings torii cont

c10

c01

c11

c03

c02

c12

Breaking Cycles in Rings/Torii (cont.)

c10

  • Channels c00 and c13 are unused
  • Routing function breaks cycles in the channel dependency graph

n0

n1

c00

c03

c11

c01

c13

c12

n3

n2

c02

summary
Summary
  • Multiprocessor and multicore interconnection networks are different from traditional inter-processor communication networks
    • Unique microarchitectures
    • Distinct concepts
  • Network-on-chip critical to multicore architectures
glossary
Glossary
  • Oblivious routing
  • Multistage networks
  • Non-blocking networks
  • Packet
  • Packet switching
  • Phit
  • Routing
  • Routing header
  • Routing payload
  • Router computation
  • Switch allocation
  • Virtual channels
  • Virtual channel allocation (in a switch)
  • Virtual channel flow control
  • Virtual cut-through switching
  • Wormhole switching
  • Adaptive routing
  • Arbitration
  • Bisection bandwidth
  • Blocking networks
  • Channel dependency
  • Circuit switching
  • Clos networks
  • Credit-based flow control
  • Crossbar network
  • Deadlock and deadlock freedom
  • Deterministic routing
  • Direct networks
  • Fat trees
  • Flit
  • Flow control
  • Hypercubes
  • Indirect networks
  • K-ary n-cubes