Multiprocessor interconnection networks todd c mowry cs 495 october 30 2002
This presentation is the property of its rightful owner.
Sponsored Links
1 / 44

Multiprocessor Interconnection Networks Todd C. Mowry CS 495 October 30, 2002 PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

Multiprocessor Interconnection Networks Todd C. Mowry CS 495 October 30, 2002. Topics Network design issues Network Topology Performance. Networks. How do we move data between processors? Design Options: Topology Routing Physical implementation. Evaluation Criteria. • Latency

Download Presentation

Multiprocessor Interconnection Networks Todd C. Mowry CS 495 October 30, 2002

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Multiprocessor interconnection networks todd c mowry cs 495 october 30 2002

Multiprocessor InterconnectionNetworksTodd C. MowryCS 495October 30, 2002

  • Topics

    • Network design issues

    • Network Topology

    • Performance


Networks

Networks

  • How do we move data between processors?

  • Design Options:

    • Topology

    • Routing

    • Physical implementation


Evaluation criteria

Evaluation Criteria

• Latency

• Bisection Bandwidth

• Contention and hot-spot behavior

• Partitionability

• Cost and scalability

• Fault tolerance


Communication perf latency

Communication Perf: Latency

  • Time(n)s-d = overhead + routing delay + channel occupancy + contention delay

  • occupancy = (n + ne) / b

  • Routing delay?

  • Contention?


Link design engineering space

Link Design/Engineering Space

  • Cable of one or more wires/fibers with connectors at the ends attached to switches or interfaces

Synchronous:

- source & dest on same

clock

Narrow:

- control, data and timing

multiplexed on wire

Short:

- single logical

value at a time

Long:

- stream of logical

values at a time

Asynchronous:

- source encodes clock in

signal

Wide:

- control, data and timing

on separate wires


Buses

P

P

P

Buses

  • • Simple and cost-effective for small-scale multiprocessors

  • • Not scalable (limited bandwidth; electrical complications)

Bus


Crossbars

Crossbars

  • • Each port has link to every other port

  • + Low latency and high throughput

  • - Cost grows as O(N^2) so not very scalable.

  • - Difficult to arbitrate and to get all data lines into and out of a centralized crossbar.

  • • Used in small-scale MPs (e.g., C.mmp) and as building block for other networks (e.g., Omega).


Rings

Rings

  • • Cheap: Cost is O(N).

  • • Point-to-point wires and pipelining can be used to make them very fast.

  • + High overall bandwidth

  • - High latency O(N)

  • • Examples: KSR machine, Hector


Multidimensional meshes and tori

(Multidimensional) Meshes and Tori

  • O(N) switches (but switches are more complicated)

  • Latency : O(k*n) (where N=kn)

  • High bisection bandwidth

  • Good fault tolerance

  • Physical layout hard for multiple dimensions

3D Cube

2D Grid


Real world 2d mesh

Real World 2D mesh

  • 1824 node Paragon: 16 x 114 array


Hypercubes

Hypercubes

  • • Also called binary n-cubes. # of nodes = N = 2^n.

  • • Latency is O(logN); Out degree of PE is O(logN)

  • • Minimizes hops; good bisection BW; but tough to layout in 3-space

  • • Popular in early message-passing computers (e.g., intel iPSC, NCUBE)

  • • Used as direct network ==> emphasizes locality


K ary d cubes

k-ary d-cubes

  • • Generalization of hypercubes (k nodes in every dimension)

  • • Total # of nodes = N = k^d.

  • • k > 2 reduces # of channels at bisection, thus allowing for wider channels but more hops.


Embeddings in two dimensions

Embeddings in two dimensions

  • Embed multiple logical dimension in one physical dimension using long wires

6 x 3 x 2

3 x 3 x 3


Trees

Trees

  • • Cheap: Cost is O(N).

  • • Latency is O(logN).

  • • Easy to layout as planar graphs (e.g., H-Trees).

  • • For random permutations, root can become bottleneck.

  • • To avoid root being bottleneck, notion of Fat-Trees (used in CM-5)


Multistage logarithmic networks

Multistage Logarithmic Networks

  • Key Idea: have multiple layers of switches between destinations.

  • • Cost is O(NlogN); latency is O(logN); throughput is O(N).

  • • Generally indirect networks.

  • • Many variations exist (Omega, Butterfly, Benes, ...).

  • • Used in many machines: BBN Butterfly, IBM RP3, ...


Omega network

Omega Network

  • • All stages are same, so can use recirculating network.

  • • Single path from source to destination.

  • • Can add extra stages and pathways to minimize collisions and increase fault tolerance.

  • • Can support combining. Used in IBM RP3.


Butterfly network

Butterfly Network

  • • Equivalent to Omega network. Easy to see routing of messages.

  • • Also very similar to hypercubes (direct vs. indirect though).

  • • Clearly see that bisection of network is (N / 2) channels.

  • • Can use higher-degree switches to reduce depth.


Properties of some topologies

Properties of Some Topologies

TopologyDegreeDiameterAve DistBisectionD (D ave) @ P=1024

1D Array2N-1N / 31huge

1D Ring2N/2N/42huge

2D Mesh42 (N1/2 - 1)2/3 N1/2N1/263 (21)

2D Torus4N1/21/2 N1/22N1/232 (16)

k-ary d-cube2ddk/2dk/4dk/415 (7.5) @d=3

Hypercuben=logNnn/2N/210 (5)


Real machines

Real Machines

  • In general, wide links => smaller routing delay

  • Tremendous variation


How many dimensions

How many dimensions?

  • d = 2 or d = 3

    • Short wires, easy to build

    • Many hops, low bisection bandwidth

    • Requires traffic locality

  • d >= 4

    • Harder to build, more wires, longer average length

    • Fewer hops, better bisection bandwidth

    • Can handle non-local traffic

  • k-ary d-cubes provide a consistent framework for comparison

    • N = kd

    • scale dimension (d) or nodes per dimension (k)

    • assume cut-through


Scaling k ary d cubes

Scaling k-ary d-cubes

  • What is equal cost?

  • Equal number of nodes?

  • Equal number of pins/wires?

  • Equal bisection bandwidth?

  • Equal area?

  • Equal wire length?

  • Each assumption leads to a different optimum

  • Recall:

  • switch degree: ddiameter = d(k-1)

  • total links = N*d

  • pins per node = 2wd

  • bisection = kd-1 = N/d links in each directions

  • 2Nw/d wires cross the middle


Scaling latency

Scaling: Latency

  • Assumes equal channel width


Average distance with equal width

Average Distance with Equal Width

  • Assumes equal channel width

  • but, equal channel width is not equal cost!

  • Higher dimension => more channels

Avg. distance = d (k-1)/2


Latency with equal width

Latency with Equal Width

  • total links(N) = Nd


Latency with equal pin count

Latency with Equal Pin Count

  • Baseline d=2, has w = 32 (128 wires per node)

  • fix 2dw pins => w(d) = 64/d

  • distance up with d, but channel time down


Latency with equal bisection width

Latency with Equal Bisection Width

  • N-node hypercube has N bisection links

  • 2d torus has 2N 1/2

  • Fixed bisection => w(d) = N 1/d / 2 = k/2

  • 1 M nodes, d=2 has w=512!


Larger routing delay w equal pin

Larger Routing Delay (w/ equal pin)

  • Conclusions strongly influenced by assumptions of routing delay


Latency under contention

Latency under Contention

  • Optimal packet size?Channel utilization?


Saturation

Saturation

  • Fatter links shorten queuing delays


Data transferred per cycle

Data transferred per cycle

  • Higher degree network has larger available bandwidth

    • cost?


Advantages of low dimensional nets

Advantages of Low-Dimensional Nets

  • What can be built in VLSI is often wire-limited

  • LDNs are easier to layout:

    • more uniform wiring density (easier to embed in 2-D or 3-D space)

    • mostly local connections (e.g., grids)

  • Compared with HDNs (e.g., hypercubes), LDNs have:

    • shorter wires (reduces hop latency)

    • fewer wires (increases bandwidth given constant bisection width)

      • increased channel width is the major reason why LDNs win!

  • LDNs have better hot-spot throughput

    • more pins per node than HDNs


Routing

Routing

  • Recall: routing algorithm determines

    • which of the possible paths are used as routes

    • how the route is determined

    • R: N x N -> C, which at each switch maps the destination node nd to the next channel on the route

  • Issues:

    • Routing mechanism

      • arithmetic

      • source-based port select

      • table driven

      • general computation

    • Properties of the routes

    • Deadlock free


Store forward vs cut through routing

Store&Forward vs Cut-Through Routing

  • h(n/b + D) vs n/b + h D

  • h: hops, n: packet length, b: badwidth,

  • D: routing delay at each switch


Routing mechanism

Routing Mechanism

  • need to select output port for each input packet

    • in a few cycles

  • Reduce relative address of each dimension in order

    • Dimension-order routing in k-ary d-cubes

    • e-cube routing in n-cube


Routing mechanism cont

Routing Mechanism (cont)

P3

P2

P1

P0

  • Source-based

    • message header carries series of port selects

    • used and stripped en route

    • CRC? Packet Format?

    • CS-2, Myrinet, MIT Artic

  • Table-driven

    • message header carried index for next port at next switch

      • o = R[i]

    • table also gives index for following hop

      • o, I’ = R[i ]

    • ATM, HPPI


Properties of routing algorithms

Properties of Routing Algorithms

  • Deterministic

    • route determined by (source, dest), not intermediate state (i.e. traffic)

  • Adaptive

    • route influenced by traffic along the way

  • Minimal

    • only selects shortest paths

  • Deadlock free

    • no traffic pattern can lead to a situation where no packets mover forward


Deadlock freedom

Deadlock Freedom

  • How can it arise?

    • necessary conditions:

      • shared resource

      • incrementally allocated

      • non-preemptible

    • think of a channel as a shared resource that is acquired incrementally

      • source buffer then dest. buffer

      • channels along a route

  • How do you avoid it?

    • constrain how channel resources are allocated

    • ex: dimension order

  • How do you prove that a routing algorithm is deadlock free


Proof technique

Proof Technique

  • Resources are logically associated with channels

  • Messages introduce dependences between resources as they move forward

  • Need to articulate possible dependences between channels

  • Show that there are no cycles in Channel Dependence Graph

    • find a numbering of channel resources such that every legal route follows a monotonic sequence

  • => no traffic pattern can lead to deadlock

  • Network need not be acyclic, only channel dependence graph


Examples

Examples

  • Why is the obvious routing on X deadlock free?

    • butterfly?

    • tree?

    • fat tree?

  • Any assumptions about routing mechanism? amount of buffering?

  • What about wormhole routing on a ring?

2

1

0

3

7

4

6

5


Flow control

Flow Control

  • What do you do when push comes to shove?

    • ethernet: collision detection and retry after delay

    • FDDI, token ring: arbitration token

    • TCP/WAN: buffer, drop, adjust rate

    • any solution must adjust to output rate

  • Link-level flow control


Examples1

Examples

  • Short Links

  • Long links

    • several messages on the wire


Link vs global flow control

Link vs global flow control

  • Hot Spots

  • Global communication operations

  • Natural parallel program dependences


Case study cray t3e

Case study: Cray T3E

  • 3-dimensional torus, with 1024 switches each connected to 2 processors

  • Short, wide, synchronous links

  • Dimension order, cut-through, packet-switched routing

  • Variable sized packets, in multiples of 16 bits


Case study sgi origin

Case Study: SGI Origin

  • Hypercube-like topologies with up to 256 switches

  • Each switch supports 4 processors and connects to 4 other switches

  • Long, wide links

  • Table-driven routing: programmable, allowing for flexible topologies and fault-avoidance


  • Login