Topologies ii
Download
1 / 57

Topologies - II - PowerPoint PPT Presentation


  • 426 Views
  • Uploaded on

Topologies - II. Overview. Express Cubes Flattened Butterfly Fat Trees DragonFly. Express Cubes. The Problem Node delay dominates wire delay Pin densities are not as high as wiring densities In a k- ary n- cube , n , k , and W are all related

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Topologies - II' - Solomon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Overview l.jpg
Overview

  • Express Cubes

  • Flattened Butterfly

  • Fat Trees

  • DragonFly


Express cubes l.jpg
Express Cubes

  • The Problem

    • Node delay dominates wire delay

    • Pin densities are not as high as wiring densities

    • In a k-ary n-cube, n, k, and W are all related

  • Consequently networks are node limited rather than wiring limited

    • Unused capacity

  • Goals:

    • design a network that can make use of the unused wiring capacity

    • Approach the wire delay lower bound


Express links l.jpg
Express Links

Interchange box

Wire delay = i.Tw

express channel

i = 4

Non-express latency

Express latency


Balancing node and wire delay l.jpg
Balancing Node and Wire Delay

  • Reduce the node delay component of latency

  • Express channel length chosen to balance wire delay and node delay

  • For large D, the latency is within a factor of 2 of dedicated Manhattan wire latency

  • Pick i based on average distance and relationship between wire delay and node delay

Interchange box

Wire delay = i.Tw

express channel


Balancing throughput and wiring density l.jpg
Balancing Throughput and Wiring Density

  • Exploit available wiring density to move beyond pin-out limitations

    • Consider wiring density of the substrate

  • More complex interchanges

  • Good utilization of express channels

  • Routing on express links


Balancing throughput and wiring density7 l.jpg
Balancing Throughput and Wiring Density

  • Simpler interchange box design

  • Routing in a dimension

  • Uneven traffic distribution across links


Hierarchical express channels l.jpg
Hierarchical Express Channels

  • Messages in the express segment are dominated by node latency

    • Reduce by making this latency growth as logarithm

    • Logarithmic for short distances and linear for long distances

  • Port assignment for messages at each level

  • Combine advantages of direct and indirect networks

  • Routing in three phases: ascent, cruise, descent


Latency behavior l.jpg
Latency Behavior

  • Saw tooth pattern reflects latency jumps

  • Hierarchical channels smoothes out latency variations


Reducing pin out l.jpg
Reducing Pin Out

  • Implementing interchanges with constant pin-out

    • Small latency penalty

  • These can be further reduced at the expense of a few more hops


Implementing multiple dimensions l.jpg
Implementing Multiple Dimensions

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

  • Pinout of interchange boxes is kept constant

  • Messages have to descend to local channels to change dimensions


Summary l.jpg
Summary

  • Push to create networks limited by wire delay and wire density

    • For long distances, latency approaches that of a wire

    • Increase the bisection width of the network

  • Baseline of k-ary n-cubes – efficient use of bisection width

  • Hierarchical express cubes combines logarithmic delay properties with wire efficiency and locality exploitation

  • What happens when high radix routers become feasible?


Flattened butterfly l.jpg
Flattened Butterfly

  • Pin bandwidth and pin-out have improved over the years

    • Low dimensional networks do not make good use of this greater I/O bandwidth

  • Implications of improved off-chip bandwidth?


Reading assignment l.jpg
Reading Assignment

  • J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007

  • J. Kim, W. J. Dally, and D. Abts, “Flattended Butterfly: A Cost Efficient Topology for High Radix Networks,” Proceedings of ISCA 2007


Topology tradeoff l.jpg
Topology Tradeoff

?

Combination of desirable properties

MINs

Clos Network

  • Retain the desirable properties of MINs

    • Logarithmic diameter

    • Cost

  • Exploit the properties of folded Clos Networks

    • Path diversity

    • Ability to exploit locality


Analysis l.jpg
Analysis

  • Logarithmic path lengths of MINs are offset by lack of path diversity and ability to exploit locality

  • Need to balance cost (switches, links/cables/connectors), latency and throughput

    • Reduce the channel count by reducing diameter

    • Reduce the channel count by concentration

  • Better properties are achieved by paying for the above with increased switch radix


Another look at meshes l.jpg
Another Look at Meshes

  • Trade-off wire length and hop count

    • Latency and energy impact

  • Take advantage of high radix routers

short hop count

short wires


Key insight l.jpg
Key Insight

  • Goal is to balance serialization latency and header latency

    • Meshes only reduce serialization latency

  • FB trades serialization latency for header latency via higher radix switches


Using concentrators l.jpg
Using Concentrators

Can be integrated as a single switch

  • Better balance of injection bandwidth and network bandwidth

    • How often do all processors want to communication concurrently?

  • Significant reduction in wiring complexity

Figure from J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007


Construction l.jpg
Construction

  • Note that the inter-router connections are determined by permutations of the address digits

    • For example in the above figure (a)


On chip structure l.jpg
On-Chip Structure

  • Structurally similar to generalized hypercube

  • Attempts to approach the wire bound for latency

  • Latency tolerant long wires (pipelined, repeated)

  • Deeper buffers

Figure from J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007


Router optimizations l.jpg
Router Optimizations

S

D

Non-minimal routing

Using Bypass Channels

Figures from J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007


Connectivity l.jpg
Connectivity

  • Each switch i in a stage is connected to j for m = 0 to k-1


Properties l.jpg
Properties

  • For a k-ary n-flat with nodes we have

    routers in

    dimensions with

    router radix

  • Built for size

    • Need high radix routers

Figure from J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007


Routing l.jpg
Routing

  • Basic dimension order traversal

    • However in MINs all paths are of length

  • In FB, only necessary dimensions need be traversed (remember binary hypercubes!)

  • Number of shortest paths is factorial in the number of differing dimensions

    • In the MIN, dimension order is fixed

  • Non-minimal routing for better network-wide load balancing

    • Enables performance equivalent to flattened Clos


Comparison to generalized hypercubes l.jpg
Comparison to Generalized Hypercubes

  • Use of router bandwidth

    • Concentration in a k-ary n-flat produces better link utilization and lower cost

    • Example: Use 1K nodes and radix 32 switches

      • GHC – (8,8,16)

      • FB – one dimension

  • Load balanced non-minimal routing in FB

0

0

0

1

1

1

0

31

31

31

If BW is matched, serialization latency will dominate


Comparison l.jpg
Comparison

  • For 1024 nodes

  • Flattened butterfly

    • Radix 32 switches and two stages

  • Folded Clos

    • Radix 64 switches and two stages

  • Binary hypercube

    • 10 dimensional hypercube

  • Trade-off analysis will keep the bisection width constant


Fat trees l.jpg
Fat Trees

  • The seminal paper by C. Leiserson, “Fat Trees: Universal Networks for Hardware Efficient Supercomputing,” IEEE Transactions on Computers, October 1985.

  • Simple engineering premise: a network topology that has the advantages of the binary tree without the problems


Reading assignment29 l.jpg
Reading Assignment

  • Xin Yuan et. al., “Oblivious Routing for Fat-Tree Based System Area Networks with Uncertain Traffic Demands,” SIGMETRICS 2007, Section 2.2 (until property 1).

  • Recommended

    C. Leiserson, “Fat Trees: Universal Networks for Hardware Efficient Supercomputing,” IEEE Transactions on Computers, October 1985.


Fat trees basic idea l.jpg

Backward

Forward

Fat Trees: Basic Idea

  • Alleviate the bandwidth bottleneck closer to the root with additional links

  • Common topology in many supercomputer installations


Alternative construction l.jpg

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Alternative Construction

  • Nodes at tree leaves

  • Switches at tree vertices

    • Building crossbars with simpler switches

  • Total link bandwidth is constant across all tree levels, with full bisection bandwidth

  • This construction is also known as having constant bisection bandwidth

© T.M. Pinkston, J. Duato, with major contributions by J. Filch


Ft 4 4 sub trees l.jpg

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

FT(4,4): Sub-trees

  • Built with constant radix (m) switches and L levels

  • Can be viewed as a less costly way of building crossbar switches

X. Lin, Y. Chung and T. Huang, “A Multiple LID Routing Scheme for Fat-tree Based InfiniBand Networks,” IEEE IPDPS 2004


Ft 4 4 l.jpg
FT (4,4)

spine

sub-tree

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

X. Lin, Y. Chung and T. Huang, “A Multiple LID Routing Scheme for Fat-tree Based InfiniBand Networks,” IEEE IPDPS 2004


Properties34 l.jpg
Properties

  • The properties follow from multistage interconnection networks

  • Number of inputs as a function of m and L

  • Number of switches as a function of m and L


Generalized fat trees l.jpg
Generalized Fat Trees

  • Networks with variable bisection bandwidth

  • Asymmetric use of switch radix between successive levels

  • Can construct expanders and concentrators

  • GFT (L,m,w)

    • L levels

    • Bandwidth ratio of m:w between levels

Reference: S. R. Ohring, M. Ibel, S. K. Das, M. J. Kumar, “On Generalized Fat-tree,” IEEE IPPS 1995



Revisiting the clos network l.jpg
Revisiting the Clos Network

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15


Folded clos network l.jpg
Folded Clos Network

0

  • Network is effectively folded into itself to produce a fat tree

  • Consequently often referred to as a folded Clos Network

    • Note this is also equivalent to a bidirectional Benes network

    • Rearrangeable when

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15


Implications l.jpg
Implications

  • The connectivity of this class of fat trees are equivalent to that of a crossbar

  • Realizing the performance of a crossbar is another matter

    • Recall it is not strictly non-blocking (crossbar) but rather rearrangeable

    • Packet scheduling and congestion management are key to performance

  • Achievable performance a function of the communication pattern


Basic routing l.jpg
Basic Routing

  • Compare addresses of source and destination

    • Digit position of the first difference identifies the stage/level is to turn around.

    • Any path up the tree

    • Deterministic routing down the tree

  • Load balanced routing randomizes the up path


Dragonfly topology l.jpg
Dragonfly Topology

  • How do we build networks that scale to >106 nodes?

    • Technology to date had prescribed low dimensional networks

    • Target – Exascale computing systems

  • Topology challenge

    • Reconciling trade-offs between diameter, switch radix, and cost

    • What is a feasible radix to use vs. need for scalability?

    • Indirect topology


Reading assignment42 l.jpg
Reading Assignment

J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven highly scalable Dragonfly topology,” Proceedings of the International Symposium on Computer Architecture, 2008


Dragonfly l.jpg
Dragonfly

  • Increasing pin bandwidth has moved the design point

    • More channels, smaller diameter rather than less channels higher bandwidth/channel

    • Increases the number and length of cables

  • Network cost proportional to number of channels, especially global channels

    • Reduce the number of global channels

    • Use new signaling technology for long channels  optical signaling

  • Engineering optimizations to increase the effective switch radix


Topology basics l.jpg
Topology Basics

  • One level hierarchy

  • Traffic limited to one global hop

  • Flexible definition of intra-group interconnection

h global channels

(a-1) local channels

inter-group network

intra-group network

G0

G1

Gg

0

1

a-1

0

1

P-1


Properties45 l.jpg
Properties

  • Switch radix

  • Effective radix (group radix)

  • Number of groups

  • Number of nodes


Topology variations l.jpg
Topology Variations

  • The inter-group network

    • Downsizing: redistribute the global channels amongst the groups

    • Upsizing: move to multiple global hops

    • Rightsizing: increase virtual radix to maintain single hop property, e.g., radix 64, 2D k-ary n-flat intra-group network and 256K nodes

  • The intra-group network

    • Trade-off local hops vs. physical radix


More on properties l.jpg
More on Properties

  • Bisection bandwidth follows inter-group network

  • Pin-out (radix) is a design parameter

    • Defines network scale and diameter

    • Defines local (group) hop count

From J. Kim, et.al, “Technology-driven highly scalable Dragonfly topology,” ISCA 2008


Balanced design l.jpg
Balanced Design

  • How do we think about values of parameters a, p, and h?

    • Balancing traffic on global and local channels

  • Balancing cost

    • Trade-off swicth radix with number of global channels (cables)


Routing49 l.jpg
Routing

  • Minimal routing takes at most three steps

    • Route to the router within the source group that connects to the correct destination group (multi-hop?)

    • Route across the global channel to the correct destination group

    • Route to the correct destination router within the destination group that connects to the destination (multi-hop?)

  • Does not work well under adversarial patterns

    • Use Valiant’s 2-phase routing algorithm

      • Route to a random intermediate group

      • Route to the desination


Summary50 l.jpg
Summary

  • Topology designed to scale to large numbers of nodes (>106)

    • Sensitive to a hierarchy of interconnect delays and costs

  • Exploits increased pin-bandwidth in emerging technologies

  • We will return to more sophisticated routing algorithms later


Hybrid networks l.jpg
Hybrid Networks

Cluster based 2D Mesh

2D Hypermesh


Comparison of direct and indirect networks l.jpg
Comparison of Direct and Indirect Networks

  • Concentration can be used to reduce direct network switch & link costs

    • “C” end nodes connect to each switch, where C is concentration factor

    • Allows larger systems to be built from fewer switches and links

    • Requires larger switch degree

    • For N = 32 and k = 8, fewer switches and links than fat tree

64-node system with 8-port switches, b = 4

32-node system with 8-port switches

© T.M. Pinkston, J. Duato, with major contributions by J. Filch


Comparison of direct and indirect networks53 l.jpg
Comparison of Direct and Indirect Networks

Distance scaling problems may be exacerbated in on-chip MINs

End Nodes

Switches

© T.M. Pinkston, J. Duato, with major contributions by J. Filch


Comparison of direct and indirect networks54 l.jpg
Comparison of Direct and Indirect Networks

  • Blocking reduced by maximizing dimensions (switch degree)

    • Can increase bisection bandwidth, but

      • Additional dimensions may increase wire length (must observe 3D packaging constraints)

      • Flow control issues (buffer size increases with link length)

      • Pin-out constraints (limit the number of dimensions achievable)

Evaluation category

Bus

Ring

2D mesh

2D torus

Hypercube

Fat tree

Fully

connected

Perf.

BWBisection in # links

1

2

8

16

32

32

1024

Max (ave.) hop count

1 (1)

32 (16)

14 (7)

8 (4)

6 (3)

11 (9)

1 (1)

Cost

I/O ports per switch

NA

3

5

5

7

4

64

Number of switches

NA

64

64

64

64

192

64

Number of net. links

1

64

112

128

192

320

2016

Total number of links

1

128

176

192

256

384

2080

Performance and cost of several network topologies for 64 nodes. Values are given in terms of bidirectional links & ports.

Hop count includes a switch and its output link (in the above, end node links are not counted for the bus topology).

© T.M. Pinkston, J. Duato, with major contributions by J. Filch


Commercial machines l.jpg

Company

System

[Network] Name

Max.

number

of nodes

[x # CPUs]

Basic network topology

Injection

[Recept’n]

node BW in

MBytes/s

# of data

bits per

link per

direction

Raw network link BW per direction in Mbytes/sec

Raw network bisection BW (bidir) in Gbytes/s

Intel

ASCI Red

Paragon

4,510

[x 2]

2-D mesh

64 x 64

400

[400]

16 bits

400

51.2

IBM

ASCI White

SP Power3

[Colony]

512

[x 16]

BMIN w/8-port bidirect. switches (fat-tree or Omega)

500

[500]

8 bits (+1 bit of control)

500

256

Intel

Thunter Itanium2

Tiger4

[QsNetII]

1,024

[x 4]

fat tree w/8-port

bidirectional

switches

928

[928]

8 bits (+2 control for 4b/5b enc)

1,333

1,365

Cray

XT3

[SeaStar]

30,508

[x 1]

3-D torus

40 x 32 x 24

3,200

[3,200]

12 bits

3,800

5,836.8

Cray

X1E

1,024

[x 1]

4-way bristled

2-D torus (~ 23 x 11)

with express links

1,600

[1,600]

16 bits

1,600

51.2

IBM

ASC Purple

pSeries 575

[Federation]

>1,280

[x 8]

BMIN w/8-port

bidirect. switches

(fat-tree or Omega)

2,000

[2,000]

8 bits (+2 bits of control)

2,000

2,560

IBM

Blue Gene/L

eServer Sol.

[Torus Net]

65,536

[x 2]

3-D torus

32 x 32 x 64

612,5

[1,050]

1 bit (bit serial)

175

358.4

Commercial Machines

© T.M. Pinkston, J. Duato, with major contributions by J. Filch


A unified view of direct and indirect networks l.jpg
A Unified View of Direct and Indirect Networks

  • Switch designs in both cases are coalescing

    • Generic network may have 0, 1, or more compute nodes/switch

  • Switches implement programmable routing functions

  • Differences are primarily an issue of topology

    • Imagine the use of source routed messages

  • Deadlock avoidance


Summary and research directions l.jpg
Summary and Research Directions

  • Use of hybrid interconnection networks

    • Best way to utilize existing pin-out?

  • Engineering considerations rapidly prune the space of candidate topologies

  • Routing + switching + topology = network

  • Onto routing…….


ad