Topologies ii l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 57

Topologies - II PowerPoint PPT Presentation


  • 387 Views
  • Uploaded on
  • Presentation posted in: Pets / Animals

Topologies - II. Overview. Express Cubes Flattened Butterfly Fat Trees DragonFly. Express Cubes. The Problem Node delay dominates wire delay Pin densities are not as high as wiring densities In a k- ary n- cube , n , k , and W are all related

Download Presentation

Topologies - II

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Topologies ii l.jpg

Topologies - II


Overview l.jpg

Overview

  • Express Cubes

  • Flattened Butterfly

  • Fat Trees

  • DragonFly


Express cubes l.jpg

Express Cubes

  • The Problem

    • Node delay dominates wire delay

    • Pin densities are not as high as wiring densities

    • In a k-ary n-cube, n, k, and W are all related

  • Consequently networks are node limited rather than wiring limited

    • Unused capacity

  • Goals:

    • design a network that can make use of the unused wiring capacity

    • Approach the wire delay lower bound


Express links l.jpg

Express Links

Interchange box

Wire delay = i.Tw

express channel

i = 4

Non-express latency

Express latency


Balancing node and wire delay l.jpg

Balancing Node and Wire Delay

  • Reduce the node delay component of latency

  • Express channel length chosen to balance wire delay and node delay

  • For large D, the latency is within a factor of 2 of dedicated Manhattan wire latency

  • Pick i based on average distance and relationship between wire delay and node delay

Interchange box

Wire delay = i.Tw

express channel


Balancing throughput and wiring density l.jpg

Balancing Throughput and Wiring Density

  • Exploit available wiring density to move beyond pin-out limitations

    • Consider wiring density of the substrate

  • More complex interchanges

  • Good utilization of express channels

  • Routing on express links


Balancing throughput and wiring density7 l.jpg

Balancing Throughput and Wiring Density

  • Simpler interchange box design

  • Routing in a dimension

  • Uneven traffic distribution across links


Hierarchical express channels l.jpg

Hierarchical Express Channels

  • Messages in the express segment are dominated by node latency

    • Reduce by making this latency growth as logarithm

    • Logarithmic for short distances and linear for long distances

  • Port assignment for messages at each level

  • Combine advantages of direct and indirect networks

  • Routing in three phases: ascent, cruise, descent


Latency behavior l.jpg

Latency Behavior

  • Saw tooth pattern reflects latency jumps

  • Hierarchical channels smoothes out latency variations


Reducing pin out l.jpg

Reducing Pin Out

  • Implementing interchanges with constant pin-out

    • Small latency penalty

  • These can be further reduced at the expense of a few more hops


Implementing multiple dimensions l.jpg

Implementing Multiple Dimensions

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

  • Pinout of interchange boxes is kept constant

  • Messages have to descend to local channels to change dimensions


Summary l.jpg

Summary

  • Push to create networks limited by wire delay and wire density

    • For long distances, latency approaches that of a wire

    • Increase the bisection width of the network

  • Baseline of k-ary n-cubes – efficient use of bisection width

  • Hierarchical express cubes combines logarithmic delay properties with wire efficiency and locality exploitation

  • What happens when high radix routers become feasible?


Flattened butterfly l.jpg

Flattened Butterfly

  • Pin bandwidth and pin-out have improved over the years

    • Low dimensional networks do not make good use of this greater I/O bandwidth

  • Implications of improved off-chip bandwidth?


Reading assignment l.jpg

Reading Assignment

  • J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007

  • J. Kim, W. J. Dally, and D. Abts, “Flattended Butterfly: A Cost Efficient Topology for High Radix Networks,” Proceedings of ISCA 2007


Topology tradeoff l.jpg

Topology Tradeoff

?

Combination of desirable properties

MINs

Clos Network

  • Retain the desirable properties of MINs

    • Logarithmic diameter

    • Cost

  • Exploit the properties of folded Clos Networks

    • Path diversity

    • Ability to exploit locality


Analysis l.jpg

Analysis

  • Logarithmic path lengths of MINs are offset by lack of path diversity and ability to exploit locality

  • Need to balance cost (switches, links/cables/connectors), latency and throughput

    • Reduce the channel count by reducing diameter

    • Reduce the channel count by concentration

  • Better properties are achieved by paying for the above with increased switch radix


Another look at meshes l.jpg

Another Look at Meshes

  • Trade-off wire length and hop count

    • Latency and energy impact

  • Take advantage of high radix routers

short hop count

short wires


Key insight l.jpg

Key Insight

  • Goal is to balance serialization latency and header latency

    • Meshes only reduce serialization latency

  • FB trades serialization latency for header latency via higher radix switches


Using concentrators l.jpg

Using Concentrators

Can be integrated as a single switch

  • Better balance of injection bandwidth and network bandwidth

    • How often do all processors want to communication concurrently?

  • Significant reduction in wiring complexity

Figure from J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007


Construction l.jpg

Construction

  • Note that the inter-router connections are determined by permutations of the address digits

    • For example in the above figure (a)


On chip structure l.jpg

On-Chip Structure

  • Structurally similar to generalized hypercube

  • Attempts to approach the wire bound for latency

  • Latency tolerant long wires (pipelined, repeated)

  • Deeper buffers

Figure from J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007


Router optimizations l.jpg

Router Optimizations

S

D

Non-minimal routing

Using Bypass Channels

Figures from J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007


Connectivity l.jpg

Connectivity

  • Each switch i in a stage is connected to j for m = 0 to k-1


Properties l.jpg

Properties

  • For a k-ary n-flat with nodes we have

    routers in

    dimensions with

    router radix

  • Built for size

    • Need high radix routers

Figure from J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007


Routing l.jpg

Routing

  • Basic dimension order traversal

    • However in MINs all paths are of length

  • In FB, only necessary dimensions need be traversed (remember binary hypercubes!)

  • Number of shortest paths is factorial in the number of differing dimensions

    • In the MIN, dimension order is fixed

  • Non-minimal routing for better network-wide load balancing

    • Enables performance equivalent to flattened Clos


Comparison to generalized hypercubes l.jpg

Comparison to Generalized Hypercubes

  • Use of router bandwidth

    • Concentration in a k-ary n-flat produces better link utilization and lower cost

    • Example: Use 1K nodes and radix 32 switches

      • GHC – (8,8,16)

      • FB – one dimension

  • Load balanced non-minimal routing in FB

0

0

0

1

1

1

0

31

31

31

If BW is matched, serialization latency will dominate


Comparison l.jpg

Comparison

  • For 1024 nodes

  • Flattened butterfly

    • Radix 32 switches and two stages

  • Folded Clos

    • Radix 64 switches and two stages

  • Binary hypercube

    • 10 dimensional hypercube

  • Trade-off analysis will keep the bisection width constant


Fat trees l.jpg

Fat Trees

  • The seminal paper by C. Leiserson, “Fat Trees: Universal Networks for Hardware Efficient Supercomputing,” IEEE Transactions on Computers, October 1985.

  • Simple engineering premise: a network topology that has the advantages of the binary tree without the problems


Reading assignment29 l.jpg

Reading Assignment

  • Xin Yuan et. al., “Oblivious Routing for Fat-Tree Based System Area Networks with Uncertain Traffic Demands,” SIGMETRICS 2007, Section 2.2 (until property 1).

  • Recommended

    C. Leiserson, “Fat Trees: Universal Networks for Hardware Efficient Supercomputing,” IEEE Transactions on Computers, October 1985.


Fat trees basic idea l.jpg

Backward

Forward

Fat Trees: Basic Idea

  • Alleviate the bandwidth bottleneck closer to the root with additional links

  • Common topology in many supercomputer installations


Alternative construction l.jpg

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Alternative Construction

  • Nodes at tree leaves

  • Switches at tree vertices

    • Building crossbars with simpler switches

  • Total link bandwidth is constant across all tree levels, with full bisection bandwidth

  • This construction is also known as having constant bisection bandwidth

© T.M. Pinkston, J. Duato, with major contributions by J. Filch


Ft 4 4 sub trees l.jpg

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

FT(4,4): Sub-trees

  • Built with constant radix (m) switches and L levels

  • Can be viewed as a less costly way of building crossbar switches

X. Lin, Y. Chung and T. Huang, “A Multiple LID Routing Scheme for Fat-tree Based InfiniBand Networks,” IEEE IPDPS 2004


Ft 4 4 l.jpg

FT (4,4)

spine

sub-tree

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

X. Lin, Y. Chung and T. Huang, “A Multiple LID Routing Scheme for Fat-tree Based InfiniBand Networks,” IEEE IPDPS 2004


Properties34 l.jpg

Properties

  • The properties follow from multistage interconnection networks

  • Number of inputs as a function of m and L

  • Number of switches as a function of m and L


Generalized fat trees l.jpg

Generalized Fat Trees

  • Networks with variable bisection bandwidth

  • Asymmetric use of switch radix between successive levels

  • Can construct expanders and concentrators

  • GFT (L,m,w)

    • L levels

    • Bandwidth ratio of m:w between levels

Reference: S. R. Ohring, M. Ibel, S. K. Das, M. J. Kumar, “On Generalized Fat-tree,” IEEE IPPS 1995


Gft 2 4 2 l.jpg

GFT(2,4,2)


Revisiting the clos network l.jpg

Revisiting the Clos Network

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15


Folded clos network l.jpg

Folded Clos Network

0

  • Network is effectively folded into itself to produce a fat tree

  • Consequently often referred to as a folded Clos Network

    • Note this is also equivalent to a bidirectional Benes network

    • Rearrangeable when

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15


Implications l.jpg

Implications

  • The connectivity of this class of fat trees are equivalent to that of a crossbar

  • Realizing the performance of a crossbar is another matter

    • Recall it is not strictly non-blocking (crossbar) but rather rearrangeable

    • Packet scheduling and congestion management are key to performance

  • Achievable performance a function of the communication pattern


Basic routing l.jpg

Basic Routing

  • Compare addresses of source and destination

    • Digit position of the first difference identifies the stage/level is to turn around.

    • Any path up the tree

    • Deterministic routing down the tree

  • Load balanced routing randomizes the up path


Dragonfly topology l.jpg

Dragonfly Topology

  • How do we build networks that scale to >106 nodes?

    • Technology to date had prescribed low dimensional networks

    • Target – Exascale computing systems

  • Topology challenge

    • Reconciling trade-offs between diameter, switch radix, and cost

    • What is a feasible radix to use vs. need for scalability?

    • Indirect topology


Reading assignment42 l.jpg

Reading Assignment

J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven highly scalable Dragonfly topology,” Proceedings of the International Symposium on Computer Architecture, 2008


Dragonfly l.jpg

Dragonfly

  • Increasing pin bandwidth has moved the design point

    • More channels, smaller diameter rather than less channels higher bandwidth/channel

    • Increases the number and length of cables

  • Network cost proportional to number of channels, especially global channels

    • Reduce the number of global channels

    • Use new signaling technology for long channels  optical signaling

  • Engineering optimizations to increase the effective switch radix


Topology basics l.jpg

Topology Basics

  • One level hierarchy

  • Traffic limited to one global hop

  • Flexible definition of intra-group interconnection

h global channels

(a-1) local channels

inter-group network

intra-group network

G0

G1

Gg

0

1

a-1

0

1

P-1


Properties45 l.jpg

Properties

  • Switch radix

  • Effective radix (group radix)

  • Number of groups

  • Number of nodes


Topology variations l.jpg

Topology Variations

  • The inter-group network

    • Downsizing: redistribute the global channels amongst the groups

    • Upsizing: move to multiple global hops

    • Rightsizing: increase virtual radix to maintain single hop property, e.g., radix 64, 2D k-ary n-flat intra-group network and 256K nodes

  • The intra-group network

    • Trade-off local hops vs. physical radix


More on properties l.jpg

More on Properties

  • Bisection bandwidth follows inter-group network

  • Pin-out (radix) is a design parameter

    • Defines network scale and diameter

    • Defines local (group) hop count

From J. Kim, et.al, “Technology-driven highly scalable Dragonfly topology,” ISCA 2008


Balanced design l.jpg

Balanced Design

  • How do we think about values of parameters a, p, and h?

    • Balancing traffic on global and local channels

  • Balancing cost

    • Trade-off swicth radix with number of global channels (cables)


Routing49 l.jpg

Routing

  • Minimal routing takes at most three steps

    • Route to the router within the source group that connects to the correct destination group (multi-hop?)

    • Route across the global channel to the correct destination group

    • Route to the correct destination router within the destination group that connects to the destination (multi-hop?)

  • Does not work well under adversarial patterns

    • Use Valiant’s 2-phase routing algorithm

      • Route to a random intermediate group

      • Route to the desination


Summary50 l.jpg

Summary

  • Topology designed to scale to large numbers of nodes (>106)

    • Sensitive to a hierarchy of interconnect delays and costs

  • Exploits increased pin-bandwidth in emerging technologies

  • We will return to more sophisticated routing algorithms later


Hybrid networks l.jpg

Hybrid Networks

Cluster based 2D Mesh

2D Hypermesh


Comparison of direct and indirect networks l.jpg

Comparison of Direct and Indirect Networks

  • Concentration can be used to reduce direct network switch & link costs

    • “C” end nodes connect to each switch, where C is concentration factor

    • Allows larger systems to be built from fewer switches and links

    • Requires larger switch degree

    • For N = 32 and k = 8, fewer switches and links than fat tree

64-node system with 8-port switches, b = 4

32-node system with 8-port switches

© T.M. Pinkston, J. Duato, with major contributions by J. Filch


Comparison of direct and indirect networks53 l.jpg

Comparison of Direct and Indirect Networks

Distance scaling problems may be exacerbated in on-chip MINs

End Nodes

Switches

© T.M. Pinkston, J. Duato, with major contributions by J. Filch


Comparison of direct and indirect networks54 l.jpg

Comparison of Direct and Indirect Networks

  • Blocking reduced by maximizing dimensions (switch degree)

    • Can increase bisection bandwidth, but

      • Additional dimensions may increase wire length (must observe 3D packaging constraints)

      • Flow control issues (buffer size increases with link length)

      • Pin-out constraints (limit the number of dimensions achievable)

Evaluation category

Bus

Ring

2D mesh

2D torus

Hypercube

Fat tree

Fully

connected

Perf.

BWBisection in # links

1

2

8

16

32

32

1024

Max (ave.) hop count

1 (1)

32 (16)

14 (7)

8 (4)

6 (3)

11 (9)

1 (1)

Cost

I/O ports per switch

NA

3

5

5

7

4

64

Number of switches

NA

64

64

64

64

192

64

Number of net. links

1

64

112

128

192

320

2016

Total number of links

1

128

176

192

256

384

2080

Performance and cost of several network topologies for 64 nodes. Values are given in terms of bidirectional links & ports.

Hop count includes a switch and its output link (in the above, end node links are not counted for the bus topology).

© T.M. Pinkston, J. Duato, with major contributions by J. Filch


Commercial machines l.jpg

Company

System

[Network] Name

Max.

number

of nodes

[x # CPUs]

Basic network topology

Injection

[Recept’n]

node BW in

MBytes/s

# of data

bits per

link per

direction

Raw network link BW per direction in Mbytes/sec

Raw network bisection BW (bidir) in Gbytes/s

Intel

ASCI Red

Paragon

4,510

[x 2]

2-D mesh

64 x 64

400

[400]

16 bits

400

51.2

IBM

ASCI White

SP Power3

[Colony]

512

[x 16]

BMIN w/8-port bidirect. switches (fat-tree or Omega)

500

[500]

8 bits (+1 bit of control)

500

256

Intel

Thunter Itanium2

Tiger4

[QsNetII]

1,024

[x 4]

fat tree w/8-port

bidirectional

switches

928

[928]

8 bits (+2 control for 4b/5b enc)

1,333

1,365

Cray

XT3

[SeaStar]

30,508

[x 1]

3-D torus

40 x 32 x 24

3,200

[3,200]

12 bits

3,800

5,836.8

Cray

X1E

1,024

[x 1]

4-way bristled

2-D torus (~ 23 x 11)

with express links

1,600

[1,600]

16 bits

1,600

51.2

IBM

ASC Purple

pSeries 575

[Federation]

>1,280

[x 8]

BMIN w/8-port

bidirect. switches

(fat-tree or Omega)

2,000

[2,000]

8 bits (+2 bits of control)

2,000

2,560

IBM

Blue Gene/L

eServer Sol.

[Torus Net]

65,536

[x 2]

3-D torus

32 x 32 x 64

612,5

[1,050]

1 bit (bit serial)

175

358.4

Commercial Machines

© T.M. Pinkston, J. Duato, with major contributions by J. Filch


A unified view of direct and indirect networks l.jpg

A Unified View of Direct and Indirect Networks

  • Switch designs in both cases are coalescing

    • Generic network may have 0, 1, or more compute nodes/switch

  • Switches implement programmable routing functions

  • Differences are primarily an issue of topology

    • Imagine the use of source routed messages

  • Deadlock avoidance


Summary and research directions l.jpg

Summary and Research Directions

  • Use of hybrid interconnection networks

    • Best way to utilize existing pin-out?

  • Engineering considerations rapidly prune the space of candidate topologies

  • Routing + switching + topology = network

  • Onto routing…….


  • Login