topologies ii l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Topologies - II PowerPoint Presentation
Download Presentation
Topologies - II

Loading in 2 Seconds...

play fullscreen
1 / 57

Topologies - II - PowerPoint PPT Presentation


  • 464 Views
  • Uploaded on

Topologies - II. Overview. Express Cubes Flattened Butterfly Fat Trees DragonFly. Express Cubes. The Problem Node delay dominates wire delay Pin densities are not as high as wiring densities In a k- ary n- cube , n , k , and W are all related

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Topologies - II' - Solomon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview
Overview
  • Express Cubes
  • Flattened Butterfly
  • Fat Trees
  • DragonFly
express cubes
Express Cubes
  • The Problem
    • Node delay dominates wire delay
    • Pin densities are not as high as wiring densities
    • In a k-ary n-cube, n, k, and W are all related
  • Consequently networks are node limited rather than wiring limited
    • Unused capacity
  • Goals:
    • design a network that can make use of the unused wiring capacity
    • Approach the wire delay lower bound
express links
Express Links

Interchange box

Wire delay = i.Tw

express channel

i = 4

Non-express latency

Express latency

balancing node and wire delay
Balancing Node and Wire Delay
  • Reduce the node delay component of latency
  • Express channel length chosen to balance wire delay and node delay
  • For large D, the latency is within a factor of 2 of dedicated Manhattan wire latency
  • Pick i based on average distance and relationship between wire delay and node delay

Interchange box

Wire delay = i.Tw

express channel

balancing throughput and wiring density
Balancing Throughput and Wiring Density
  • Exploit available wiring density to move beyond pin-out limitations
    • Consider wiring density of the substrate
  • More complex interchanges
  • Good utilization of express channels
  • Routing on express links
balancing throughput and wiring density7
Balancing Throughput and Wiring Density
  • Simpler interchange box design
  • Routing in a dimension
  • Uneven traffic distribution across links
hierarchical express channels
Hierarchical Express Channels
  • Messages in the express segment are dominated by node latency
    • Reduce by making this latency growth as logarithm
    • Logarithmic for short distances and linear for long distances
  • Port assignment for messages at each level
  • Combine advantages of direct and indirect networks
  • Routing in three phases: ascent, cruise, descent
latency behavior
Latency Behavior
  • Saw tooth pattern reflects latency jumps
  • Hierarchical channels smoothes out latency variations
reducing pin out
Reducing Pin Out
  • Implementing interchanges with constant pin-out
    • Small latency penalty
  • These can be further reduced at the expense of a few more hops
implementing multiple dimensions
Implementing Multiple Dimensions

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

  • Pinout of interchange boxes is kept constant
  • Messages have to descend to local channels to change dimensions
summary
Summary
  • Push to create networks limited by wire delay and wire density
    • For long distances, latency approaches that of a wire
    • Increase the bisection width of the network
  • Baseline of k-ary n-cubes – efficient use of bisection width
  • Hierarchical express cubes combines logarithmic delay properties with wire efficiency and locality exploitation
  • What happens when high radix routers become feasible?
flattened butterfly
Flattened Butterfly
  • Pin bandwidth and pin-out have improved over the years
    • Low dimensional networks do not make good use of this greater I/O bandwidth
  • Implications of improved off-chip bandwidth?
reading assignment
Reading Assignment
  • J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007
  • J. Kim, W. J. Dally, and D. Abts, “Flattended Butterfly: A Cost Efficient Topology for High Radix Networks,” Proceedings of ISCA 2007
topology tradeoff
Topology Tradeoff

?

Combination of desirable properties

MINs

Clos Network

  • Retain the desirable properties of MINs
    • Logarithmic diameter
    • Cost
  • Exploit the properties of folded Clos Networks
    • Path diversity
    • Ability to exploit locality
analysis
Analysis
  • Logarithmic path lengths of MINs are offset by lack of path diversity and ability to exploit locality
  • Need to balance cost (switches, links/cables/connectors), latency and throughput
    • Reduce the channel count by reducing diameter
    • Reduce the channel count by concentration
  • Better properties are achieved by paying for the above with increased switch radix
another look at meshes
Another Look at Meshes
  • Trade-off wire length and hop count
    • Latency and energy impact
  • Take advantage of high radix routers

short hop count

short wires

key insight
Key Insight
  • Goal is to balance serialization latency and header latency
    • Meshes only reduce serialization latency
  • FB trades serialization latency for header latency via higher radix switches
using concentrators
Using Concentrators

Can be integrated as a single switch

  • Better balance of injection bandwidth and network bandwidth
    • How often do all processors want to communication concurrently?
  • Significant reduction in wiring complexity

Figure from J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007

construction
Construction
  • Note that the inter-router connections are determined by permutations of the address digits
    • For example in the above figure (a)
on chip structure
On-Chip Structure
  • Structurally similar to generalized hypercube
  • Attempts to approach the wire bound for latency
  • Latency tolerant long wires (pipelined, repeated)
  • Deeper buffers

Figure from J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007

router optimizations
Router Optimizations

S

D

Non-minimal routing

Using Bypass Channels

Figures from J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007

connectivity
Connectivity
  • Each switch i in a stage is connected to j for m = 0 to k-1
properties
Properties
  • For a k-ary n-flat with nodes we have

routers in

dimensions with

router radix

  • Built for size
    • Need high radix routers

Figure from J. Kim, J. Balfour, and W. J. Dally, “Flattened Butterfly Topology for On-Chip Networks,” Proceedings of MICRO 2007

routing
Routing
  • Basic dimension order traversal
    • However in MINs all paths are of length
  • In FB, only necessary dimensions need be traversed (remember binary hypercubes!)
  • Number of shortest paths is factorial in the number of differing dimensions
    • In the MIN, dimension order is fixed
  • Non-minimal routing for better network-wide load balancing
    • Enables performance equivalent to flattened Clos
comparison to generalized hypercubes
Comparison to Generalized Hypercubes
  • Use of router bandwidth
    • Concentration in a k-ary n-flat produces better link utilization and lower cost
    • Example: Use 1K nodes and radix 32 switches
      • GHC – (8,8,16)
      • FB – one dimension
  • Load balanced non-minimal routing in FB

0

0

0

1

1

1

0

31

31

31

If BW is matched, serialization latency will dominate

comparison
Comparison
  • For 1024 nodes
  • Flattened butterfly
    • Radix 32 switches and two stages
  • Folded Clos
    • Radix 64 switches and two stages
  • Binary hypercube
    • 10 dimensional hypercube
  • Trade-off analysis will keep the bisection width constant
fat trees
Fat Trees
  • The seminal paper by C. Leiserson, “Fat Trees: Universal Networks for Hardware Efficient Supercomputing,” IEEE Transactions on Computers, October 1985.
  • Simple engineering premise: a network topology that has the advantages of the binary tree without the problems
reading assignment29
Reading Assignment
  • Xin Yuan et. al., “Oblivious Routing for Fat-Tree Based System Area Networks with Uncertain Traffic Demands,” SIGMETRICS 2007, Section 2.2 (until property 1).
  • Recommended

C. Leiserson, “Fat Trees: Universal Networks for Hardware Efficient Supercomputing,” IEEE Transactions on Computers, October 1985.

fat trees basic idea

Backward

Forward

Fat Trees: Basic Idea
  • Alleviate the bandwidth bottleneck closer to the root with additional links
  • Common topology in many supercomputer installations
alternative construction

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Alternative Construction
  • Nodes at tree leaves
  • Switches at tree vertices
    • Building crossbars with simpler switches
  • Total link bandwidth is constant across all tree levels, with full bisection bandwidth
  • This construction is also known as having constant bisection bandwidth

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

ft 4 4 sub trees

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

FT(4,4): Sub-trees
  • Built with constant radix (m) switches and L levels
  • Can be viewed as a less costly way of building crossbar switches

X. Lin, Y. Chung and T. Huang, “A Multiple LID Routing Scheme for Fat-tree Based InfiniBand Networks,” IEEE IPDPS 2004

ft 4 4
FT (4,4)

spine

sub-tree

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

X. Lin, Y. Chung and T. Huang, “A Multiple LID Routing Scheme for Fat-tree Based InfiniBand Networks,” IEEE IPDPS 2004

properties34
Properties
  • The properties follow from multistage interconnection networks
  • Number of inputs as a function of m and L
  • Number of switches as a function of m and L
generalized fat trees
Generalized Fat Trees
  • Networks with variable bisection bandwidth
  • Asymmetric use of switch radix between successive levels
  • Can construct expanders and concentrators
  • GFT (L,m,w)
    • L levels
    • Bandwidth ratio of m:w between levels

Reference: S. R. Ohring, M. Ibel, S. K. Das, M. J. Kumar, “On Generalized Fat-tree,” IEEE IPPS 1995

revisiting the clos network
Revisiting the Clos Network

0

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

folded clos network
Folded Clos Network

0

  • Network is effectively folded into itself to produce a fat tree
  • Consequently often referred to as a folded Clos Network
    • Note this is also equivalent to a bidirectional Benes network
    • Rearrangeable when

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

implications
Implications
  • The connectivity of this class of fat trees are equivalent to that of a crossbar
  • Realizing the performance of a crossbar is another matter
    • Recall it is not strictly non-blocking (crossbar) but rather rearrangeable
    • Packet scheduling and congestion management are key to performance
  • Achievable performance a function of the communication pattern
basic routing
Basic Routing
  • Compare addresses of source and destination
    • Digit position of the first difference identifies the stage/level is to turn around.
    • Any path up the tree
    • Deterministic routing down the tree
  • Load balanced routing randomizes the up path
dragonfly topology
Dragonfly Topology
  • How do we build networks that scale to >106 nodes?
    • Technology to date had prescribed low dimensional networks
    • Target – Exascale computing systems
  • Topology challenge
    • Reconciling trade-offs between diameter, switch radix, and cost
    • What is a feasible radix to use vs. need for scalability?
    • Indirect topology
reading assignment42
Reading Assignment

J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven highly scalable Dragonfly topology,” Proceedings of the International Symposium on Computer Architecture, 2008

dragonfly
Dragonfly
  • Increasing pin bandwidth has moved the design point
    • More channels, smaller diameter rather than less channels higher bandwidth/channel
    • Increases the number and length of cables
  • Network cost proportional to number of channels, especially global channels
    • Reduce the number of global channels
    • Use new signaling technology for long channels  optical signaling
  • Engineering optimizations to increase the effective switch radix
topology basics
Topology Basics
  • One level hierarchy
  • Traffic limited to one global hop
  • Flexible definition of intra-group interconnection

h global channels

(a-1) local channels

inter-group network

intra-group network

G0

G1

Gg

0

1

a-1

0

1

P-1

properties45
Properties
  • Switch radix
  • Effective radix (group radix)
  • Number of groups
  • Number of nodes
topology variations
Topology Variations
  • The inter-group network
    • Downsizing: redistribute the global channels amongst the groups
    • Upsizing: move to multiple global hops
    • Rightsizing: increase virtual radix to maintain single hop property, e.g., radix 64, 2D k-ary n-flat intra-group network and 256K nodes
  • The intra-group network
    • Trade-off local hops vs. physical radix
more on properties
More on Properties
  • Bisection bandwidth follows inter-group network
  • Pin-out (radix) is a design parameter
    • Defines network scale and diameter
    • Defines local (group) hop count

From J. Kim, et.al, “Technology-driven highly scalable Dragonfly topology,” ISCA 2008

balanced design
Balanced Design
  • How do we think about values of parameters a, p, and h?
    • Balancing traffic on global and local channels
  • Balancing cost
    • Trade-off swicth radix with number of global channels (cables)
routing49
Routing
  • Minimal routing takes at most three steps
    • Route to the router within the source group that connects to the correct destination group (multi-hop?)
    • Route across the global channel to the correct destination group
    • Route to the correct destination router within the destination group that connects to the destination (multi-hop?)
  • Does not work well under adversarial patterns
    • Use Valiant’s 2-phase routing algorithm
      • Route to a random intermediate group
      • Route to the desination
summary50
Summary
  • Topology designed to scale to large numbers of nodes (>106)
    • Sensitive to a hierarchy of interconnect delays and costs
  • Exploits increased pin-bandwidth in emerging technologies
  • We will return to more sophisticated routing algorithms later
hybrid networks
Hybrid Networks

Cluster based 2D Mesh

2D Hypermesh

comparison of direct and indirect networks
Comparison of Direct and Indirect Networks
  • Concentration can be used to reduce direct network switch & link costs
    • “C” end nodes connect to each switch, where C is concentration factor
    • Allows larger systems to be built from fewer switches and links
    • Requires larger switch degree
    • For N = 32 and k = 8, fewer switches and links than fat tree

64-node system with 8-port switches, b = 4

32-node system with 8-port switches

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

comparison of direct and indirect networks53
Comparison of Direct and Indirect Networks

Distance scaling problems may be exacerbated in on-chip MINs

End Nodes

Switches

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

comparison of direct and indirect networks54
Comparison of Direct and Indirect Networks
  • Blocking reduced by maximizing dimensions (switch degree)
    • Can increase bisection bandwidth, but
      • Additional dimensions may increase wire length (must observe 3D packaging constraints)
      • Flow control issues (buffer size increases with link length)
      • Pin-out constraints (limit the number of dimensions achievable)

Evaluation category

Bus

Ring

2D mesh

2D torus

Hypercube

Fat tree

Fully

connected

Perf.

BWBisection in # links

1

2

8

16

32

32

1024

Max (ave.) hop count

1 (1)

32 (16)

14 (7)

8 (4)

6 (3)

11 (9)

1 (1)

Cost

I/O ports per switch

NA

3

5

5

7

4

64

Number of switches

NA

64

64

64

64

192

64

Number of net. links

1

64

112

128

192

320

2016

Total number of links

1

128

176

192

256

384

2080

Performance and cost of several network topologies for 64 nodes. Values are given in terms of bidirectional links & ports.

Hop count includes a switch and its output link (in the above, end node links are not counted for the bus topology).

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

commercial machines

Company

System

[Network] Name

Max.

number

of nodes

[x # CPUs]

Basic network topology

Injection

[Recept’n]

node BW in

MBytes/s

# of data

bits per

link per

direction

Raw network link BW per direction in Mbytes/sec

Raw network bisection BW (bidir) in Gbytes/s

Intel

ASCI Red

Paragon

4,510

[x 2]

2-D mesh

64 x 64

400

[400]

16 bits

400

51.2

IBM

ASCI White

SP Power3

[Colony]

512

[x 16]

BMIN w/8-port bidirect. switches (fat-tree or Omega)

500

[500]

8 bits (+1 bit of control)

500

256

Intel

Thunter Itanium2

Tiger4

[QsNetII]

1,024

[x 4]

fat tree w/8-port

bidirectional

switches

928

[928]

8 bits (+2 control for 4b/5b enc)

1,333

1,365

Cray

XT3

[SeaStar]

30,508

[x 1]

3-D torus

40 x 32 x 24

3,200

[3,200]

12 bits

3,800

5,836.8

Cray

X1E

1,024

[x 1]

4-way bristled

2-D torus (~ 23 x 11)

with express links

1,600

[1,600]

16 bits

1,600

51.2

IBM

ASC Purple

pSeries 575

[Federation]

>1,280

[x 8]

BMIN w/8-port

bidirect. switches

(fat-tree or Omega)

2,000

[2,000]

8 bits (+2 bits of control)

2,000

2,560

IBM

Blue Gene/L

eServer Sol.

[Torus Net]

65,536

[x 2]

3-D torus

32 x 32 x 64

612,5

[1,050]

1 bit (bit serial)

175

358.4

Commercial Machines

© T.M. Pinkston, J. Duato, with major contributions by J. Filch

a unified view of direct and indirect networks
A Unified View of Direct and Indirect Networks
  • Switch designs in both cases are coalescing
    • Generic network may have 0, 1, or more compute nodes/switch
  • Switches implement programmable routing functions
  • Differences are primarily an issue of topology
    • Imagine the use of source routed messages
  • Deadlock avoidance
summary and research directions
Summary and Research Directions
  • Use of hybrid interconnection networks
    • Best way to utilize existing pin-out?
  • Engineering considerations rapidly prune the space of candidate topologies
  • Routing + switching + topology = network
  • Onto routing…….