- 79 Views
- Uploaded on
- Presentation posted in: General

Ethernet Data Center Routing Challenges and 802.1aq/SPB new work PETER ASHWOOD-SMITH [email protected]

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Ethernet Data Center Routing Challengesand 802.1aq/SPB new work

PETER ASHWOOD-SMITH

A) TweakBridgePrioritiesHere

B)

S1 … S16

802.1aq’s 16 ECT can give perfect spread going 2 hops 16 uplinks. However:

A) Need to tweak 2nd layer switch priorities to guarantee all 16 are used.

B) Need at least 16 subnets (C/S-Vlan’s) to assign one per 802.1aq B-VID.

- David Allan et al. have a presentation on this so I won’t spend much time on it.
- In general a network with N equal cost paths from ‘some source’ to ‘some destination’ requires #ECT about 25-40% greater than N (to statistically capture them all).
- Therefore when #ECT == N some ‘tweaking’ is usually required (for DC its trivial to do however).
- Dave et al. suggest non-independence between ECT algorithms as way to address this (maximize diversity) …

*Tweaking = adjustingBridge Priorities up/down fromdefaults.

A1

A2

B1

B2

B3

B4

S1,1

S32,1

S3,1

S1,160

S32,160

S3,160

“Example” 802.1aq switching cluster – assume 100GE NNI links/groups

A15

A16

Goodnumbers“16”

& “2”levels.

32 x 100GE

16 x 32 x 100GE = 51.2T

using 48 x 2T switches

16 x 100GE

160 x 10GE

B29

B30

B31

B32

5120 x 10GE

- 48 switch non blocking 2 layer L2 fabric
- 16 at “upper” layer A1..A16
- 32 at “lower” layer B1.. B32
- 16 uplinks per Bn, & 160 UNI links per Bn
- 32 downlinks per An

- (16 x 100GE per Bn)x32 = 512x100GE = 51.2T
- 160 x 10GE server links (UNI) per Bn
- (32 x 160)/2 = 2560 servers @ 2x10GE per
- uFIB = 16 x 48 B-mac = 768 entries
- mFIB = 16 subnet x 48 src = 768 entries

1536 FIB/node

ECT-ALG#12SourceNode (1)

S1 … S16

For a given ECT-ALGk, Aj is a member of every SPF-TREE(B*,ECT-ALGk)

Properly tuned no two ECT-ALGorithms will use the same Aj as a fork point.

Subnet Ni maps to I-SIDj and then to a unique A (j mod 16 )

A1

A2

A15

A16

B1

B2

B3

B4

B29

B30

B31

B32

I-SIDi

I-SIDi

I-SIDi

I-SIDj

I-SIDj

I-SIDj

So load spreading allows each Aito transit a complete subnet.

Problem#1 - Unable to further spread such that Aiand Aj(i != j) each handle subset of flows in I-SID j

This is an issue under failure of Aj

A1

A2

A15

A16

B1

B2

B3

B4

B29

B30

B31

B32

I-SIDi

I-SIDi

I-SIDi

I-SIDj

I-SIDj

I-SIDj

Recovery will move entire subnet traffic to another Ai node.

A preferable solution is to spread affected load over remaining A*

Possible solution – head end hashing (unicast only)

A1

A2

A15

A16

B1

B2

B3

B4

B29

B30

B31

B32

I-SIDi

I-SIDi

I-SIDi

I-SIDj

I-SIDj

I-SIDj

Allow unicast I-SIDi and I-SIDjtraffic to be hashed based on smaller flows to different B-VIDs (ECT-ALGorithms)

This breaks the symmetry and congruence rules but allows edge balancing at smaller granularity. No changes to multicast.Requires learning <C-DA, B-DA> , independent of B-VID

Unicast

Mcast

A1

A15

A2

A16

B1

B29

B2

B30

B31

B3

B4

B32

Interconnection of fabrics creates more than 16 paths (exponential )

O(16x2x16)

C1

C2

O(16x2)

A1

A2

A15

A16

O(16)

B29

B30

B31

B32

B1

B2

B3

B4

Number of paths can grow exponentially with increasing levels.

Constant number of paths always << number of paths in many networks.

Growing 802.1aq ECT to say 32 or even 100 ECMP causes larger unicast FIBs.

Horizontal Growth – not too bad but need more ECT-ALGORITHMS.

A1

A2

A15

A16

A17

B33

B34

B29

B30

B31

B32

B1

B2

B3

B4

Horizontal growth by 1 just increases number of ECT by 1

Not too big a problem but we would need to define new ECT (via Opaque).

Choosepath from

N x B-VID

O(degree)

D

S

O(diameter)

#paths ~= O( diameter degree)

So head end ECT in worst case requires O(exp(# B-VIDs))

Single B-VID

S

D

Choosepath from

N x nxt hop

Choosepath from

N x nxt hop

Re-assign traffic to path at each hop

Tandem “ECMP” just like IP.

Need to keep O(degree) number of next hops

Only need one B-VID .. removes O(diameter) from state cost

Flip side is you have no control – just hope for fine scale statistical distribution

802.1aq Ingress Check is very strong in the case of a single next hop and hence

a single possible ingress for an SA.

802.1aq Ingress Check is weakened in the case of a multiple next hop and hence

Multiple possible ingress for an SA.

However 802.1aq Agreement Protocol functions correctly in the context of multiple possible Next Hops for the same B-VID (refer to Mick’s proof).

But …

Is it too complex? it is clearly non trivial, we need implementation/emulation experience.

Is it overly Draconian. For example the bounds on movement are what is required for a mathematical proof by induction .. However there are probably many cases where further movement would not loop. What isthe degree of ‘overkill’ ?

Is it marketable? – this is unfortunately a legitimate concern!!!

802.1aq can be deployed without AP until we introduce hash basedforwarding at which point we either require a symmetric AP and/oran on-data-path loop detection/drop mechanism.

Believe that an on-data-path loop detection mechanism is requiredfor hash based ECMP until we have more experience with AP.

Recommend we standardize a TTL TAG either stand-alone or as a new form of I-TAG.

R1) New ECT-ALGorithms with improved spreading properties.

R2) Allow optional head end hash assignment of 802.1aq SPBM UNI known unicasttraffic to one of multiple next hop interfaces/B-VIDs. Very similar to Link Ag.Minimally HASH (seed, C.SA, C.DA, C-VID, [ IP.SA, IP.DA, IP.PROTO] )

R3) Allow optional tandem hash assignment of 802.1aq SPBM B-VID NNI unicasttraffic to one of multiple next hop interfaces. Essentially a new SPBM ECT-ALGwith its own B-VID. (i.e. new ECT-ALGorithms, all usable at same time)Minimally HASH (seed, B-VID, C.SA, C.DA, C-VID, [ IP.SA, IP.DA, IP.PROTO ])

R4) minor OA&M changes in support of R2 and R3, because symmetry/congruence broken.

R5) More experience with AP, emulations, simulations etc. +addition of TTL to new I-TAG or a TTL-TAG.