zupdate updating data center networks with zero loss n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
zUpdate : Updating Data Center Networks with Zero Loss PowerPoint Presentation
Download Presentation
zUpdate : Updating Data Center Networks with Zero Loss

Loading in 2 Seconds...

play fullscreen
1 / 33

zUpdate : Updating Data Center Networks with Zero Loss - PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on

zUpdate : Updating Data Center Networks with Zero Loss. Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer , Dave Maltz (Microsoft). DCN is constantly in flux. Upgrade  Reboot. New Switch. Traffic Flows. DCN is constantly in flux.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'zUpdate : Updating Data Center Networks with Zero Loss' - yahto


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
zupdate updating data center networks with zero loss

zUpdate:Updating Data Center Networks with Zero Loss

Hongqiang Harry Liu (Yale University)

Xin Wu (Duke University)

Ming Zhang, Lihua Yuan, Roger Wattenhofer, Dave Maltz (Microsoft)

dcn is constantly in flux
DCN is constantly in flux

Upgrade  Reboot

New Switch

Traffic Flows

dcn is constantly in flux1
DCN is constantly in flux

Traffic Flows

Virtual Machines

network updates are painful for operators
Network updates are painful for operators
  • Two weeks before update, Bob has to:
  • Coordinate with application owners
  • Prepare a detailed update plan
  • Review and revise the plan with colleagues

Complex Planning

Switch Upgrade

Unexpected Performance Faults

  • At the night of update, Bob executes plan by hands, but
  • Application alerts aretriggered unexpectedly
  • Switch failures force him to backpedal several times.
  • Eight hours later, Bob is still stuck with update:
  • No sleep over night
  • Numerous application complaints
  • No quick fix in sight

Laborious Process

Bob: An operator

congestion free dcn update is the key
Congestion-free DCN update is the key
  • Applications want network updates to be seamless
    • Reachability
    • Low network latency (propagation, queuing)
    • No packet drops
  • Congestion-free updates are hard
    • Many switches are involved
    • Multi-step plan
    • Different scenarios have distinct requirements
    • Interactions between network and traffic demand changes

Congestion

a clos network with ecmp
A clos network with ECMP

All switches: Equal-Cost Multi-Path (ECMP)

Link capacity: 1000

150

150

= 920

150

620

+150

+ 150

150

300

300

300

300

600

600

switch upgrade a na ve solution triggers congestion
Switch upgrade: a naïve solution triggers congestion

Link capacity: 1000

= 1070

= 920

+ 300

620

+ 150

+150

Drain AGG1

600

switch upgrade a smarter solution seems to be working
Switch upgrade: a smarter solution seems to be working

Link capacity: 1000

= 1070

= 970

+ 50

+ 150

620

+300

Drain AGG1

500

100

Weighted ECMP

traffic distribution transition
Traffic distribution transition

Initial Traffic Distribution

Congestion-free

FinalTraffic Distribution

Congestion-free

Transition

?

300

0

600

300

500

100

300

300

Simple?

NO!

Asynchronous Switch Updates

asynchronous changes can cause transient congestion
Asynchronous changes can cause transient congestion

When ToR1 is changed but ToR5 is not yet:

Link capacity: 1000

620 +300+ 150 = 1070

Drain AGG1

300

300

600

Not Yet

solution introducing an intermediate step
Solution: introducing an intermediate step

Final

Initial

Transition

300

0

600

300

500

100

300

300

Congestion-free regardless the asynchronizations

Congestion-freeregardless the asynchronizations

Intermediate

?

200

400

450

150

how zupdate performs congestion free update
How zUpdateperforms congestion-free update

Update

Scenario

Update requirements

Operator

zUpdate

Target Traffic Distribution

Intermediate

Traffic Distribution

Intermediate

Traffic Distribution

Current Traffic Distribution

Data Center Network

Routing Weights

Reconfigurations

key technical issues
Key technical issues
  • Describing traffic distribution
  • Representing update requirements
  • Defining conditions for congestion-free transition
  • Computing an update plan
  • Implementing an update plan
describing traffic distribution
Describing traffic distribution

: flow f’s load on the link from switch v to u

=150

150

=300

300

600

Traffic Distribution:

representing update requirements
Representing update requirements

When s2 recovers

Drain s2

Constraint: =

Constraint: = 0

To restore ECMP:

To upgrade switch :

switch a synchronization exponentially inflates the possible load values
Switch asynchronization exponentially inflates the possible load values

Transition from old traffic distribution to new traffic distribution

ingress

f

2

4

6

1

egress

f

8

7

5

3

Asynchronous updates can result in possible load values on link during transition.

In large networks, it is impossible to check if the load value exceeds link capacity.

t wo phase commit reduces the possible load values to two
Two-phase commit reduces the possible load values to two

Transition from old traffic distribution to new traffic distribution

  • With two-phase commit, f’s load on link only has two possible values throughout a transition:

ingress

f

egress

2

4

6

1

f

version flip

8

7

5

3

or

flow asynchronization exponentially inflates the possible load values
Flow asynchronizationexponentially inflates the possible load values

f1

2

6

4

1

f1 + f2

8

f2

5

7

3

0

=

Asynchronous updates to N independent flows can result in possible load values on link

handling flow asynchronization
Handling flow asynchronization

f1

2

6

4

1

[Congestion-free transition constraint] There is no congestion throughout a transition if and only if:

8

f2

5

0

7

3

=

Basic idea:

the capacity of link

computing congestion free transition plan
Computing congestion-free transition plan

Linear Programming

Constraint:

Congestion-free

Constraint:

Update Requirements

Constant:

Current Traffic Distribution

Variable:

Target Traffic

Distribution

Variable:

Intermediate

Traffic Distribution

Variable:

Intermediate

Traffic Distribution

  • Constraint:
  • Deliver all traffic
  • Flow conservation
implementing an update plan
Implementing an update plan

Weighted-ECMP

ECMP

  • Computation time
  • Switch table size limit
  • Update overhead
  • Failure during transition
  • Traffic demand variation

Other Flows

Critical

Flows

Flows traversing bottleneck links

evaluations
Evaluations
  • Testbed experiments
  • Large-scale trace-driven simulations
testbed setup
Testbed setup

ToR6,7: 6.2Gbps

ToR6,7: 6.2Gbps

ToR6,7: 6.2Gbps

ToR6,7: 6.2Gbps

Drain AGG1

ToR5: 6Gbps

ToR8: 6Gbps

zupdate achieves congestion free switch upgrade
zUpdateachieves congestion-free switch upgrade

Initial

Intermediate

3Gbps

2Gbps

3Gbps

4Gbps

3Gbps

4.5Gbps

1.5Gbps

3Gbps

Final

0

6Gbps

5Gbps

1Gbps

one step update causes transient congestion
One-step update causes transient congestion

Initial

3Gbps

3Gbps

3Gbps

3Gbps

Final

0

6Gbps

5Gbps

1Gbps

large scale trace driven simulations
Large-scale trace-driven simulations

A production DCN topology

Flows

Test flows (1%)

zupdate beats alternative solutions
zUpdate beats alternative solutions

Post-transition Loss Rate

Transition Loss Rate

15

10

Loss Rate (%)

5

0

zUpdate

zUpdate-OneStep

ECMP-OneStep

ECMP-Planned

#step

1

2

1

300+

conclusion
Conclusion
  • Switch and flow asynchronization can cause severe congestion during DCN updates
  • We present zUpdate for congestion-free DCN updates
    • Novel algorithms to compute update plan
    • Practical implementation on commodity switches
    • Evaluations in real DCN topology and update scenarios

The

End

updating dcn is a painful process
Updating DCN is a painful process

Interactive

Applications

Switch Upgrade

Any performance disruption?

How bad will the latency be?

Operator

How long will the disruption last?

Uh?…

This is Bob

What servers will be affected?

network update a tussle between applications and operators
Network update: a tussle between applications and operators
  • Applications want network update to be fast and seamless
    • Update can happen on demand
    • No performance disruption during update
  • Network update is time consuming
    • Nowadays, an update is planned and executed by hands
    • Rolling back in unplanned cases
  • Network update is risky
    • Human errors
    • Accidents
c hallenges in congestion free dcn update
Challenges in congestion-free DCN update
  • Many switches are involved
  • Multi-step plan
  • Different scenarios have distinctive requirements
    • Switch upgrade/failure recovery
    • New switch on-boarding
    • Load balancer reconfiguration
    • VM migration
  • Coordination between changes in routing (network) and traffic demand (application)

Help!

related work
Related work
  • SWAN [SIGCOMM’13]
    • maximizing the network utilization
    • Tunnel-based traffic engineering
  • Reitblatt et al. [SIGCOMM’12]
    • Control plane consistency during network updates
    • Per-packet and per-flow cannot guarantee “no congestions”
  • Raza et al. [ToN’2011], Ghorbani et al. [HotSDN’12]
    • One a specific scenario (IGP update, VM migration)
    • One link weight change or one VM migration at a time