netpilot automating datacenter network failure mitigation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
NetPilot : Automating Datacenter Network Failure Mitigation PowerPoint Presentation
Download Presentation
NetPilot : Automating Datacenter Network Failure Mitigation

Loading in 2 Seconds...

play fullscreen
1 / 36

NetPilot : Automating Datacenter Network Failure Mitigation - PowerPoint PPT Presentation


  • 219 Views
  • Uploaded on

NetPilot : Automating Datacenter Network Failure Mitigation. Xin Wu , Daniel Turner, Chao- Chih Chen, David A. Maltz , Xiaowei Yang, Lihua Yuan, Ming Zhang. Failures are Common and Harmful. Network failures are common. 10,000+ switches. Failures are Common and Harmful.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'NetPilot : Automating Datacenter Network Failure Mitigation' - debra


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
netpilot automating datacenter network failure mitigation

NetPilot: Automating Datacenter Network Failure Mitigation

Xin Wu, Daniel Turner, Chao-Chih Chen,

David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

failures are common and harmful
Failures are Common and Harmful
  • Network failures are common

10,000+ switches

failures are common and harmful1
Failures are Common and Harmful

Six-month failure logs of production datacenters

  • Network failures are common
    • Failures cause long down times

25% of failures take 13+hours to repair

Time from detection to repair (minutes)

failures are common and harmful2
Failures are Common and Harmful
  • Failures are common due to VERY large datacenters
    • Failures cause long down times
    • Long failure duration  large revenue loss
previous work
Previous Work
  • Conventional failure recovery takes 3 steps
  • Failure localization/diagnosis
    • [M. K. Aguilera, SOSP’03]
    • [M. Y. Chen, NSDI’04]
    • [R.R Kompella, NSDI ’05]
    • [P.Bahl, SIGCOMM’07]
    • [S. Kandula, SIGCOMM’09]…

Detection

Diagnosis

Repair

passive

ping

active

automating failure diagnosis is challenging
Automating Failure Diagnosis is Challenging
  • Root causes are deep in network stack
  • Diagnosis involves multiple parties
slide8

Six-month failure logs fromseveral production DCNs

Failure Diagnosis Requires Human Intervention !

  • 2. Diagnosis involves multiple parties
  • 1. Root causes are deep in the network stack
netpilot mitigating rather than diagnosing failures
NetPilot: Mitigating rather than Diagnosing Failures
  • Mitigatefailure symptoms ASAP, at the cost of reduced capacity

Diagnosis

Repair

Automated Mitigation

Detection

netpilot benefits
NetPilot Benefits
  • Short recovery time
  • Small network disruption
  • Low operation cost

Automated Mitigation

Diagnosis

Repair

Detection

failure mitigation is effective
Failure Mitigation is Effective
  • Most failures can be mitigated by simpleactions
  • Mitigation is feasible due to redundancy
mitigation made possible by redundancy
Mitigation Made Possible by Redundancy

Internet

  • Redundancy  deactivation unlikely to partition / overload the network

CORE

AGG

ToR

outline
Outline
  • Automating failure diagnosis is challenging
  • Failure mitigation is effective
  • How to automate mitigation?
  • NetPilot evaluations
  • Conclusion
a strawman netpilot trial and error
A StrawmanNetPilot: Trial-and-error

Network failure

Localization

Execute an action

Roll back if necessary

No

Failure mitigated?

End

Yes

netpilot challenges solutions
NetPilot: Challenges & Solutions

Network failure

1. Blind trial-and-error

takes a long time

Localization

Localization

Roll back if necessary

Failure specific localization

Execute an action

No

Failure mitigated?

End

Yes

netpilot challenges solutions1
NetPilot: Challenges & Solutions

Network failure

Localization

Localization

2. Partition/overload network

Estimate impact

Impact estimation

Roll back if necessary

Execute an action

No

Failure mitigated?

End

Yes

netpilot challenges solutions2
NetPilot: Challenges & Solutions

Network failure

Localization

Localization

Estimate impact

3. Different actions have different side-effects

Rank actions

Roll back if necessary

Rank actions based on impact

Execute an action

No

Failure mitigated?

End

Yes

failure specific localization
Failure Specific Localization
  • Limited # of failure types
  • Domain knowledge improves accuracy
example frame check sequence fcs errors
Example: FrameCheckSequence (FCS) Errors
  • 13% of allthefailures
  • Cut-throughswitching
    • Forward frames before checksums are verified
  • Increaseapplicationlatency
localizing fcs errors
Localizing FCS Errors

frames corrupted by L

  • xL: link corruption rate
  • # of variables = # of equations = # of links
  • Corrupted links: xL> 0

error frames seen on L

frames corrupted by other links & traverse L

netpilot overview
NetPilotOverview

Network failure

Localization

Estimate impact

Rank actions

Roll back if necessary

Execute an action

No

Failure mitigated?

End

Yes

impact metrics
Impact Metrics
  • Derived from Service Level Agreement (SLA)
    • Availability: online_server_ratio
    • Packet loss: total_lost_pkt
    • latency: max_link_utilization
      • Small link utilization  small (queuing) delay
  • Total_lost_pkt & max_link_utilization derived from utilization of individual links
estimating link utilization
Estimating Link Utilization

Action

  • # of flows >> redundant paths
    • Traffic evenly distributed under ECMP
  • Estimate the load contributed by each flow on each link
  • Sum up the loads to compute utilization

Impact Estimator

Link utilization

Traffic

Topology

l ink utilization estimation is highly accurate
Link Utilization Estimation is Highly Accurate
  • 1-month traffic from a 8000-server network
    • Log socket events on each server
  • Ground truth: SNMP counters
netpilot overview1
NetPilotOverview

Network failure

Localization

Estimate impact

Choose the action with the least impact

Rank actions

Roll back if necessary

Execute an action

No

Failure mitigated?

End

Yes

outline1
Outline
  • Automating failure diagnosis is challenging
  • Failure mitigation is effective
  • How to automate mitigation?
    • Localization  impact estimation  ranking
  • NetPilot evaluations
    • Mitigating load imbalance
    • Mitigating FCS errors
    • Mitigating overload
  • Conclusion
l oad imbalance
Load Imbalance

corea

coreb

  • Agga stops receiving traffic
  • Localize to 4 suspects

Aggb

Agga

m itigating load imbalance
Mitigating Load Imbalance

corea

coreb

Aggb

Agga

corea -> aggb

corea -> agga

Mitigation confirmed

coreb -> aggb

coreb -> agga

Detected & reboot coreb

Agga stops receiving traffic

Reboot corea

Reboot Agga

Load evenly splitted

fast fcs error mitigation
Fast FCS Error Mitigation

3.5 hours  15 minutes

Human operator:

after 11 trials in 3.5 hours, 2 out of 28 ports are deactivated

NetPilot:

deactivates 2 links in 1 trial within 15 minutes

mitigating link overload
Mitigating Link Overload
  • Mitigate overload by deactivatinghealthy links

1.5

1.5

core1

core2

agg

core1

3

agg

mitigating link overload1
Mitigating Link Overload
  • Mitigate overload by deactivatinghealthy links
    • Many candidate links in production networks
    • Choose the link(s) with the least impact

0

1

1.5

3

1.5

1.5

core1

core1

core1

core2

core2

core2

agg

agg

agg

lost 0.5

3

3

3

action ranking lowers link utilization
Action Ranking Lowers Link Utilization
  • Replay 97 overload incidents due to link failures
conclusion
Conclusion
  • Mitigation shortens failure recovery time
  • Simple actions are effective
  • Made possible by redundancy
  • NetPilot: automating failure mitigation
  • Recovery time: hour  minutes
  • Several mitigation scenarios deployed in Bing
thank you
Thank You!

NetPilot: Automated Mitigation

Detection

Diagnosis

Repair

netpilot@microsoft.com