1 / 42

Understanding Network Failures in Data Centers : Measurement, Analysis, and Implications

Understanding Network Failures in Data Centers : Measurement, Analysis, and Implications. Group Research 1 : AKHTAR, Kamran SU, Hao SUN, Qiang TANG, Yue YANG, Xiaofan

ouida
Download Presentation

Understanding Network Failures in Data Centers : Measurement, Analysis, and Implications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding Network Failures in Data Centers:Measurement, Analysis, and Implications Group Research 1: AKHTAR, Kamran SU, Hao SUN, Qiang TANG, Yue YANG, Xiaofan (42 slides)

  2. Summary This is the first large-scale analysis of failures in a data center network. Based on data center networks and characterizing failure events within data center, using multiple data sources commonly collected by network operators, analyze and estimating the failures, their impact, and the effectiveness of network redundancy. Key achievements: They found that commodity switches exhibit high reliability which supports current proposals to design flat networks using commodity components. They also highlighted the importance of studies to better manage middle boxes such as load balancers, as they exhibit high failure rates. Finally, at both network and application layers, more investigation is needed to analyze and improve the effectiveness of redundancy.

  3. OUTLINE Part 1: Introduction Part 2: Background Part 3: Methodology and Data Sets Part 4: Failure Analysis Part 5: Estimating Failure Impact Part 6: Discussion Part 7: Related Work Part 8: Conclusions and Future Work

  4. F C A P S Fault management

  5. Part 1: Introduction

  6. Data Center Networks scalable, efficient, fault tolerant, and easy-to-manage

  7. Three dimensions Characterizing the most failure prone network elements Estimating the impact of failures. 3. Analyzing the effectiveness of network redundancy

  8. Part 2: Background

  9. 2.1 Data center network architecture

  10. 2.2 Data center workload characteristics

  11. Part 3: Methodology and Data Sets

  12. 3.1 Existing data sets 1. Network event logs (SNMP/syslog) 2. NOC Tickets Info about when and how events were discovered as well as when they were resolved (operators employ it) 3. Network traffic data 4. Network topology data

  13. “provisioning” (no data before, some data transferred during failure) 3.2---3.4 Defining and identifying failures with impact Link failures Device failures

  14. For link failures Eliminating spurious notifications Focus on measurable events For device failures At least one link failure within a time window of five minutes We only need failure events impacted network traffic

  15. Part 4: Failure Analysis

  16. Outline Failure event panorama Daily volume of failures Probability of failure Aggregate impact of failures Properties of failures Grouping link failures Root causes of failures

  17. Failure event panorama All failures vs. failures with impact Widespread failures Long-lived failures

  18. Daily volume of failures Link failures are variable and bursty Device failures are usually caused by maintenance Table4: Failures per time unit

  19. Probability of failure Load balancers have the highest failure probability ToRs have low failure rates Load balancer links have the highest rate of logged failures Management and inter-data center links have lowest failure

  20. Aggregate impact of failures Load balancers have the most failures but ToRs have the most downtime Load balancer links experience many failure events but relatively small downtime Load balancer failures dominated by few failure prone devices

  21. properties of failures • Time to repair • Load balancers experience short-lived failure • ToRs experience correlated failures • Inter-data center links take the longest to repair

  22. properties of failures • Time between failures • Load balancer failures are bursty • Link flapping is absent from the actionable network logs • MGMT, CORE and ISC links are the most reliable in time between failures

  23. properties of failures • Reliability of network elements • Data center networks experience high availability • Links have high availability (having higher than 9’s of reliability)

  24. Grouping link failures • To group correlated failures • Require that link failures occur in the same data center • Failures to occur within a predefined time threshold • Link failures tend to be isolated

  25. Root causes of failures Choose to leverage the “Problem type” field of the NOC tickets Hardware problems take longer to mitigate Load balancers affected by software problems Link failures are dominated by connection and hardware problems

  26. Part 5: Estimating Failure Impact

  27. 5.1 Is redundancy effective in reducing impact? Several reasons why redundancy may not be 100% effective: 1. bugs in fail-over mechanisms can arise if there is uncertainty as to which link or component is the back up. 2. if the redundant components are not configured correctly, they will not be able to reroute traffic away from the failed component. 3. protocol issues such as TCP backoff, timeouts, and spanning tree reconfigurations may result in loss of traffic. Network redundancy helps, but it is not entirely effective.

  28. 5.2 Redundancy at different layers of the network topology Links highest in the topology benefit most from redundancy. Links from ToRs to aggregation switches benefit the least from redundancy, but have low failure impact.

  29. Part 6: Discussion

  30. Low-end switches exhibit high reliability Low-cost The lowest failure rate with a lower failure probability However, as populations of these devices rise, the absolute number of failures observed will inevitably increase.

  31. Improve reliability of middle-boxes Need to be taken into account The development of better management and debugging tools Software load balancers running on commodity servers Load balancer links have the highest rate of logged failures Management and inter-data center links have lowest failure

  32. Improve the effectiveness of network redundancy Network redundancies in our system are 40% effective at masking the impact of network failures. One cause: due to configuration issues that lead to redundancy being ineffective at masking failure The back up link was subject to the same flaw as the primary.

  33. Separate control plane from data plane The separation between control plane and data plane becomes even more crucial to avoid impact to hosted applications. The cases of NOC tickets

  34. Part 7: Related Work

  35. Application failures Other study [1] found: The majority of failures occur during the TCP handshake as a result of end-toend connectivity issues. Web access failures are dominated by server-side issues. These findings highlight the importance of studying failures in data centers hosting Web services.

  36. Network failures Some studies observe significant instability and flapping as a result of external routing protocols [2]. Unlikethese studies, not observe link flapping owing to our data sources being geared towards actionable events. Some studies find that 70% of failures involve only a single link [3]. Similarly observe that the majority of failures in data centers are isolated. Some studies also observe longer time to repair on wide area links [4]. Similar to the observations for wide area links connecting data centers.

  37. Failures in cloud computing Some studies consider the availability of distributed storage and observe that the majority of failures involving more than ten storage nodes are localized within a single rack [5]. Also observe spatial correlations but they occur higher in the network topology, where we see multiple ToRs associated with the same aggregation switch having correlated failures.

  38. Part 8: Conclusions and Future Work

  39. Give your own opinionabout what you think is good or bad about the paper, e.g. how could it be improved?

  40. References [1] V. Padmanabhan, S. Ramabhadran, S. Agarwal, and J. Padhye. A study of end-to-end web access failures. In CoNEXT, 2006. [2] B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: A large-scale field study. In SIGMETRICS, 2009. [3] A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. A case study of OSPF behavior in a large enterprise network. In ACM IMW, 2002. [4] D. Turner, K. Levchenko, A. C. Snoeren, and S. Savage. California fault lines: Understanding the causes and impact of network failures. In SIGCOMM, 2010. [5] K. V. Vishwanath and N. Nagappan. Characterizing cloud computing hardware reliability. In Symposium on Cloud Computing (SOCC), 2010. The figures come from the project paper, Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications.

  41. The EndTHX

More Related