1 / 18

Troubleshooting Chronic Conditions in Large IP

Troubleshooting Chronic Conditions in Large IP. Network Reliability. Applications demand high reliability and performance N etwork outages required accurate and timely troubleshooting Traditionally, troubleshooting focused on hard failures . Chronic Conditions-A B ig Trouble.

moral
Download Presentation

Troubleshooting Chronic Conditions in Large IP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Troubleshooting Chronic Conditions in Large IP

  2. Network Reliability • Applications demand high reliability and performance • Network outages required accurate and timely troubleshooting • Traditionally, troubleshooting focused on hard failures

  3. Chronic Conditions-A Big Trouble • Individual events disappear before you can react to them • Keep re-occurring • Cause performance degradation to customers • Can even turn into serious hard failures • Examples : Chronic link flaps Chronic router CPU utilization anomalies

  4. Key points of Troubleshooting Chronic Conditions • Mining measurement data – the heart of the troubleshooting process • Find chronic patterns • Reproduce patterns in lab settings (if needed) • Perform software and hardware analysis (if needed) • Traditionally, troubleshooting chronics has been performed manually, making it a cumbersome, time-consuming and error-prone process

  5. Troubleshooting Challenges • Massive Scale • Potential root-causes hidden in thousands of event-series • E.g., root-causes for packet loss include link congestion (SNMP), protocol down (Route data), software errors (syslogs) • Complex spatial and topology models • Cross-layer dependency • Causal impact scope • Local versus global (propagation through protocols) • Imperfect timing information • Propagation (events take time to show impact – timers) • Measurement granularity (point versus range events)

  6. NICE (Network-wide Information Correlation and Exploration) • a novel infrastructure that enables the troubleshooting of chronic network conditions by detecting and analyzing statistical correlations across multiple data sources. • NICE Chronic Symptom Statistically Correlated Events Spatial Proximity model Unified Data Model Statistical Correlation Network data

  7. Customs and Traditions • Hierarchical structure -capture event location • Proximity distance -capture impact scope of event

  8. Unified Data Model • Facilitate easy cross-event correlations • Padding time-margins to handle diverse data • Convert any event-series to range series • Common time-bin to simplify correlations • Convert range-series to binary time-series

  9. Statistical Correlation Testing • Measure statistical time co-occurrence Pair-wise Pearson’s correlation coefficient • Test the significance of the correlation score using novel circular permutation-based significance test

  10. Conclusions • In this part we focus on troubleshooting in chronic conditions • Simply introduced NICE • Any comments or questions?

  11. Automating Cross-Layer Diagnosis of Enterprise 802.11 Wireless Networks

  12. Different Standards • 802.11 -- applies to wireless LANs and provides 1 or 2 Mbps transmission in the 2.4 GHz band. • 802.11a -- an extension to 802.11 that applies to wireless LANs and provides up to 54 Mbps in the 5GHz band. • 802.11b (also referred to as 802.11 High Rate or Wi-Fi) -- an extension to 802.11 that applies to wireless LANS and provides 11 Mbps transmission (with a fallback to 5.5, 2 and 1 Mbps) in the 2.4 GHz band. • 802.11g -- applies to wireless LANs and provides 20+ Mbps in the 2.4 GHz band.

  13. Enterprise Wireless Networks • May comprise hundreds of distinct APs • Carefully sited and configured in accordance with an RF (radio frequency) site survey • Minimize contention, maximize throughput and to provide the illusion of seamless coverage • In practice, there are numerous opportunities for disrupting or degrading a user's connectivity in an 802.11 network.

  14. Problems in the life of a packet • There are numerous opportunities for disrupting or degrading a user's connectivity in an 802.11 network. • Physical Layer: Sharing the 2.4Ghz spectrum are a wide range of non-802.11 devices, ranging from cordless phones to microwave ovens. • Link Layer: Transmission delays. Management delays. • Infrastructure support. • Transport Layer.

  15. Problems can be in anywhere • Across layers – protocols • Even in the same layer – 802.11 {a,b,f,g,h,i,n,s} • Software incompatibilities – vendor variations • Transient or persistent - time • Radio propagates in free space - locations • Radio spreads across channels – frequencies • Shared spectrum makes it worse • APs bridge wireless and wired worlds – infrastructure

  16. Shaman • Goal: Develop a system to automatically diagnose problems in wireless networks • Pervasive data collection (Jigsaw) • Extensive passive monitoring system • Observe all transmissions across locations, channels, and time • Provides a unified synchronized trace of every packet transmission • Explicitly model protocols on critical path • DHCP, 802.11 MAC, TCP, etc. • Provides complete delay and loss breakdown • For every packet transmission, all protocol stages • Framework for diagnostic tools • Use model outputs to determine root cause of problems • Users can query on demand, also alert admins

  17. Another good ideal • The goal is to determine the various delays an actual monitored frame encountered as it traversed through the stages of the wireless network path.

  18. Conclusion • Modern enterprise networks are of sufficient complexity that even simple faults can be difficult to diagnose. • Some good solutions such as Shaman • Any comments or questions?

More Related