1 / 13

Enterprise Network Troubleshooting

Enterprise Network Troubleshooting. Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar). Three Disjoint Views of the Network. Error Checking and Deployment. Generation. Policy: The operator’s “wish list”

rendor
Download Presentation

Enterprise Network Troubleshooting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enterprise Network Troubleshooting Nick FeamsterGeorgia Tech(joint with Russ Clark, Yiyi Huang,Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

  2. Three Disjoint Views of the Network Error Checking and Deployment Generation • Policy: The operator’s “wish list” • Static: What the configurations say • Dynamic: The behavior that users witness Policy Static Dynamic • ping- traceroute- … • rancid/rcc- FIREMAN/Lumeta Independent analyses!

  3. A Closer Look • Proactive analysis • Fault avoidance • Policy conformance • Reactive diagnosis • Correcting network faults • Detection • Localization • Active and passive measurements • Need user’s perspective • Two studies • Routing • Firewalls Idea: These analyses should inform each other

  4. Catastrophic Configuration Faults “…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.” -- news.com, April 25, 1997 “Microsoft's websites were offline for up to 23 hours...because of a [router] misconfiguration…it took nearly a day to determine what was wrong and undo the changes.” -- wired.com, January 25, 2001 “WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to "a route table issue." -- cnn.com, October 3, 2002 "A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration).“ -- dslreports.com, February 23, 2004

  5. rcc Configure Detect Faults Deploy Case 1: Network-Wide Routing Analysis • Proactive routing configuration analysis • Idea: Analyze configuration before deployment Many faults can be detected with static analysis.

  6. Operators Find Static Analysis Useful “That’s wicked!” -- Nicolas Strina, ip-man.net “Thanks again for a great tool.” -- Paul Piecuch, IT Manager “...good to finally see more coverage of routing as distributed programming. From my experience, the principles of software engineering eliminate a vast majority of errors.” -- Joe Provo, rcn.com “I find your approach useful, it is really not fun (but critical for the health of the network) to keep track of the inconsistencies among different routers…a configuration verifier like yours can give the operator a degree of confidence that the sky won't fall on his head real soon now.” -- Arnaud Le Tallanter, clara.net

  7. Yes, but Surprises Happen! • Link failures • Node failures • Traffic volumes shift • Network devices “wedged” • … • Two problems • Detection • Localization

  8. Detection: Analyze Routing Dynamics • Idea: Routers exhibit correlated behavior Blips across signals may be more operationally interesting than any spike in one.

  9. Detection Three Types of Events • Single-router bursts • Correlated bursts • Multi-router bursts • Common • Commonly missed using thresholds

  10. Localization: Joint Dynamic/Static • Which routers are “border routers” for that burst • Topological properties of routers in the burst Proactive Analysis Deployment Static Dynamic Diagnosis/Correction Reactive Detection

  11. Case 2: Firewalls • Georgia Tech Campus Network • Research and Administrative Network • 180 buildings • 130+ firewalls • 1700+ switches • 55000+ ports • Problem: Availability/Reachability • Flux in firewall, router, switch configurations • No common authority over changes made

  12. Specific Focus: Firewall Configuration • Difficult to understand and audit configs • Subject to continual modifications • Roughly 1-2 touches per day • Federated policy, distributed dependencies • Each department has independent policies • Local changes may affect global behavior

  13. (Immediate) Open Issues • Reachability and reliability of controller • Service-level probes • Diagnostic tools != Service-level “Happiness” • Policy conformance

More Related