1 / 7

Jennifer Rexford Fall 2010 (TTh 1:30-2:50 in COS 302) COS 561: Advanced Computer Networks http://www.cs.princeton.edu/co

Network Diagnosis. Jennifer Rexford Fall 2010 (TTh 1:30-2:50 in COS 302) COS 561: Advanced Computer Networks http://www.cs.princeton.edu/courses/archive/fall10/cos561/. Networks Break (In Weird Ways). Bad things happen Reliability: link, router, firewall, DNS server, Web server

indiya
Download Presentation

Jennifer Rexford Fall 2010 (TTh 1:30-2:50 in COS 302) COS 561: Advanced Computer Networks http://www.cs.princeton.edu/co

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Network Diagnosis Jennifer Rexford Fall 2010 (TTh 1:30-2:50 in COS 302) COS 561: Advanced Computer Networks http://www.cs.princeton.edu/courses/archive/fall10/cos561/

  2. Networks Break (In Weird Ways) • Bad things happen • Reliability: link, router, firewall, DNS server, Web server • Performance : congestion, long paths, overloaded server • Not straight-forward • Selective failure (e.g., MTU mismatch, server replica) • Application problems (e.g., receive window) • Short-lived problems (e.g., convergence, incast) • Problems in other domains (e.g., downstream loss) • Unexpected causes (e.g., hot weather, software bugs) • Yet, we can approach diagnosis in a rigorous way

  3. Detecting and Diagnosing Problems • Do nothing • Rely on the network to adapt to failures • E.g., dynamic routing protocols, TCP congestion control • Doesn’t help in detecting and fixing persistent problems • Direct observation • Detailed measurement to observe problem directly • E.g., route monitoring, fault logs, … • High overhead and works only for problems you know • Inference • Infer the root causes from indirect observations • Common attributes of the observed failures, and uncommon attributes of the things that don’t fail

  4. Fault Localization in a Single Domain • Failures are often correlated • Links connected to same router or traversing same fiber • Routers using same power supply or software version • Inputs • Shared risk link groups • Group of failed components • Output • Most likely root cause • Practical challenge: dirty data • Lost failure-reporting messages • Inaccurate model of risk groups

  5. Fault Localization in Path-Vector Routing • Routing changes are correlated • A single link failure causes multiple routing changes • … for all paths that traverse the failed edge • Inputs • No knowledge of the underlying topology • Path changes viewed from several vantage points • Output • Link(s) responsible for the changes • Practical challenges • Incomplete data, multiple failures • Complex routing policies “1 3 5” 1 2 3 “1 4 5” 4 5

  6. Link-Level Parameter Estimation • Path performance is correlated • Path performance is affected by each link in the path • Many paths have (some) common links • Inputs • Network topology and routes • Path-level observations of packet loss, delay, … • Outputs • Estimate of link parameters • Practical challenges: noise • Time-varying link properties 5% loss 1% loss

  7. Path-Level Traffic Intensity Estimation • Link loads are correlated • Each ingress-egress pair imparts load on all the links along a path • Inputs • Network topology and routes • Total traffic load on each link • Outputs • Offered load for each ingress-egress pair • Practical challenge • Under-constrained inference problem 0 5 10 0 15 0 0

More Related