Toward Optimal Network Fault Correction via End-to-End Inference

Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan RubensteinDistributed Network Analysis (DNA) LabColumbia UniversityMay 9, 2007

Outline • Motivation • Framework for end-to-end inference • Inference algorithm • Performance evaluation • Conclusions

Motivation • Goal: Correct (diagnose and repair) data-path failures in a system where only end-to-end information is available and link-level probing is unreliable. • Example: overlays across externally managed nodes No data? No data? Data stream server OK!

Problem • What should an administrator do if some paths fail to deliver data? • What the administrator knows: • some nodes on the faulty paths must have failed • What the administrator doesn’t know: • which nodes on the paths failed • how many nodes on the paths failed • reasons the nodes failed • Solution: Checking, via a series of sanity tests, the nodes that potentially failed, and repairing those that did.

Constraints • Checking and repairing a node incurs a cost • e.g., wages and man-hours of support staff, or cost of test equipment • Such a cost can be highly varying • e.g., service providers may charge different costs of checking nodes

Objective • Assume each node i has a priori known • failure probability pi: the likelihood that node i has failed • checking cost ci: the cost needed to perform sanity tests on node i • Objective: minimize the expected total checking cost of correcting (i.e., diagnosing and repairing) all faulty nodes ∑ i ci Pr(node i is actually checked) minimize over all sequences of nodes to be checked

End-to-End Inference • End-to-end inference approach for correcting data-path failures: Network topology Input: Monitor paths Check nodes Repair identified bad nodes Select the nodes to check Bad paths exist? Yes No How to select nodes to check? Done

How to Select Nodes to Check? • Suppose that we check one node at a time. • Most-Likely Fault (MLF) approach • First check the most likely faulty node, i.e., the node with the highest conditional failure probability given that some paths failed to deliver data. Does the MLF approach necessarily minimize the expected total checking cost?

Example – Why the MLF Scheme is not Optimal? • No, the MLF scheme is not optimal in general. • Two data paths are given. Both failed to deliver data. • Nodes have: • different failure probabilities • same checking cost. • The conditional failure probabilities can be determined accordingly. 0.45 1 0.3 2 3 4 0.5 0.6

Example – Why the MLF Scheme is not Optimal? 0.45 1 • Findings: • Node 3 has the highest conditional failure probability. • However, by brute-force approach, we find that checking node 1 first is optimal (even nodes have the same checking cost). • Intuition: • Node 3 affects only one path, but node 1 affects both paths. • We may repair both paths by only checking node 1. 0.3 2 3 4 0.5 0.6

Our Contributions • Propose an end-to-end inference approach for correcting all data-path failures. • Identify a set of candidate nodes, and prove that one of them must be checked first in order to minimize the expected total checking cost. • Evaluate via simulation that our inference approach has a smaller expected cost than the prior MLF-based approaches [Katzela and Schwartz, 1995; Kandula et al., 2005; Steinder and Sethi, 2004].

Topologies • Topologies that we consider: Tree Multiple trees • We prove optimality results for a tree, and propose heuristics for multiple trees.

Finding Good/Bad Paths • For each data path, • Good – if the data path has no faulty node and can deliver data • Bad – if the data path has at least one faulty node and cannot deliver data • Assumption: • Each node has the same data-forwarding behavior across all paths upon which it lies. • This implies if a node lies on at least one good path, it is a non-faulty (good) node.

3 5 6 Bad tree: a tree in which every path is a bad path 8 9 Forming a Bad Tree • Monitor data streams from the root node 1 to each of the leaf nodes 6, 7, 8, 9. • Keep only bad paths, and remove any nodes that are known to be good. 1 2 3 4 5 6 7 Bad path Good path 8 9 Bad path Bad path

Pr(T | Xi, Ai) pici (1 – pi) Φ(i) = Inference Algorithm • Our inference algorithm selects which nodes to check: • Each node i is associated with a potential function: • pi = failure probability of node i • ci = checking cost of node i • Pr(T | Xi, Ai) = conditional probability of having a bad tree • T = the event that the tree is a bad tree • Xi = the event that node i is bad • Ai = the event that ancestors of node i are good • Intuitively, we should first check the node with high pi and small ci, i.e., the node with the high potential first.

3 5 6 8 9 Inference Algorithm • Candidate node • On each bad path, one node has the highest potential. We call this node a candidate node. • Example of identifying candidate nodes: • Main theorem • To minimize the expected total checking cost of correcting all faulty nodes for a given bad tree, we must check a candidate node first.

pi ci (1 – pi) Inference Algorithm • For some special cases, we know which candidate node should be checked first to minimize the expected cost. • Examples of the special cases: • A path • Check the node with the highest first • A tree in which nodes have a fixed failure probability and a fixed checking cost • Check the root node first

Inference Algorithm • For general cases, we don’t know which candidate node should be checked first to minimize the expected cost. • e.g., not necessarily the candidate node with the highest potential • Heuristics: • Sequential strategy: Checks the candidate node with the highest potential • Parallel strategy: Checks simultaneously multiple candidate nodes that cover all bad paths

Highlights of Experiments • Setup • Use BRITE to create 200 random experimental networks, each of which has 200 routers • Assign each node a failure probability and a checking cost • Focus on multi-tree topologies, each of which is a shortest-path tree rooted at a randomly selected router • Metric • Expected total checking cost to diagnose and repair all faulty nodes • Heuristics to be compared: • Candidate-based heuristics – check the candidate nodes first • MLF-based heuristics – check the most-likely faulty nodes first

Highlights of Experiments • Random failure prob., fixed checking cost • pi ~ U(0, 0.2) • ci = 1 • Result: • Both heuristics have almost the same expected total checking cost.

Highlights of Experiments • Random failure prob., random checking cost • pi ~ U(0, 0.2) • ci ~ U(0, 1) • Result: • Checking first the candidate nodes decreases the expected total checking cost by ~10%.

Highlights of Experiments • Fixed failure prob., random checking cost • pi = 0.1 • ci ~ U(0, 1) • Result: • Checking first the candidate nodes decreases the expected total checking cost by ~20%.

Conclusions • Presented optimality results for diagnosing and repairing all data-path failures, with an objective to minimize the expected total checking cost. • Constructed a potential function to identify candidate nodes, one of which must be checked first to minimize the expected total checking cost. • Showed via evaluation that checking candidate nodes first can reduce the checking cost by up to 20% compared to checking the most likely faulty nodes first.

Toward Optimal Network Fault Correction via End-to-End Inference

Toward Optimal Network Fault Correction via End-to-End Inference

Presentation Transcript

Maximizing End-to-End Network Performance

End to End Protocols

Toward Optimal Network Fault Correction via End-to-End Inference

End-to-End Inference of Router Packet Forwarding Priority

PhantomNet An end-to-end mobile network testbed

End-to-End Issues

Towards Unbiased End-to-End Network Diagnosis

Maximizing End-to-End Network Performance

End-to-End Performance: The Network View

Predictive End-to-End Reservations via A Hierarchical Clearing House

High Performance Active End-to-end Network Monitoring

Detecting Shared Congestion of Flows Via End-to-end Measurement

Detecting Shared Congestion of Flows Via End-to-end Measurement (and other inference problems)

End-to-End Fault Tolerance Using Transport Layer Multihoming

Network Tomography Using Passive End-to-End Measurements

Maximizing End-to-End Network Performance

End-to-End Data

Maximizing End-to-End Network Performance

End-to-end eProcurement

Towards Unbiased End-to-End Network Diagnosis

End-to-End Protocols

End to End Protocols