Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A. Maltz, Ming Zhang Microsoft Research Presented by Zhenyu Pan

Introduction Enterprise Network Service • Network service is an (IPaddr, port) pair. • Enterprise network • network of a single enterprise. • traffic does not cross the open Internet. • user-perceptible service degradations are rampant.

IntroductionSherlock System • Conventional approach • treat each service as up or down. • box-centric, blind to the complex set of dependencies • meaningless alerts (15,000 alerts a day, almost universally ignored). • The new approach • models service availability as a 3-state value: • Up: response time is normal; • Down: requests result in either an error status or no response at all; • Troubled: response times fall significantly outside of normal response times. • user-centric, does not report problems that do not directly affect users.

IntroductionSherlock System • System components • Detects faults and performance problems by monitoring the response time of services. Software agents: run on each host, analyze the packets, determine the set of services the host depends. • Determines the set of components that responsible, a service, a router, or a link, etc. Sherlock server: assembles an multi-level, 3-state inference graph that captures the dependencies between all components. • Localizes the problem to the most likely component. Ferret Algorithm: localizes faults using the inference graph. • Main contributions: • Inference Graph • Ferret Algorithm

IntroductionInference Graph • An example

Inference GraphNode Types • Root-cause node: physical components whose failure can cause an end-user to experience failures. • computer, service, router, IP link, etc; • two special root-cause nodes: always troubled (AT) and always down (AD) to model external factors • Observation node: accesses to network services whose performances can be measured by Sherlock. • Meta-node: model the dependencies between root causes and observations. Three types of meta-nodes: • noisy-max • selector (load-balancers) • failover (failover redundancy)

Inference GraphNode States • The state of each node • three-tuple: (Pup, Ptrouble, Pdown) • P stands for probability • Pup + Ptrouble + Pdown = 1 • The state of the root-cause node is independent of any other node. • The state of observation nodes can be uniquely determined from the state of its ancestors.

Inference GraphEdges • Edge from node A to B: • the dependency that node A has to be in the up state for node B to be up. • For example, • a client cannot retrieve a file from a file server if the path to that file server is down. • A client might still be able to retrieve the file even when the DNS server is down, if the file server’s name to IP address mapping is found in the client’s local DNS cache. • dependency probability indicates how strong the dependency is.

Inference GraphPropagation of State • Noisy-Max Meta-Nodes • Max: The node gets the worst condition of its parents. • Noisy: if the weight of a parent’s edge is d, then with probability (1-d) the child is not affected

Inference GraphPropagation of State • Selector Meta-Nodes • Used to model load balancing • NLB: Network Load Balancer. • ECMP: routers send packets to a destination along several paths.

Inference GraphPropagation of State • Failover Meta-Nodes • Failover: Clients access primary servers and failover to backup servers when the primary server is inaccessible.

Inference GraphTime to Propagate • the majority of the nodes with more than one parent are noisy-max meta-nodes. • For these nodes, computation time is O(n) • for selector and failover meta-nodes: • Still needs O(3n) time. • HOWEVER, those two types of meta-nodes have no more than 6 parents.

Inference GraphFault Localization • Assignment-vector • An assignment of state to every root-cause node which has probability of 1 of being either up, troubled or down. • Our target • Find the assignment-vector that best explain the observation • Ferret • sets the root causes to the states specified in the assignment-vector and then propagate state probabilities downwards until they reach the observation nodes. • for each observation node, computes a score based on how well the probabilities in the state of the observation node agree with the statistical evidence.

Inference GraphFault Localization • Impossible to traverse all possible assignment vectors to determine the vector with the highest score • OBSERVATION 1. It is very likely that at any point in time only a few root-cause nodes are troubled or down. • OBSERVATION 2. Since a root-cause is assigned to be up in most assignment vectors, the evaluation of an assignment vector only requires re-evaluation of states at the descendants of root cause nodes that are not up.

Inference Graphferret algorithm

Sherlock SystemThree-Step Process • Step1: service-level dependency graph • We define the dependency probability of a host on service A when accessing service B as the probability the host needs to communicate with service A before it can successfully communicate with service B. • Step2: Inference Graph • Step3: Fault Localization using Ferret • Score for a given assignment vector: • Track the history of response time and fits two Gaussian distribution to the data, namely Gaussianup and Gaussiantroubled. • If the observation time is t and the predicted observation node is (pup, ptroubled, pdown), then the score of this vector is calculated as: pup*Prob(t|Gaussianup) + ptroubled*Prob(t|Gaussiantroubled)

Sherlock SystemDiscovering Service-Level Dependencies • OBSERVATION: If accessing service B depends on service A, then packets exchanged with A and B are likely to co-occur. • Dependency probability: conditional probability of accessing service A within the dependency interval, prior to accessing service B. • Dependency interval: 10ms • Chance of co-occurrence: first calculate the average interval, I, between accesses to the same service and estimate the likelihood of “chance co-occurrence” as (10ms)/I. They then retain only the dependencies where the dependency probability is much greater than the likelihood of chance co-occurrence.

Sherlock SystemConstructing the Inference Graph • creates a noisy max meta-node to represent the service. • creates an observation node and makes the service meta-node a parent of the observation node. • examines the service dependency information recursively. • creates a root-cause node to represent the host on which the service runs and makes this root-cause a parent of the meta-node. • adds network topology information by using trace route results. For each path between hosts, it adds a noisy-max meta node to represent the path and root-cause nodes to represent every router and link on the path. • adds each of these root-causes as parents of the path meta-node. • put AT and AD. Give each the edges connecting AT/AD to the observation point a weight 0.001. And give the edges between a router and a path meta-node a probability 0.999.

Implementation Agent

Implementation Service Dependency Graphs

Implementation Dependency Probabilities

Implementation Test Bed

EvaluationRoot Cause

EvaluationPerformance Comparison

EvaluationError Influence

Summary • Main contribution • Sherlock system • Assist IT Admin for troubleshooting. • A Multi-level probabilistic inference model • Automatic Construction of the Inference Graph • An algorithm to localize root cause.

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies