1 / 34

Detailed Diagnosis in Enterprise Networks.

Detailed Diagnosis in Enterprise Networks. . Fault diagnosis . Small enterprise networks operators need detailed fault diagnosis. The system should be able to diagnose generic faults and application specific faults.

hansel
Download Presentation

Detailed Diagnosis in Enterprise Networks.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detailed Diagnosis in Enterprise Networks.

  2. Fault diagnosis • Small enterprise networks operators need detailed fault diagnosis. • The system should be able to diagnose generic faults and application specific faults. • Detailed diagnosis as an inference problem that captures the behaviors and interactions of fine grained network components and processes.

  3. Introduction • Diagnosing problems in computer networks is frustrating. • Configuration changes in seemingly unrelated files, resource hogs elsewhere in the network, and even software upgrades can ruin what worked perfectly yesterday. • Existing diagnostic systems, designed with large, complex networks in mind, fall short at helping the operators of small networks.

  4. Small Enterprise Networks • Most problems in this environment concern application specific issues. • Generic problems related to performance or reachabilityare in a minority. • Culprits underlying these faults range from bad application or firewall configuration to software driver bugs. • Diagnosis at the granularity of machines is not very useful. • Operators often already know which machine is faulty.

  5. Existing Diagnostic Systems • Sherlock, targets only performance and reachability issues and diagnose at the granularity of machines. • SCORE for ISP networks, uses extensive knowledge of the structure of their domains. • Extending them to perform detailed diagnosis in Enterprise Networks would require embedding detailed knowledge of each application’s dependencies.

  6. Netmedic • Formulation models the network as a dependency graph of fine-grained components such as processes and firewall configuration. • The goal of diagnosis in its model is to link affected components to components that are likely culprits through a chain of dependency edges. • Many variables. • Without programmed knowledge. • Use of history to estimate likelihood of components impacting one another in the present.

  7. Problem formulation • The system should be able to diagnose both, application specific and generic problems. • It should identify likely causes with as much specificity as possible. • The system should rely on minimal application specific knowledge.

  8. NetMedic Inference Problem • Model the network as a dependency graph between components: application processes, host machines, and configuration elements. • There is a direct edge between two components if the source directly impacts the destination. • The dependency graph may contain cycles. • Netmedic automatically constructs the dependency graph. • The state of a component at a given time consists of visible and invisible parts.

  9. Dependency graph

  10. Given a component whose visible state has changed relative to some period in the past, the goal is to identify the components likely responsible for the change. • Thus the system can be used to explain any change, including improvements.

  11. Using History to Gauge impact • The main problem in estimating when a component may be impacting another is that we don’t know a priori how components interact. • We can use time to rule out the possibility of impact along certain dependency edges. • It is not uncommon for at least some variables to be in an abnormal state at any time.

  12. History based primitive • This primitive extracts information from the joint historical behavior of components to estimate the likelihood that a component is currently impacting a neighbor. • They use this estimated likelihood to set the edge weights in the dependency graph. • The weights are then used to identify the likely causes as those that have a path of high impact edges in the dependency graph leading to the affected component.

  13. Edge weight

  14. Whether the edge weight is correctly determined for an edge depends on the contents of the history. • Although estimating the correct weight for every edge is not critical. • It is important for accurate diagnosis the ability to correctly assign a low weight to enough edges such that the path from the real cause to its effects shines through.

  15. NetMedic Workflow

  16. Capturing component state • Partition of components depends on the kinds of faults that appear on the logs. • Components: application processes, machine and network paths and configuration of applications, machine and firewalls. • Virtual component: NbrSet • Granularity of diagnosis determined by the granularity of the modeled components. • State of each component is stored in one-minute bins.

  17. Example of NetMedic State Variables

  18. Generating dependency graph • Model the network as a dependency graph among components. • There is an edge from a component to each of its directly dependent components. • The graph is automatically generated using a set of templates. • 1 template per component type. • IP communication is captured using NbrSet.

  19. Templates used by NetMedic

  20. Diagnosis • Inputs: • (One-minute) time bin to analyze • Time range as historical reference • 3 Steps: • Determine the extent to which various components and variables are statistically abnormal. • Compute weights for edges in the dependency graph. • Use edge weights to compute path weights and produce a ranked list of likely culprits.

  21. Step 1: Computing abnormality • Assume that the values of the variable approximate the normal distribution. • Given the abnormality for each variable, the abnormality of a component is the maximum abnormality across its variables. • Abnormality values are used in two ways: • Used directly, for instance as multiplicative factors. • To make binary decisions, to whether a variable or component is abnormal.

  22. Step 2: Computing edge weights • Let S and D be the source and destination of dependency edge. • If either S or D is behaving normally, unlikely that S is impacting D, therefore assign low weight to edge (0.1). • If both S and D are abnormal, use joint historical behavior to determine the edge weight.

  23. When no usable historical information exists, e.g. because of insufficient history or because of similar source states do not exist => assign a high weight of 0.8 to the edge. • To get a robust measure of how differently a component is behaving at different points in time, variables are normalized.

  24. Extensions • Use of extensions to the basic procedure to create a similar effect without requiring knowledge of variable semantics. • A) Weight variables by abnormality • B) Ignore redundant variables • C) Focus on variables relevant to interaction with neighbor. • D) Account for aggregate relationships.

  25. Step 3: Ranking likely causes • We use the edge weights to order likely causes. • The edge weights help connect likely causes to their observed effects through a sequence of high weight edges. • The goal is to rank causes such that more likely culprits have lower ranks. • Components with larger products are ranked lower.

  26. The impact l(c->e) from one component to another is the maximum weight across all acyclic paths between them. • The score S(c) of a component is the weighted sum of its impact on each other component in the network. • Real culprits have low ranks most of the times.

  27. Implementation • Two parts: data collection and analysis. • The first part captures and stores the state of various components. • The second part uses the stored data to generate de dependency graph and conduct diagnosis. • Has been implemented just in Windows. • It is future work to implement data collection on non Windows machines, e.g. using syslog and proc file system in Linux.

  28. Evaluation • How well is NetMedic at linking effects to their likely causes? • It identifies the correct component as the most likely culprit in 80% of the cases. • A coarse diagnosis method performs poorly, only for 15% of the faults. • Evaluation Platforms: 2 environments. • Methodology: Faults are injected and a hour long history. • Coarse diagnosis method: compares against it. • Metric: Rank assigned to the real cause for each anticipated effect of a fault.

  29. Effectiveness of diagnosis

  30. Why NetMedic outperforms Coarse?

  31. Benefit of extension

  32. Multiple simultaneous faults, Impact of history and In situ Behaviour.

  33. Scaling • NetMedic can also help large enterprises. • There is two challenges in scaling: • To carry out diagnosis-related computation over large dependency graphs Bottleneck: calculating component abnormality and edge weights. ( Although they are parallelizable) • Data collection, storage and retrieval in large deployments.

  34. Conclusions • NetMedic enables detailed diagnosis in enterprise networks with minimal application knowledge. • Their experiments show that it is highly effective in diagnosing a diverse set of faults that were injected in a live environment. • The tool enables diagnosis of a broad range of faults that are visible in available data without embedding the evolving semantics of the data.

More Related