1 / 61

Problem Diagnosis

Problem Diagnosis. Distributed Problem Diagnosis Sherlock X-trace. Troubleshooting Networked Systems. Hard to develop, debug, deploy, troubleshoot No standard way to integrate debugging, monitoring, diagnostics. Status quo : device centric. Web 1. Load Balancer. Firewall. Database.

oceana
Download Presentation

Problem Diagnosis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Problem Diagnosis • Distributed Problem Diagnosis • Sherlock • X-trace

  2. Troubleshooting Networked Systems • Hard to develop, debug, deploy, troubleshoot • No standard way to integrate debugging, monitoring, diagnostics

  3. Status quo: device centric Web 1 Load Balancer Firewall Database ... ... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro 66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga ... ... ... ... [04:03:23 2006] [notice] Dispatch s1... [04:03:23 2006] [notice] Dispatch s2... [04:04:18 2006] [notice] Dispatch s3... [04:07:03 2006] [notice] Dispatch s1... [04:10:55 2006] [notice] Dispatch s2... [04:03:24 2006] [notice] Dispatch s3... [04:04:47 2006] [crit] Server s3 down... ... ... ... ... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:38 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... 28 03:55:39 PM fire... ... ... ... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... LOG: statement: SELECT COU... LOG: statement: SELECT g2_... LOG: statement: select oid... ... ... Web 2 ... ... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal 66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga 66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro 66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga ... ...

  4. Status quo: device centric • Determining paths: • Join logs on time and ad-hoc identifiers • Relies on • well synchronized clocks • extensive application knowledge • Requires all operations logged to guarantee complete paths

  5. Examples DNS Server User Web Server Proxy

  6. Examples DNS Server User Web Server Proxy

  7. Examples DNS Server User Web Server Proxy

  8. Examples DNS Server User Web Server Proxy

  9. Approaches to Diagnosis • Passively learn the relationships • Infer problems as deviations from the norm • Actively Instrument the stack to learn relationships • Infer problems as deviations from the norm

  10. Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula

  11. Well-Managed Enterprises Still Unreliable Response time of a Web server (ms) .1 .08 85% Normal Fraction Of Requests .06 10% Troubled 0.7% Down .04 .02 0 10 100 1000 10000 10% responses take up to 10x longer than normal How do we manage evolving enterprise networks?

  12. Sherlock Instead of looking at the nitty-gritty of individual components, use an end-to-end approach that focuses on user problems

  13. Challenges for the End-to-End Approach • Don’t know what user’s performance depends on

  14. Auth. Server Web Server SQL Backend DNS Challenges for the End-to-End Approach E.g., Web Connection • Don’t know what user’s performance depends on • Dependencies are distributed • Dependencies are non-deterministic • Don’t know which dependency is causing the problem • Server CPU 70%, link dropped 10 packets, but which affected user? Client

  15. Sherlock’s Contributions • Passively infersdependencies from logs • Builds a unified dependency graph incorporating network, server and application dependencies • Diagnoses user problems in the enterprise • Deployed in a part of the Microsoft Enterprise

  16. Sherlock’s Architecture

  17. Sherlock’s Architecture Network Dependency Graph Servers Inference Engine Web1 1000ms Web2 30ms File1 Timeout + User Observations Clients = List Troubled Components Sherlock works for various client-server applications

  18. Video Server Data Store DNS How do you automatically learn such distributed dependencies?

  19.  Not Practical Strawman: Instrument all applications and libraries Sherlock exploits timing info My Client talks to B My Client talks to C Time t If talks to B, whenever talks to C  Dependent Connections

  20. Strawman: Instrument all applications and libraries  Not Practical Sherlock exploits timing info B B B B B B B C Time t False Dependence If talks to B, whenever talks to C  Dependent Connections

  21. Strawman: Instrument all applications and libraries  Not Practical Sherlock exploits timing info B B C Time t Inter-access time Dependent iff t << Inter-access time If talks to B, whenever talks to C  Dependent Connections As long as this occurs with probability higher than chance

  22. Store Video DNS Dependency Graph Sherlock’s Algorithm to Infer Dependencies • Infer dependent connections from timing

  23. Bill’s Client Store Video DNS Video Store DNS Dependency Graph Sherlock’s Algorithm to Infer Dependencies • Infer dependent connections from timing • Infer topology from Traceroutes & configurations Video Store Bill DNS Bill Video Bill Watches Video • Works with legacy applications • Adapts to changing conditions

  24. But hard dependencies are not enough…

  25. Video Store DNS But hard dependencies are not enough… Bill’s Client Video Store Bill DNS Bill Video p3 p1 p1=10% p2 p2=100% Bill watches Video If Bill caches server’s IP  DNS down but Bill gets video  Need Probabilities Sherlock uses the frequency with which a dependence occurs in logs as its edge probability

  26. How do we use the dependency graph to diagnose user problems?

  27. Video Store DNS Diagnosing User Problems Bill’s Client Video Store Bill DNS Bill Video Bill Watches Video • Which components caused the problem? • Need to disambiguate!!

  28. Bill Sales Video Video2 Store Sales DNS Bill Sees Sales Diagnosing User Problems Bill’s Client Video2 Store Video Store Bill DNS Bill Video Paul Video2 Bill Watches Video Paul Watches Video2 • Which components caused the problem? • Disambiguate by correlating • Across logs from same client • Across clients • Prefer simpler explanations • Use correlation to disambiguate!!

  29. Will Correlation Scale?

  30. Will Correlation Scale? Microsoft Internal Network O(100,000) client desktops O(10,000) servers O(10,000) apps/services O(10,000) network devices Building Network Corporate Core Campus Core Dependency Graph is Huge Data Center

  31. Will Correlation Scale? Can we evaluate all combinations of component failures? The number of fault combinations is exponential! Impossible to compute!

  32. Scalable Algorithm to Correlate Only a few faults happen concurrently • But how many is few? • Evaluate enough to cover 99.9% of faults • For MS network, at most 2 concurrent faults  99.9% accurate Exponential  Polynomial

  33. Scalable Algorithm to Correlate Only a few faults happen concurrently Only few nodes change state • But how many is few? • Evaluate enough to cover 99.9% of faults • For MS network, at most 2 concurrent faults  99.9% accurate Exponential  Polynomial

  34. Scalable Algorithm to Correlate Only a few faults happen concurrently Only few nodes change state • But how many is few? • Evaluate enough to cover 99.9% of faults • For MS network, at most 2 concurrent faults  99.9% accurate • Re-evaluate only if an ancestor changes state Reduces the cost of evaluating a case by 30x-70x Exponential  Polynomial

  35. Results

  36. Experimental Setup • Evaluated on the Microsoft enterprise network • Monitored 23 clients, 40 production servers for 3 weeks • Clients are at MSR Redmond • Extra host on server’s Ethernet logs packets • Busy, operational network • Main Intranet Web site and software distribution file server • Load-balancing front-ends • Many paths to the data-center

  37. What Do Web Dependencies in the MS Enterprise Look Like?

  38. What Do Web Dependencies in the MS Enterprise Look Like? Auth. Server Client Accesses Portal

  39. What Do Web Dependencies in the MS Enterprise Look Like? Auth. Server Client Accesses Portal

  40. What Do Web Dependencies in the MS Enterprise Look Like? Auth. Server Client Accesses Portal Client Accesses Sales Sherlock discovers complex dependencies of real apps.

  41. What Do File-Server Dependencies Look Like? Backend Server 1 Backend Server 2 8% Backend Server 3 File Server Auth. Server WINS DNS Proxy 5% 1% Backend Server 4 5% 10% 6% 2% .3% 100% Client Accesses Software Distribution Server Sherlock works for many client-server applications

  42. Sherlock Identifies Causes of Poor Performance Component Index Time (days) Dependency Graph: 2565 nodes; 358 components that can fail 87% of problems localized to 16 components

  43. Sherlock Identifies Causes of Poor Performance Inference Graph: 2565 nodes; 358 components that can fail Component Index Time (days) Corroborated the three significant faults

  44. Sherlock Goes Beyond Traditional Tools • SNMP-reported utilization on a link flagged by Sherlock • Problems coincide with spikes Sherlock identifies the troubled link but SNMP cannot!

  45. X-Trace • X-Trace records events in a distributed execution and their causal relationship • Events are grouped into tasks • Well defined starting event and all that is causally related • Each event generates a report, binding it to one or more preceding events • Captures full happens-before relation

  46. HTTP Client X-Trace Output HTTP Proxy HTTP Server • Task graph capturing task execution • Nodes: events across layers, devices • Edges: causal relations between events TCP 1 Start TCP 1 End TCP 2 Start TCP 2 End IP IP Router IP IP IP Router IP Router IP

  47. g n a [T, a] [T, g] HTTP Proxy HTTP Server [T, a] h f m b X-Trace Report TaskID: T EventID: g Edge: from a, f TCP 1 Start TCP 1 End TCP 2 Start TCP 2 End i j k l c d e IP IP Router IP IP IP Router IP Router IP HTTP Client Basic Mechanism • Each event uniquely identified within a task: [TaskId, EventId] • [TaskId, EventId] propagated along execution path • For each event create and log an X-Trace report • Enough info to reconstruct the task graph

  48. X-Trace Library API • Handles propagation within app • Threads / event-based (e.g., libasync) • Akin to a logging API: • Main call is logEvent(message) • Library takes care of event id creation, binding, reporting, etc • Implementations in C++, Java, Ruby, Javascript

  49. Task Tree • X-Trace tags all network operations resulting from a particular task with the same task identifier • Task tree is the set of network operations connected with an initial task • Task tree could be reconstruct after collecting trace data with reports

More Related