problem diagnosis n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Problem Diagnosis PowerPoint Presentation
Download Presentation
Problem Diagnosis

Loading in 2 Seconds...

play fullscreen
1 / 61

Problem Diagnosis - PowerPoint PPT Presentation


  • 144 Views
  • Uploaded on

Problem Diagnosis. Distributed Problem Diagnosis Sherlock X-trace. Troubleshooting Networked Systems. Hard to develop, debug, deploy, troubleshoot No standard way to integrate debugging, monitoring, diagnostics. Status quo : device centric. Web 1. Load Balancer. Firewall. Database.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Problem Diagnosis' - oceana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
problem diagnosis
Problem Diagnosis
  • Distributed Problem Diagnosis
  • Sherlock
  • X-trace
troubleshooting networked systems
Troubleshooting Networked Systems
  • Hard to develop, debug, deploy, troubleshoot
  • No standard way to integrate debugging, monitoring, diagnostics
status quo device centric
Status quo: device centric

Web 1

Load

Balancer

Firewall

Database

...

...

72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga

65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob

65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob

65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal

65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal

66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga

66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga

66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro

66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga

...

...

...

...

[04:03:23 2006] [notice] Dispatch s1...

[04:03:23 2006] [notice] Dispatch s2...

[04:04:18 2006] [notice] Dispatch s3...

[04:07:03 2006] [notice] Dispatch s1...

[04:10:55 2006] [notice] Dispatch s2...

[04:03:24 2006] [notice] Dispatch s3...

[04:04:47 2006] [crit] Server s3 down...

...

...

...

...

28 03:55:38 PM fire...

28 03:55:38 PM fire...

28 03:55:38 PM fire...

28 03:55:38 PM fire...

28 03:55:38 PM fire...

28 03:55:38 PM fire...

28 03:55:38 PM fire...

28 03:55:39 PM fire...

28 03:55:39 PM fire...

28 03:55:39 PM fire...

28 03:55:39 PM fire...

28 03:55:39 PM fire...

28 03:55:39 PM fire...

28 03:55:39 PM fire...

28 03:55:39 PM fire...

...

...

...

LOG: statement: select oid...

LOG: statement: SELECT COU...

LOG: statement: SELECT g2_...

LOG: statement: select oid...

LOG: statement: SELECT COU...

LOG: statement: SELECT g2_...

LOG: statement: select oid...

LOG: statement: SELECT COU...

LOG: statement: SELECT g2_...

LOG: statement: select oid...

LOG: statement: select oid...

LOG: statement: SELECT COU...

LOG: statement: SELECT g2_...

LOG: statement: select oid...

LOG: statement: SELECT COU...

LOG: statement: SELECT g2_...

LOG: statement: select oid...

...

...

Web 2

...

...

72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga

65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob

65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob

65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal

65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal

66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga

66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga

66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro

66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga

...

...

status quo device centric1
Status quo: device centric
  • Determining paths:
    • Join logs on time and ad-hoc identifiers
  • Relies on
    • well synchronized clocks
    • extensive application knowledge
  • Requires all operations logged to guarantee complete paths
examples
Examples

DNS Server

User

Web Server

Proxy

examples1
Examples

DNS Server

User

Web Server

Proxy

examples2
Examples

DNS Server

User

Web Server

Proxy

examples3
Examples

DNS Server

User

Web Server

Proxy

approaches to diagnosis
Approaches to Diagnosis
  • Passively learn the relationships
    • Infer problems as deviations from the norm
  • Actively Instrument the stack to learn relationships
    • Infer problems as deviations from the norm
well managed enterprises still unreliable
Well-Managed Enterprises Still Unreliable

Response time of a Web server (ms)

.1

.08

85% Normal

Fraction Of Requests

.06

10% Troubled

0.7% Down

.04

.02

0

10 100 1000 10000

10% responses take up to 10x longer than normal

How do we manage evolving enterprise networks?

sherlock
Sherlock

Instead of looking at the nitty-gritty of individual components, use an end-to-end approach that focuses on user problems

slide13

Challenges for the End-to-End Approach

  • Don’t know what user’s performance depends on
slide14

Auth. Server

Web Server

SQL

Backend

DNS

Challenges for the End-to-End Approach

E.g., Web Connection

  • Don’t know what user’s performance depends on
    • Dependencies are distributed
    • Dependencies are non-deterministic
  • Don’t know which dependency is causing the problem
    • Server CPU 70%, link dropped 10 packets, but which affected user?

Client

sherlock s contributions
Sherlock’s Contributions
  • Passively infersdependencies from logs
  • Builds a unified dependency graph incorporating network, server and application dependencies
  • Diagnoses user problems in the enterprise
  • Deployed in a part of the Microsoft Enterprise
slide17

Sherlock’s Architecture

Network Dependency Graph

Servers

Inference Engine

Web1 1000ms

Web2 30ms

File1 Timeout

+

User Observations

Clients

=

List Troubled Components

Sherlock works for various client-server applications

slide18

Video Server

Data Store

DNS

How do you automatically learn such distributed dependencies?

strawman instrument all applications and libraries

 Not Practical

Strawman: Instrument all applications and libraries

Sherlock exploits timing info

My Client talks to B

My Client talks to C

Time

t

If talks to B, whenever talks to C  Dependent Connections

slide20

Strawman: Instrument all applications and libraries

 Not Practical

Sherlock exploits timing info

B

B

B

B

B

B

B

C

Time

t

False Dependence

If talks to B, whenever talks to C  Dependent Connections

slide21

Strawman: Instrument all applications and libraries

 Not Practical

Sherlock exploits timing info

B

B

C

Time

t

Inter-access time

Dependent iff t << Inter-access time

If talks to B, whenever talks to C  Dependent Connections

As long as this occurs with probability higher than chance

slide22

Store

Video

DNS

Dependency Graph

Sherlock’s Algorithm to Infer Dependencies

  • Infer dependent connections from timing
slide23

Bill’s Client

Store

Video

DNS

Video

Store

DNS

Dependency Graph

Sherlock’s Algorithm to Infer Dependencies

  • Infer dependent connections from timing
  • Infer topology from Traceroutes & configurations

Video Store

Bill DNS

Bill Video

Bill Watches Video

  • Works with legacy applications
  • Adapts to changing conditions
slide25

Video

Store

DNS

But hard dependencies are not enough…

Bill’s Client

Video Store

Bill DNS

Bill Video

p3

p1

p1=10%

p2

p2=100%

Bill watches Video

If Bill caches server’s IP  DNS down but Bill gets video

 Need Probabilities

Sherlock uses the frequency with which a dependence occurs in logs as its edge probability

slide27

Video

Store

DNS

Diagnosing User Problems

Bill’s Client

Video Store

Bill DNS

Bill Video

Bill Watches Video

  • Which components caused the problem?
  • Need to disambiguate!!
slide28

Bill Sales

Video

Video2

Store

Sales

DNS

Bill Sees Sales

Diagnosing User Problems

Bill’s Client

Video2 Store

Video Store

Bill DNS

Bill Video

Paul Video2

Bill Watches Video

Paul Watches Video2

  • Which components caused the problem?
  • Disambiguate by correlating
    • Across logs from same client
    • Across clients
  • Prefer simpler explanations
  • Use correlation to disambiguate!!
will correlation scale
Will Correlation Scale?

Microsoft Internal Network

O(100,000) client desktops

O(10,000) servers

O(10,000) apps/services

O(10,000) network devices

Building Network

Corporate Core

Campus Core

Dependency Graph is Huge

Data Center

slide31

Will Correlation Scale?

Can we evaluate all combinations of component failures?

The number of fault combinations is exponential!

Impossible to compute!

slide32

Scalable Algorithm to Correlate

Only a few faults happen concurrently

  • But how many is few?
  • Evaluate enough to cover 99.9% of faults
  • For MS network, at most 2 concurrent faults  99.9% accurate

Exponential  Polynomial

slide33

Scalable Algorithm to Correlate

Only a few faults happen concurrently

Only few nodes change state

  • But how many is few?
  • Evaluate enough to cover 99.9% of faults
  • For MS network, at most 2 concurrent faults  99.9% accurate

Exponential  Polynomial

slide34

Scalable Algorithm to Correlate

Only a few faults happen concurrently

Only few nodes change state

  • But how many is few?
  • Evaluate enough to cover 99.9% of faults
  • For MS network, at most 2 concurrent faults  99.9% accurate
  • Re-evaluate only if an ancestor changes state

Reduces the cost of evaluating a case by 30x-70x

Exponential  Polynomial

experimental setup
Experimental Setup
  • Evaluated on the Microsoft enterprise network
  • Monitored 23 clients, 40 production servers for 3 weeks
    • Clients are at MSR Redmond
    • Extra host on server’s Ethernet logs packets
  • Busy, operational network
    • Main Intranet Web site and software distribution file server
    • Load-balancing front-ends
    • Many paths to the data-center
slide40

What Do Web Dependencies in the MS Enterprise Look Like?

Auth. Server

Client Accesses Portal

Client Accesses Sales

Sherlock discovers complex dependencies of real apps.

slide41

What Do File-Server Dependencies Look Like?

Backend

Server 1

Backend

Server 2

8%

Backend

Server 3

File Server

Auth.

Server

WINS

DNS

Proxy

5%

1%

Backend

Server 4

5%

10%

6%

2%

.3%

100%

Client Accesses Software Distribution Server

Sherlock works for many client-server applications

slide42

Sherlock Identifies Causes of Poor Performance

Component Index

Time (days)

Dependency Graph: 2565 nodes; 358 components that can fail

87% of problems localized to 16 components

slide43

Sherlock Identifies Causes of Poor Performance

Inference Graph: 2565 nodes; 358 components that can fail

Component Index

Time (days)

Corroborated the three significant faults

slide44

Sherlock Goes Beyond Traditional Tools

  • SNMP-reported utilization on a link flagged by Sherlock
  • Problems coincide with spikes

Sherlock identifies the troubled link but SNMP cannot!

x trace
X-Trace
  • X-Trace records events in a distributed execution and their causal relationship
  • Events are grouped into tasks
    • Well defined starting event and all that is causally related
  • Each event generates a report, binding it to one or more preceding events
  • Captures full happens-before relation
x trace output

HTTP

Client

X-Trace Output

HTTP

Proxy

HTTP

Server

  • Task graph capturing task execution
    • Nodes: events across layers, devices
    • Edges: causal relations between events

TCP 1

Start

TCP 1

End

TCP 2

Start

TCP 2

End

IP

IP

Router

IP

IP

IP

Router

IP

Router

IP

basic mechanism

g

n

a

[T, a]

[T, g]

HTTP

Proxy

HTTP

Server

[T, a]

h

f

m

b

X-Trace Report

TaskID: T

EventID: g

Edge: from a, f

TCP 1

Start

TCP 1

End

TCP 2

Start

TCP 2

End

i

j

k

l

c

d

e

IP

IP

Router

IP

IP

IP

Router

IP

Router

IP

HTTP

Client

Basic Mechanism
  • Each event uniquely identified within a task: [TaskId, EventId]
  • [TaskId, EventId] propagated along execution path
  • For each event create and log an X-Trace report
    • Enough info to reconstruct the task graph
x trace library api
X-Trace Library API
  • Handles propagation within app
  • Threads / event-based (e.g., libasync)
  • Akin to a logging API:
    • Main call is logEvent(message)
  • Library takes care of event id creation, binding, reporting, etc
  • Implementations in C++, Java, Ruby, Javascript
task tree
Task Tree
  • X-Trace tags all network operations resulting from a particular task with the same task identifier
  • Task tree is the set of network operations connected with an initial task
  • Task tree could be reconstruct after collecting trace data with reports
an example of the task tree
An example of the task tree
  • A simple HTTP request through a proxy
x trace components
X-Trace Components
  • Data
    • X-Trace metadata
  • Network path
    • Task tree
  • Report
    • Reconstruct task tree
propagation of x trace metadata
Propagation of X-Trace Metadata
  • The propagation of X-Trace metadata through the task tree
propagation of x trace metadata1
Propagation of X-Trace Metadata
  • The propagation of X-Trace metadata through the task tree
x trace like in google bing yahoo
X-Trace-like in Google/Bing/Yahoo
  • Why?
    • Own large portion of the ecosystem
    • Use RPC for communication
    • Need to understand
      • Time for user request
      • Resource utilization by request
sherlock v x trace
Sherlock V X-trace
  • Overhead V. Accuracy
  • Deployment issues
    • Invasiveness
    • Code modification
slide61

Conclusions

  • Sherlock passively infers network-wide dependencies from logs and traceroutes
  • It diagnoses faults by correlating user observations
  • X-trace actively discovers network-wide dependencies