1 / 25

Root Cause Analysis of Failures in Large-Scale Computing Environments

Root Cause Analysis of Failures in Large-Scale Computing Environments. Alex Mirgorodskiy, University of Wisconsin mirg@cs.wisc.edu Naoya Maruyama, Tokyo Institute of Technology naoya.maruyama@is.titech.ac.jp Barton P. Miller, University of Wisconsin bart@cs.wisc.edu http://www.paradyn.org/.

dakota
Download Presentation

Root Cause Analysis of Failures in Large-Scale Computing Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Root Cause Analysis of Failures in Large-Scale Computing Environments Alex Mirgorodskiy, University of Wisconsin mirg@cs.wisc.edu Naoya Maruyama, Tokyo Institute of Technology naoya.maruyama@is.titech.ac.jp Barton P. Miller, University of Wisconsin bart@cs.wisc.edu http://www.paradyn.org/

  2. Motivation • Systems are complex and non-transparent • Many components, different vendors • Anomalies are common • Intermittent • Environment-specific • Users have little debugging expertise Finding the causes of bugs and performance problems in production systems is hard

  3. Vision Agent Host A Process P Host B network Process R Process Q Autonomous, detailed, low-overhead analysis: • User specifies a perceived problem cause • The agent finds the actual cause

  4. Applications • Diagnostics of E-commerce systems • Trace the path each request takes through a system • Identify unusual paths • Find out why they are different from the norm • Diagnostics of Cluster and Grid systems • Monitor behavior of different nodes in the system • Identify nodes with unusual behavior • Find out why they are different from the norm • Example: found problems in SCore middleware • Diagnostics of Real-time and Interactive systems • Trace words through the phone network • Find out why some words were dropped

  5. Key Components • Data collection: self-propelled instrumentation • Works for a single process • Can cross the user-kernel boundary • Can be deployed on multiple nodes at the same time • Ongoing work: crossing process and host boundaries • Data analysis: use repetitiveness to find anomalies • Repetitive execution of the same high-level action OR • Repetitiveness among identical processes (e.g., Cluster management tools, Parallel codes, Web server farms)

  6. Focus on Control Flow Anomalies • Unusual statements executed • Corner cases are more likely to have bugs • Statements executed in unusual order • Race conditions • Function taking unusually long to complete • Sporadic performance problems • Deadlocks, livelocks

  7. Current Framework P1 • Traces control flow of all processes • Begins at process startup • Stops upon a failure or performance degradation • Identifies anomalies: unusual traces • Problems on a small number of nodes • Both fail-stop and not • Identifies the causes of the anomalies • Function responsible for the problem Trace of P1 P2 P4 P3

  8. Inject instrumenter.so Propagate Analyze: build call graph/CFG with Dyninst Activate a.out 83f0: 83f1: 83f3: 8400: 8405: 8413: 8414: push mov ... call mov pop ret %ebp %esp,%ebp foo %ebp,%esp %ebp bar Patch1 call call jmp instrument(foo) foo 0x8405 jmp patch jmp 8430: 8431: 8433: 8444: 8446: 8449: 844b: 844c: push mov ... call mov xor pop ret %ebp %esp,%ebp *%eax %ebp,%esp %eax,%eax %ebp foo Patch2 call call jmp instrument(%eax) *%eax 0x8446 jmp /dev/instrumenter OS Kernel patch jmp Patch3 sys_call: %eax *%eax push ... call ... iret instrument(%eax) *%eax 0x6d27 call call jmp 6cf5: 6d20: 6d27: 6d49: jmp

  9. Data Collection: Trace Management • The trace is kept in a fixed-size circular buffer • New entries overwrite the oldest entries • Retains the most recent events leading to the problem • The buffer is located in a shared memory segment • Does not disappear if the process crashes ret foo call foo … Process P Tracer

  10. Data Analysis: Find Anomalous Host • Check if the anomaly was fail-stop or not: • One of the traces ends substantially earlier than the others -> Fail-stop • The corresponding host is an anomaly • Traces end at similar times -> Non-fail-stop • Look at differences in behavior across traces P1 P2 Traces P3 P4 Trace end time

  11. Data Analysis: Non-fail-stop Host Find outliers (traces different from the rest): • Define a distance metric between two traces • d(g,h) = measure of similarity of traces g and h • Define a trace suspect score • σ(h) = similarity of h to the common behavior • Report traces with high suspect scores • Most distant from the common behavior

  12. Defining the Distance Metric t(bar) δ(g,h) p(g) p(h) t(foo) • Compute the time profile for each host h: • p(h) = (t1, …, tF) • ti = normalized time spent in function fi on host h • Profiles are less sensitive to noise than raw traces • Delta vector of two profiles: δ(g,h) = p(g) – p(h) • Distance metric: d(g,h) = Manhattan norm of δ(g,h)

  13. Defining the Suspect Score σ(g) g σ(h) h • Common behavior = normal • Suspect score: σ(h) = distance to nearestneighbor • Report host with the highest σ to the analyst • h is in the big mass, σ(h) is low, h is normal • g is a single outlier, σ(g) is high, g is an anomaly • What if there is more than one anomaly?

  14. Defining the Suspect Score σk(g) g h Computing the score using k=2 • Suspect score: σk(h) = distance to the kth neighbor • Exclude (k-1) closest neighbors • Sensitivity study: k = NumHosts/4 works well • Represents distance to the “big mass”: • h is in the big mass, kth neighbor is close, σk(h) is low • g is an outlier, kth neighbor is far, σk(g) is high

  15. Defining the Suspect Score σk(g) g h • Anomalous means unusual, but unusual does not always mean anomalous! • E.g., MPI master is different from all workers • Would be reported as an anomaly (false positive) • Distinguish false positives from true anomalies: • With knowledge of system internals – manual effort • With previous execution history – can be automated

  16. Defining the Suspect Score g h n • Add traces from known-normal previous run • One-class classification • Suspect score σk(h) = distance to the kth trial neighbor or the 1st known-normal neighbor • Distance to the big mass or known-normal behavior • h is in the big mass, kth neighbor is close, σk(h) is low • g is an outlier, normal node n is close, σk(g) is low

  17. Finding Anomalous Function • Fail-stop problems • Failure is in the last function invoked • Non-fail-stop problems • Find why host h was marked as an anomaly • Function with the highest contribution to σ(h): • σ(h) = |δ(h,g)|, where g is the chosen neighbor • anomFn = arg max |δi| i

  18. Experimental Study: SCore sc_watch patrol • SCore: cluster-management framework • Job scheduling, checkpointing, migration • Supports MPI, PVM, Cluster-enabled OpenMP • Implemented as a ring of daemons, scored • One daemon per host for monitoring jobs • Daemons exchange keep-alive patrol messages • If no patrol message traverses the ring in 10 minutes, sc_watch kills and restarts all daemons scored scored scored

  19. Debugging SCore sc_watch patrol • Inject tracing agents into all scoreds • Instrument sc_watch to find when the daemons are being killed • Identify the anomalous trace • Identify the anomalous function/call path scored scored scored

  20. Finding the Host • Host n129 is unusual – different from the others • Host n129 is anomalous – not present in previous known-normal runs • Host n129 is a new anomaly – not present in previous known-faulty runs

  21. Finding the Cause • Call chain with the highest contribution to the suspect score: (output_job_status -> score_write_short -> score_write -> __libc_write) • Tries to output a log message to the scbcast process • Writes to the scbcast process kept blocking for 10 minutes • Scbcast stopped reading data from its socket – bug! • Scored did not handle it well (spun in an infinite loop) – bug!

  22. Ongoing work Host A Process P Host B • Cross process and host boundaries • Propagate upon communication • Reconstruct system-wide flows • Compare flows to identify anomalies network Process R Process Q

  23. Ongoing work • Propagate upon communication • Notice the act of communication • Identify the peer • Inject the agent into the peer • Trace the peer after it receives the data • Reconstruct system-wide flows • Separate concurrent interleaved flows • Compare flows • Identify common flows and anomalies

  24. Conclusion • Data collection: acquire call traces from all nodes • Self-propelled instrumentation: autonomous, dynamic and low-overhead • Data analysis: identify unusual traces and find what made them unusual • Fine-grained: identifies individual suspect functions • Highly accurate: reduces rate of false positives using past history • Come see the demo!

  25. Relevant Publications • A.V. Mirgorodskiy, N. Maruyama, and B.P. Miller, "Root Cause Analysis of Failures in Large-Scale Computing Environments", Submitted for publication, ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy05Root.pdf • A.V. Mirgorodskiy and B.P. Miller, "Autonomous Analysis of Interactive Systems with Self-Propelled Instrumentation", 12th Multimedia Computing and Networking (MMCN 2005), San Jose, CA, January 2005, ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy04SelfProp.pdf

More Related