Triage: Diagnosing Production Run Failures at the User’s Site

Triage: Diagnosing Production Run Failures at the User’s Site Joseph Tucek, Shan Lu, Chengdu Huang, SpirosXanthosand Yuanyuan Zhou University of Illinois at Urbana Champaign

Motivation • Software failures are a major contributor to system downtime. • Security holes. • Software has grown in size, complexity and cost. • Software testing has become more difficult. • Software packages inevitably contain bugs (even production ones).

Motivation • Result: Software failures during production runs at user’s site. • One Solution: Offsite software diagnosis: • Difficult to reproduce failure triggering conditions. • Cannot provide timely online recovery (e.g. from fast Internet Worms). • Programmers cannot be provided to every site. • Privacy concerns.

Goal: automatically diagnosing software failures occurring at end-user site production runs. • Understand a failure that has happened. • Find the root causes. • Minimize manual debugging.

Current state of the art Offsite diagnosis: Primitive onsite diagnosis: Interactive debuggers. Program slicing. Core Dump analysis (Partial execution path construction). Large overhead makes it impractical for production sites. Unprocessed failure information collections. Deterministic replay tools. All require manual analysis. Privacy concerns.

Onsite Diagnosis • Efficiently reproduce the occurred failure (i.e. fast and automatically). • Impose little overhead during normal execution. • Require no human involvement. • Require no prior knowledge.

Triage • Capturing the failure point and conducting just-in-time failure diagnosis with checkpoint-reexecution. • Delta Generation and Delta Analysis. • Automated top-down human-like software failure diagnosis protocol. • Reports: • Failure nature and type. • Failure-triggering conditions. • Failure-related code/variable and the fault propagation chain.

Triage Architecture 3 groups of components: • Runtime Group. • Control Group. • Analysis Group.

Checkpoint & Reexecution • Uses Rx (Previous work by authors). • Rx checkpointing: • Use fork()-like operations. • Keeps a copy of accessed files and file pointers. • Record messages using a network proxy. • Replay may be potentially modified.

Lightweight Monitoring for detecting failures • Must not impose high overhead. • Cheapest way: catch fault traps: • Assertions • Access violations • Divide by zero • More… • Extensions: Branch histories, system call trace… • Triage only uses exceptions and assertions.

Control layer • Implements the Triage Diagnosis protocol. • Controls reexecutionswith different inputs based on past results. • Choice of analysis technique. • Collects results and sends to off-site programmers.

Analysis Layer Techniques:

TDP: Triage Diagnosis Protocol Simple Replay Deterministic bug Coredump analysis Stack/Heap OK. Segmentation fault: strln() Dynamic bug detection Null-pointer dereference Delta Generation Collection of good and bad inputs Delta Analysis Code paths leading to fault Report

TDP: Triage Diagnosis ProtocolExample report

Protocol extensions and variations • Add different debugging techniques. • Reorder diagnosis steps. • Omit steps (e.g. memory checks for java programs). • Protocol may be costume-designed for specific applications. • Try and fix bugs: • Filter failure triggering inputs. • Dynamically delete code – risky. • Change variable values. • Automatic patch generation – future work?

Delta Generation • Two Goals: • Generate many similar replays: some that fail and some that don’t. • Identify signature of failure triggering inputs. • Signatures may be used for: • Failure analysis and reproduction. • Input filtering e.g. Vigilante, Autograph ,etc.

Delta Generation Changing the input Changing the Environment Replay previously stored client requests via proxy – try different subsets and combinations. Isolate bug-triggering part – data “fuzzing”. Find non-failing inputs with minimum distance from failing ones. Make protocol aware changes. Use a “normal form” of the input, if specific triggering portion is known. Pad or zero-fill new allocations. Change messages order. Drop messages. Manipulate thread scheduling. Modify the system environment. Make use of prior steps information (e.g. target specific buffers).

Delta Generation • Results passed to the next stage: • Break code to basic blocks. • For each replay extract a vector of exercise count of each block and block trace. • Possible to change granularity.

Example revisited Good run: Trace: AHIKBDEFEF…EG Block vector: {A:1,B:1,D:1,E:11,F:10,G:1,H:1,I:1,K:1} Bad run: Trace: AHIJBCDE Block vector: {A:1,B:1,C:1,D:1,E:1,H:1,I:1,J:1,K:1}

Delta Analysis Follows three steps: • Basic Block Vector (BBV) Comparison: Find a pair of most similar failing and non-failing replays F and S. • Path comparison: Compare the execution path of F and S. • Intersection with backward slice: Find the difference that contributes to the failure.

Delta Analysis: BBV Comparison • The number of times each block is executed is recorded using instrumentation. • Calculate the Manhattan distance between every pair of failing and non-failing replays (can relax the minimum demand and settle for similar). • In the Example: {c:-1,E:10,F:10,G:1,J:-1,K:1} giving a Manhattan distance of 24.

Delta Analysis: Path Comparison • Consider execution order. • Find where the failing and non-failing runs diverge. • Compute: Minimum Edit Distance i.e. the minimum number of insertion, deletion, and substitution operations needed to transform one to the other. • Example:

Delta Analysis: Backward Slicing • Want to eliminate differences that have no effect on the failure. • Dynamic Backward Slicing: extracts a program slice consisting of all and only those that lead to a given instruction’s execution. • Starting point may be supplied by earlier steps of the protocol. • Overhead is acceptable in post-hoc analysis. • Optimization: Dynamically build dependencies during replays. • Experiments show that overhead is acceptably low.

Backward Slicing and result Intersection

Limitations and Extensions • Need to define a privacy policy for the results sent to programmers. • Very limited success with patch generation. • Does not handle memory leaks well. • Failure must occur. Does not handle incorrect operation. • Difficult to reproduce bugs that take a long time to manifest. • No support for deterministic replay on multi-processor architectures. • False positives.

Evaluation Methodology • Experimented with 10 real software failures in 9 applications. • Triage is implemented in Linux OS (2.4.22). • Hardware: 2.4 GHz Pentium-4, 512K L2 cache, 1G memory and 100Mbs Ethernet. • Triage checkpoints every 200ms and keeps 20 checkpoint. • User study: 15 programmers were given 5 bugs and Triage’s report for some of the bugs. Compared time to locate the bug with and without the report.

Bugs used for Evaluation

Experimental Results No input testing

Experimental Results • For application bugs, Delta generation only worked for BC and TAR. • In all cases Triage correctly diagnoses the nature of the bug (deterministic or non-deterministic). • In all 6 applicable cases Triage correctly pinpoints the bug type, buggy instruction, and memory location. • When Delta Analysis is applied, it reduces the amount of data to be considered by 63% (Best: 98% worse: 12%). • For MySQL – Finds an example interleaving pair as a trigger.

Case Study 1: Apache • Failure at ap_gregsub. • Bug detector catches a stack smash in lmatcher. • How can lmatcher affect try_alias_list? • Stack smash overwrites the stack frame above it, invalidating r. • Trace shows how lmatcher is called by try_alias_list. • Failure is independent of the headers. • Failure is triggered by requests for a specific resource.

Case Study 2: Squid • Coredump analysis suggests a heap overflow. • Happens at strcat of two buffers. • Fault propagation shows how buffers were allocated. • t has strlen(usr) while the other buffer has strlen(user)*3. • Input testing gives failure-triggering input. • Gives minimally different non-failing inputs.

Efficiency and Overhead Normal Execution overhead: • Negligble effect caused by checkpointing. • In no case over 5%. • With 400ms checkpointing intervals – overhead is 0.1%

Efficiency and Overhead Diagnosis Efficiency: • Except for Delta Analysis, all steps are efficient. • All (other) diagnostic steps finish within 5 minutes. • Delta analysis time is governed by the Edit Distance D in the O(ND) computation (N – number of blocks). • Comparison step of Delta Analysis may run in the background.

User Study • Real bugs: • On average, programmers took 44.6% less time debugging using Triage reports. • Toy bugs: • On average, programmers took 18.4% less time debugging using Triage reports.

Questions?

Triage: Diagnosing Production Run Failures at the User’s Site

Triage: Diagnosing Production Run Failures at the User’s Site

Presentation Transcript

PXI-Based Magnetic Imaging System for Diagnosing Infant Brain Activity

Diagnosing and Treating Mood Disorders: The Science and Ethics

New Jersey Disaster Triage Tag

Introduction to User Interface Design

FAILURES IN PERIODONTAL THERAPY

Emergency Department Triage and Evaluation of the Patient with Chest Pain

Basic Film Production

MCI Triage: Beyond Red , Yellow, Green and Black

Tailings Dam Failures, ARD, and Reclamation Activities

Disaster Triage START/JUMPSTART

Architecture and Techniques for Diagnosing Faults in IEEE 802.11 Infrastructure Networks

Production Planning and Control

TRIAGE

70% of failures are directly due to “soft factors” .

How to Avoid Premature Coating Failures

STATE OF CONNECTICUT Core-CT Project

The First Four Years

Automated Time Attendance and Production System (ATAAPS) User Training Release 11-20