Characterization of pathological behavior http www ices cmu edu ballista
1 / 35

Characterization of Pathological Behavior ices.cmu/ballista - PowerPoint PPT Presentation

  • Uploaded on

Characterization of Pathological Behavior Philip Koopman [email protected] - (412) 268-5225 Dan Siewiorek [email protected] - (412) 268-2570 (and more than a dozen other contributors). Goals.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Characterization of Pathological Behavior ices.cmu/ballista' - nevada-mcfarland

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Characterization of pathological behavior http www ices cmu edu ballista

Characterization ofPathologicalBehavior

Philip Koopman

[email protected] - (412) 268-5225

Dan Siewiorek

[email protected] - (412) 268-2570

(and more than a dozen other contributors)


  • Detect pathological patterns for fault prognosis

  • Develop fault propagation models

  • Develop statistical identification and stochastic characterization of pathological phenomena


  • Definitions

  • Digital Hardware Prediction

  • Digital Software Characterization

  • Research Challenges

Definitions cause effect sequence and duration
Definitions: Cause-Effect Sequence and Duration

  • FAULT - incorrect state of hardware/software caused by component failure, environment, operator errors, or incorrect design

  • ERROR - manifestation of a fault within a program or data structure

  • FAILURE - services deviates from specified service due to an error


    • Permanent- continuous and stable due to hardware failure, repair by replacement

    • Intermittent- occasionally present due to unstable hardware or varying hardware/software state, repair by replacement

    • Transient- resulting from design errors or temporary environmental conditions, not repairable by replacement

Cmu andrew file server study
CMU Andrew File Server Study

  • Configuration

    • 13 SUN II Workstations with 68010 processor

    • 4 Fujitsu Eagle Disk Drives

  • Observations

    • 21 Workstation Years

  • Frequency of events

    • Permanent Failures 29

    • Intermittent Faults 610

    • Transient Faults 446

    • System Crashes 298

  • Mean Time To

    • Permanent Failures 6552 hours

    • Intermittent Faults 58 hours

    • Transient Faults 354 hours

    • System Crash 689 hours

Some interesting numbers
Some Interesting Numbers

  • Permanent Outages/Total Crashes = 0.1

  • Intermittent Faults/Permanent Failures = 21

    • Thus first symptom appears over 1200 hours prior to repair

  • (Crashes - Permanent)/Total Faults = 0.255

  • 14/29 failures had three or fewer error log entries

    • 8/29 had no error log entries

Measurement and Prediction Module

Measurement & Prediction




Future Predict

  • History Collection -- Calculation and reporting of system availability

  • Future prediction -- failure prediction of system devices


Application Prog

Operating System

History Collection

  • => Availability

  • This module consists :

    • Crash Monitor - monitors system state

    • Calculator - Average uptime and average of fraction,

History Collection



Files of





Application Prog

Operating System

Files of system state info



Average uptime
Average uptime










= 5min


= t2 - t1 = 600min


= t3 - t1=13min

periodically samples system state

System state’s changing

Crash Monitor

Preliminary experiment data cont
Preliminary Experiment Data (cont.)

An NT system accumulative availability daily report over

5-month period

Future prediction
Future Prediction

  • This module generates device failure warning information

    • Sys-log Monitor : monitors new entries by checking the system event

    • log periodically.

    • DFT Engine : DFT Heuristic applied and corresponding device

    • failure warning issued if rules satisfied.

Dispersion Frame Technique



Files of





Application Prog

Operating System




Future Prediction

Principle from observation
Principle from observation


  • periods of increasingly unreliable behavior prior to catastrophic failure.

Error entry example: DISK:9/180445/563692570/829000:errmsg:xylg:syc:cmd6:reset failed (drive not ready) blk 0

type time

Mem Board







Filter by

event type



  • Based on this observation, the DFT Heuristic was derived, to detect the non-monotonically decreasing inter-arrival time.

How dft works via an example
How DFT Works via an example







rule: if a sliding window of 1/2 of the current error interval successively twice covers 3 errors in the future - issue a warning

last 5 errors of the same type (disk)


Where we started component wrapping
Where We Started: Component Wrapping

  • Improve Commercial Off-The-Shelf (COTS) software robustness

Exception handling the basis for error detection
Exception Handling The Basis for Error Detection

  • Exception handling is an important part of dependable systems

    • Responding to unexpected operating conditions

    • Tolerating activation of latent design defects

  • Robustness testing can help evaluate software dependability

    • Reaction to exceptional situations (current results)

    • Reaction to overloads and software “aging” (future results)

    • First big objective: measure exception handling robustness

      • Apply to operating systems

      • Apply to other applications

  • It’s difficult to improve something you can’t measure … so let’s figure out how to measure robustness!

Measurement part 1 software testing
Measurement Part 1: Software Testing

  • SW Testing requires: Ballista uses:

    • Test case “Bad” value combinations

    • Module under test Module under Test

    • Oracle (a “specification”) Watchdog timer/core dumps

Ballista scalable test generation
Ballista: Scalable Test Generation

  • Ballista combines test values to generate test cases

Ballista high level repeatable
Ballista: “High Level” + “Repeatable”

  • High level testing is done using API to perform fault injection

    • Send exceptional values into a system through the API

      • Requires no modification to code -- only linkable object files needed

      • Can be used with any function that takes a parameter list

    • Direct testing instead of middleware injection simplifies usage

  • Each test is a specific function call with a specific set of parameters

    • System state initialized & cleaned up for each single-call test

    • Combinations of valid and invalid parameters tried in turn

    • A “simplistic” model, but it does in fact work...

  • Early results were encouraging:

    • Found a significant percentage of functions with robustness failures

    • Crashed systems from user mode

  • The testing object-based approach scales!

Crash robustness testing result categories
CRASH Robustness Testing Result Categories

  • Catastrophic

    • Computer crashes/panics, requiring a reboot

    • e.g., Irix 6.2: munmap(malloc((1<<30)+1), ((1<<31)-1)) );

    • e.g., DUNIX 4.0D: mprotect(malloc((1 << 29)+1), 65537, 0);

  • Restart

    • Benchmark process hangs, requiring restart

  • Abort

    • Benchmark process aborts (e.g., “core dump”)

  • Silent

    • No error code generated, when one should have been(e.g., de-referencing null pointer produces no error)

  • Hindering

    • Incorrect error code generated

Technology transfer
Technology Transfer

  • Original project sponsor DARPA

    • Sponsored technology transfer projects for:

      • Trident Submarine navigation system (U.S. Navy)

      • Defense Modeling & Simulation Office HLA system

  • Industrial sponsors are continuing the work

    • Cisco – Network switching infrastructure

    • ABB – Industrial automation framework

    • Emerson – Windows CE testing

    • AT&T – CORBA testing

    • ADtranz – (defining project)

    • Microsoft – Windows 2000 testing

  • Other users include

    • Rockwell, Motorola, and, potentially, some POSIX OS developers

Specifying a test web demo interface
Specifying A Test (web/demo interface)

  • Simple demo interface; real interface has a few more steps...

Viewing results
Viewing Results

  • Each robustness failure is one test case (one set of parameters)

Bug report program creation
“Bug Report” program creation

  • Reproduces failure in isolation (>99% effective in practice)

    /* Ballista single test case Sun Jun 13 14:11:06 1999

    * fopen(FNAME_NEG, STR_EMPTY) */


    const char *str_empty = "";


    param0 = (char *) -1;

    str_ptr = (char *) malloc (strlen (str_empty) + 1);

    strcpy (str_ptr, str_empty);

    param1 = str_ptr;


    fopen (param0, param1);

Research challenges1
Research Challenges

  • Ballista provides a small, discrete state-space for software components

  • Challenge is to create models of inter-module relations and workload statistics to create predictions

  • Create discrete simulations using model and probabilities as input parameters

  • Validation of model at a high level of abstraction through experimentation on testbed

  • Optimize cost/performance


  • What does it take to do this sort of research?

    • A legacy of 15 years of previous Carnegie Mellon work to build upon

      • But, sometimes it takes that long just to understand the real problems!

    • Ballista: 3.5 years and about $1.6 Million spent to date


  • Meredith Beveridge

  • John Devale

  • Kim Fernsler

  • David Guttendorf

  • Geoff Hendrey

  • Nathan Kropp

  • Jiantao Pan

  • Charles Shelton

  • Ying Shi

  • Asad Zaidi

Faculty & Staff:

  • Kobey DeVale

  • Phil Koopman

  • Roy Maxion

  • Dan Siewiorek