Characterization of pathological behavior http www ices cmu edu ballista
Sponsored Links
This presentation is the property of its rightful owner.
1 / 35

Characterization of Pathological Behavior ices.cmu/ballista PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Characterization of Pathological Behavior Philip Koopman - (412) 268-5225 Dan Siewiorek - (412) 268-2570 (and more than a dozen other contributors). Goals.

Download Presentation

Characterization of Pathological Behavior ices.cmu/ballista

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Characterization of pathological behavior http www ices cmu edu ballista

Characterization ofPathologicalBehavior

Philip Koopman - (412) 268-5225

Dan Siewiorek - (412) 268-2570

(and more than a dozen other contributors)



  • Detect pathological patterns for fault prognosis

  • Develop fault propagation models

  • Develop statistical identification and stochastic characterization of pathological phenomena



  • Definitions

  • Digital Hardware Prediction

  • Digital Software Characterization

  • Research Challenges

Definitions cause effect sequence and duration

Definitions: Cause-Effect Sequence and Duration

  • FAULT - incorrect state of hardware/software caused by component failure, environment, operator errors, or incorrect design

  • ERROR -manifestation of a fault within a program or data structure

  • FAILURE - services deviates from specified service due to an error


    • Permanent-continuous and stable due to hardware failure, repair by replacement

    • Intermittent-occasionally present due to unstable hardware or varying hardware/software state, repair by replacement

    • Transient-resulting from design errors or temporary environmental conditions, not repairable by replacement

Cmu andrew file server study

CMU Andrew File Server Study

  • Configuration

    • 13 SUN II Workstations with 68010 processor

    • 4 Fujitsu Eagle Disk Drives

  • Observations

    • 21 Workstation Years

  • Frequency of events

    • Permanent Failures 29

    • Intermittent Faults610

    • Transient Faults446

    • System Crashes298

  • Mean Time To

    • Permanent Failures6552 hours

    • Intermittent Faults 58 hours

    • Transient Faults 354 hours

    • System Crash 689 hours

Some interesting numbers

Some Interesting Numbers

  • Permanent Outages/Total Crashes = 0.1

  • Intermittent Faults/Permanent Failures = 21

    • Thus first symptom appears over 1200 hours prior to repair

  • (Crashes - Permanent)/Total Faults = 0.255

  • 14/29 failures had three or fewer error log entries

    • 8/29 had no error log entries

Harbinger detection of anomalies

Harbinger Detection of Anomalies

Digital hardware prediction

Digital Hardware Prediction

Characterization of pathological behavior ices cmu ballista

Measurement and Prediction Module

Measurement & Prediction




Future Predict

  • History Collection -- Calculation and reporting of system availability

  • Future prediction -- failure prediction of system devices


Application Prog

Operating System

Characterization of pathological behavior ices cmu ballista

History Collection

  • => Availability

  • This module consists :

    • Crash Monitor - monitors system state

    • Calculator - Average uptime and average of fraction,

History Collection



Files of





Application Prog

Operating System

Files of system state info



Average uptime

Average uptime










= 5min


= t2 - t1 = 600min


= t3 - t1=13min

periodically samples system state

System state’s changing

Crash Monitor

Preliminary experiment data cont

Preliminary Experiment Data (cont.)

An NT system accumulative availability daily report over

5-month period

Future prediction

Future Prediction

  • This module generates device failure warning information

    • Sys-log Monitor : monitors new entries by checking the system event

    • log periodically.

    • DFT Engine : DFT Heuristic applied and corresponding device

    • failure warning issued if rules satisfied.

Dispersion Frame Technique



Files of





Application Prog

Operating System




Future Prediction

Principle from observation

Principle from observation


  • periods of increasingly unreliable behavior prior to catastrophic failure.

Error entry example: DISK:9/180445/563692570/829000:errmsg:xylg:syc:cmd6:reset failed (drive not ready) blk 0

type time

Mem Board







Filter by

event type



  • Based on this observation, the DFT Heuristic was derived, to detect the non-monotonically decreasing inter-arrival time.

How dft works via an example

How DFT Works via an example







rule: if a sliding window of 1/2 of the current error interval successively twice covers 3 errors in the future - issue a warning

last 5 errors of the same type (disk)


Digital software characterization

Digital Software Characterization

Where we started component wrapping

Where We Started: Component Wrapping

  • Improve Commercial Off-The-Shelf (COTS) software robustness

Exception handling the basis for error detection

Exception Handling The Basis for Error Detection

  • Exception handling is an important part of dependable systems

    • Responding to unexpected operating conditions

    • Tolerating activation of latent design defects

  • Robustness testing can help evaluate software dependability

    • Reaction to exceptional situations (current results)

    • Reaction to overloads and software “aging” (future results)

    • First big objective: measure exception handling robustness

      • Apply to operating systems

      • Apply to other applications

  • It’s difficult to improve something you can’t measure … so let’s figure out how to measure robustness!

Measurement part 1 software testing

Measurement Part 1: Software Testing

  • SW Testing requires:Ballista uses:

    • Test case“Bad” value combinations

    • Module under testModule under Test

    • Oracle (a “specification”)Watchdog timer/core dumps

Ballista scalable test generation

Ballista: Scalable Test Generation

  • Ballista combines test values to generate test cases

Ballista high level repeatable

Ballista: “High Level” + “Repeatable”

  • High level testing is done using API to perform fault injection

    • Send exceptional values into a system through the API

      • Requires no modification to code -- only linkable object files needed

      • Can be used with any function that takes a parameter list

    • Direct testing instead of middleware injection simplifies usage

  • Each test is a specific function call with a specific set of parameters

    • System state initialized & cleaned up for each single-call test

    • Combinations of valid and invalid parameters tried in turn

    • A “simplistic” model, but it does in fact work...

  • Early results were encouraging:

    • Found a significant percentage of functions with robustness failures

    • Crashed systems from user mode

  • The testing object-based approach scales!

Crash robustness testing result categories

CRASH Robustness Testing Result Categories

  • Catastrophic

    • Computer crashes/panics, requiring a reboot

    • e.g., Irix 6.2: munmap(malloc((1<<30)+1), ((1<<31)-1)) );

    • e.g., DUNIX 4.0D: mprotect(malloc((1 << 29)+1), 65537, 0);

  • Restart

    • Benchmark process hangs, requiring restart

  • Abort

    • Benchmark process aborts (e.g., “core dump”)

  • Silent

    • No error code generated, when one should have been(e.g., de-referencing null pointer produces no error)

  • Hindering

    • Incorrect error code generated

Digital unix 4 0 results

Digital Unix 4.0 Results

Comparing fifteen posix operating systems

Comparing Fifteen POSIX Operating Systems

Failure rates by posix fn call category

Failure Rates By POSIX Fn/Call Category

C library is a potential robustness bottleneck

C Library Is A Potential Robustness Bottleneck

Failure rates by function group

Failure Rates by Function Group

Technology transfer

Technology Transfer

  • Original project sponsor DARPA

    • Sponsored technology transfer projects for:

      • Trident Submarine navigation system (U.S. Navy)

      • Defense Modeling & Simulation Office HLA system

  • Industrial sponsors are continuing the work

    • Cisco – Network switching infrastructure

    • ABB – Industrial automation framework

    • Emerson – Windows CE testing

    • AT&T – CORBA testing

    • ADtranz – (defining project)

    • Microsoft – Windows 2000 testing

  • Other users include

    • Rockwell, Motorola, and, potentially, some POSIX OS developers

Specifying a test web demo interface

Specifying A Test (web/demo interface)

  • Simple demo interface; real interface has a few more steps...

Viewing results

Viewing Results

  • Each robustness failure is one test case (one set of parameters)

Bug report program creation

“Bug Report” program creation

  • Reproduces failure in isolation (>99% effective in practice)

    /* Ballista single test case Sun Jun 13 14:11:06 1999

    * fopen(FNAME_NEG, STR_EMPTY) */


    const char *str_empty = "";


    param0 = (char *) -1;

    str_ptr = (char *) malloc (strlen (str_empty) + 1);

    strcpy (str_ptr, str_empty);

    param1 = str_ptr;


    fopen (param0, param1);

Research challenges

Research Challenges

Research challenges1

Research Challenges

  • Ballista provides a small, discrete state-space for software components

  • Challenge is to create models of inter-module relations and workload statistics to create predictions

  • Create discrete simulations using model and probabilities as input parameters

  • Validation of model at a high level of abstraction through experimentation on testbed

  • Optimize cost/performance



  • What does it take to do this sort of research?

    • A legacy of 15 years of previous Carnegie Mellon work to build upon

      • But, sometimes it takes that long just to understand the real problems!

    • Ballista: 3.5 years and about $1.6 Million spent to date


  • Meredith Beveridge

  • John Devale

  • Kim Fernsler

  • David Guttendorf

  • Geoff Hendrey

  • Nathan Kropp

  • Jiantao Pan

  • Charles Shelton

  • Ying Shi

  • Asad Zaidi

Faculty & Staff:

  • Kobey DeVale

  • Phil Koopman

  • Roy Maxion

  • Dan Siewiorek

  • Login