characterization of pathological behavior http www ices cmu edu ballista
Skip this Video
Download Presentation
Characterization of Pathological Behavior ices.cmu/ballista

Loading in 2 Seconds...

play fullscreen
1 / 35

Characterization of Pathological Behavior ices.cmu/ballista - PowerPoint PPT Presentation

  • Uploaded on

Characterization of Pathological Behavior Philip Koopman [email protected] - (412) 268-5225 Dan Siewiorek [email protected] - (412) 268-2570 (and more than a dozen other contributors). Goals.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Characterization of Pathological Behavior ices.cmu/ballista' - nevada-mcfarland

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
characterization of pathological behavior http www ices cmu edu ballista

Characterization ofPathologicalBehavior

Philip Koopman

[email protected] - (412) 268-5225

Dan Siewiorek

[email protected] - (412) 268-2570

(and more than a dozen other contributors)

  • Detect pathological patterns for fault prognosis
  • Develop fault propagation models
  • Develop statistical identification and stochastic characterization of pathological phenomena
  • Definitions
  • Digital Hardware Prediction
  • Digital Software Characterization
  • Research Challenges
definitions cause effect sequence and duration
Definitions: Cause-Effect Sequence and Duration
  • FAULT - incorrect state of hardware/software caused by component failure, environment, operator errors, or incorrect design
  • ERROR - manifestation of a fault within a program or data structure
  • FAILURE - services deviates from specified service due to an error
    • Permanent- continuous and stable due to hardware failure, repair by replacement
    • Intermittent- occasionally present due to unstable hardware or varying hardware/software state, repair by replacement
    • Transient- resulting from design errors or temporary environmental conditions, not repairable by replacement
cmu andrew file server study
CMU Andrew File Server Study
  • Configuration
    • 13 SUN II Workstations with 68010 processor
    • 4 Fujitsu Eagle Disk Drives
  • Observations
    • 21 Workstation Years
  • Frequency of events
    • Permanent Failures 29
    • Intermittent Faults 610
    • Transient Faults 446
    • System Crashes 298
  • Mean Time To
    • Permanent Failures 6552 hours
    • Intermittent Faults 58 hours
    • Transient Faults 354 hours
    • System Crash 689 hours
some interesting numbers
Some Interesting Numbers
  • Permanent Outages/Total Crashes = 0.1
  • Intermittent Faults/Permanent Failures = 21
    • Thus first symptom appears over 1200 hours prior to repair
  • (Crashes - Permanent)/Total Faults = 0.255
  • 14/29 failures had three or fewer error log entries
    • 8/29 had no error log entries
Measurement and Prediction Module

Measurement & Prediction




Future Predict

  • History Collection -- Calculation and reporting of system availability
  • Future prediction -- failure prediction of system devices


Application Prog

Operating System

History Collection
  • => Availability
  • This module consists :
    • Crash Monitor - monitors system state
    • Calculator - Average uptime and average of fraction,

History Collection



Files of





Application Prog

Operating System

Files of system state info



average uptime
Average uptime










= 5min


= t2 - t1 = 600min


= t3 - t1=13min

periodically samples system state

System state’s changing

Crash Monitor

preliminary experiment data cont
Preliminary Experiment Data (cont.)

An NT system accumulative availability daily report over

5-month period

future prediction
Future Prediction
  • This module generates device failure warning information
    • Sys-log Monitor : monitors new entries by checking the system event
    • log periodically.
    • DFT Engine : DFT Heuristic applied and corresponding device
    • failure warning issued if rules satisfied.

Dispersion Frame Technique



Files of





Application Prog

Operating System




Future Prediction

principle from observation
Principle from observation


  • periods of increasingly unreliable behavior prior to catastrophic failure.

Error entry example: DISK:9/180445/563692570/829000:errmsg:xylg:syc:cmd6:reset failed (drive not ready) blk 0

type time

Mem Board







Filter by

event type



  • Based on this observation, the DFT Heuristic was derived, to detect the non-monotonically decreasing inter-arrival time.
how dft works via an example
How DFT Works via an example







rule: if a sliding window of 1/2 of the current error interval successively twice covers 3 errors in the future - issue a warning

last 5 errors of the same type (disk)


where we started component wrapping
Where We Started: Component Wrapping
  • Improve Commercial Off-The-Shelf (COTS) software robustness
exception handling the basis for error detection
Exception Handling The Basis for Error Detection
  • Exception handling is an important part of dependable systems
    • Responding to unexpected operating conditions
    • Tolerating activation of latent design defects
  • Robustness testing can help evaluate software dependability
    • Reaction to exceptional situations (current results)
    • Reaction to overloads and software “aging” (future results)
    • First big objective: measure exception handling robustness
      • Apply to operating systems
      • Apply to other applications
  • It’s difficult to improve something you can’t measure … so let’s figure out how to measure robustness!
measurement part 1 software testing
Measurement Part 1: Software Testing
  • SW Testing requires: Ballista uses:
    • Test case “Bad” value combinations
    • Module under test Module under Test
    • Oracle (a “specification”) Watchdog timer/core dumps
ballista scalable test generation
Ballista: Scalable Test Generation
  • Ballista combines test values to generate test cases
ballista high level repeatable
Ballista: “High Level” + “Repeatable”
  • High level testing is done using API to perform fault injection
    • Send exceptional values into a system through the API
      • Requires no modification to code -- only linkable object files needed
      • Can be used with any function that takes a parameter list
    • Direct testing instead of middleware injection simplifies usage
  • Each test is a specific function call with a specific set of parameters
    • System state initialized & cleaned up for each single-call test
    • Combinations of valid and invalid parameters tried in turn
    • A “simplistic” model, but it does in fact work...
  • Early results were encouraging:
    • Found a significant percentage of functions with robustness failures
    • Crashed systems from user mode
  • The testing object-based approach scales!
crash robustness testing result categories
CRASH Robustness Testing Result Categories
  • Catastrophic
    • Computer crashes/panics, requiring a reboot
    • e.g., Irix 6.2: munmap(malloc((1<<30)+1), ((1<<31)-1)) );
    • e.g., DUNIX 4.0D: mprotect(malloc((1 << 29)+1), 65537, 0);
  • Restart
    • Benchmark process hangs, requiring restart
  • Abort
    • Benchmark process aborts (e.g., “core dump”)
  • Silent
    • No error code generated, when one should have been(e.g., de-referencing null pointer produces no error)
  • Hindering
    • Incorrect error code generated
technology transfer
Technology Transfer
  • Original project sponsor DARPA
    • Sponsored technology transfer projects for:
      • Trident Submarine navigation system (U.S. Navy)
      • Defense Modeling & Simulation Office HLA system
  • Industrial sponsors are continuing the work
    • Cisco – Network switching infrastructure
    • ABB – Industrial automation framework
    • Emerson – Windows CE testing
    • AT&T – CORBA testing
    • ADtranz – (defining project)
    • Microsoft – Windows 2000 testing
  • Other users include
    • Rockwell, Motorola, and, potentially, some POSIX OS developers
specifying a test web demo interface
Specifying A Test (web/demo interface)
  • Simple demo interface; real interface has a few more steps...
viewing results
Viewing Results
  • Each robustness failure is one test case (one set of parameters)
bug report program creation
“Bug Report” program creation
  • Reproduces failure in isolation (>99% effective in practice)

/* Ballista single test case Sun Jun 13 14:11:06 1999

* fopen(FNAME_NEG, STR_EMPTY) */


const char *str_empty = "";


param0 = (char *) -1;

str_ptr = (char *) malloc (strlen (str_empty) + 1);

strcpy (str_ptr, str_empty);

param1 = str_ptr;


fopen (param0, param1);

research challenges1
Research Challenges
  • Ballista provides a small, discrete state-space for software components
  • Challenge is to create models of inter-module relations and workload statistics to create predictions
  • Create discrete simulations using model and probabilities as input parameters
  • Validation of model at a high level of abstraction through experimentation on testbed
  • Optimize cost/performance
  • What does it take to do this sort of research?
    • A legacy of 15 years of previous Carnegie Mellon work to build upon
      • But, sometimes it takes that long just to understand the real problems!
    • Ballista: 3.5 years and about $1.6 Million spent to date


  • Meredith Beveridge
  • John Devale
  • Kim Fernsler
  • David Guttendorf
  • Geoff Hendrey
  • Nathan Kropp
  • Jiantao Pan
  • Charles Shelton
  • Ying Shi
  • Asad Zaidi

Faculty & Staff:

  • Kobey DeVale
  • Phil Koopman
  • Roy Maxion
  • Dan Siewiorek