Software rejuvenation
1 / 16

Software Rejuvenation - PowerPoint PPT Presentation

  • Updated On :

Software Rejuvenation. Vittorio Castelli Rick Harper Phil Heidelberger Steve Hunter Tom Pahel Kalyan Vaidyanathan. Objectives. Improve system availability Software-induced outages dominate hardware-induced outages A top concern of most customers

Related searches for Software Rejuvenation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Software Rejuvenation' - caine

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Software rejuvenation l.jpg

Software Rejuvenation

Vittorio Castelli

Rick Harper

Phil Heidelberger

Steve Hunter

Tom Pahel

Kalyan Vaidyanathan

Objectives l.jpg

  • Improve system availability

  • Software-induced outages dominate hardware-induced outages

    • A top concern of most customers

    • Proactive fault management is greatly preferred, replacing unplanned outages with planned outages

  • There are many problems...we chose to attack software aging

    • Predict and avoid unplanned outages due to software aging

    • Monitor consumption of resources such as free memory, swap space, handle count, thread count, inode count, ...

    • Extrapolate resource consumption within a user-specified horizon

    • When exhaustion is predicted, produce "event" into IBM Director, which can cause alert, selective rejuvenation, cluster failover, or reboot

    • In some cases can identify which process/subsystem is the culprit

Software aging l.jpg
Software Aging

  • Software state (OS, middleware, applications) decays with time...

    • memory leaks

    • handle leaks

    • nonterminated threads

    • unreleased file-locks

    • data corruption

    • ...resulting in Bad Things (outages, hangs, performance degradation)

  • We feel this behavior is an inevitable by-product of software industry dynamics and practice

  • Software failure prediction and state rejuvenation is a proactive technology designed to mitigate the effects of software aging

    • Predict when resource exhaustion is about to occur

    • Reset the state of the system to an initial low-resource-consumption condition

Project history l.jpg
Project History

  • Supported in 2000 and 2001 by PSI funding

  • xSeries, Research, University collaboration

    • xSeries architecture (Steve Hunter), RAS, Development (Tom Pahel) Marketing

    • Research: Rick Harper, Vittorio Castelli and Phil Heidelberger

    • Duke: Kalyan Vaidyanathan, Kishor Trivedi

  • Incorporated into IBM Director

    • Timed rejuvenation on NT GA Q4'99

    • Predictive rejuvenation on NT/W2K GA Q4'00 (NT includes per-process diagnosis)

    • Predictive rejuvenation and per-process diagnosis on Linux GA Q4'01

  • The market liked it

Software rejuvenation agent l.jpg

Software Rejuvenation Agent

Prediction Algorithm

Prediction algorithm l.jpg
Prediction Algorithm

  • Sampled Parameters

    • Windows agent can predict exhaustion of committed bytes, pool nonpaged bytes, pool paged bytes, logical disk bytes

    • Linux agent: swap space, disk space, inodes, file descriptors, processes

  • Sampling Technique

    • User selects exhaustion notification horizon

      • Typically should be at least several days

    • Agent sets up sliding sampling window that is 1/10 the size of the horizon

    • Agent sets up sampling rate such that 300 points lie within sampling window

      • Can perform linear prediction using 200 points, more complex predictions require 300 points

    • Historical data is saved, subject to user-selected file size limitation

  • Predictive algorithm

    • Constructs 6 candidate fitted curves to smoothed sliding window data

      • Linear, Log, Linear/Log with 2 or 3 breakpoints

    • Selects best-fitting curve

    • Extrapolates selected curve out to exhaustion horizon

    • Generates event if extrapolated data impacts limits within horizon, and indicates how long until impact

Diagnosis l.jpg

  • Process Consumption of Nonpaged Pool Bytes:

    • SERVICES 447936 2.51%

    • WINLOGON 64992 0.36%

    • WinMgmt 57068 0.32%

    • svchost 47448 0.27%

    • explorer 45896 0.26%

    • svchost 44704 0.25%

    • CSRSS 42416 0.24%

    • LSASS 40708 0.23%

    • msdtc 35608 0.20%

    • rtvscan 34448 0.19%

  • System Module Consumption of NonPaged Pool Bytes:

  • Tag LSwi 2293760 0.124

    • Tag File 2027424 0.110

    • Tag Wdm 1705888 0.092

    • Tag MmCa 1371744 0.074

    • Tag Ntfr 1350112 0.073

    • Tag Nmdd 1048576 0.057

    • Tag NtFs 753888 0.041

    • Tag Ntfn 750080 0.041

    • Tag NDam 612608 0.033

    • Tag FSfm 541888 0.029

    • Tag Dmio 532448 0.029

  • Of a total of 35667968 Pool Nonpaged Bytes, 1318904 (3.70%%) can be diagnosed to processes and 34349064 (96.30%%) are consumed by system modules.

Software rejuvenation agent12 l.jpg

Software Rejuvenation Agent

Director Integration

High level design l.jpg
High Level Design

IBM Director Console

Director Tasks:

Inventory, Events, …


Software Rejuvenation Task

Director Management Server

  • Console Task is used to configure SW Rejuv Options & Criteria

  • Server Task saves persistent configuration data and communicates with agent machine

  • The agent monitors OS usage of resources, projects future exhaustion, & notifies server if exhaustion is imminent.

Topology Engine









Other Add-on Topology


Director Server Tasks:

Inventory, Event, Monitors, FileTransfer, Scheduler, CIM...

Software Rejuvenation Server Task

(Persistent Data)



eServer Box

Director IPC Agent




Sub-Agents: Events, Inventory, Monitors,...




Plug-ins: Inventory, Monitors

Operating System, Device Drivers

Service Processor, ServeRAID

Console task l.jpg
Console Task

Verify Console Installation:

Task interface l.jpg
Task Interface

  • Trend Viewer for systems w/prediction

  • Schedule Filter to prevent rejuvenation on specified days

  • Drag-n-Drop services for time based rejuvenation

  • Rejuvenation Options apply to clusters only

Conclusions l.jpg

  • The xSeries Software Rejuvenation project only attacked a small fraction of system outage causes, yet was well received

  • Much remains...

    • Current technology based on lab testing and a priori understanding of exhaustible resources

    • A very limited class of outage causes

    • Adaptive identification of pre-outage signatures

    • Improved diagnostic resolution

      • Selective rejuvenation of offending subsystem

  • Expand to more general classes of software failures and syndromes

    • Multiparameter signatures

    • Non-extremal conditions

    • Misconfigurations

    • Event log analysis

  • Applications

    • Workload balancing

    • HW/SW fault discrimination

    • SW testing and hardening