software rejuvenation
Download
Skip this Video
Download Presentation
Software Rejuvenation

Loading in 2 Seconds...

play fullscreen
1 / 16

Software Rejuvenation - PowerPoint PPT Presentation


  • 231 Views
  • Uploaded on

Software Rejuvenation. Vittorio Castelli Rick Harper Phil Heidelberger Steve Hunter Tom Pahel Kalyan Vaidyanathan. Objectives. Improve system availability Software-induced outages dominate hardware-induced outages A top concern of most customers

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Software Rejuvenation' - caine


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
software rejuvenation

Software Rejuvenation

Vittorio Castelli

Rick Harper

Phil Heidelberger

Steve Hunter

Tom Pahel

Kalyan Vaidyanathan

objectives
Objectives
  • Improve system availability
  • Software-induced outages dominate hardware-induced outages
    • A top concern of most customers
    • Proactive fault management is greatly preferred, replacing unplanned outages with planned outages
  • There are many problems...we chose to attack software aging
    • Predict and avoid unplanned outages due to software aging
    • Monitor consumption of resources such as free memory, swap space, handle count, thread count, inode count, ...
    • Extrapolate resource consumption within a user-specified horizon
    • When exhaustion is predicted, produce "event" into IBM Director, which can cause alert, selective rejuvenation, cluster failover, or reboot
    • In some cases can identify which process/subsystem is the culprit
software aging
Software Aging
  • Software state (OS, middleware, applications) decays with time...
    • memory leaks
    • handle leaks
    • nonterminated threads
    • unreleased file-locks
    • data corruption
    • ...resulting in Bad Things (outages, hangs, performance degradation)
  • We feel this behavior is an inevitable by-product of software industry dynamics and practice
  • Software failure prediction and state rejuvenation is a proactive technology designed to mitigate the effects of software aging
    • Predict when resource exhaustion is about to occur
    • Reset the state of the system to an initial low-resource-consumption condition
project history
Project History
  • Supported in 2000 and 2001 by PSI funding
  • xSeries, Research, University collaboration
    • xSeries architecture (Steve Hunter), RAS, Development (Tom Pahel) Marketing
    • Research: Rick Harper, Vittorio Castelli and Phil Heidelberger
    • Duke: Kalyan Vaidyanathan, Kishor Trivedi
  • Incorporated into IBM Director
    • Timed rejuvenation on NT GA Q4\'99
    • Predictive rejuvenation on NT/W2K GA Q4\'00 (NT includes per-process diagnosis)
    • Predictive rejuvenation and per-process diagnosis on Linux GA Q4\'01
  • The market liked it
software rejuvenation agent

Software Rejuvenation Agent

Prediction Algorithm

prediction algorithm
Prediction Algorithm
  • Sampled Parameters
    • Windows agent can predict exhaustion of committed bytes, pool nonpaged bytes, pool paged bytes, logical disk bytes
    • Linux agent: swap space, disk space, inodes, file descriptors, processes
  • Sampling Technique
    • User selects exhaustion notification horizon
      • Typically should be at least several days
    • Agent sets up sliding sampling window that is 1/10 the size of the horizon
    • Agent sets up sampling rate such that 300 points lie within sampling window
      • Can perform linear prediction using 200 points, more complex predictions require 300 points
    • Historical data is saved, subject to user-selected file size limitation
  • Predictive algorithm
    • Constructs 6 candidate fitted curves to smoothed sliding window data
      • Linear, Log, Linear/Log with 2 or 3 breakpoints
    • Selects best-fitting curve
    • Extrapolates selected curve out to exhaustion horizon
    • Generates event if extrapolated data impacts limits within horizon, and indicates how long until impact
diagnosis
Diagnosis
  • Process Consumption of Nonpaged Pool Bytes:
    • SERVICES 447936 2.51%
    • WINLOGON 64992 0.36%
    • WinMgmt 57068 0.32%
    • svchost 47448 0.27%
    • explorer 45896 0.26%
    • svchost 44704 0.25%
    • CSRSS 42416 0.24%
    • LSASS 40708 0.23%
    • msdtc 35608 0.20%
    • rtvscan 34448 0.19%
  • System Module Consumption of NonPaged Pool Bytes:
  • Tag LSwi 2293760 0.124
    • Tag File 2027424 0.110
    • Tag Wdm 1705888 0.092
    • Tag MmCa 1371744 0.074
    • Tag Ntfr 1350112 0.073
    • Tag Nmdd 1048576 0.057
    • Tag NtFs 753888 0.041
    • Tag Ntfn 750080 0.041
    • Tag NDam 612608 0.033
    • Tag FSfm 541888 0.029
    • Tag Dmio 532448 0.029
  • Of a total of 35667968 Pool Nonpaged Bytes, 1318904 (3.70%%) can be diagnosed to processes and 34349064 (96.30%%) are consumed by system modules.
software rejuvenation agent12

Software Rejuvenation Agent

Director Integration

high level design
High Level Design

IBM Director Console

Director Tasks:

Inventory, Events, …

IPC

Software Rejuvenation Task

Director Management Server

  • Console Task is used to configure SW Rejuv Options & Criteria
  • Server Task saves persistent configuration data and communicates with agent machine
  • The agent monitors OS usage of resources, projects future exhaustion, & notifies server if exhaustion is imminent.

Topology Engine

SNMP

Device

Director

Clients

Cluster

Servers

Rack

Device

Other Add-on Topology

Extensions

Director Server Tasks:

Inventory, Event, Monitors, FileTransfer, Scheduler, CIM...

Software Rejuvenation Server Task

(Persistent Data)

DataBase

IPC

eServer Box

Director IPC Agent

Software

Rejuvenation

Sub-Agent

Sub-Agents: Events, Inventory, Monitors,...

Configuration

(Input/Output)

Files

Plug-ins: Inventory, Monitors

Operating System, Device Drivers

Service Processor, ServeRAID

console task
Console Task

Verify Console Installation:

task interface
Task Interface
  • Trend Viewer for systems w/prediction
  • Schedule Filter to prevent rejuvenation on specified days
  • Drag-n-Drop services for time based rejuvenation
  • Rejuvenation Options apply to clusters only
conclusions
Conclusions
  • The xSeries Software Rejuvenation project only attacked a small fraction of system outage causes, yet was well received
  • Much remains...
    • Current technology based on lab testing and a priori understanding of exhaustible resources
    • A very limited class of outage causes
    • Adaptive identification of pre-outage signatures
    • Improved diagnostic resolution
      • Selective rejuvenation of offending subsystem
  • Expand to more general classes of software failures and syndromes
    • Multiparameter signatures
    • Non-extremal conditions
    • Misconfigurations
    • Event log analysis
  • Applications
    • Workload balancing
    • HW/SW fault discrimination
    • SW testing and hardening
ad