Enhancing Self-Managing Systems with NIMO: Performance Optimization and Recovery Strategies

CompSci 296.2 Self-Managing Systems Shivnath Babu

Today • Wrap up sample projects • ROC discussion

Sample Projects • NIMO • Fa • Combining structured & unstructured data • Projects using Nagios • Projects using IBM autonomic computing toolkit

NIMO: NonInvasive Modeling for Optimization • Build performance models for scientific apps • Automatic, online, and noninvasive • Projects • Study many scientific apps (e.g., 140 bio apps in BioPortal)  characterize behavior, good models • “Steal app”, build and refine model • Incorporate NIMO in a “grid” scheduler (Condor, Globus) • Optimization problems in scheduling workflows

Fa • Testbed to study: • Whether we can automate problem prediction, diagnosis • Relationship among problems, causes, data, & models • Projects • Models for predicting performance problems (online) • Models and mechanisms for root-cause queries • Others

Structured and Unstructured Data • Combined querying/mining of structured and unstructured system data • Structured data: time series of CPU utilization • Unstructured data (free text): System error log • Ex: Characterize system state when a specific error occurs

Add New Features to Current Systems • Add problem-prediction capability to Nagios • Add root-cause querying to Nagios • Similar projects using the IBM Autonomic Computing Toolkit + ABLE framework • Remember the “mechanism projects” • Undo, virtualization, active probing

ROC: Recovery-Oriented Computing • Complaints about current systems • Focus only on performance  Availability & maintainability is neglected • Focus on MTTF of individual components  MTTR neglected • MTTF of system << MTTF of individual components

ROC Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”) • People/HW/SW failures are facts, not problems • Recovery/repair is how we cope with above facts ROC focus is on fast repair Vs.old focus on longer time between failures

ROC Principles • Recovery experiments: benchmarking recovery • Pinpoint: Automatic problem diagnosis • Recursive restart: Innovative use of reboot • App and system undo • Defense in depth: ROC at hardware level

Discussion • Strong point: Comprehensive, relate to other fields • Margin of safety for systems • Current examples? • How to incorporate? • Negative point: Evolution Vs. revolution? • What approach is the project taking? • At what level should we support Undo? • Transaction, application, system • Pros and cons • Benchmarking availability/recovery (TOC?) • How can you claim that a system is 99.999% available? • Dealing with the automation irony • Fire drills

Enhancing Self-Managing Systems with NIMO: Performance Optimization and Recovery Strategies

Enhancing Self-Managing Systems with NIMO: Performance Optimization and Recovery Strategies

Presentation Transcript

Self-Managing Computer Systems An Introduction

Self-Managing Systems: a bird’s eye view

Self-Managing Technology in Database Management Systems

Evolving and Self-Managing Data Integration Systems

SELF MANAGING TASKS

Self-managing database systems

Managing Self

CSCE 496/896 Self-Managing Computer Systems

Managing Self

Self-Managing Computer Systems An Introduction

CompSci 296.2 Self-Managing Systems

COMPSCI 110 Operating Systems

COMPSCI 110 Operating Systems

Viable Self-Managing Software Systems

Managing Self

CompSci 296.2 Self-Managing Systems

CSCE 496/896 Self-Managing Computer Systems

CompSci 296.2 Self-Managing Systems

CompSci 296.2 Self-Managing Systems

COMPSCI 110 Operating Systems

CompSci 296.2 Self-Managing Systems