110 likes | 240 Views
This project explores self-managing systems, focusing on combining structured and unstructured data for performance optimization. We utilize the IBM Autonomic Computing Toolkit (NIMO) to develop automated models for diagnosing performance issues in scientific applications. Key objectives include building performance models, enhancing Nagios for problem prediction, and studying a variety of scientific apps for behavioral characterization. Additionally, we address the principles of Recovery-Oriented Computing (ROC) to ensure system availability, maintainability, and rapid recovery from failures.
E N D
CompSci 296.2 Self-Managing Systems Shivnath Babu
Today • Wrap up sample projects • ROC discussion
Sample Projects • NIMO • Fa • Combining structured & unstructured data • Projects using Nagios • Projects using IBM autonomic computing toolkit
NIMO: NonInvasive Modeling for Optimization • Build performance models for scientific apps • Automatic, online, and noninvasive • Projects • Study many scientific apps (e.g., 140 bio apps in BioPortal) characterize behavior, good models • “Steal app”, build and refine model • Incorporate NIMO in a “grid” scheduler (Condor, Globus) • Optimization problems in scheduling workflows
Fa • Testbed to study: • Whether we can automate problem prediction, diagnosis • Relationship among problems, causes, data, & models • Projects • Models for predicting performance problems (online) • Models and mechanisms for root-cause queries • Others
Structured and Unstructured Data • Combined querying/mining of structured and unstructured system data • Structured data: time series of CPU utilization • Unstructured data (free text): System error log • Ex: Characterize system state when a specific error occurs
Add New Features to Current Systems • Add problem-prediction capability to Nagios • Add root-cause querying to Nagios • Similar projects using the IBM Autonomic Computing Toolkit + ABLE framework • Remember the “mechanism projects” • Undo, virtualization, active probing
ROC: Recovery-Oriented Computing • Complaints about current systems • Focus only on performance Availability & maintainability is neglected • Focus on MTTF of individual components MTTR neglected • MTTF of system << MTTF of individual components
ROC Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”) • People/HW/SW failures are facts, not problems • Recovery/repair is how we cope with above facts ROC focus is on fast repair Vs.old focus on longer time between failures
ROC Principles • Recovery experiments: benchmarking recovery • Pinpoint: Automatic problem diagnosis • Recursive restart: Innovative use of reboot • App and system undo • Defense in depth: ROC at hardware level
Discussion • Strong point: Comprehensive, relate to other fields • Margin of safety for systems • Current examples? • How to incorporate? • Negative point: Evolution Vs. revolution? • What approach is the project taking? • At what level should we support Undo? • Transaction, application, system • Pros and cons • Benchmarking availability/recovery (TOC?) • How can you claim that a system is 99.999% available? • Dealing with the automation irony • Fire drills