1 / 25

CompSci 296.2 Self-Managing Systems

Explore self-managing systems in the context of autonomic computing, Recovery-Oriented Computing, and new research goals like ACME (Availability, Changeability, Maintainability, Evolutionary Growth). Dive into projects, mechanisms, and challenges. Discover examples and case studies.

nicolasj
Download Presentation

CompSci 296.2 Self-Managing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CompSci 296.2 Self-Managing Systems Shivnath Babu

  2. Today • Some current work in self-managing systems • Ideas & resources for projects • IBM • ROC (Discussion deferred to next class) • Our projects at Duke • HP

  3. Project • Group size <= 2 • Identify “general topic” by end of January, meet Shivnath • Feb 7: Scope problem and give 15-minute talk • Feb 21: 3-minute talk • March 7: 15-minute talk • March 28: 3-minute talk • April 4/6: 15-minute talk • April 20/24: 15-minute final in-class presentation (+ “demo”)

  4. Work on Self-Managing Systems • IBM • IBM Journal, Volume 42, Number 1, 2003 • Autonomic computing home page • IBM autonomic home – library, demos • Autonomic computing toolkit • IBM Tivoli

  5. Work on Self-Managing Systems • Berkeley-Stanford ROC project • Reading for this class • Interesting source of project ideas and source code • Sample project reports/presentations (follow the CS444A/294-4 link)

  6. The past: research goals andassumptions of last 15 years • Goal #1: Improve performance • Goal #2: Improve performance • Goal #3: Improve cost-performance

  7. New research goals for a New Century: ACME • Availability • Changeability • support rapid deployment of new software, apps, UI • Maintainability • reduce burden on system administrators • provide helpful, forgiving SysAdmin environments • Evolutionary Growth • allow easy system expansion over time • Also Security/Privacy

  8. Recovery-Oriented Computing (ROC) Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”) • People/HW/SW failures are facts, not problems • Recovery/repair is how we cope with above facts • Since major Sys Admin job is recovery after failure, ROC also helps with maintenance/TCO ROC focus is on fast repair Vs.old focus on longer time between failures

  9. An Example Project in ROC • Undo functionality for system administrators (useful for self-managing components as well) • To recover from human errors • To recover from failed operations like software upgrades, installs, and configuration updates • An interesting mechanism project for self-healing

  10. Mechanism Projects • Required/useful mechanisms for self-managing systems • Take a goal related to self-managing (e.g., self-optimization, predicting problems), take a system (e.g., a database)  What mechanisms are needed? Will current mechanisms suffice? • Ex: Data collection • nonintrusive, distributed, “active probing”

  11. Our Projects at Duke • Ques:Querying Systems (as data) • Better tools for system administrators and self-managing system components • CoD:Cluster on Demand • Allocate virtual clusters to applications on demand

  12. Clients WAN Web server Application servers Database servers Querying Systems as Data

  13. Clients WAN WAN Web server Application servers Database servers WAN WAN Querying Systems as Data WAN

  14. Querying Systems as Data • What are probable causes of the Service-Level-Agreement (SLA) violations rising to 12%? Root-cause query

  15. Queries: What if … • Given today’s workload, how will average response time change if my database fails? • If I double the memory on my application servers, how will SLA violation rate change?

  16. Queries: Let me know … • Let me know if, with 75% probability, average response time will exceed 5 seconds in next 30 minutes • Prediction • Continuous query

  17. Queries: What should I do? • What should I do to reduce SLA violations of requests A to <1%, without increasing violations of other requests? • Root-cause + What-if

  18. D A T A Querying Systems as Data • Instrumented traces, logs • System activity data • Data from active probing • Workload • System configuration data (e.g., buffer size, indexes) • Source code • Models • Analytic performance models • Machine learning models • Rules from system experts • Simulators

  19. System mgmt. services D A T A Queries Model- driven DB Engine Data Maintenance Answers Query Processor Data Acquisition Querying Systems with QueS (30,000 ft)

  20. Challenges: Query Complexity • Support for complex queries • Rank probable causes of SLA violation rising to 12%? • “What should I do” queries • Queries are ad-hoc • Queries may be acquisitional

  21. Challenges: Query Specification • Declarative query language • Expressibility of language • Composition • Snapshot queries and continuous queries

  22. Challenges: Query Processing • Model-based query processing • Many types of data sources • Structured, semi-structured, and unstructured • Uncertainty in input data • E.g., legacy systems may have partial/no instrumentation • Imprecise answers • Answers may include quantification of accuracy • Ranking

  23. Challenges: Run-time Overhead • Real-time service for 24x7 systems • Tunable data acquisition • Active probing

  24. Work in Progress • With Piyush Shivam • Models for answering queries about expected performance given a resource assignment, feasible resource assignments to meet SLA, what-if queries for scientific applications • With Songyun Duan • Use of Bayesian Networks for performance prediction and root-cause queries • With Wanhong Xu • What-if queries on configuration-parameter settings

  25. Projects at HP Research • Project 1: Predicting performance problems, finding root cases of problems • Project 2: Debugging complex systems • Project 3: Designing adaptive systems

More Related