1 / 18

Ira Cohen, Jeffrey S. Chase et al.

Correlating Instrumentation data to system states: A building block for automated diagnosis and control. Ira Cohen, Jeffrey S. Chase et al. Introduction. Networked systems continue to grow in scale Complex behavior stemming from interaction of Workload Software structure Hardware

dacia
Download Presentation

Ira Cohen, Jeffrey S. Chase et al.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correlating Instrumentation data to system states: A building block for automated diagnosis and control Ira Cohen, Jeffrey S. Chase et al.

  2. Introduction • Networked systems continue to grow in scale • Complex behavior stemming from interaction of • Workload • Software structure • Hardware • Traffic conditions • System goals • Pervasive System needed to manage such a system • Examples? • HP’s Openview • IBM’s Tivoli • (Aggregates + displays graphically)

  3. Introduction • Two approaches to build self managing systems • A priori models • Event-condition-action rules • Not based on real systems • (Disadvantages?) • Difficult and costly • Unreliable, does not take account of all

  4. Introduction • Statistical learning techniques • Assumes little to no domain knowledge • Hence “general” • Problem! • Still have to identify techniques that are powerful enough to induce effective models that are: • Efficient • Accurate • Robust

  5. Goals • Automatically analyze instrumentation data from network services in order to • Forecast • Diagnose • Repair failure conditions • We use the Tree-Augmented Naïve Bayesian Networks (TANs) as the basis for • Diagnosis • Forcasting • System-level instrumentations in a 3-tier network service. • Widely used in various fields, but TANs are not used in the context of computer systems.

  6. Goals • Analyzed data from 124 metrics gathered from • 3 tiered e-commerce site under synthetic load • Httperf • Java PetStore as platform • TAN model select combination of metrics and threshold values that complies with Service Level Objectives for average response time. • Results later

  7. What is a TAN? • Bayesian network is an annotated directed acyclic graph encoding a joint probability distribution • Naïve Bayesian Network • State var S is only parent of all other vertices • Assumes all metrics are fully independent given S • TANs consider relationships among metrics themselves, with constraint that each metric has only one other parent than S

  8. Why Use a TAN? • Based on premise that a relatively small subset of metrics and threshold values is sufficient to approximate the distribution accurately • Outperforms generalized Bayesian networks and other alternatives in both • Cost • Accuracy

  9. Why use a TAN? • Useful for forecasting failures and violations • Possible to induce models that predict SLO violations in near future, even when system is stable • Automated controller can invoke directly • Identify impending violation • Respond • Loading • Adding resources • Cheap model to induce • Possible to maintain multiple models • Periodic refresh

  10. Setup • System is 3-tier webservice • Apache • Middleware (BEA WebLogic) • Oracle db • 3 Servers with HP Openview to collect statistics • Load Generator is httperf • SLO indicator processes the logs to determine compliance

  11. Interpretability and Modifiability • TANs offer other advantages • Interpretability • Modifiability • Influence of each metric can be quantified in a probabilistic model • Analysis catalogs each type of violation according to the metrics and values that correlate with observed instances • Strength is given from prob value occurring in different states • Gives insight to causes of violations and how to repair

  12. Workloads • Varies several characteristics • Aggregate req rate • Number of concurrent connections • Fraction of data-intensive vs app-intensive requests • This is to exercise the model-induction methodology by providing it with a wide range of M,P pairs • Where M = sample of values for system metrics • P = vector of app-level performance measurements

  13. Workloads • RAMP: Increasing concurrency • STEP: Background + Step function • Background constant traffic • Bursty, hour long bursts • BUGGY: Increasing aggregate req. rate

  14. Results • Varied SLO thresholds to explore effect on induced models • To eval accuracy of models under varying conditions • Trained and evaled TAN classifier for each of 31 different SLO definitions • Baseline: accuracy of 60-pctile SLO classifier (MOD) and CPU as metric.

  15. Results • Overall BA of TAN is 87-94% • 90+% for all experiments • 6% False alarm for 2 experiments, 17% for BUGGY • Single metric is not sufficient to capture pattern of SLO violations (CPU) • Small number of metrics is sufficient to capture pattern (3-8) • Sensitive to workload and SLO definition (MOD always has high detection rate, but generate false alarms at increasing rate as SLO thresh increases)

  16. Conclusion • TANs are attractive for self-managing systems • Build system models automatically • No a priori knowledge required • Generalizes to wide range of conditions • Zeroes in on most relevant metrics • Practical

  17. Conclusion • Possible work to adapt this to changing conditions • Close the loop for automated diagnosis and control • Ultimately most successful model is a hybrid of • Automatically induced models • A priori models

  18. Questions?

More Related