Ira Cohen, Jeffrey S. Chase et al.

Correlating Instrumentation data to system states: A building block for automated diagnosis and control Ira Cohen, Jeffrey S. Chase et al.

Introduction • Networked systems continue to grow in scale • Complex behavior stemming from interaction of • Workload • Software structure • Hardware • Traffic conditions • System goals • Pervasive System needed to manage such a system • Examples? • HP’s Openview • IBM’s Tivoli • (Aggregates + displays graphically)

Introduction • Two approaches to build self managing systems • A priori models • Event-condition-action rules • Not based on real systems • (Disadvantages?) • Difficult and costly • Unreliable, does not take account of all

Introduction • Statistical learning techniques • Assumes little to no domain knowledge • Hence “general” • Problem! • Still have to identify techniques that are powerful enough to induce effective models that are: • Efficient • Accurate • Robust

Goals • Automatically analyze instrumentation data from network services in order to • Forecast • Diagnose • Repair failure conditions • We use the Tree-Augmented Naïve Bayesian Networks (TANs) as the basis for • Diagnosis • Forcasting • System-level instrumentations in a 3-tier network service. • Widely used in various fields, but TANs are not used in the context of computer systems.

Goals • Analyzed data from 124 metrics gathered from • 3 tiered e-commerce site under synthetic load • Httperf • Java PetStore as platform • TAN model select combination of metrics and threshold values that complies with Service Level Objectives for average response time. • Results later

What is a TAN? • Bayesian network is an annotated directed acyclic graph encoding a joint probability distribution • Naïve Bayesian Network • State var S is only parent of all other vertices • Assumes all metrics are fully independent given S • TANs consider relationships among metrics themselves, with constraint that each metric has only one other parent than S

Why Use a TAN? • Based on premise that a relatively small subset of metrics and threshold values is sufficient to approximate the distribution accurately • Outperforms generalized Bayesian networks and other alternatives in both • Cost • Accuracy

Why use a TAN? • Useful for forecasting failures and violations • Possible to induce models that predict SLO violations in near future, even when system is stable • Automated controller can invoke directly • Identify impending violation • Respond • Loading • Adding resources • Cheap model to induce • Possible to maintain multiple models • Periodic refresh

Setup • System is 3-tier webservice • Apache • Middleware (BEA WebLogic) • Oracle db • 3 Servers with HP Openview to collect statistics • Load Generator is httperf • SLO indicator processes the logs to determine compliance

Interpretability and Modifiability • TANs offer other advantages • Interpretability • Modifiability • Influence of each metric can be quantified in a probabilistic model • Analysis catalogs each type of violation according to the metrics and values that correlate with observed instances • Strength is given from prob value occurring in different states • Gives insight to causes of violations and how to repair

Workloads • Varies several characteristics • Aggregate req rate • Number of concurrent connections • Fraction of data-intensive vs app-intensive requests • This is to exercise the model-induction methodology by providing it with a wide range of M,P pairs • Where M = sample of values for system metrics • P = vector of app-level performance measurements

Workloads • RAMP: Increasing concurrency • STEP: Background + Step function • Background constant traffic • Bursty, hour long bursts • BUGGY: Increasing aggregate req. rate

Results • Varied SLO thresholds to explore effect on induced models • To eval accuracy of models under varying conditions • Trained and evaled TAN classifier for each of 31 different SLO definitions • Baseline: accuracy of 60-pctile SLO classifier (MOD) and CPU as metric.

Results • Overall BA of TAN is 87-94% • 90+% for all experiments • 6% False alarm for 2 experiments, 17% for BUGGY • Single metric is not sufficient to capture pattern of SLO violations (CPU) • Small number of metrics is sufficient to capture pattern (3-8) • Sensitive to workload and SLO definition (MOD always has high detection rate, but generate false alarms at increasing rate as SLO thresh increases)

Conclusion • TANs are attractive for self-managing systems • Build system models automatically • No a priori knowledge required • Generalizes to wide range of conditions • Zeroes in on most relevant metrics • Practical

Conclusion • Possible work to adapt this to changing conditions • Close the loop for automated diagnosis and control • Ultimately most successful model is a hybrid of • Automatically induced models • A priori models

Questions?

Ira Cohen, Jeffrey S. Chase et al.