What Changed?

1 / 40

# What Changed? - PowerPoint PPT Presentation

What Changed?. Frank Bereznay Kaiser Permanente. What Changed?. Two Questions Can we use statistical techniques to help us differentiate between variation that is part of a normal operation and variation due to assignable causes? How should data be organized to properly use these techniques?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'What Changed?' - wardah

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### What Changed?

Frank Bereznay

Kaiser Permanente

What Changed?
• Two Questions
• Can we use statistical techniques to help us differentiate between variation that is part of a normal operation and variation due to assignable causes?
• How should data be organized to properly use these techniques?
Agenda
• Frank’s One Hour Stat Class
• What is Statistics all about?
• Commonly Used Statistical Techniques
• Hypothesis Testing
• Statistical Process Control
• Analysis of Variance
• Time Series Data
• An Example
• Populations have Parameters.
• Samples have Statistics.
• Statistics is all about estimating Population parameters by taking samples and calculating statistics.
• A Key Question
• What is the data population you are trying to estimate and what are it’s properties?
Hypothesis Testing
• A very brief review
• Conduct an experiment to make a decision about a population parameter.
• Relies on the Central Limit Theorem to describe the properties of a sample.
• Samples are always normally distributed irrespective of the underlying population.
Hypothesis Testing
• Classic form of test

H0: µ = 0

Ha: µ ≠0

• A level of confidence, usually 95% is specified.
• You collect a sample set of values and compare the derived statistic against a normal distribution to make the determination.
Hypothesis Testing
• So, what can this do for us?
• If we know what a metric is supposed to be, we can sample and test.
• This has some value for SLAs and other metrics that are mandated.
• Most of the time we don’t know the population parameters.
Statistical Process Control
• A bit of History
• Walter Shewhart
• W Edwards Deming
• Post WWII Japan and the Deming Cycle
Statistical Process Control
• Key Concepts and Terms
• There is no up front hypothesis test.
• No information is required about the parameters of the process being evaluated.
• Control Chart
• Primary method to track a process.
• Many forms of the metric can be analyzed.
• Rational Subgroups
• Recurring sets of data that are summarized for analysis.
Statistical Process Control
• Sample Control Chart
Statistical Process Control
• A number of CMG papers have been published in this area:
• Brey “Managing at the Knee of the Curve(The use of SPC in Managing a Data Center)”, CMG90
• Lipner “Zero-Defect Capacity and Performance Management”, CMG92
• Chu “A Three Sigma Quality Target for 100 Percent SLA”, CMG92
• Buzen & Shum “MASF – Multivariate Adaptive Statistical Filtering”, CMG95
Statistical Process Control
• Following the Buzen & Shum paper, Trubin has published a set of papers on applications of MASF.
• However, the overall interest in this area seemed to wane.
• I believe it is primarily due to the complexity of the data we work with.
Analysis of Variance (ANOVA)
• Developed by Sir Ronald A. Fischer in the early 20th Century.
• Initial use was focused on helping the agriculture industry.
• A method was needed to evaluate the effectiveness of multiple simultaneous attempts to improve crop yield.
Analysis of Variance (ANOVA)
• Take a area of interest and sub-divide it into multiple populations.
• Subject these separate populations to various treatments.
• Make a determination if there are differences in the population means that can be attributed to the treatments.
Analysis of Variance (ANOVA)
• Hypothesis test

H0: µ1 = µ2 = µ3 …= µn

Ha: Not all µi (i=1,2,3,…n) are equal

• This can be a very handy tool to determine if there are differences in the sub-groups within a body of data.
• Are business volumes the same Monday thru Friday?
• Is there a difference in Tuesday’s volume week over week?
Quick Summary
• We have described three techniques:
• Hypothesis Testing
• Statistical Process Control
• Analysis of Variance (ANOVA)
• Time to see an example
Bottling Process
• We have a process that puts a beverage in a bottle
• The intended fill volume is 2 liters or 2,000 CM
• We collect 36 random samples over a nine day period.
• Sample Mean 1,999.51
• Sample Variance 1.89
• Standard Deviation 1.37
Bottling Process

Process Range = 3.9

Upper Control Limit = 2001.46

Lower Control Limit = 1997.56

The ANOVA Procedure

Class Level Information

Class Levels Values

TimeStamp 9 13MAR06 14MAR06 15MAR06 16MAR06 17MAR06

18MAR06 19MAR06 20MAR06 21MAR06

Number of Observations Used 36

Dependent Variable: CM CM

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 8 20.23060000 2.52882500 1.49 0.2081

Error 27 45.91587500 1.70058796

Corrected Total 35 66.14647500

Tukey's Studentized Range (HSD) Test for CM

Means with the same letter are not significantly different.

Time

Tukey Grouping Mean N Stamp

A 2000.3525 4 13MAR06

A

A 2000.3200 4 20MAR06

A

A 2000.2050 4 19MAR06

A

A 1999.8775 4 18MAR06

A

A 1999.8275 4 14MAR06

A

A 1999.1275 4 16MAR06

A

A 1999.0150 4 21MAR06

A

A 1998.8075 4 17MAR06

A

A 1998.0650 4 15MAR06

Second Quick Summary
• That was reassuring!
• When we purchase a bottle of wine we can be sure we are getting our money’s worth.
• Why did Frank chose this example?
• What is the relevance to our commercial computing environments?
Time Series Data
• The instrumentation data we analyze is a very complex data aggregate that contains the influences of multiple factors.
• Many of the factors are related to time or duration.
• Hour of the day, Day of the week.
• Day of the month, Month of the Year.
• Growth rate for the enterprise.
Time Series Data
• Example of overall MIPS usage.
Time Series Data
• Time Series data generally contains four components.
• Trend
• Long term constant movement.
• Cycle
• Movement pattern greater than a year.
• Seasonal Variations
• Movement patterns within a year.
• Irregular Fluctuations
• Events not triggered by a duration.
Time Series Data
• To properly work with this type of data you need to decompose it into it’s components before you begin the testing.
• You have four separate questions to ask, one for each component.
• Did the trend component change?
• Did the cycle component change?
• Did the seasonal component change?
• Did the irregular component change?
Time Series Data
• While it is possible to perform the decomposition of Time Series data into it’s components, the best strategy is to avoid the need to do so.
• Choose granular data intervals.
• Keep the number of intervals to a minimum.
• Start with a 24x7 type of matrix and use ANOVA to determine the hours that belong together.
Third Quick Summary
• OK, Now we have a strategy.
• Treat each hour of each day as a separate process.
• Use ANOVA to see how similar the day/hour combination is week over week. Are we selecting the right combination?
• Use SPC to develop a process mean and control limits for this day/hour.
• Plot the results on a day by day basis.
• Lets see how this looks.
Example – Midrange CPU

Tukey's Studentized Range (HSD) Test for CPUAVE

Alpha 0.05

Error Degrees of Freedom 15

Error Mean Square 5.953

Critical Value of Studentized Range 4.36699

Minimum Significant Difference 5.3275

Means with the same letter are not significantly different.

Tukey Grouping Mean N DATE

A 34.800 4 30MAR06

A

B A 29.850 4 09MAR06

B

B 28.650 4 23MAR06

B

B 28.075 4 02MAR06

B

B 28.025 4 16MAR06

Summary
• These statistical tools can really help.
• But it is not a slam dunk to implement.
• You need to get to know your data.
• Producing the information is only the beginning.
• Recall the problems Shewart and Demming had.
• This really needs to the basis for managing the environment.