# Characterizing Variability and Comparing Patterns from Data - PowerPoint PPT Presentation

1 / 39

Characterizing Variability and Comparing Patterns from Data. “Statistics” Module 3. Outline. random samples notion of a statistic estimating the mean - sample average assessing the impact of variation on estimates - sampling distribution

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

### Download Presentation

Characterizing Variability and Comparing Patterns from Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Characterizing Variability and Comparing Patterns from Data

“Statistics”

Module 3

### Outline

• random samples

• notion of a statistic

• estimating the mean - sample average

• assessing the impact of variation on estimates - sampling distribution

• estimating variance - sample variance and standard deviation

• making decisions - comparisons of means, variances using confidence intervals, hypothesis tests

J. McLellan

### Random Samples

Scenario -

• we have an underlying pattern of variability for a process which we would like to characterize -- the population

• we perform a series of experiments on the process in such a way that the results are independent - outcome of one experiment has no influence on any other experiment

• the underlying distribution in place during each experimental run is identical to that of the population

• when we run each experiment, we are collecting a value from the random variable Xi - which has uncertainty

• Xi represents the “i-th” act of sampling - referred to as a sample random variable

J. McLellan

### Definition - Random Sample

A random sample of size “n” of a population random variable is a collection of random variables X1, … Xn such that

• the Xi’s are independent

• the Xi’s have distributions identical to that of X, i.e.,

Each Xi represents a snapshot of the process. The Xi’s are referred to as sample random variables.

What do we do with these sample values?...

=

F

(

x

)

F

(

x

)

X

X

i

J. McLellan

### Sample Average

• used to estimate the mean

• given “n” samples, X1, …, Xn, compute

• interpretation - a rule for computing the sample average, involving sampling

• is a random variable

• observed value

n

1

=

å

X

X

i

n

=

i

1

n

Lower case is used to denote

observed values of the sample

random variables and average.

1

=

å

x

x

i

n

=

i

1

J. McLellan

### Statistics

• Sample average is an example of a “statistic”

Definition

A statistic is a function of sample random variables that is used to estimate a value of a parameter, and does not depend on any unknown parameters.

• e.g., sample average estimates mean  and doesn’t depend on unknown parameters

n

1

=

å

X

X

i

n

=

i

1

J. McLellan

### Sampling Distribution

A statistic is a random variable, with its own probability distribution

• distribution arises from probability distribution of underlying population, via the sample random variables

• distribution of the statistic is called the sampling distribution

• characteristics of the sampling distribution depend on:

• the form of the statistic - e.g., linear function of the sample random variables

• the distribution of the underlying population

J. McLellan

### Sampling Distribution for the Sample Average

• determine the mean and variance of the sample average

Mean

ì

ü

ì

ü

n

n

1

1

=

=

å

å

E

{

X

}

E

X

E

X

í

ý

í

ý

i

i

n

n

î

þ

î

þ

=

=

i

1

i

1

n

n

m

1

1

n

=

=

m

=

=

m

å

å

E

{

X

}

i

n

n

n

=

=

i

1

i

1

Value expected on average

of the sample average is

the true mean of the process

- sample average is an

UNBIASED estimator for the

mean.

because of independence

of sample random variables

J. McLellan

Variance

æ

ö

n

1

ç

÷

=

å

Var

(

X

)

Var

X

ç

÷

i

n

è

ø

=

i

1

æ

ö

n

n

1

1

ç

÷

=

=

å

å

Var

X

Var

(

X

)

ç

÷

i

i

2

2

è

ø

n

n

=

=

i

1

i

1

2

2

s

s

n

=

=

2

n

n

J. McLellan

### Aside - Variance

If we have a sum of independent random variables, X and Y, with “a” and “b” constants, then

Var( a X+ b Y) = a2 Var(X) + b2 Var(Y)

J. McLellan

### Variance of Sample Average

Interpretation

• variance of sample average is 2 / n

• as n becomes larger, variance of sample average becomes smaller

• as more data is used, estimate becomes more precise

• sample average represents a concentration of information

J. McLellan

### Distribution of the Sample Average

• in preceding slides, no assumption was made about distribution of population (e.g., normal, exponential)

• Central Limit Theorem implies that distribution of sample average approaches a Normal distribution when number of samples becomes large

• even if underlying population is non-Normal

• important consequences for comparing values - hypothesis tests and confidence limits

J. McLellan

### Outline

• random samples

• notion of a statistic

• estimating the mean - sample average

• assessing the impact of variation on estimates - sampling distribution

• estimating variance - sample variance and standard deviation

• making decisions - comparisons of means, variances using confidence intervals, hypothesis tests

J. McLellan

### Sample Variance

… is estimated using the following statistic:

Observed value:

Mean of the sample variance:

n

1

2

2

=

-

å

s

(

X

X

)

i

-

n

1

=

i

1

n

1

2

2

=

-

å

s

(

x

x

)

i

-

n

1

=

i

1

Sample variance is an UNBIASED

estimator of variance.

2

2

=

s

E

{

s

}

J. McLellan

### Sample Standard Deviation

… is simply the square root of the sample variance

BUT

• sample standard deviation is a biased estimator of population standard deviation

• value on average does not tend to population value

¹

s

E

{

s

}

J. McLellan

### Outline

• random samples

• notion of a statistic

• estimating the mean - sample average

• assessing the impact of variation on estimates - sampling distribution

• estimating variance - sample variance and standard deviation

• making decisions - comparisons of means, variances using confidence intervals, hypothesis tests

J. McLellan

### Confidence Intervals

Consider the sample average

We can standardize this to have zero mean and unit variance:

2

m

s

X

~

N

(

,

/

n

)

X

X

“Normally distributed with mean

and variance”

“is distributed as”

-

m

X

X

=

Z

s

/

n

X

J. McLellan

### Confidence Intervals

Distribution for standard normal:

Start with -

and consider Z -

-

<

<

=

P

(

1

.

96

Z

1

.

96

)

0

.

95

-

m

X

X

-

<

<

=

P

(

1

.

96

1

.

96

)

0

.

95

s

/

n

X

Û

m

-

s

<

<

m

+

s

=

P

(

1

.

96

/

n

X

1

.

96

/

n

)

0

.

95

X

X

X

X

J. McLellan

### Confidence Intervals

Rearrange this last statement to obtain:

Interpretation -

• limits of interval have uncertainty - if we repeated sequence of estimating average and computing the limits, the endpoints would change somewhat BUT95% of the time, the interval would contain the true value of the mean

-

s

<

m

<

+

s

=

P

(

X

1

.

96

/

n

X

1

.

96

/

n

)

0

.

95

X

X

X

RANDOM

NOT

random

RANDOM

J. McLellan

### Confidence Intervals

• this interval DOES NOT imply that the mean  is uncertain

Picture - sequence of intervals associated with repeated experimentation

true value of mean

J. McLellan

### Confidence Intervals

General result for mean -

100(1-)% confidence interval given by:

where -

• z/2 - “fence” - value for which P(Z> z/2 ) = /2

• value obtained from tables

• 95% - value is 1.96 - approximately 2

• 99% - value is 2.57

-

s

<

m

<

+

s

X

z

/

n

X

z

/

n

a

a

/

2

X

X

/

2

X

J. McLellan

### Confidence Intervals

General Approach

• form a quantity with a known distribution that depends on the parameter of interest

• form a probability statement - choose fences (limits) with a known probability

• re-arrange statement to obtain an interval specifying a range of values for the parameter of interest

-

m

X

X

=

Z

s

/

n

X

-

m

X

X

-

<

<

=

P

(

1

.

96

1

.

96

)

0

.

95

s

/

n

X

-

s

<

m

<

+

s

=

P

(

X

1

.

96

/

n

X

1

.

96

/

n

)

0

.

95

X

X

X

J. McLellan

### Confidence Intervals for Mean

When population variance is “known”, 100(1-)% confidence interval is -

Known variance -

• knowledge of variance when process has been operating steadily for long period of time

• on basis of extensive operating experience

• “large number of data points”

-

s

<

m

<

+

s

X

z

/

n

X

z

/

n

a

a

/

2

X

X

/

2

X

J. McLellan

### Confidence Intervals for Mean

What if variance is unknown?

• Estimate using sample variance s2

Follow previous approach by forming standardized quantity:

• issue - s2 is a statistic itself, and is a random variable

• this quantity no longer has a standard Normal distribution

Solution -

• what is the probability distribution of this quantity, whendata are Normally distributed?

-

m

X

X

s

/

n

X

J. McLellan

### Student’s t Distribution

When the data are Normally distributed,

follows a Student’s t distribution with n-1 degrees of freedom

Degrees of freedom -

• number of statistically independent pieces of information used to compute sample variance

• recall that in s2, we divide by n-1 where n is the number of data points

-

m

X

X

s

/

n

X

J. McLellan

### Student’s t Distribution

… has a shape similar to that of Normal distribution

• symmetric

• values are available in tables

• extra parameter in tables - degrees of freedom

3 degrees of

freedom

J. McLellan

### Confidence Intervals for Mean

Variance Unknown

• estimated using sample variance

• 100(1-)% case

•  is the number of degrees of freedom (n-1), where n is number of data points used to compute sample variance (and average)

• obtained following identical argument used in the known variance case

-

<

m

<

+

X

t

s

/

n

X

t

s

/

n

n

a

n

a

,

/

2

X

X

,

/

2

X

J. McLellan

### Example #1

Conversion in a chemical reactor using new catalyst preparation

• data collected, average conversion computed using 10 data points is 76.1%

• prior operating history indicates that variance of conversion is 4.41 %2

• determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70%

J. McLellan

### Example #1

• Confidence interval - 95%

• upper tail area is 2.5% 

• standard devn = sqrt(4.41) = 2.1

• confidence interval

• conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion

-

<

m

<

+

76

.

1

(

1

.

96

)(

2

.

1

)

/

10

76

.

1

(

1

.

96

)(

2

.

1

)

/

10

Þ

<

m

<

74

.

8

77

.

4

J. McLellan

### Example #2

Conversion in a chemical reactor using new catalyst preparation

• data collected, average conversion computed using 10 data points is 76.1%

• current data set of 10 points used to estimate sample variance, which is 5.3 %2

• determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70%

J. McLellan

### Example #2

• Confidence interval - 95%

• variance UNKNOWN - need to use Student’s t distribution -- degrees of freedom = 10-1 = 9

• upper tail area is 2.5% 

• standard devn = sqrt(5.3) = 2.3

• confidence interval

• conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion

-

<

m

<

+

76

.

1

(

2

.

262

)(

2

.

3

)

/

10

76

.

1

(

2

.

262

)(

2

.

3

)

/

10

Þ

<

m

<

74

.

5

77

.

7

J. McLellan

### Confidence Intervals for Variance

First, we need to know the sampling distribution of the sample variance:

• when data are Normally distributed, sample variance is the sum of squared Normal random variables

• squaring “folds over” the negative values of the Normal random variable and makes them positive - asymmetry

n

1

2

2

=

-

å

s

(

X

X

)

i

-

n

1

=

i

1

J. McLellan

### Chi-squared distribution

• is the distribution of a squared standard Normal random variable

• Chi-squared random variable with 1 degree of freedom

• degrees of freedom = number of independent standard Normal random variables being squared

• e.g.,

• 3 degrees of freedom

2

2

c

Z

~

1

2

2

2

2

+

+

c

Z

Z

Z

~

1

2

3

3

3 degrees of

freedom

J. McLellan

### Sampling distribution -sample variance

Sample variance

• is the sum of n squared Normal random variables BUT we add the sum of squared deviations from the sample average

• given value of sample average introduces constraint - given Xbar, we only have n-1 independent random variables (the n-th can be computed from the average)

• sample variance contains n-1 independent Normal random variables --> degrees of freedom for Chi-squared distribution is n-1

2

s

2

2

c

s

~

-

n

1

-

n

1

J. McLellan

### Confidence Intervals - Sample Variance

• Form probability statement

• Re-arrange statement

• 100(1-)% interval is

2

-

(

n

1

)

s

2

2

c

<

<

c

=

-

a

P

(

)

1

-

-

a

-

a

n

1

,

1

/

2

n

1

,

/

2

2

s

2

2

-

-

(

n

1

)

s

(

n

1

)

s

2

<

s

<

=

-

a

P

(

)

1

2

2

c

c

-

a

-

-

a

n

1

,

/

2

n

1

,

1

/

2

2

2

-

-

(

n

1

)

s

(

n

1

)

s

2

<

s

<

2

2

c

c

-

a

-

-

a

n

1

,

/

2

n

1

,

1

/

2

J. McLellan

### Confidence Limits for Variance

Notes

1) the tail areas are equal

• symmetric tail areas

however the interval can be asymmetric

• consequence of asymmetry of Chi-squared distribution

2) is the value of the Chi-squared random variable with upper tail area of 1-/2 and n-1 degrees of freedom

equal tail areas

2

c

-

-

a

n

1

,

1

/

2

J. McLellan

### Variance Confidence Intervals - Example

Temperature controller has been implemented on a polymer reactor -

• variance under previous operation was 4.7 C

• under new operation, we have collected 10 data points and computed a sample variance of 3.2 C

• is the variance under the new control operation significantly better?

• i.e., is variance under new operation significantly lower?

J. McLellan

### Variance Confidence Intervals - Example

Use confidence interval for variance

• n-1 = 10-1 = 9 degrees of freedom

• form 95% confidence interval ( = 0.05)

• from tables:

• interval for variance:

• conclusion - variance reduction isn’t significant after background variation in sample variance computation is taken into account

• note that interval isn’t symmetric

2

c

=

2

.

7

-

9

,

1

0

.

025

2

c

=

19

.

0

9

,

0

.

025

2

<

s

<

1

.

52

10

.

67

J. McLellan

### Variance Confidence Intervals - Example

Comment

• variance is sensitive to degrees of freedom

• need larger number of data points to obtain precise estimate

• e.g., if variance estimate was 3.2 C with 30 degrees of freedom (31 data points), the interval would be:

• cf. previous interval with 10 data points

Conclusion still doesn’t

change, however.

2

<

s

<

2

.

04

5

.

71

2

<

s

<

1

.

52

10

.

67

J. McLellan