Characterizing Variability and Comparing Patterns from Data

1 / 39

# Characterizing Variability and Comparing Patterns from Data - PowerPoint PPT Presentation

Characterizing Variability and Comparing Patterns from Data. “Statistics” Module 3. Outline. random samples notion of a statistic estimating the mean - sample average assessing the impact of variation on estimates - sampling distribution

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Characterizing Variability and Comparing Patterns from Data' - gyda

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Characterizing Variability and Comparing Patterns from Data

“Statistics”

Module 3

Outline
• random samples
• notion of a statistic
• estimating the mean - sample average
• assessing the impact of variation on estimates - sampling distribution
• estimating variance - sample variance and standard deviation
• making decisions - comparisons of means, variances using confidence intervals, hypothesis tests

J. McLellan

Random Samples

Scenario -

• we have an underlying pattern of variability for a process which we would like to characterize -- the population
• we perform a series of experiments on the process in such a way that the results are independent - outcome of one experiment has no influence on any other experiment
• the underlying distribution in place during each experimental run is identical to that of the population
• when we run each experiment, we are collecting a value from the random variable Xi - which has uncertainty
• Xi represents the “i-th” act of sampling - referred to as a sample random variable

J. McLellan

Definition - Random Sample

A random sample of size “n” of a population random variable is a collection of random variables X1, … Xn such that

• the Xi’s are independent
• the Xi’s have distributions identical to that of X, i.e.,

Each Xi represents a snapshot of the process. The Xi’s are referred to as sample random variables.

What do we do with these sample values?...

=

F

(

x

)

F

(

x

)

X

X

i

J. McLellan

Sample Average
• used to estimate the mean
• given “n” samples, X1, …, Xn, compute
• interpretation - a rule for computing the sample average, involving sampling
• is a random variable
• observed value

n

1

=

å

X

X

i

n

=

i

1

n

Lower case is used to denote

observed values of the sample

random variables and average.

1

=

å

x

x

i

n

=

i

1

J. McLellan

Statistics
• Sample average is an example of a “statistic”

Definition

A statistic is a function of sample random variables that is used to estimate a value of a parameter, and does not depend on any unknown parameters.

• e.g., sample average estimates mean  and doesn’t depend on unknown parameters

n

1

=

å

X

X

i

n

=

i

1

J. McLellan

Sampling Distribution

A statistic is a random variable, with its own probability distribution

• distribution arises from probability distribution of underlying population, via the sample random variables
• distribution of the statistic is called the sampling distribution
• characteristics of the sampling distribution depend on:
• the form of the statistic - e.g., linear function of the sample random variables
• the distribution of the underlying population

J. McLellan

Sampling Distribution for the Sample Average
• determine the mean and variance of the sample average

Mean

ì

ü

ì

ü

n

n

1

1

=

=

å

å

E

{

X

}

E

X

E

X

í

ý

í

ý

i

i

n

n

î

þ

î

þ

=

=

i

1

i

1

n

n

m

1

1

n

=

=

m

=

=

m

å

å

E

{

X

}

i

n

n

n

=

=

i

1

i

1

Value expected on average

of the sample average is

the true mean of the process

- sample average is an

UNBIASED estimator for the

mean.

because of independence

of sample random variables

J. McLellan

Sampling Distribution for the Sample Average

Variance

æ

ö

n

1

ç

÷

=

å

Var

(

X

)

Var

X

ç

÷

i

n

è

ø

=

i

1

æ

ö

n

n

1

1

ç

÷

=

=

å

å

Var

X

Var

(

X

)

ç

÷

i

i

2

2

è

ø

n

n

=

=

i

1

i

1

2

2

s

s

n

=

=

2

n

n

J. McLellan

Aside - Variance

If we have a sum of independent random variables, X and Y, with “a” and “b” constants, then

Var( a X+ b Y) = a2 Var(X) + b2 Var(Y)

J. McLellan

Variance of Sample Average

Interpretation

• variance of sample average is 2 / n
• as n becomes larger, variance of sample average becomes smaller
• as more data is used, estimate becomes more precise
• sample average represents a concentration of information

J. McLellan

Distribution of the Sample Average
• in preceding slides, no assumption was made about distribution of population (e.g., normal, exponential)
• Central Limit Theorem implies that distribution of sample average approaches a Normal distribution when number of samples becomes large
• even if underlying population is non-Normal
• important consequences for comparing values - hypothesis tests and confidence limits

J. McLellan

Outline
• random samples
• notion of a statistic
• estimating the mean - sample average
• assessing the impact of variation on estimates - sampling distribution
• estimating variance - sample variance and standard deviation
• making decisions - comparisons of means, variances using confidence intervals, hypothesis tests

J. McLellan

Sample Variance

… is estimated using the following statistic:

Observed value:

Mean of the sample variance:

n

1

2

2

=

-

å

s

(

X

X

)

i

-

n

1

=

i

1

n

1

2

2

=

-

å

s

(

x

x

)

i

-

n

1

=

i

1

Sample variance is an UNBIASED

estimator of variance.

2

2

=

s

E

{

s

}

J. McLellan

Sample Standard Deviation

… is simply the square root of the sample variance

BUT

• sample standard deviation is a biased estimator of population standard deviation
• value on average does not tend to population value

¹

s

E

{

s

}

J. McLellan

Outline
• random samples
• notion of a statistic
• estimating the mean - sample average
• assessing the impact of variation on estimates - sampling distribution
• estimating variance - sample variance and standard deviation
• making decisions - comparisons of means, variances using confidence intervals, hypothesis tests

J. McLellan

Confidence Intervals

Consider the sample average

We can standardize this to have zero mean and unit variance:

2

m

s

X

~

N

(

,

/

n

)

X

X

“Normally distributed with mean

and variance”

“is distributed as”

-

m

X

X

=

Z

s

/

n

X

J. McLellan

Confidence Intervals

Distribution for standard normal:

and consider Z -

-

<

<

=

P

(

1

.

96

Z

1

.

96

)

0

.

95

-

m

X

X

-

<

<

=

P

(

1

.

96

1

.

96

)

0

.

95

s

/

n

X

Û

m

-

s

<

<

m

+

s

=

P

(

1

.

96

/

n

X

1

.

96

/

n

)

0

.

95

X

X

X

X

J. McLellan

Confidence Intervals

Rearrange this last statement to obtain:

Interpretation -

• limits of interval have uncertainty - if we repeated sequence of estimating average and computing the limits, the endpoints would change somewhat BUT95% of the time, the interval would contain the true value of the mean

-

s

<

m

<

+

s

=

P

(

X

1

.

96

/

n

X

1

.

96

/

n

)

0

.

95

X

X

X

RANDOM

NOT

random

RANDOM

J. McLellan

Confidence Intervals
• this interval DOES NOT imply that the mean  is uncertain

Picture - sequence of intervals associated with repeated experimentation

true value of mean

J. McLellan

Confidence Intervals

General result for mean -

100(1-)% confidence interval given by:

where -

• z/2 - “fence” - value for which P(Z> z/2 ) = /2
• value obtained from tables
• 95% - value is 1.96 - approximately 2
• 99% - value is 2.57

-

s

<

m

<

+

s

X

z

/

n

X

z

/

n

a

a

/

2

X

X

/

2

X

J. McLellan

Confidence Intervals

General Approach

• form a quantity with a known distribution that depends on the parameter of interest
• form a probability statement - choose fences (limits) with a known probability
• re-arrange statement to obtain an interval specifying a range of values for the parameter of interest

-

m

X

X

=

Z

s

/

n

X

-

m

X

X

-

<

<

=

P

(

1

.

96

1

.

96

)

0

.

95

s

/

n

X

-

s

<

m

<

+

s

=

P

(

X

1

.

96

/

n

X

1

.

96

/

n

)

0

.

95

X

X

X

J. McLellan

Confidence Intervals for Mean

When population variance is “known”, 100(1-)% confidence interval is -

Known variance -

• knowledge of variance when process has been operating steadily for long period of time
• on basis of extensive operating experience
• “large number of data points”

-

s

<

m

<

+

s

X

z

/

n

X

z

/

n

a

a

/

2

X

X

/

2

X

J. McLellan

Confidence Intervals for Mean

What if variance is unknown?

• Estimate using sample variance s2

Follow previous approach by forming standardized quantity:

• issue - s2 is a statistic itself, and is a random variable
• this quantity no longer has a standard Normal distribution

Solution -

• what is the probability distribution of this quantity, whendata are Normally distributed?

-

m

X

X

s

/

n

X

J. McLellan

Student’s t Distribution

When the data are Normally distributed,

follows a Student’s t distribution with n-1 degrees of freedom

Degrees of freedom -

• number of statistically independent pieces of information used to compute sample variance
• recall that in s2, we divide by n-1 where n is the number of data points

-

m

X

X

s

/

n

X

J. McLellan

Student’s t Distribution

… has a shape similar to that of Normal distribution

• symmetric
• values are available in tables
• extra parameter in tables - degrees of freedom

3 degrees of

freedom

J. McLellan

Confidence Intervals for Mean

Variance Unknown

• estimated using sample variance
• 100(1-)% case
•  is the number of degrees of freedom (n-1), where n is number of data points used to compute sample variance (and average)
• obtained following identical argument used in the known variance case

-

<

m

<

+

X

t

s

/

n

X

t

s

/

n

n

a

n

a

,

/

2

X

X

,

/

2

X

J. McLellan

Example #1

Conversion in a chemical reactor using new catalyst preparation

• data collected, average conversion computed using 10 data points is 76.1%
• prior operating history indicates that variance of conversion is 4.41 %2
• determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70%

J. McLellan

Example #1
• Confidence interval - 95%
• upper tail area is 2.5% 
• standard devn = sqrt(4.41) = 2.1
• confidence interval
• conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion

-

<

m

<

+

76

.

1

(

1

.

96

)(

2

.

1

)

/

10

76

.

1

(

1

.

96

)(

2

.

1

)

/

10

Þ

<

m

<

74

.

8

77

.

4

J. McLellan

Example #2

Conversion in a chemical reactor using new catalyst preparation

• data collected, average conversion computed using 10 data points is 76.1%
• current data set of 10 points used to estimate sample variance, which is 5.3 %2
• determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70%

J. McLellan

Example #2
• Confidence interval - 95%
• variance UNKNOWN - need to use Student’s t distribution -- degrees of freedom = 10-1 = 9
• upper tail area is 2.5% 
• standard devn = sqrt(5.3) = 2.3
• confidence interval
• conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion

-

<

m

<

+

76

.

1

(

2

.

262

)(

2

.

3

)

/

10

76

.

1

(

2

.

262

)(

2

.

3

)

/

10

Þ

<

m

<

74

.

5

77

.

7

J. McLellan

Confidence Intervals for Variance

First, we need to know the sampling distribution of the sample variance:

• when data are Normally distributed, sample variance is the sum of squared Normal random variables
• squaring “folds over” the negative values of the Normal random variable and makes them positive - asymmetry

n

1

2

2

=

-

å

s

(

X

X

)

i

-

n

1

=

i

1

J. McLellan

Chi-squared distribution
• is the distribution of a squared standard Normal random variable
• Chi-squared random variable with 1 degree of freedom
• degrees of freedom = number of independent standard Normal random variables being squared
• e.g.,
• 3 degrees of freedom

2

2

c

Z

~

1

2

2

2

2

+

+

c

Z

Z

Z

~

1

2

3

3

3 degrees of

freedom

J. McLellan

Sampling distribution -sample variance

Sample variance

• is the sum of n squared Normal random variables BUT we add the sum of squared deviations from the sample average
• given value of sample average introduces constraint - given Xbar, we only have n-1 independent random variables (the n-th can be computed from the average)
• sample variance contains n-1 independent Normal random variables --> degrees of freedom for Chi-squared distribution is n-1

2

s

2

2

c

s

~

-

n

1

-

n

1

J. McLellan

Confidence Intervals - Sample Variance
• Form probability statement
• Re-arrange statement
• 100(1-)% interval is

2

-

(

n

1

)

s

2

2

c

<

<

c

=

-

a

P

(

)

1

-

-

a

-

a

n

1

,

1

/

2

n

1

,

/

2

2

s

2

2

-

-

(

n

1

)

s

(

n

1

)

s

2

<

s

<

=

-

a

P

(

)

1

2

2

c

c

-

a

-

-

a

n

1

,

/

2

n

1

,

1

/

2

2

2

-

-

(

n

1

)

s

(

n

1

)

s

2

<

s

<

2

2

c

c

-

a

-

-

a

n

1

,

/

2

n

1

,

1

/

2

J. McLellan

Confidence Limits for Variance

Notes

1) the tail areas are equal

• symmetric tail areas

however the interval can be asymmetric

• consequence of asymmetry of Chi-squared distribution

2) is the value of the Chi-squared random variable with upper tail area of 1-/2 and n-1 degrees of freedom

equal tail areas

2

c

-

-

a

n

1

,

1

/

2

J. McLellan

Variance Confidence Intervals - Example

Temperature controller has been implemented on a polymer reactor -

• variance under previous operation was 4.7 C
• under new operation, we have collected 10 data points and computed a sample variance of 3.2 C
• is the variance under the new control operation significantly better?
• i.e., is variance under new operation significantly lower?

J. McLellan

Variance Confidence Intervals - Example

Use confidence interval for variance

• n-1 = 10-1 = 9 degrees of freedom
• form 95% confidence interval ( = 0.05)
• from tables:
• interval for variance:
• conclusion - variance reduction isn’t significant after background variation in sample variance computation is taken into account
• note that interval isn’t symmetric

2

c

=

2

.

7

-

9

,

1

0

.

025

2

c

=

19

.

0

9

,

0

.

025

2

<

s

<

1

.

52

10

.

67

J. McLellan

Variance Confidence Intervals - Example

Comment

• variance is sensitive to degrees of freedom
• need larger number of data points to obtain precise estimate
• e.g., if variance estimate was 3.2 C with 30 degrees of freedom (31 data points), the interval would be:
• cf. previous interval with 10 data points

Conclusion still doesn’t

change, however.

2

<

s

<

2

.

04

5

.

71

2

<

s

<

1

.

52

10

.

67

J. McLellan