Characterizing variability and comparing patterns from data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 39

Characterizing Variability and Comparing Patterns from Data PowerPoint PPT Presentation


  • 52 Views
  • Uploaded on
  • Presentation posted in: General

Characterizing Variability and Comparing Patterns from Data. “Statistics” Module 3. Outline. random samples notion of a statistic estimating the mean - sample average assessing the impact of variation on estimates - sampling distribution

Download Presentation

Characterizing Variability and Comparing Patterns from Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Characterizing Variability and Comparing Patterns from Data

“Statistics”

Module 3


Outline

  • random samples

  • notion of a statistic

  • estimating the mean - sample average

  • assessing the impact of variation on estimates - sampling distribution

  • estimating variance - sample variance and standard deviation

  • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests

J. McLellan


Random Samples

Scenario -

  • we have an underlying pattern of variability for a process which we would like to characterize -- the population

  • we perform a series of experiments on the process in such a way that the results are independent - outcome of one experiment has no influence on any other experiment

  • the underlying distribution in place during each experimental run is identical to that of the population

  • when we run each experiment, we are collecting a value from the random variable Xi - which has uncertainty

  • Xi represents the “i-th” act of sampling - referred to as a sample random variable

J. McLellan


Definition - Random Sample

A random sample of size “n” of a population random variable is a collection of random variables X1, … Xn such that

  • the Xi’s are independent

  • the Xi’s have distributions identical to that of X, i.e.,

    Each Xi represents a snapshot of the process. The Xi’s are referred to as sample random variables.

    What do we do with these sample values?...

=

F

(

x

)

F

(

x

)

X

X

i

J. McLellan


Sample Average

  • used to estimate the mean

  • given “n” samples, X1, …, Xn, compute

  • interpretation - a rule for computing the sample average, involving sampling

  • is a random variable

  • observed value

n

1

=

å

X

X

i

n

=

i

1

n

Lower case is used to denote

observed values of the sample

random variables and average.

1

=

å

x

x

i

n

=

i

1

J. McLellan


Statistics

  • Sample average is an example of a “statistic”

    Definition

    A statistic is a function of sample random variables that is used to estimate a value of a parameter, and does not depend on any unknown parameters.

    • e.g., sample average estimates mean  and doesn’t depend on unknown parameters

n

1

=

å

X

X

i

n

=

i

1

J. McLellan


Sampling Distribution

A statistic is a random variable, with its own probability distribution

  • distribution arises from probability distribution of underlying population, via the sample random variables

  • distribution of the statistic is called the sampling distribution

  • characteristics of the sampling distribution depend on:

    • the form of the statistic - e.g., linear function of the sample random variables

    • the distribution of the underlying population

J. McLellan


Sampling Distribution for the Sample Average

  • determine the mean and variance of the sample average

    Mean

ì

ü

ì

ü

n

n

1

1

=

=

å

å

E

{

X

}

E

X

E

X

í

ý

í

ý

i

i

n

n

î

þ

î

þ

=

=

i

1

i

1

n

n

m

1

1

n

=

=

m

=

=

m

å

å

E

{

X

}

i

n

n

n

=

=

i

1

i

1

Value expected on average

of the sample average is

the true mean of the process

- sample average is an

UNBIASED estimator for the

mean.

because of independence

of sample random variables

J. McLellan


Sampling Distribution for the Sample Average

Variance

æ

ö

n

1

ç

÷

=

å

Var

(

X

)

Var

X

ç

÷

i

n

è

ø

=

i

1

æ

ö

n

n

1

1

ç

÷

=

=

å

å

Var

X

Var

(

X

)

ç

÷

i

i

2

2

è

ø

n

n

=

=

i

1

i

1

2

2

s

s

n

=

=

2

n

n

J. McLellan


Aside - Variance

If we have a sum of independent random variables, X and Y, with “a” and “b” constants, then

Var( a X+ b Y) = a2 Var(X) + b2 Var(Y)

J. McLellan


Variance of Sample Average

Interpretation

  • variance of sample average is 2 / n

    • as n becomes larger, variance of sample average becomes smaller

    • as more data is used, estimate becomes more precise

    • sample average represents a concentration of information

J. McLellan


Distribution of the Sample Average

  • in preceding slides, no assumption was made about distribution of population (e.g., normal, exponential)

  • Central Limit Theorem implies that distribution of sample average approaches a Normal distribution when number of samples becomes large

    • even if underlying population is non-Normal

    • important consequences for comparing values - hypothesis tests and confidence limits

J. McLellan


Outline

  • random samples

  • notion of a statistic

  • estimating the mean - sample average

  • assessing the impact of variation on estimates - sampling distribution

  • estimating variance - sample variance and standard deviation

  • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests

J. McLellan


Sample Variance

… is estimated using the following statistic:

Observed value:

Mean of the sample variance:

n

1

2

2

=

-

å

s

(

X

X

)

i

-

n

1

=

i

1

n

1

2

2

=

-

å

s

(

x

x

)

i

-

n

1

=

i

1

Sample variance is an UNBIASED

estimator of variance.

2

2

=

s

E

{

s

}

J. McLellan


Sample Standard Deviation

… is simply the square root of the sample variance

BUT

  • sample standard deviation is a biased estimator of population standard deviation

    • value on average does not tend to population value

¹

s

E

{

s

}

J. McLellan


Outline

  • random samples

  • notion of a statistic

  • estimating the mean - sample average

  • assessing the impact of variation on estimates - sampling distribution

  • estimating variance - sample variance and standard deviation

  • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests

J. McLellan


Confidence Intervals

Consider the sample average

We can standardize this to have zero mean and unit variance:

2

m

s

X

~

N

(

,

/

n

)

X

X

“Normally distributed with mean

and variance”

“is distributed as”

-

m

X

X

=

Z

s

/

n

X

J. McLellan


Confidence Intervals

Distribution for standard normal:

Start with -

and consider Z -

-

<

<

=

P

(

1

.

96

Z

1

.

96

)

0

.

95

-

m

X

X

-

<

<

=

P

(

1

.

96

1

.

96

)

0

.

95

s

/

n

X

Û

m

-

s

<

<

m

+

s

=

P

(

1

.

96

/

n

X

1

.

96

/

n

)

0

.

95

X

X

X

X

J. McLellan


Confidence Intervals

Rearrange this last statement to obtain:

Interpretation -

  • limits of interval have uncertainty - if we repeated sequence of estimating average and computing the limits, the endpoints would change somewhat BUT95% of the time, the interval would contain the true value of the mean

-

s

<

m

<

+

s

=

P

(

X

1

.

96

/

n

X

1

.

96

/

n

)

0

.

95

X

X

X

RANDOM

NOT

random

RANDOM

J. McLellan


Confidence Intervals

  • this interval DOES NOT imply that the mean  is uncertain

    Picture - sequence of intervals associated with repeated experimentation

true value of mean

J. McLellan


Confidence Intervals

General result for mean -

100(1-)% confidence interval given by:

where -

  • z/2 - “fence” - value for which P(Z> z/2 ) = /2

  • value obtained from tables

    • 95% - value is 1.96 - approximately 2

    • 99% - value is 2.57

-

s

<

m

<

+

s

X

z

/

n

X

z

/

n

a

a

/

2

X

X

/

2

X

J. McLellan


Confidence Intervals

General Approach

  • form a quantity with a known distribution that depends on the parameter of interest

  • form a probability statement - choose fences (limits) with a known probability

  • re-arrange statement to obtain an interval specifying a range of values for the parameter of interest

-

m

X

X

=

Z

s

/

n

X

-

m

X

X

-

<

<

=

P

(

1

.

96

1

.

96

)

0

.

95

s

/

n

X

-

s

<

m

<

+

s

=

P

(

X

1

.

96

/

n

X

1

.

96

/

n

)

0

.

95

X

X

X

J. McLellan


Confidence Intervals for Mean

When population variance is “known”, 100(1-)% confidence interval is -

Known variance -

  • knowledge of variance when process has been operating steadily for long period of time

  • on basis of extensive operating experience

  • “large number of data points”

-

s

<

m

<

+

s

X

z

/

n

X

z

/

n

a

a

/

2

X

X

/

2

X

J. McLellan


Confidence Intervals for Mean

What if variance is unknown?

  • Estimate using sample variance s2

    Follow previous approach by forming standardized quantity:

  • issue - s2 is a statistic itself, and is a random variable

  • this quantity no longer has a standard Normal distribution

    Solution -

  • what is the probability distribution of this quantity, whendata are Normally distributed?

-

m

X

X

s

/

n

X

J. McLellan


Student’s t Distribution

When the data are Normally distributed,

follows a Student’s t distribution with n-1 degrees of freedom

Degrees of freedom -

  • number of statistically independent pieces of information used to compute sample variance

  • recall that in s2, we divide by n-1 where n is the number of data points

-

m

X

X

s

/

n

X

J. McLellan


Student’s t Distribution

… has a shape similar to that of Normal distribution

  • symmetric

  • values are available in tables

  • extra parameter in tables - degrees of freedom

3 degrees of

freedom

J. McLellan


Confidence Intervals for Mean

Variance Unknown

  • estimated using sample variance

  • 100(1-)% case

  •  is the number of degrees of freedom (n-1), where n is number of data points used to compute sample variance (and average)

  • obtained following identical argument used in the known variance case

-

<

m

<

+

X

t

s

/

n

X

t

s

/

n

n

a

n

a

,

/

2

X

X

,

/

2

X

J. McLellan


Example #1

Conversion in a chemical reactor using new catalyst preparation

  • data collected, average conversion computed using 10 data points is 76.1%

  • prior operating history indicates that variance of conversion is 4.41 %2

  • determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70%

J. McLellan


Example #1

  • Confidence interval - 95%

    • upper tail area is 2.5% 

    • standard devn = sqrt(4.41) = 2.1

    • confidence interval

    • conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion

-

<

m

<

+

76

.

1

(

1

.

96

)(

2

.

1

)

/

10

76

.

1

(

1

.

96

)(

2

.

1

)

/

10

Þ

<

m

<

74

.

8

77

.

4

J. McLellan


Example #2

Conversion in a chemical reactor using new catalyst preparation

  • data collected, average conversion computed using 10 data points is 76.1%

  • current data set of 10 points used to estimate sample variance, which is 5.3 %2

  • determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70%

J. McLellan


Example #2

  • Confidence interval - 95%

    • variance UNKNOWN - need to use Student’s t distribution -- degrees of freedom = 10-1 = 9

    • upper tail area is 2.5% 

    • standard devn = sqrt(5.3) = 2.3

    • confidence interval

    • conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion

-

<

m

<

+

76

.

1

(

2

.

262

)(

2

.

3

)

/

10

76

.

1

(

2

.

262

)(

2

.

3

)

/

10

Þ

<

m

<

74

.

5

77

.

7

J. McLellan


Confidence Intervals for Variance

First, we need to know the sampling distribution of the sample variance:

  • when data are Normally distributed, sample variance is the sum of squared Normal random variables

    • squaring “folds over” the negative values of the Normal random variable and makes them positive - asymmetry

n

1

2

2

=

-

å

s

(

X

X

)

i

-

n

1

=

i

1

J. McLellan


Chi-squared distribution

  • is the distribution of a squared standard Normal random variable

    • Chi-squared random variable with 1 degree of freedom

    • degrees of freedom = number of independent standard Normal random variables being squared

    • e.g.,

      • 3 degrees of freedom

2

2

c

Z

~

1

2

2

2

2

+

+

c

Z

Z

Z

~

1

2

3

3

3 degrees of

freedom

J. McLellan


Sampling distribution -sample variance

Sample variance

  • is the sum of n squared Normal random variables BUT we add the sum of squared deviations from the sample average

  • given value of sample average introduces constraint - given Xbar, we only have n-1 independent random variables (the n-th can be computed from the average)

  • sample variance contains n-1 independent Normal random variables --> degrees of freedom for Chi-squared distribution is n-1

2

s

2

2

c

s

~

-

n

1

-

n

1

J. McLellan


Confidence Intervals - Sample Variance

  • Form probability statement

  • Re-arrange statement

  • 100(1-)% interval is

2

-

(

n

1

)

s

2

2

c

<

<

c

=

-

a

P

(

)

1

-

-

a

-

a

n

1

,

1

/

2

n

1

,

/

2

2

s

2

2

-

-

(

n

1

)

s

(

n

1

)

s

2

<

s

<

=

-

a

P

(

)

1

2

2

c

c

-

a

-

-

a

n

1

,

/

2

n

1

,

1

/

2

2

2

-

-

(

n

1

)

s

(

n

1

)

s

2

<

s

<

2

2

c

c

-

a

-

-

a

n

1

,

/

2

n

1

,

1

/

2

J. McLellan


Confidence Limits for Variance

Notes

1) the tail areas are equal

  • symmetric tail areas

    however the interval can be asymmetric

  • consequence of asymmetry of Chi-squared distribution

    2) is the value of the Chi-squared random variable with upper tail area of 1-/2 and n-1 degrees of freedom

equal tail areas

2

c

-

-

a

n

1

,

1

/

2

J. McLellan


Variance Confidence Intervals - Example

Temperature controller has been implemented on a polymer reactor -

  • variance under previous operation was 4.7 C

  • under new operation, we have collected 10 data points and computed a sample variance of 3.2 C

  • is the variance under the new control operation significantly better?

    • i.e., is variance under new operation significantly lower?

J. McLellan


Variance Confidence Intervals - Example

Use confidence interval for variance

  • n-1 = 10-1 = 9 degrees of freedom

  • form 95% confidence interval ( = 0.05)

  • from tables:

  • interval for variance:

  • conclusion - variance reduction isn’t significant after background variation in sample variance computation is taken into account

  • note that interval isn’t symmetric

2

c

=

2

.

7

-

9

,

1

0

.

025

2

c

=

19

.

0

9

,

0

.

025

2

<

s

<

1

.

52

10

.

67

J. McLellan


Variance Confidence Intervals - Example

Comment

  • variance is sensitive to degrees of freedom

    • need larger number of data points to obtain precise estimate

    • e.g., if variance estimate was 3.2 C with 30 degrees of freedom (31 data points), the interval would be:

    • cf. previous interval with 10 data points

Conclusion still doesn’t

change, however.

2

<

s

<

2

.

04

5

.

71

2

<

s

<

1

.

52

10

.

67

J. McLellan


  • Login