Characterizing variability and comparing patterns from data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 39

Characterizing Variability and Comparing Patterns from Data PowerPoint PPT Presentation


  • 44 Views
  • Uploaded on
  • Presentation posted in: General

Characterizing Variability and Comparing Patterns from Data. “Statistics” Module 3. Outline. random samples notion of a statistic estimating the mean - sample average assessing the impact of variation on estimates - sampling distribution

Download Presentation

Characterizing Variability and Comparing Patterns from Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Characterizing variability and comparing patterns from data

Characterizing Variability and Comparing Patterns from Data

“Statistics”

Module 3


Outline

Outline

  • random samples

  • notion of a statistic

  • estimating the mean - sample average

  • assessing the impact of variation on estimates - sampling distribution

  • estimating variance - sample variance and standard deviation

  • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests

J. McLellan


Random samples

Random Samples

Scenario -

  • we have an underlying pattern of variability for a process which we would like to characterize -- the population

  • we perform a series of experiments on the process in such a way that the results are independent - outcome of one experiment has no influence on any other experiment

  • the underlying distribution in place during each experimental run is identical to that of the population

  • when we run each experiment, we are collecting a value from the random variable Xi - which has uncertainty

  • Xi represents the “i-th” act of sampling - referred to as a sample random variable

J. McLellan


Definition random sample

Definition - Random Sample

A random sample of size “n” of a population random variable is a collection of random variables X1, … Xn such that

  • the Xi’s are independent

  • the Xi’s have distributions identical to that of X, i.e.,

    Each Xi represents a snapshot of the process. The Xi’s are referred to as sample random variables.

    What do we do with these sample values?...

=

F

(

x

)

F

(

x

)

X

X

i

J. McLellan


Sample average

Sample Average

  • used to estimate the mean

  • given “n” samples, X1, …, Xn, compute

  • interpretation - a rule for computing the sample average, involving sampling

  • is a random variable

  • observed value

n

1

=

å

X

X

i

n

=

i

1

n

Lower case is used to denote

observed values of the sample

random variables and average.

1

=

å

x

x

i

n

=

i

1

J. McLellan


Statistics

Statistics

  • Sample average is an example of a “statistic”

    Definition

    A statistic is a function of sample random variables that is used to estimate a value of a parameter, and does not depend on any unknown parameters.

    • e.g., sample average estimates mean  and doesn’t depend on unknown parameters

n

1

=

å

X

X

i

n

=

i

1

J. McLellan


Sampling distribution

Sampling Distribution

A statistic is a random variable, with its own probability distribution

  • distribution arises from probability distribution of underlying population, via the sample random variables

  • distribution of the statistic is called the sampling distribution

  • characteristics of the sampling distribution depend on:

    • the form of the statistic - e.g., linear function of the sample random variables

    • the distribution of the underlying population

J. McLellan


Sampling distribution for the sample average

Sampling Distribution for the Sample Average

  • determine the mean and variance of the sample average

    Mean

ì

ü

ì

ü

n

n

1

1

=

=

å

å

E

{

X

}

E

X

E

X

í

ý

í

ý

i

i

n

n

î

þ

î

þ

=

=

i

1

i

1

n

n

m

1

1

n

=

=

m

=

=

m

å

å

E

{

X

}

i

n

n

n

=

=

i

1

i

1

Value expected on average

of the sample average is

the true mean of the process

- sample average is an

UNBIASED estimator for the

mean.

because of independence

of sample random variables

J. McLellan


Sampling distribution for the sample average1

Sampling Distribution for the Sample Average

Variance

æ

ö

n

1

ç

÷

=

å

Var

(

X

)

Var

X

ç

÷

i

n

è

ø

=

i

1

æ

ö

n

n

1

1

ç

÷

=

=

å

å

Var

X

Var

(

X

)

ç

÷

i

i

2

2

è

ø

n

n

=

=

i

1

i

1

2

2

s

s

n

=

=

2

n

n

J. McLellan


Aside variance

Aside - Variance

If we have a sum of independent random variables, X and Y, with “a” and “b” constants, then

Var( a X+ b Y) = a2 Var(X) + b2 Var(Y)

J. McLellan


Variance of sample average

Variance of Sample Average

Interpretation

  • variance of sample average is 2 / n

    • as n becomes larger, variance of sample average becomes smaller

    • as more data is used, estimate becomes more precise

    • sample average represents a concentration of information

J. McLellan


Distribution of the sample average

Distribution of the Sample Average

  • in preceding slides, no assumption was made about distribution of population (e.g., normal, exponential)

  • Central Limit Theorem implies that distribution of sample average approaches a Normal distribution when number of samples becomes large

    • even if underlying population is non-Normal

    • important consequences for comparing values - hypothesis tests and confidence limits

J. McLellan


Outline1

Outline

  • random samples

  • notion of a statistic

  • estimating the mean - sample average

  • assessing the impact of variation on estimates - sampling distribution

  • estimating variance - sample variance and standard deviation

  • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests

J. McLellan


Sample variance

Sample Variance

… is estimated using the following statistic:

Observed value:

Mean of the sample variance:

n

1

2

2

=

-

å

s

(

X

X

)

i

-

n

1

=

i

1

n

1

2

2

=

-

å

s

(

x

x

)

i

-

n

1

=

i

1

Sample variance is an UNBIASED

estimator of variance.

2

2

=

s

E

{

s

}

J. McLellan


Sample standard deviation

Sample Standard Deviation

… is simply the square root of the sample variance

BUT

  • sample standard deviation is a biased estimator of population standard deviation

    • value on average does not tend to population value

¹

s

E

{

s

}

J. McLellan


Outline2

Outline

  • random samples

  • notion of a statistic

  • estimating the mean - sample average

  • assessing the impact of variation on estimates - sampling distribution

  • estimating variance - sample variance and standard deviation

  • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests

J. McLellan


Confidence intervals

Confidence Intervals

Consider the sample average

We can standardize this to have zero mean and unit variance:

2

m

s

X

~

N

(

,

/

n

)

X

X

“Normally distributed with mean

and variance”

“is distributed as”

-

m

X

X

=

Z

s

/

n

X

J. McLellan


Confidence intervals1

Confidence Intervals

Distribution for standard normal:

Start with -

and consider Z -

-

<

<

=

P

(

1

.

96

Z

1

.

96

)

0

.

95

-

m

X

X

-

<

<

=

P

(

1

.

96

1

.

96

)

0

.

95

s

/

n

X

Û

m

-

s

<

<

m

+

s

=

P

(

1

.

96

/

n

X

1

.

96

/

n

)

0

.

95

X

X

X

X

J. McLellan


Confidence intervals2

Confidence Intervals

Rearrange this last statement to obtain:

Interpretation -

  • limits of interval have uncertainty - if we repeated sequence of estimating average and computing the limits, the endpoints would change somewhat BUT95% of the time, the interval would contain the true value of the mean

-

s

<

m

<

+

s

=

P

(

X

1

.

96

/

n

X

1

.

96

/

n

)

0

.

95

X

X

X

RANDOM

NOT

random

RANDOM

J. McLellan


Confidence intervals3

Confidence Intervals

  • this interval DOES NOT imply that the mean  is uncertain

    Picture - sequence of intervals associated with repeated experimentation

true value of mean

J. McLellan


Confidence intervals4

Confidence Intervals

General result for mean -

100(1-)% confidence interval given by:

where -

  • z/2 - “fence” - value for which P(Z> z/2 ) = /2

  • value obtained from tables

    • 95% - value is 1.96 - approximately 2

    • 99% - value is 2.57

-

s

<

m

<

+

s

X

z

/

n

X

z

/

n

a

a

/

2

X

X

/

2

X

J. McLellan


Confidence intervals5

Confidence Intervals

General Approach

  • form a quantity with a known distribution that depends on the parameter of interest

  • form a probability statement - choose fences (limits) with a known probability

  • re-arrange statement to obtain an interval specifying a range of values for the parameter of interest

-

m

X

X

=

Z

s

/

n

X

-

m

X

X

-

<

<

=

P

(

1

.

96

1

.

96

)

0

.

95

s

/

n

X

-

s

<

m

<

+

s

=

P

(

X

1

.

96

/

n

X

1

.

96

/

n

)

0

.

95

X

X

X

J. McLellan


Confidence intervals for mean

Confidence Intervals for Mean

When population variance is “known”, 100(1-)% confidence interval is -

Known variance -

  • knowledge of variance when process has been operating steadily for long period of time

  • on basis of extensive operating experience

  • “large number of data points”

-

s

<

m

<

+

s

X

z

/

n

X

z

/

n

a

a

/

2

X

X

/

2

X

J. McLellan


Confidence intervals for mean1

Confidence Intervals for Mean

What if variance is unknown?

  • Estimate using sample variance s2

    Follow previous approach by forming standardized quantity:

  • issue - s2 is a statistic itself, and is a random variable

  • this quantity no longer has a standard Normal distribution

    Solution -

  • what is the probability distribution of this quantity, whendata are Normally distributed?

-

m

X

X

s

/

n

X

J. McLellan


Student s t distribution

Student’s t Distribution

When the data are Normally distributed,

follows a Student’s t distribution with n-1 degrees of freedom

Degrees of freedom -

  • number of statistically independent pieces of information used to compute sample variance

  • recall that in s2, we divide by n-1 where n is the number of data points

-

m

X

X

s

/

n

X

J. McLellan


Student s t distribution1

Student’s t Distribution

… has a shape similar to that of Normal distribution

  • symmetric

  • values are available in tables

  • extra parameter in tables - degrees of freedom

3 degrees of

freedom

J. McLellan


Confidence intervals for mean2

Confidence Intervals for Mean

Variance Unknown

  • estimated using sample variance

  • 100(1-)% case

  •  is the number of degrees of freedom (n-1), where n is number of data points used to compute sample variance (and average)

  • obtained following identical argument used in the known variance case

-

<

m

<

+

X

t

s

/

n

X

t

s

/

n

n

a

n

a

,

/

2

X

X

,

/

2

X

J. McLellan


Example 1

Example #1

Conversion in a chemical reactor using new catalyst preparation

  • data collected, average conversion computed using 10 data points is 76.1%

  • prior operating history indicates that variance of conversion is 4.41 %2

  • determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70%

J. McLellan


Example 11

Example #1

  • Confidence interval - 95%

    • upper tail area is 2.5% 

    • standard devn = sqrt(4.41) = 2.1

    • confidence interval

    • conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion

-

<

m

<

+

76

.

1

(

1

.

96

)(

2

.

1

)

/

10

76

.

1

(

1

.

96

)(

2

.

1

)

/

10

Þ

<

m

<

74

.

8

77

.

4

J. McLellan


Example 2

Example #2

Conversion in a chemical reactor using new catalyst preparation

  • data collected, average conversion computed using 10 data points is 76.1%

  • current data set of 10 points used to estimate sample variance, which is 5.3 %2

  • determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70%

J. McLellan


Example 21

Example #2

  • Confidence interval - 95%

    • variance UNKNOWN - need to use Student’s t distribution -- degrees of freedom = 10-1 = 9

    • upper tail area is 2.5% 

    • standard devn = sqrt(5.3) = 2.3

    • confidence interval

    • conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion

-

<

m

<

+

76

.

1

(

2

.

262

)(

2

.

3

)

/

10

76

.

1

(

2

.

262

)(

2

.

3

)

/

10

Þ

<

m

<

74

.

5

77

.

7

J. McLellan


Confidence intervals for variance

Confidence Intervals for Variance

First, we need to know the sampling distribution of the sample variance:

  • when data are Normally distributed, sample variance is the sum of squared Normal random variables

    • squaring “folds over” the negative values of the Normal random variable and makes them positive - asymmetry

n

1

2

2

=

-

å

s

(

X

X

)

i

-

n

1

=

i

1

J. McLellan


Chi squared distribution

Chi-squared distribution

  • is the distribution of a squared standard Normal random variable

    • Chi-squared random variable with 1 degree of freedom

    • degrees of freedom = number of independent standard Normal random variables being squared

    • e.g.,

      • 3 degrees of freedom

2

2

c

Z

~

1

2

2

2

2

+

+

c

Z

Z

Z

~

1

2

3

3

3 degrees of

freedom

J. McLellan


Sampling distribution sample variance

Sampling distribution -sample variance

Sample variance

  • is the sum of n squared Normal random variables BUT we add the sum of squared deviations from the sample average

  • given value of sample average introduces constraint - given Xbar, we only have n-1 independent random variables (the n-th can be computed from the average)

  • sample variance contains n-1 independent Normal random variables --> degrees of freedom for Chi-squared distribution is n-1

2

s

2

2

c

s

~

-

n

1

-

n

1

J. McLellan


Confidence intervals sample variance

Confidence Intervals - Sample Variance

  • Form probability statement

  • Re-arrange statement

  • 100(1-)% interval is

2

-

(

n

1

)

s

2

2

c

<

<

c

=

-

a

P

(

)

1

-

-

a

-

a

n

1

,

1

/

2

n

1

,

/

2

2

s

2

2

-

-

(

n

1

)

s

(

n

1

)

s

2

<

s

<

=

-

a

P

(

)

1

2

2

c

c

-

a

-

-

a

n

1

,

/

2

n

1

,

1

/

2

2

2

-

-

(

n

1

)

s

(

n

1

)

s

2

<

s

<

2

2

c

c

-

a

-

-

a

n

1

,

/

2

n

1

,

1

/

2

J. McLellan


Confidence limits for variance

Confidence Limits for Variance

Notes

1) the tail areas are equal

  • symmetric tail areas

    however the interval can be asymmetric

  • consequence of asymmetry of Chi-squared distribution

    2) is the value of the Chi-squared random variable with upper tail area of 1-/2 and n-1 degrees of freedom

equal tail areas

2

c

-

-

a

n

1

,

1

/

2

J. McLellan


Variance confidence intervals example

Variance Confidence Intervals - Example

Temperature controller has been implemented on a polymer reactor -

  • variance under previous operation was 4.7 C

  • under new operation, we have collected 10 data points and computed a sample variance of 3.2 C

  • is the variance under the new control operation significantly better?

    • i.e., is variance under new operation significantly lower?

J. McLellan


Variance confidence intervals example1

Variance Confidence Intervals - Example

Use confidence interval for variance

  • n-1 = 10-1 = 9 degrees of freedom

  • form 95% confidence interval ( = 0.05)

  • from tables:

  • interval for variance:

  • conclusion - variance reduction isn’t significant after background variation in sample variance computation is taken into account

  • note that interval isn’t symmetric

2

c

=

2

.

7

-

9

,

1

0

.

025

2

c

=

19

.

0

9

,

0

.

025

2

<

s

<

1

.

52

10

.

67

J. McLellan


Variance confidence intervals example2

Variance Confidence Intervals - Example

Comment

  • variance is sensitive to degrees of freedom

    • need larger number of data points to obtain precise estimate

    • e.g., if variance estimate was 3.2 C with 30 degrees of freedom (31 data points), the interval would be:

    • cf. previous interval with 10 data points

Conclusion still doesn’t

change, however.

2

<

s

<

2

.

04

5

.

71

2

<

s

<

1

.

52

10

.

67

J. McLellan


  • Login