Statistics: Analyzing and Comparing Data

1 / 40

# Statistics: Analyzing and Comparing Data - PowerPoint PPT Presentation

Statistics: Analyzing and Comparing Data . Module 3. Outline. Estimating the true mean using sample average Estimating the true variance using sample variance Confidence intervals and hypothesis tests for means and variances. Population Mean and Sample Mean.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Statistics: Analyzing and Comparing Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Statistics:Analyzing and Comparing Data

Module 3

Outline
• Estimating the true mean using sample average
• Estimating the true variance using sample variance
• Confidence intervals and hypothesis tests for means and variances

K. McAuley

Population Mean and Sample Mean

If the complete population contains N values, the average is

Often N is very large or infinite, so we collect a random sample

and estimate  using the sample mean

Why is a random variable?

What happens to the quality of the estimate as n changes?

K. McAuley

Sample Average
• What is the difference between:

and

K. McAuley

Definition - Random Sample

Independent random variables X1, X2 …Xn with the same underlying distribution are called a random sample

A STATISTIC is any function of the random variables in a random sample.

• The statistics we calculate most often are the sample mean and sample variance. Parameters in models are also statistics.

K. McAuley

Statistics
• Is the same as ?
• Is the same as ?
• Is a random variable? What about ? What about ?
• The probability distribution of a statistic arises from the probability distribution of underlying population.
• The variability of Xi influences the variability of
• Statisticians call the probability distribution for a statistic a sampling distribution

K. McAuley

Sampling Distribution for the Sample Average
• Let’s determine the mean and variance of the sample average

The expected value of the sample average is the true meanof the population.

We say that the sample average is an UNBIASED estimatorfor the mean of the population.

The variance of the sample average is smaller than the variance of the underlying population.

Let’s do some proofs!

K. McAuley

Variance

We use the following theorem to find the variance of X.

If we have a sum of independent random variables, X and Y, with “a” and “b” constants, then

Var( a X+ b Y) = a2 Var(X) + b2 Var(Y)

K. McAuley

Sampling Distribution for the Sample Average

Variance

• as n becomes larger, variance of sample average becomes smaller
• as more data are used, the estimate for the true mean becomes more precise

K. McAuley

Distribution of the Sample Average
• In the preceding slides, no assumption was made about the distribution of the population (e.g., normal, uniform, exponential)
• The Central Limit Theorem says that the distribution of sample average approaches a Normal distribution when number of samples becomes large
• Even if underlying population is non-Normal
• When using hypothesis tests and confidence limits and control charts, we will assume Normality for

K. McAuley

Sample Variance

The variance of the population

can be estimated using the sample variance

People are sloppy and use lower case for both the observed value and the random variable s2

K. McAuley

Sample Variance

Observed value:

Expected value of the sample variance:

Sample variance is an UNBIASED

estimator of population variance.

K. McAuley

Sample Standard Deviation

Sample standard deviation is simply the square root of the sample variance

K. McAuley

Outline
• Estimating the true mean using sample average
• Estimating the true variance using sample variance
• Confidence intervals and hypothesis tests for means and variances

K. McAuley

Confidence Intervals

Consider the sample average

We can standardize this to have zero mean and unit variance:

“Normally distributed with mean

and variance”

“is distributed as”

K. McAuley

Distribution for standard normal:

If X is normally distributed:

Let’s rearrange to get  in the middle

K. McAuley

Confidence Intervals

Rearranging gives:

Interpretation -

• limits of interval have uncertainty - if we get a new set of samples and re-estimate the average and re-compute the limits, the endpoints change somewhat BUT95% of the time, the interval will contain the true value of the mean. 5% of the time the true mean will be outside the limits.
• What is the “true mean” anyway?

RANDOM

NOT

random

RANDOM

K. McAuley

Confidence Intervals

Imagine repeating a set of experiments eight times and calculating confidence intervals on the mean from each set of experiments. The true mean wouldn’t move, but the confidence limits would.

true value of mean

K. McAuley

Confidence Intervals

What if we want 90% or 99% confidence interval?

100(1-)% confidence interval given by:

where -

• z/2 – “fence” value for which P(Z> z/2 ) = /2
• value obtained from tables
• For 95%, =0.05 and z/2 = 1.96
• For 99%, =0.01 and z/2 = 2.57
• Why do we find z/2 instead of z?

K. McAuley

Confidence Intervals for Mean

When population variance is “known”, the 100(1-)% confidence interval is

Known variance -

• We might be comfortable assuming that we “know” the variance when the process has been operating steadily for a long period of time
• on the basis of extensive operating experience and a large number of data points

But we usually don’t know the variance!

K. McAuley

Confidence Intervals for Mean

What if variance is unknown? This is the usual situation!

• Estimate using sample variance s2
• Issue - s2 is a random variable
• this approximate quantity no longer has a standard Normal distribution

Solution -

• What is the probability distribution of when data are Normally distributed?
• Student’s t distribution

K. McAuley

Student’s t Distribution

When the data are from a Normally distributed population:

follows a Student’s t distribution with n-1 degrees of freedom

Degrees of freedom (read p. 24 of text)

• Number of independent pieces of information used to compute sample variance.
• Recall that when we calculate s2, we divide by n-1.
• One degree of freedom gets used up because we calculateand use it to obtain s2

K. McAuley

Student’s t Distribution

… has a shape similar to that of Normal distribution but the tails are heavier

• symmetric
• Cumulative t distribution is in Table II on pg. A-4.

3 degrees of

freedom

K. McAuley

Confidence Intervals for Population Mean

True Variance Unknown

• 100(1-)% case
• , the number of degrees of freedom, is (n-1) when n data points are used to compute sample variance.
• Let’s do a proof.
• How do we get confidence intervals in general?

K. McAuley

General Approach for Obtaining Confidence Intervals
• Determine a quantity with a known distribution that depends on the parameter of interest
• Write a probability statement using fences with a known probability
• re-arrange statement to obtain an interval specifying a range of values for the parameter of interest

K. McAuley

Example #1

Conversion in a chemical reactor using new catalyst

• Average conversion computed using 10 data points is 76.1%
• Prior operating history indicates that variance of conversion is 4.41 %2
• Determine 95% confidence interval for mean conversion using the new catalyst, and use this to determine whether the new conversion is significantly different than mean conversion obtained with old catalyst, which is known to be 70%
• What assumptions will we need to make to find the answer? Do they bother you?

K. McAuley

Example #1
• Confidence interval - 95%
• upper tail area is 2.5% 
• confidence interval
• conclusion - interval doesn’t contain conversion of 70% for the old catalyst, so we conclude that the new preparation is providing a significant change (increase) in conversion

K. McAuley

Example #2

Conversion in a chemical reactor using new catalyst

• Average conversion computed using 10 data points is 76.1%
• Data set of 10 points was used to calculate the sample variance, which is 5.3 %2
• determine the 95% confidence interval for mean conversion using the new catalyst, and use this to determine whether the new conversion is significantly different than the conversion obtained using the old catalyst, which is known to be 70%
• How would we calculate the sample variance?

K. McAuley

Example #2
• Confidence interval - 95%
• variance UNKNOWN - need to use Student’s t distribution -- degrees of freedom = 10-1 = 9
• upper tail area is 2.5% 
• confidence interval
• Conclusion - interval doesn’t contain conversion of 70% --> new catalyst is providing a significant change (increase) in conversion

K. McAuley

Confidence Intervals for Variance

First, we need to know the sampling distribution of the sample variance:

If the data are Normally distributed, s2 is the sum of squared Normal random variables

K. McAuley

Chi-squared distribution
• 2is the name given to the distribution of a squared standard Normal random variable
• Chi-squared random variable with 1 degree of freedom
• degrees of freedom = number of independent Normal random variables being squared
• e.g.,
• 3 degrees of freedom

3 degrees of

freedom

K. McAuley

Chi-squared distribution
• Functional form of 2 distribution is in Montgomery and Runger.
• Integrals are available in Table III in Appendix A.
• The 2 distribution is asymmetric. It goes from 0 to .
• Why can’t random samples from 2 be negative?

K. McAuley

Sampling distribution -sample variance

Sample variance

• Looks like it might be the sum of n squared Normal random variables
• However, the calculated sample average introduces a constraint - given Xbar, we only have n-1 independent random variables (the n-th can be computed from the n-1 variables and the average)
• sample variance really contains n-1 independent Normal random variables --> degrees of freedom for Chi-squared distribution is n-1

K. McAuley

Confidence Intervals for True Variance
• Form probability statement
• Re-arrange statement
• 100(1-)% interval is

K. McAuley

equal tail areas

Confidence Limits for Variance

Notes

1) If the tail areas are equal, the confidence interval is asymmetric about 2

• consequence of asymmetry of Chi-squared distribution

K. McAuley

Variance Confidence Intervals - Example

Temperature controller has been implemented on a polymerization reactor -

• variance under previous operation was 4.7 °C2
• under new operation, we have collected 10 data points and computed a sample variance of 3.2 C2
• is the variance under the new control operation significantly different?

K. McAuley

Variance Confidence Intervals - Example

Use confidence interval for variance

• n-1 = 10-1 = 9 degrees of freedom
• form 95% confidence interval ( = 0.05)
• from tables:
• interval for variance:
• Conclusion: 4.7 is within this range. Variance reduction change insignificant
• Notice that the interval isn’t symmetric about 3.2 C2

K. McAuley

Variance Confidence Intervals - Example

Comment

• Confidence intervals for variance are sensitive to degrees of freedom
• We need a larger number of data points to obtain a precise estimate
• e.g., if variance estimate was 3.2 °C2 with 30 degrees of freedom (31 data points), the interval would be:
• Compare with previous interval with 10 data points

Conclusion still doesn’t

change, however.

K. McAuley