Statistics: Analyzing and Comparing Data

Statistics:Analyzing and Comparing Data Module 3

Outline • Estimating the true mean using sample average • Estimating the true variance using sample variance • Confidence intervals and hypothesis tests for means and variances K. McAuley

Population Mean and Sample Mean If the complete population contains N values, the average is Often N is very large or infinite, so we collect a random sample and estimate  using the sample mean Why is a random variable? What happens to the quality of the estimate as n changes? K. McAuley

Sample Average • What is the difference between: and K. McAuley

Definition - Random Sample Independent random variables X1, X2 …Xn with the same underlying distribution are called a random sample A STATISTIC is any function of the random variables in a random sample. • The statistics we calculate most often are the sample mean and sample variance. Parameters in models are also statistics. K. McAuley

Statistics • Is the same as ? • Is the same as ? • Is a random variable? What about ? What about ? • The probability distribution of a statistic arises from the probability distribution of underlying population. • The variability of Xi influences the variability of • Statisticians call the probability distribution for a statistic a sampling distribution K. McAuley

Sampling Distribution for the Sample Average • Let’s determine the mean and variance of the sample average The expected value of the sample average is the true meanof the population. We say that the sample average is an UNBIASED estimatorfor the mean of the population. The variance of the sample average is smaller than the variance of the underlying population. Let’s do some proofs! K. McAuley

Sampling Distribution for the Sample Average Mean: K. McAuley

Variance We use the following theorem to find the variance of X. If we have a sum of independent random variables, X and Y, with “a” and “b” constants, then Var( a X+ b Y) = a2 Var(X) + b2 Var(Y) K. McAuley

Sampling Distribution for the Sample Average Variance • as n becomes larger, variance of sample average becomes smaller • as more data are used, the estimate for the true mean becomes more precise K. McAuley

Distribution of the Sample Average • In the preceding slides, no assumption was made about the distribution of the population (e.g., normal, uniform, exponential) • The Central Limit Theorem says that the distribution of sample average approaches a Normal distribution when number of samples becomes large • Even if underlying population is non-Normal • When using hypothesis tests and confidence limits and control charts, we will assume Normality for K. McAuley

Sample Variance The variance of the population can be estimated using the sample variance People are sloppy and use lower case for both the observed value and the random variable s2 K. McAuley

Sample Variance Observed value: Expected value of the sample variance: Sample variance is an UNBIASED estimator of population variance. K. McAuley

Sample Standard Deviation Sample standard deviation is simply the square root of the sample variance K. McAuley

Outline • Estimating the true mean using sample average • Estimating the true variance using sample variance • Confidence intervals and hypothesis tests for means and variances K. McAuley

Confidence Intervals Consider the sample average We can standardize this to have zero mean and unit variance: “Normally distributed with mean and variance” “is distributed as” K. McAuley

Getting Confidence Intervals for the True Mean using Sampled Data Distribution for standard normal: Start with - If X is normally distributed: Let’s rearrange to get  in the middle K. McAuley

Confidence Intervals Rearranging gives: Interpretation - • limits of interval have uncertainty - if we get a new set of samples and re-estimate the average and re-compute the limits, the endpoints change somewhat BUT95% of the time, the interval will contain the true value of the mean. 5% of the time the true mean will be outside the limits. • What is the “true mean” anyway? RANDOM NOT random RANDOM K. McAuley

Confidence Intervals Imagine repeating a set of experiments eight times and calculating confidence intervals on the mean from each set of experiments. The true mean wouldn’t move, but the confidence limits would. true value of mean K. McAuley

Confidence Intervals What if we want 90% or 99% confidence interval? 100(1-)% confidence interval given by: where - • z/2 – “fence” value for which P(Z> z/2 ) = /2 • value obtained from tables • For 95%, =0.05 and z/2 = 1.96 • For 99%, =0.01 and z/2 = 2.57 • Why do we find z/2 instead of z? K. McAuley

Confidence Intervals for Mean When population variance is “known”, the 100(1-)% confidence interval is Known variance - • We might be comfortable assuming that we “know” the variance when the process has been operating steadily for a long period of time • on the basis of extensive operating experience and a large number of data points But we usually don’t know the variance! K. McAuley

Confidence Intervals for Mean What if variance is unknown? This is the usual situation! • Estimate using sample variance s2 • Issue - s2 is a random variable • this approximate quantity no longer has a standard Normal distribution Solution - • What is the probability distribution of when data are Normally distributed? • Student’s t distribution K. McAuley

Student’s t Distribution When the data are from a Normally distributed population: follows a Student’s t distribution with n-1 degrees of freedom Degrees of freedom (read p. 24 of text) • Number of independent pieces of information used to compute sample variance. • Recall that when we calculate s2, we divide by n-1. • One degree of freedom gets used up because we calculateand use it to obtain s2 K. McAuley

Student’s t Distribution … has a shape similar to that of Normal distribution but the tails are heavier • symmetric • Cumulative t distribution is in Table II on pg. A-4. 3 degrees of freedom K. McAuley

Student t Distribution K. McAuley

Confidence Intervals for Population Mean True Variance Unknown • 100(1-)% case • , the number of degrees of freedom, is (n-1) when n data points are used to compute sample variance. • Let’s do a proof. • How do we get confidence intervals in general? K. McAuley

General Approach for Obtaining Confidence Intervals • Determine a quantity with a known distribution that depends on the parameter of interest • Write a probability statement using fences with a known probability • re-arrange statement to obtain an interval specifying a range of values for the parameter of interest K. McAuley

Example #1 Conversion in a chemical reactor using new catalyst • Average conversion computed using 10 data points is 76.1% • Prior operating history indicates that variance of conversion is 4.41 %2 • Determine 95% confidence interval for mean conversion using the new catalyst, and use this to determine whether the new conversion is significantly different than mean conversion obtained with old catalyst, which is known to be 70% • What assumptions will we need to make to find the answer? Do they bother you? K. McAuley

Example #1 • Confidence interval - 95% • upper tail area is 2.5%  • confidence interval • conclusion - interval doesn’t contain conversion of 70% for the old catalyst, so we conclude that the new preparation is providing a significant change (increase) in conversion K. McAuley

Example #2 Conversion in a chemical reactor using new catalyst • Average conversion computed using 10 data points is 76.1% • Data set of 10 points was used to calculate the sample variance, which is 5.3 %2 • determine the 95% confidence interval for mean conversion using the new catalyst, and use this to determine whether the new conversion is significantly different than the conversion obtained using the old catalyst, which is known to be 70% • How would we calculate the sample variance? K. McAuley

Example #2 • Confidence interval - 95% • variance UNKNOWN - need to use Student’s t distribution -- degrees of freedom = 10-1 = 9 • upper tail area is 2.5%  • confidence interval • Conclusion - interval doesn’t contain conversion of 70% --> new catalyst is providing a significant change (increase) in conversion K. McAuley

Confidence Intervals for Variance First, we need to know the sampling distribution of the sample variance: If the data are Normally distributed, s2 is the sum of squared Normal random variables K. McAuley

Chi-squared distribution • 2is the name given to the distribution of a squared standard Normal random variable • Chi-squared random variable with 1 degree of freedom • degrees of freedom = number of independent Normal random variables being squared • e.g., • 3 degrees of freedom 3 degrees of freedom K. McAuley

Chi-squared distribution • Functional form of 2 distribution is in Montgomery and Runger. • Integrals are available in Table III in Appendix A. • The 2 distribution is asymmetric. It goes from 0 to . • Why can’t random samples from 2 be negative? K. McAuley

Sampling distribution -sample variance Sample variance • Looks like it might be the sum of n squared Normal random variables • However, the calculated sample average introduces a constraint - given Xbar, we only have n-1 independent random variables (the n-th can be computed from the n-1 variables and the average) • sample variance really contains n-1 independent Normal random variables --> degrees of freedom for Chi-squared distribution is n-1 K. McAuley

Confidence Intervals for True Variance • Form probability statement • Re-arrange statement • 100(1-)% interval is K. McAuley

equal tail areas Confidence Limits for Variance Notes 1) If the tail areas are equal, the confidence interval is asymmetric about 2 • consequence of asymmetry of Chi-squared distribution K. McAuley

Variance Confidence Intervals - Example Temperature controller has been implemented on a polymerization reactor - • variance under previous operation was 4.7 °C2 • under new operation, we have collected 10 data points and computed a sample variance of 3.2 C2 • is the variance under the new control operation significantly different? K. McAuley

Variance Confidence Intervals - Example Use confidence interval for variance • n-1 = 10-1 = 9 degrees of freedom • form 95% confidence interval ( = 0.05) • from tables: • interval for variance: • Conclusion: 4.7 is within this range. Variance reduction change insignificant • Notice that the interval isn’t symmetric about 3.2 C2 K. McAuley

Variance Confidence Intervals - Example Comment • Confidence intervals for variance are sensitive to degrees of freedom • We need a larger number of data points to obtain a precise estimate • e.g., if variance estimate was 3.2 °C2 with 30 degrees of freedom (31 data points), the interval would be: • Compare with previous interval with 10 data points Conclusion still doesn’t change, however. K. McAuley

Statistics: Analyzing and Comparing Data