Characterizing Variability and Comparing Patterns from Data

Characterizing Variability and Comparing Patterns from Data “Statistics” Module 3

Outline • random samples • notion of a statistic • estimating the mean - sample average • assessing the impact of variation on estimates - sampling distribution • estimating variance - sample variance and standard deviation • making decisions - comparisons of means, variances using confidence intervals, hypothesis tests J. McLellan

Random Samples Scenario - • we have an underlying pattern of variability for a process which we would like to characterize -- the population • we perform a series of experiments on the process in such a way that the results are independent - outcome of one experiment has no influence on any other experiment • the underlying distribution in place during each experimental run is identical to that of the population • when we run each experiment, we are collecting a value from the random variable Xi - which has uncertainty • Xi represents the “i-th” act of sampling - referred to as a sample random variable J. McLellan

Definition - Random Sample A random sample of size “n” of a population random variable is a collection of random variables X1, … Xn such that • the Xi’s are independent • the Xi’s have distributions identical to that of X, i.e., Each Xi represents a snapshot of the process. The Xi’s are referred to as sample random variables. What do we do with these sample values?... = F ( x ) F ( x ) X X i J. McLellan

Sample Average • used to estimate the mean • given “n” samples, X1, …, Xn, compute • interpretation - a rule for computing the sample average, involving sampling • is a random variable • observed value n 1 = å X X i n = i 1 n Lower case is used to denote observed values of the sample random variables and average. 1 = å x x i n = i 1 J. McLellan

Statistics • Sample average is an example of a “statistic” Definition A statistic is a function of sample random variables that is used to estimate a value of a parameter, and does not depend on any unknown parameters. • e.g., sample average estimates mean  and doesn’t depend on unknown parameters n 1 = å X X i n = i 1 J. McLellan

Sampling Distribution A statistic is a random variable, with its own probability distribution • distribution arises from probability distribution of underlying population, via the sample random variables • distribution of the statistic is called the sampling distribution • characteristics of the sampling distribution depend on: • the form of the statistic - e.g., linear function of the sample random variables • the distribution of the underlying population J. McLellan

Sampling Distribution for the Sample Average • determine the mean and variance of the sample average Mean ì ü ì ü n n 1 1 = = å å E { X } E X E X í ý í ý i i n n î þ î þ = = i 1 i 1 n n m 1 1 n = = m = = m å å E { X } i n n n = = i 1 i 1 Value expected on average of the sample average is the true mean of the process - sample average is an UNBIASED estimator for the mean. because of independence of sample random variables J. McLellan

Sampling Distribution for the Sample Average Variance æ ö n 1 ç ÷ = å Var ( X ) Var X ç ÷ i n è ø = i 1 æ ö n n 1 1 ç ÷ = = å å Var X Var ( X ) ç ÷ i i 2 2 è ø n n = = i 1 i 1 2 2 s s n = = 2 n n J. McLellan

Aside - Variance If we have a sum of independent random variables, X and Y, with “a” and “b” constants, then Var( a X+ b Y) = a2 Var(X) + b2 Var(Y) J. McLellan

Variance of Sample Average Interpretation • variance of sample average is 2 / n • as n becomes larger, variance of sample average becomes smaller • as more data is used, estimate becomes more precise • sample average represents a concentration of information J. McLellan

Distribution of the Sample Average • in preceding slides, no assumption was made about distribution of population (e.g., normal, exponential) • Central Limit Theorem implies that distribution of sample average approaches a Normal distribution when number of samples becomes large • even if underlying population is non-Normal • important consequences for comparing values - hypothesis tests and confidence limits J. McLellan

Sample Variance … is estimated using the following statistic: Observed value: Mean of the sample variance: n 1 2 2 = - å s ( X X ) i - n 1 = i 1 n 1 2 2 = - å s ( x x ) i - n 1 = i 1 Sample variance is an UNBIASED estimator of variance. 2 2 = s E { s } J. McLellan

Sample Standard Deviation … is simply the square root of the sample variance BUT • sample standard deviation is a biased estimator of population standard deviation • value on average does not tend to population value ¹ s E { s } J. McLellan

Confidence Intervals Consider the sample average We can standardize this to have zero mean and unit variance: 2 m s X ~ N ( , / n ) X X “Normally distributed with mean and variance” “is distributed as” - m X X = Z s / n X J. McLellan

Confidence Intervals Distribution for standard normal: Start with - and consider Z - - < < = P ( 1 . 96 Z 1 . 96 ) 0 . 95 - m X X - < < = P ( 1 . 96 1 . 96 ) 0 . 95 s / n X Û m - s < < m + s = P ( 1 . 96 / n X 1 . 96 / n ) 0 . 95 X X X X J. McLellan

Confidence Intervals Rearrange this last statement to obtain: Interpretation - • limits of interval have uncertainty - if we repeated sequence of estimating average and computing the limits, the endpoints would change somewhat BUT95% of the time, the interval would contain the true value of the mean - s < m < + s = P ( X 1 . 96 / n X 1 . 96 / n ) 0 . 95 X X X RANDOM NOT random RANDOM J. McLellan

Confidence Intervals • this interval DOES NOT imply that the mean  is uncertain Picture - sequence of intervals associated with repeated experimentation true value of mean J. McLellan

Confidence Intervals General result for mean - 100(1-)% confidence interval given by: where - • z/2 - “fence” - value for which P(Z> z/2 ) = /2 • value obtained from tables • 95% - value is 1.96 - approximately 2 • 99% - value is 2.57 - s < m < + s X z / n X z / n a a / 2 X X / 2 X J. McLellan

Confidence Intervals General Approach • form a quantity with a known distribution that depends on the parameter of interest • form a probability statement - choose fences (limits) with a known probability • re-arrange statement to obtain an interval specifying a range of values for the parameter of interest - m X X = Z s / n X - m X X - < < = P ( 1 . 96 1 . 96 ) 0 . 95 s / n X - s < m < + s = P ( X 1 . 96 / n X 1 . 96 / n ) 0 . 95 X X X J. McLellan

Confidence Intervals for Mean When population variance is “known”, 100(1-)% confidence interval is - Known variance - • knowledge of variance when process has been operating steadily for long period of time • on basis of extensive operating experience • “large number of data points” - s < m < + s X z / n X z / n a a / 2 X X / 2 X J. McLellan

Confidence Intervals for Mean What if variance is unknown? • Estimate using sample variance s2 Follow previous approach by forming standardized quantity: • issue - s2 is a statistic itself, and is a random variable • this quantity no longer has a standard Normal distribution Solution - • what is the probability distribution of this quantity, whendata are Normally distributed? - m X X s / n X J. McLellan

Student’s t Distribution When the data are Normally distributed, follows a Student’s t distribution with n-1 degrees of freedom Degrees of freedom - • number of statistically independent pieces of information used to compute sample variance • recall that in s2, we divide by n-1 where n is the number of data points - m X X s / n X J. McLellan

Student’s t Distribution … has a shape similar to that of Normal distribution • symmetric • values are available in tables • extra parameter in tables - degrees of freedom 3 degrees of freedom J. McLellan

Confidence Intervals for Mean Variance Unknown • estimated using sample variance • 100(1-)% case •  is the number of degrees of freedom (n-1), where n is number of data points used to compute sample variance (and average) • obtained following identical argument used in the known variance case - < m < + X t s / n X t s / n n a n a , / 2 X X , / 2 X J. McLellan

Example #1 Conversion in a chemical reactor using new catalyst preparation • data collected, average conversion computed using 10 data points is 76.1% • prior operating history indicates that variance of conversion is 4.41 %2 • determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70% J. McLellan

Example #1 • Confidence interval - 95% • upper tail area is 2.5%  • standard devn = sqrt(4.41) = 2.1 • confidence interval • conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion - < m < + 76 . 1 ( 1 . 96 )( 2 . 1 ) / 10 76 . 1 ( 1 . 96 )( 2 . 1 ) / 10 Þ < m < 74 . 8 77 . 4 J. McLellan

Example #2 Conversion in a chemical reactor using new catalyst preparation • data collected, average conversion computed using 10 data points is 76.1% • current data set of 10 points used to estimate sample variance, which is 5.3 %2 • determine 95% confidence interval for mean conversion under new preparation, and use this to determine whether new conversion is significantly different than current conversion, which is known to be 70% J. McLellan

Example #2 • Confidence interval - 95% • variance UNKNOWN - need to use Student’s t distribution -- degrees of freedom = 10-1 = 9 • upper tail area is 2.5%  • standard devn = sqrt(5.3) = 2.3 • confidence interval • conclusion - interval doesn’t contain current conversion of 70% --> new preparation is providing a significant change (increase) in conversion - < m < + 76 . 1 ( 2 . 262 )( 2 . 3 ) / 10 76 . 1 ( 2 . 262 )( 2 . 3 ) / 10 Þ < m < 74 . 5 77 . 7 J. McLellan

Confidence Intervals for Variance First, we need to know the sampling distribution of the sample variance: • when data are Normally distributed, sample variance is the sum of squared Normal random variables • squaring “folds over” the negative values of the Normal random variable and makes them positive - asymmetry n 1 2 2 = - å s ( X X ) i - n 1 = i 1 J. McLellan

Chi-squared distribution • is the distribution of a squared standard Normal random variable • Chi-squared random variable with 1 degree of freedom • degrees of freedom = number of independent standard Normal random variables being squared • e.g., • 3 degrees of freedom 2 2 c Z ~ 1 2 2 2 2 + + c Z Z Z ~ 1 2 3 3 3 degrees of freedom J. McLellan

Sampling distribution -sample variance Sample variance • is the sum of n squared Normal random variables BUT we add the sum of squared deviations from the sample average • given value of sample average introduces constraint - given Xbar, we only have n-1 independent random variables (the n-th can be computed from the average) • sample variance contains n-1 independent Normal random variables --> degrees of freedom for Chi-squared distribution is n-1 2 s 2 2 c s ~ - n 1 - n 1 J. McLellan

Confidence Intervals - Sample Variance • Form probability statement • Re-arrange statement • 100(1-)% interval is 2 - ( n 1 ) s 2 2 c < < c = - a P ( ) 1 - - a - a n 1 , 1 / 2 n 1 , / 2 2 s 2 2 - - ( n 1 ) s ( n 1 ) s 2 < s < = - a P ( ) 1 2 2 c c - a - - a n 1 , / 2 n 1 , 1 / 2 2 2 - - ( n 1 ) s ( n 1 ) s 2 < s < 2 2 c c - a - - a n 1 , / 2 n 1 , 1 / 2 J. McLellan

Confidence Limits for Variance Notes 1) the tail areas are equal • symmetric tail areas however the interval can be asymmetric • consequence of asymmetry of Chi-squared distribution 2) is the value of the Chi-squared random variable with upper tail area of 1-/2 and n-1 degrees of freedom equal tail areas 2 c - - a n 1 , 1 / 2 J. McLellan

Variance Confidence Intervals - Example Temperature controller has been implemented on a polymer reactor - • variance under previous operation was 4.7 C • under new operation, we have collected 10 data points and computed a sample variance of 3.2 C • is the variance under the new control operation significantly better? • i.e., is variance under new operation significantly lower? J. McLellan

Variance Confidence Intervals - Example Use confidence interval for variance • n-1 = 10-1 = 9 degrees of freedom • form 95% confidence interval ( = 0.05) • from tables: • interval for variance: • conclusion - variance reduction isn’t significant after background variation in sample variance computation is taken into account • note that interval isn’t symmetric 2 c = 2 . 7 - 9 , 1 0 . 025 2 c = 19 . 0 9 , 0 . 025 2 < s < 1 . 52 10 . 67 J. McLellan

Variance Confidence Intervals - Example Comment • variance is sensitive to degrees of freedom • need larger number of data points to obtain precise estimate • e.g., if variance estimate was 3.2 C with 30 degrees of freedom (31 data points), the interval would be: • cf. previous interval with 10 data points Conclusion still doesn’t change, however. 2 < s < 2 . 04 5 . 71 2 < s < 1 . 52 10 . 67 J. McLellan

Characterizing Variability and Comparing Patterns from Data

Characterizing Variability and Comparing Patterns from Data

Presentation Transcript

Statistics: Analyzing and Comparing Data

1. Patterns and Variability

Comparing Quantitative Data

Comparing Data

Comparing Data from MD simulations and X-ray Crystallography

Macroecology …characterizing and explaining patterns of abundance, distribution, and diversity

Quality Assurance: Towards Tools for Characterizing and Comparing Digital Documents

Interannual Variability of Solar Reflectance From Data and Model

Characterizing Multidecadal Variability in the Southeastern United States

Progress in Characterizing AMOC Structure and Variability from Observations

Characterizing activity in AGN with X-ray variability

Comparing Means from Two Data Sets

Comparing Solar and KamLAND Data

Teleconnections and height patterns: comparing recent winters

Patterns and Mechanisms of Decadal Climate Variability

Curiosity Observe Data then Hypothesize from Data Patterns

5.11 Comparing Data

Comparing income data from survey and register

Comparing Statistics: Central Tendency and Variability

Comparing data

“Characterizing” X-ray Variability of TeV Blazars