330 likes | 455 Views
Comparing Distributions III: Chi squared test, ANOVA By Peter Woolf (pwoolf@umich.edu) University of Michigan Michigan Chemical Process Dynamics and Controls Open Textbook version 1.0. Creative commons. Unit 1. Unit 2.
E N D
Comparing Distributions III: Chi squared test, ANOVA By Peter Woolf (pwoolf@umich.edu) University of MichiganMichigan Chemical Process Dynamics and Controls Open Textbookversion 1.0 Creative commons
Unit 1 Unit 2 Scenario: You have two parallel processes that carry out the same reaction using very similar equipment. Question: Are these units actually behaving the same or not?
Approach: (1) Gather data on yield from both units Plot of data does not clearly show any difference
Requires binning Directly on data Approach: (1) Gather data on yield from both units (2) Perform statistical analysis • Fisher’s exact test • Chi squared test • ANOVA
HIGH LOW Binning Data: Data reduction by reassigning data into windows
HIGH LOW Binning Data: Data reduction by reassigning data into windows Choosing a binning strategy: • Assign to bins that naturally appear such as groupings or important thresholds (e.g. yield>50 is profitable, so this is a natural window) • If multiple windows appear, assign multiple bins • If no natural bins appear, choose equally sized bins or above/below average Bin in excel with IF.. THEN statements
HIGH As mentioned in last lecture, we can use Fisher’s exact to calculate a p-value of the probability of finding this configuration at random LOW For Fisher’s exact and Chi squared tests,create a contingency table. Contingency table Low High 97 150 Unit 1 53 82 68 150 Unit 2 135 165 300
Low High observed 97 150 Unit 1 53 82 68 150 Unit 2 135 165 300 Low High “more extreme” configuration 98 150 Unit 1 52 83 67 150 Unit 2 135 165 300 “most extreme” configuration Low High 150 150 Unit 1 0 135 15 150 Unit 2 135 165 300
Most likely cases if this were a random sample Observed case More extreme =0.0005 Less extreme =0.9995 Total area=1.0 Conclusion: • The units are behaving differently IDEA! The distance between observed case and the most likely if random is far, so can we just use that? Probability of configuration # changes away from observed
IDEA! The distance between observed case and the most likely if random is far, so can we just use that? If this distance is “big” then the observed case is unusual What is this point? Probability of configuration # changes away from observed
Observed case Low High 97 150 Unit 1 53 82 68 150 Unit 2 135 165 300 Distance between these two cases? Low High 150 Unit 1 150 Unit 2 135 165 300 Chi squared statistic What is this point? Most likely case if random =150*(135/300) =67.5 =150*(165/300) =82.5 But this depends on the magnitude, so normalize it.. =150*(135/300) =67.5 =150*(165/300) =82.5
Low High 97 150 Unit 1 53 82 68 150 Unit 2 135 165 300 For this case: Low High 150 Unit 1 150 Unit 2 135 165 300 Chi squared statistic Observed case Most likely case if random Okay.. So what? What is the p-value? =150*(135/300) =67.5 =150*(165/300) =82.5 =150*(135/300) =67.5 =150*(165/300) =82.5
For this case: This can be done in a more automated way in excel using “chitest” Chi squared statistic The chi squared statistic has a known distribution that can be looked up or found in excel using “chidist” with 1 degree of freedom. =chidist(11.33,1)=0.00076 For this case chitest & Fisher’s exact agree
Chi squared test vs. Fisher’s exact • For a random null, Fisher’s exact will always yield a correct result • Chi squared test is often easier to carry out (the math is easier) • Chi squared will give incorrect results when • fewer than 20 samples are present • if there are between 20 and 40 samples and one expected number is 5 or below Chitest says the result is 2x more significant--error due to small sample effect
Chi squared test vs. Fisher’s exact (continued) • Chi squared test is easy to do for larger contingency tables and when the expected distribution is not random. • Can be done with a Fisher’s like test, but the math gets much harder. Example: 3 by 3 contingency table with a model for expectations Observed is close to the expected, but far from random
Approach: (1) Gather data on yield from both units (2) Perform statistical analysis • Fisher’s exact test • Chi squared test • ANOVA Requires binning Directly on data
ANOVA: Analysis of Variance Method to compare continuous measurements determine if they are sampled from the same or different distributions. For a single factor ANOVA, we assume that each observation in each class can be modeled as: Observation = overall mean + class effect + random error In the study we are following in this class, the class effect would be the effect unit 1 or unit 2. ANOVA analysis can be easily done in Excel using Tools->Data Analysis-> ANOVA
1 way ANOVA Key value: p-value here tells the probability that both units (each group) are the same.
2 way ANOVA with replicates Scenario: Testing three units in triplicate, each with three different control architectures: Feedback (FB), Model predictive control (MPC), and a cascade architecture. In each case we measure the yield. Questions: Do the units significantly differ? Do the control architectures significantly differ? Tools->Data Analysis ->ANOVA:Two factor with replication
2 way ANOVA with replicates Controllers (samples) have a significant effect Columns (units) don’t have a significant effect ?? Looks like an error, and may be why we get a negative F value and no p-value
ANOVA • ANOVA tells you if factors are significantly related to an outcome according to a linear model • Nonlinear relationships can be strong, but may appear insignificant in an ANOVA analysis. • ANOVA does not tell you the model parameters. • ANOVA, t-test, and z-test all provide similar kinds of information for different kinds of data.
Statistical Analysis Physical process Experimental Data • Results: • Unit 1 is different from unit 2 • This difference is clearer in the binned data (chi squared and fisher’s<ANOVA) Unit 1 Unit 2
Take Home Messages • Chi squared tests are analogous to Fisher’s exact tests, but are generally easier to calculate • Chi squared tests fail when sample sizes are small • ANOVA determines if lists of continuous measurements likely the same or different • ANOVA can determine the significance of a set of factors on the measurements
The following pages have additional examples of ChemE applications of ANOVA analyses
Solution approach: two factor ANOVA. Factor 1: Farm Factor 2: Shipper See if a factor has a significant p-value
Looking at averages and ranges, it looks like shipper Rex has a somewhat worse record than Ned. The farms have some variation, but it is small. This said, both shippers will bring wheat with moths, but Rex will bring more.
1) Import data into Excel 2) Select Tools->Data Analysis-> ANOVA: Two factor with Replication Conclusion, the factor “shipper” has a significant Influence on the moth probability with a p-value of 0.03
ANOVA- ChemE examples How does temperature affect yield?
ANOVA- ChemE examples Do both temperature and concentration affect yield?
ANOVA- ChemE examples How can controlling v4 and v2 differently affect process profitability? Example from 2006 controls wiki: http://controls.engin.umich.edu/wiki/index.php/Design_of_experiments_via_taguchi_methods:_one_and_two_way_layouts
DATA How can controlling v4 and v2 differently affect process profitability? Example from 2006 controls wiki: http://controls.engin.umich.edu/wiki/index.php/Design_of_experiments_via_taguchi_methods:_one_and_two_way_layouts
DATA ANOVA How can controlling v4 and v2 differently affect process profitability? Example from 2006 controls wiki: http://controls.engin.umich.edu/wiki/index.php/Design_of_experiments_via_taguchi_methods:_one_and_two_way_layouts