1 / 17

A case study

A case study. The following data come from an experiment designed to measure the accuracy of eleven laboratories. Each laboratory was given three samples for each of two different types of chalk. The laboratories were then asked to take readings on the bulk density of precipitated chalk.

lorne
Download Presentation

A case study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A case study • The following data come from an experiment designed to measure the accuracy of eleven laboratories. • Each laboratory was given three samples for each of two different types of chalk. • The laboratories were then asked to take readings on the bulk density of precipitated chalk. • In this experiment, • The response is bulk density • The factors are CHALK and LAB • The factor CHALK has two levels A and B • The factor LAB has eleven levels corresponding to the different laboratories. Statistical Data Analysis - Lecture16 - 09/04/03

  2. Here’s the raw data. • How do we get it into a form that can be analysed Statistical Data Analysis - Lecture16 - 09/04/03

  3. Data manipulation • Often, up to 90% of your time in any analysis is getting the data into a format that is convenient for the analysis you wish to do • This data set is no exception. There is no single way of doing this. • I used Microsoft Excel (good for data manipulation not so good for statistical analysis) because of its ease of handling of columnar data • Our ultimate goal is to get the data into R ready for a two way ANOVA. • The format R expects (as do most stats packages) is to have the response in one column, and appropriate factor levels in adjacent columns Statistical Data Analysis - Lecture16 - 09/04/03

  4. Data manipulation • Using Excel, I copied each block chalk data and pasted the transpose of the data in my worksheet. • Transposing turns the rows into columns, so for each chalk type I went from a block of 11 rows and 3 columns to 3 rows and eleven columns. • This makes it easier to stack the results from the different laboratories on top of each other • After turning each block into a column I stacked those columns on top of each other leaving me with one column • Now, all the data are in one column. Statistical Data Analysis - Lecture16 - 09/04/03

  5. Coding the factors • The 1st 33 observations are experiments done with Chalk A and the 2nd 33 observations are experiments done with Chalk B • Therefore in R we need to make a vector with 33 A’s and 33 B’s (to represent the factor levels for CHALK) • We can do this with chalk<-as.factor(rep(c(“A”,”B”),c(33,33))) Statistical Data Analysis - Lecture16 - 09/04/03

  6. Coding the factors • We know that in each block of 33 experiments that there are 3 observations from each lab • This means we need a sequence that represents the idea that the 1st 3 observations where done on chalk A by lab 1, the 2nd 3 observations where done on chalk A by lab 2 and so on • We’ve taken care of the CHALK coding. • To code the LAB factor we label each observation with the lab it came from. This means we need a vector of 3 “ones”, 3 “twos” and so on. And it needs to be repeated for Chalk B • Therefore, we use lab<-as.factor(rep(rep(1:11,rep(3,11)),2)) Statistical Data Analysis - Lecture16 - 09/04/03

  7. Fitting the model • We fit a standard two-way ANOVA model to the data • In this case i = A, B , j = 1,…,11 and k = 1,…,3 • This is a balanced design because =N/IJ=66/211=3 • What do we expect to see before we do any fitting? • We know the chalk types are different, so the factor CHALK should be significant • We expect the labs to perform about the same so we hope that the factor LAB is not significant – if it is this means that the quality of some of the labs is lower • We hope there is no difference in the quality of the results on the basis of chalk type – i.e. we hope there is no significant interaction between CHALK and LAB Statistical Data Analysis - Lecture16 - 09/04/03

  8. The fitted model • In the interests of numerical stability we multiple the responses by 1000. • This multiplies the group means by 1,000 and the group variances and sums of squares by 1,000,000 • The results are still easy to understand, but we don’t need to worry as much about rounding error • We need to remember to undo this change if we wish to say anything in particular about the numerical value of the results Statistical Data Analysis - Lecture16 - 09/04/03

  9. Analysis of Variance Table Response: Density Df Sum Sq Mean Sq F value Pr(>F) chalk 1 503215 503215 63503.1912 < 2.2e-16 *** lab 10 5223 522 65.9132 < 2.2e-16 *** chalk:lab 10 469 47 5.9247 1.313e-05 *** Residuals 44 349 8 --- • We can see that we have some problems • CHALK is significant as we predicted BUT… • so are the LAB effects and • the CHALK*LAB interaction is significant as well • What does this mean? • Maybe a plot will help Statistical Data Analysis - Lecture16 - 09/04/03

  10. Statistical Data Analysis - Lecture16 - 09/04/03

  11. Interpreting the interaction plot • The interaction plot is interesting • It seems to offer contrary findings to our ANOVA table • Remember, if an interaction is significant, then the lines will generally overlap or not be parallel • The lines here seem to be mostly parallel • In fact the plot is dominated by the difference between the chalks • This fact is key to our interpretation • Let’s go back to the ANOVA table Statistical Data Analysis - Lecture16 - 09/04/03

  12. Percentage of variation explained • When we’re modelling data, our aim is to explain the data • In statistics, we measure how well we’ve explained the data by the percentage or proportion of variation in the data that the model accounts for. • If the model only explains a small amount of the variation, then the model does not explain the data well, i.e. a poor fit. • Conversely if the model explains a large amount of the variation, then USUALLY the model does explain the data well, i.e. a good fit. • The reason we don’t automatically say the model is a good fit is because addition of model parameters will always improve fit Statistical Data Analysis - Lecture16 - 09/04/03

  13. Percentage of variation explained • When we work out the percentage of the total sums of squares (TSS) attributed to each of the model terms we see that 98.8% comes from the difference between the chalks • Because the sums of squares are a measure of total variation we can treat this as a measure of variation explained • It is fairly obvious that there is little increase in the relative quality of the fit with the addition of the labs and interaction terms • Furthermore, our interaction plot says that we’re unlikely to pick a different lab to do an analysis on the basis of the chalk type we’re looking at Statistical Data Analysis - Lecture16 - 09/04/03

  14. Refitting the model • Having convinced ourselves that the additive model will explain the data well enough, we fit the reduced model Analysis of Variance Table Response: Density Df Sum Sq Mean Sq F value Pr(>F) chalk 1 503215 503215 33213.399 < 2.2e-16 *** lab 10 5223 522 34.474 < 2.2e-16 *** Residuals 54 818 15 Statistical Data Analysis - Lecture16 - 09/04/03

  15. Additive model • Examining the ANOVA table we can see that the main effects are still significant even though we haven’t accounted for the interaction • Remember that the aim of this experiment was not to prove that there is no difference between the chalks (we know there is), but to look at differences in accuracy between labs • A main effects plot shows us the effects due to to each effect. • The group means are plotted on separated plots for each factor • In our example we have a plot with for the chalk means and another plot for the lab means Statistical Data Analysis - Lecture16 - 09/04/03

  16. Main effects plot Statistical Data Analysis - Lecture16 - 09/04/03

  17. Further considerations in Twoway ANOVA (not examinable) • Linear Contrast and Confidence intervals for interaction effects • Similar to those for one-way ANOVA • Twoway ANOVA with one replicate, • We can not fit a model with an interaction term • Since there is only one replicate, we can in fact drop the subscript k • Tukey’s test for non-addivity, assumes our interaction is proportional to the product of the two main effects Statistical Data Analysis - Lecture16 - 09/04/03

More Related