- By
**kylia** - Follow User

- 150 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'MT2004' - kylia

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Olivier GIMENEZ

Telephone: 01334 461827

E-mail: [email protected]

Website: http://www.creem.st-and.ac.uk/olivier/OGimenez.html

- So far, we’ve investigated the relationship between a response variable and one or several continuous explanatory variables
- The objective here is to study the relationship between a response variable Y and one or two discrete explanatory variables

13.1 One-way ANOVA

- Example: a standard measurement of the flammability of fabric is given by the length of the burnt portion of a piece of the fabric which has been held over a flame for a given time. An investigation to see whether or not there was a difference between the measurement obtained by 5 laboratories produced the following data.

13.1 One-way ANOVA

laboratory

1 2 3 4 5

2.9 2.7 3.3 3.3 4.1

3.1 3.4 3.3 3.2 4.1

3.1 3.6 3.5 3.4 3.7

3.7 3.2 3.5 2.7 4.2

3.1 4.0 2.8 2.7 3.1

4.2 4.1 2.8 3.3 3.5

3.7 3.8 3.2 2.9 2.8

3.9 3.8 2.8 3.2 3.5

3.1 4.3 3.8 2.9 3.7

3.0 3.4 3.5 2.6 3.5

2.9 3.3 3.8 2.8 3.9

Measurements of length obtained by 5 laboratories

- The problem here is to compare several populations
- The technique we will use is the one-way analysis of variance
- This is a special case of the ANOVA introduced in the Regression Section
- Consider k distributions (or populations) with means 1,…,k, and suppose we wish to test:
- H0: 1=…=k
- against
- H0: 1,…,k are not all equal
- WARNING: the alternative hypothesis does not imply that all the i are different, but at least one pair. E.g. with k = 3, 1=23 would be OK.

- In the example, we wish to test the null hypothesis that the means of lengths obtained by the k = 5 laboratories are the same.
- Suppose that we have random sample of sizes n1,…,nk from the k distributions. Note that the random samples do not need to have same sample size.
- yij denotes the jth observation on the ith distribution, i = 1,…, k and j = 1,…, ni

13.1 One-way ANOVA

laboratory

1 2 3 4 5

2.9 2.7 3.3 3.3 4.1

3.1 3.4 3.3 3.2 4.1

3.1 3.6 3.5 3.4 3.7

3.7 3.2 3.5 2.7 4.2

3.1 4.0 2.8 2.7 3.1

4.2 4.1 2.8 3.3 3.5

3.7 3.8 3.2 2.9 2.8

3.9 3.8 2.8 3.2 3.5

3.1 4.3 3.8 2.9 3.7

3.0 3.4 3.5 2.6 3.5

2.9 3.3 3.8 2.8 3.9

y23

Measurements of length obtained by 5 laboratories

y57

- In the example, we wish to test the null hypothesis that the means of lengths obtained by the k = 5 laboratories are the same.
- Suppose that we have random sample of sizes n1,…,nk from the k distributions. Note that the random samples do not need to have same sample size.
- yij denotes the jth observation on the ith distribution, i = 1,…, k and j = 1,…, ni
- We will assume that yij is an observation from a random variable Yij where:
- Yij N(i,2), i = 1,…, k and j = 1,…, ni, Yij independent
- We thus have that E(Yij) = i

- Actually, this model is a particular case of a multiple regression
- Define indicator variables x1,…, xk by:
- Then the equation E(Yij) = i can be rewritten as:
- E(Yij) = 1x1 + … + kxk
- This equation defines a multiple regression without intercept
- Now, to test the null hypothesis, we can apply the results of the end of the Regression Section (we place equality restrictions on the full model)

- The full model has k parameters (1,…,k) thus p1 = k.
- The submodel under H0 is E(Yij) = , thus p0 = 1.
- We have n = n1 + … + nk observations.
- So an appropriate statistic to test the null hypothesis is:
- If H0 is false (i.e. 1,…,k are not all equal), then this statistic will tend to take values too large to be consistent with the quantile of a F distribution with k-1 and n-k degrees of freedom.

- We provide other expressions for rss0 and rss1, much easier to manipulate
- Let denote the overall sample mean
- Let denote the sample mean of the ith random sample (pop.)

- We provide other expressions for rss_0 and rss_1, much easier to manipulate
- Let denote the overall sample mean
- Let denote the sample mean of the ith random sample (pop.)

- We provide other expressions for rss_0 and rss_1, much easier to manipulate
- Let denote the overall sample mean
- Let denote the sample mean of the ith random sample (pop.)

- We provide other expressions for rss_0 and rss_1, much easier to manipulate
- Let denote the overall sample mean
- Let denote the sample mean of the ith random sample (pop.)
- It can be shown that the total variability is the sum of the between and within variability:

- It can also be shown that the maximum likelihood are given:
- For the full model by:
- For the submodel by:

- And that:

- If we define:
- Then
- Becomes:

- Most often, the sums of squares, mean squares, F values, p-values are displayed in an ANOVA table
- With
- Note that the within mean square MSW is an unbiased estimator of the variance 2, called the residual s.e.

13.1.1 One-way ANOVA in R

- Example: a standard measurement of the flammability of fabric is given by the length of the burnt portion of a piece of the fabric which has been held over a flame for a given time. An investigation to see whether or not there was a difference between the measurement obtained by 5 laboratories produced the following data.

13.1.1 One-way ANOVA in R

laboratory

1 2 3 4 5

2.9 2.7 3.3 3.3 4.1

3.1 3.4 3.3 3.2 4.1

3.1 3.6 3.5 3.4 3.7

3.7 3.2 3.5 2.7 4.2

3.1 4.0 2.8 2.7 3.1

4.2 4.1 2.8 3.3 3.5

3.7 3.8 3.2 2.9 2.8

3.9 3.8 2.8 3.2 3.5

3.1 4.3 3.8 2.9 3.7

3.0 3.4 3.5 2.6 3.5

2.9 3.3 3.8 2.8 3.9

Measurements of length obtained by 5 laboratories

13.1.1 One-way ANOVA in R

- We wish to test the null hypothesis:
- H0: 1 = … = 5
- Against the alternative hypothesis
- H1: at least one pair of i’s are not equal
- Where i is the mean length of burnt fabric in measurements from laboratory i (i = 1,…, 5)

> lengthlab1<-c(2.9,3.1,3.1,3.7,3.1,4.2,3.7,3.9,3.1,3.0,2.9)

> lengthlab2<-c(2.7,3.4,3.6,3.2,4.0,4.1,3.8,3.8,4.3,3.4,3.3)

> lengthlab3<-c(3.3,3.3,3.5,3.5,2.8,2.8,3.2,2.8,3.8,3.5,3.8)

> lengthlab4<-c(3.3,3.2,3.4,2.7,2.7,3.3,2.9,3.2,2.9,2.6,2.8)

> lengthlab5<-c(4.1,4.1,3.7,4.2,3.1,3.5,2.8,3.5,3.7,3.5,3.9)

> lab1 <- rep(1,11)

> lab2 <- rep(2,11)

> lab3 <- rep(3,11)

> lab4 <- rep(4,11)

> lab5 <- rep(5,11)

> fabric<-data.frame(lab=c(lab1,lab2,lab3,lab4,lab5),length=c(lengthlab1,lengthlab2,lengthlab3,lengthlab4,lengthlab5))

> plot(fabric$lab,fabric$length)

> reglab <- lm(length~as.factor(lab), data=fabric)

> reglab

Call:

lm(formula = length ~ as.factor(lab), data = fabric)

Coefficients:

(Intercept) as.factor(lab)2 as.factor(lab)3 as.factor(lab)4

3.33636 0.26364 -0.03636 -0.33636

as.factor(lab)5

0.30909

The lm command produces in that case the parameters estimates of model

E(Yij) = 1x1 + … + 5x5

- > anova(reglab)
- Analysis of Variance Table
- Response: length
- Df Sum Sq Mean Sq F value Pr(>F)
- as.factor(lab) 4 2.9865 0.7466 4.5346 0.003337 **
- Residuals 50 8.2327 0.1647
- ---
- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- The R command anova applied to the regression object reglab produces the ANOVA table
- The pvalue is small, we reject H0 that 1 = … = 5

Checking the assumptions

- It is crucial to test the assumptions of the ANOVA model, in particular:
- The observations in each group come from a normal distribution
- The variances are equal (= 2)

Checking the assumptions

- It is crucial to test the assumptions of the ANOVA model, in particular:
- Normality: use QQplot on residuals
- Homogeneity of variance: inspect the variances

Checking the assumptions

1. The observations in each group come from a normal distribution.

We check normality of the residuals:

> resfab1<-lab1-mean(lab1)

> resfab2<-lab2-mean(lab2)

> resfab3<-lab3-mean(lab3)

> resfab4<-lab4-mean(lab4)

> resfab5<-lab5-mean(lab5)

> resfab<-c(resfab1,resfab2,resfab3,resfab4,resfab5)

> qqnorm(resfab)

> qqline(resfab)

Normality is OK…

Checking the assumptions

2. The variances are equal (2).

We inspect the variances:

> var(lab1)

[1] 0.2045455

> var(lab2)

[1] 0.212

> var(lab3)

[1] 0.138

> var(lab4)

[1] 0.082

> var(lab5)

[1] 0.1867273

Variances are roughly equal

13.1.2 Least Significant Differences

- When performing an ANOVA, we wish to test the null hypothesis:
- H0: 1 = … = 5
- Against the alternative hypothesis
- H1: at least one pair of i’s are not equal
- So if the null hypothesis is rejected, the question is which differences between groups are most important
- In other words, we wish to test H0: i = j, i j

13.1.2 Least Significant Differences

- An appropriate test to compare the groups in pairs is the 2-sample t-test
- Under H0: i = j, we have that
- But 2 is unknown
- We will replace 2 by s2 = rss1 / (n-k) = MSW (given in the ANOVA table)

13.1.2 Least Significant Differences

- On one hand, we have that s2 is independent of
- On the other hand:
- So, an appropriate test statistic is:
- To be compared with the quantile t0.025;n-k for a 2-sided test

13.1.2 Least Significant Differences

- WARNING: This is not exactly the same formula as for the 2-sample t-test since:
- s2 is calculated using all the data, and not just group i and j
- the degree of freedom is n - k rather than ni + nj - 2

13.1.2 Least Significant Differences

- If the samples are of equal size, i.e. n1 = … = nk
- It’s easier to calculate the smallest difference in sample means leading to rejection of the null hypothesis that the 2 groups have equal means
- This is called the Least Significant Differences (LSD)
- If k groups, each with m observations (n = mk), the LSD for significance level is:
- Once the LSD is calculated, then look for the pairs of groups with sample means differing by more than the LSD

13.1.2 Least Significant Differences

Example (Fabric data):

The Least Significant Differences for significance level is:

> LSD <- qt(0.975,55-5)*sqrt(2*0.1647/11)

> LSD

[1] 0.3475762

13.1.2 Least Significant Differences

> mean(lab5)

[1] 3.645455

> mean(lab2)

[1] 3.6

> mean(lab1)

[1] 3.336364

> mean(lab3)

[1] 3.3

> mean(lab4)

[1] 3

We calculate the sample mean for each group

13.1.2 Least Significant Differences

> mean(lab5)

[1] 3.645455

> mean(lab2)

[1] 3.6

> mean(lab1)

[1] 3.336364

> mean(lab3)

[1] 3.3

> mean(lab4)

[1] 3

And then look for the pairs of groups with sample means differing by more than the LSD = 0.3475762

13.1.2 Least Significant Differences

> mean(lab5)

[1] 3.645455

> mean(lab2)

[1] 3.6

> mean(lab1)

[1] 3.336364

> mean(lab3)

[1] 3.3

> mean(lab4)

[1] 3

Suggests that 4 < 2

13.1.2 Least Significant Differences

> mean(lab5)

[1] 3.645455

> mean(lab2)

[1] 3.6

> mean(lab1)

[1] 3.336364

> mean(lab3)

[1] 3.3

> mean(lab4)

[1] 3

Suggests that 4 < 5

13.1.2 Least Significant Differences

> mean(lab5)

[1] 3.645455

> mean(lab2)

[1] 3.6

> mean(lab1)

[1] 3.336364

> mean(lab3)

[1] 3.3

> mean(lab4)

[1] 3

Suggests that 4 < 2 and 4 < 5,

but does not suggest any other differences between the i

- So far, we've considered only one explanatory discrete variable (lab in the fabric data example)
- Let's assume now that each observation belongs to 2 groups
- This is a 2-way ANOVA
- Example: consider a reading comprehension test given to pupils of age 9, 10 and 11 from 4 schools (A, B, C and D), giving the scores:

- Example: consider a reading comprehension test given to pupils of age 9, 10 and 11 from 4 schools (A, B, C and D), giving the scores:
- So observation/score yi belongs to school j (j = 1,..., J) and age k (k = 1,..., K)
- yi is an observation of an independent r.v. Yi N(i,2)

- There are four models of potential interest.
- Model 3: the expected comprehension score E(Yi) = i is the sum of a school effect and an age effect:

- There are four models of potential interest.
- Model 1: the expected comprehension score E(Yi) = i is the result of a school effect only (k = 0, k, k = 1,..., K):

- There are four models of potential interest.
- Model 2: the expected comprehension score E(Yi) = i is the result of an age effect only (j = 0, j, j = 1,..., J):

- There are four models of potential interest.
- Model 0: the expected comprehension score E(Yi) = i is not the result of a school effect nor an age effect:

- The ANOVA table for comparing models 0, 1, 2 to model 3 is:

13.2.1 Two-way ANOVA in R

- Example: consider a reading comprehension test given to pupils of age 9, 10 and 11 from 4 schools (A, B, C and D), giving the scores:

> score <-c(71,92,89,44,51,85,50,64,72,67,81,86)

> school=c(rep(1,3),rep(2,3),rep(3,3),rep(4,3))

> age=rep(c(1,2,3),4)

> data <- data.frame(school,age,score)

> data

school age score

1 1 1 71

2 1 2 92

3 1 3 89

4 2 1 44

5 2 2 51

6 2 3 85

7 3 1 50

8 3 2 64

9 3 3 72

10 4 1 67

11 4 2 81

12 4 3 86

# produce the ANOVA table:

> anova(lm(score~as.factor(school)+as.factor(age)))

Analysis of Variance Table

Response: score

Df Sum Sq Mean Sq F value Pr(>F)

as.factor(school) 3 1260.00 420.00 6.2069 0.02861 *

as.factor(age) 2 1256.00 628.00 9.2808 0.01458 *

Residuals 6 406.00 67.67

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

# produce the ANOVA table:

> anova(lm(score~as.factor(school)+as.factor(age)))

Analysis of Variance Table

Response: score

Df Sum Sq Mean Sq F value Pr(>F)

as.factor(school) 3 1260.00 420.00 6.2069 0.02861 *

as.factor(age) 2 1256.00 628.00 9.2808 0.01458 *

Residuals 6 406.00 67.67

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Between variability (i.e. between sum of squares and between mean square)

# produce the ANOVA table:

> anova(lm(score~as.factor(school)+as.factor(age)))

Analysis of Variance Table

Response: score

Df Sum Sq Mean Sq F value Pr(>F)

as.factor(school) 3 1260.00 420.00 6.2069 0.02861 *

as.factor(age) 2 1256.00 628.00 9.2808 0.01458 *

Residuals 6 406.00 67.67

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

= s2 an unbiased estimate of 2

# produce the ANOVA table:

> anova(lm(score~as.factor(school)+as.factor(age)))

Analysis of Variance Table

Response: score

Df Sum Sq Mean Sq F value Pr(>F)

as.factor(school) 3 1260.00 420.00 6.2069 0.02861 *

as.factor(age) 2 1256.00 628.00 9.2808 0.01458 *

Residuals 6 406.00 67.67

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

MSS = SSS / dfS = 1260 / 3

MSA = SSA / dfA = 1256 / 2

# produce the ANOVA table:

> anova(lm(score~as.factor(school)+as.factor(age)))

Analysis of Variance Table

Response: score

Df Sum Sq Mean Sq F value Pr(>F)

as.factor(school) 3 1260.00 420.00 6.2069 0.02861 *

as.factor(age) 2 1256.00 628.00 9.2808 0.01458 *

Residuals 6 406.00 67.67

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

FA = MSA / MSB = 628 / 67.67

to be compared with a F2,6;0.025

# produce the ANOVA table:

> anova(lm(score~as.factor(school)+as.factor(age)))

Analysis of Variance Table

Response: score

Df Sum Sq Mean Sq F value Pr(>F)

as.factor(school) 3 1260.00 420.00 6.2069 0.02861 *

as.factor(age) 2 1256.00 628.00 9.2808 0.01458 *

Residuals 6 406.00 67.67

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

both school and age effects are significant at the 5% significance level

# produce the ANOVA table:

> anova(lm(score~as.factor(school)+as.factor(age)))

Analysis of Variance Table

Response: score

Df Sum Sq Mean Sq F value Pr(>F)

as.factor(school) 3 1260.00 420.00 6.2069 0.02861 *

as.factor(age) 2 1256.00 628.00 9.2808 0.01458 *

Residuals 6 406.00 67.67

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

The observed value of the test statistic of

H0: j = 0 and k = 0 j, k

is ((1260+1256)/5) / 67.67 = 7.43

to be compared with F0.025; 5,6

The p-value is 0.015

13.2.2 Least Significance Differences

- If one/two effects are significant, then we'd like to know which differences between groups (for each effect) are most important
- Similar reasoning to that employed in the one-way ANOVA gives the % LSD for the 2 effects:

for factor with J categories

for factor with K categories

13.2.2 Least Significance Differences

- For example, as the school factor is significant in the reading comprehension test example, we wish to figure out which schools contribute most
- The 5% LSD for a difference in mean score between school is

with J = 4 schools and K = 3 age groups

> mean(score[school==1])

[1] 84

> mean(score[school==4])

[1] 78

> mean(score[school==3])

[1] 62

> mean(score[school==2])

[1] 60

We calculate the sample mean for each school

> mean(score[school==1])

[1] 84

> mean(score[school==4])

[1] 78

> mean(score[school==3])

[1] 62

> mean(score[school==2])

[1] 60

And then look for the pairs of groups with sample means differing by more than the LSD = 14.23316

> mean(score[school==1])

[1] 84

> mean(score[school==4])

[1] 78

> mean(score[school==3])

[1] 62

> mean(score[school==2])

[1] 60

Suggests that A > B

> mean(score[school==1])

[1] 84

> mean(score[school==4])

[1] 78

> mean(score[school==3])

[1] 62

> mean(score[school==2])

[1] 60

Suggests that A > C

> mean(score[school==1])

[1] 84

> mean(score[school==4])

[1] 78

> mean(score[school==3])

[1] 62

> mean(score[school==2])

[1] 60

Suggests that D > C

> mean(score[school==1])

[1] 84

> mean(score[school==4])

[1] 78

> mean(score[school==3])

[1] 62

> mean(score[school==2])

[1] 60

Suggests that D > B

> mean(score[school==1])

[1] 84

> mean(score[school==4])

[1] 78

> mean(score[school==3])

[1] 62

> mean(score[school==2])

[1] 60

Suggests that A > B, A > C , D > C and D > B

but does not suggest any other differences between the i

Download Presentation

Connecting to Server..