Lecture 2 Linear Models I

Advanced Research Skills Lecture 2 Linear Models I Olivier MISSA, om502@york.ac.uk

Outline • What are Linear models • Basic assumptions • "Refresher" on different types of model: • Single Linear Regression • 2 sample t-test • Anova (one-way & two-way) • Ancova • Multiple Linear Regression

What are linear models ? • Put simply they are Models attempting to "explain" one response variable by combining linearly several predictor variables. In theory, the response variable and the predictor variables can either be continuous, discrete or categorical. This makes them particularly versatile. Indeed, many well known statistical procedures are linear models.e.g. Linear regression, Two-sample t-test, One-way & Two-way ANOVA, ANCOVA, ...

What are linear models ? • For the time being, we are going to assume that • (1) the response variable is continuous. • (2) the residuals (ε) are normally distributed and ... • (3) ... independently and identically distributed. These "three" assumptions define classical linear models. We will see in later lectures ways to bypass these assumptions to be able to cope with an even wider range of situations (generalized linear models, mixed-effects models). But let's take things one step at a time.

Starting with a simple example • Single Linear regression • 1 response variable (continuous) vs. • 1 explanatory variable either continuous or discrete (but ordinal !). • Attempts to fit a straight line, y = a + b.x > library(faraway) > data(seatpos) > attach(seatpos) > model <- lm(Weight ~ Ht) > model Coefficients: (Intercept) Ht -292.992 2.653

fitted values Deviance Residual Sum of Squares Total Sum of Squares average value TotalSS = 47371 RSS = 14853 But how "significant" is this trend ? • Can be approached in a number of ways • 1) R2 the proportion of variance explained. R2= 0.686

Sum of Squares due to regression SSreg = 32518 TotalSS = 47371 RSS = 14853 • 1) R2 the proportion of variance explained.

1) R2 the proportion of variance explained. > summary(model) Call: lm(formula = Weight ~ Ht) Residuals: Min 1Q Median 3Q Max -29.388 -12.213 -4.876 4.945 59.586 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -292.9915 50.6402 -5.786 1.34e-06 *** Ht 2.6533 0.2989 8.878 1.35e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 20.31 on 36 degrees of freedom Multiple R-squared: 0.6865, Adjusted R-squared: 0.6777 F-statistic: 78.82 on 1 and 36 DF, p-value: 1.351e-10

2) model parameters and their standard errors > summary(model) ... some outputs left out Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -292.9915 50.6402 -5.786 1.34e-06 *** Ht 2.6533 0.2989 8.878 1.35e-10 *** ... How many standard errors away from 0 are the estimates

SSreg MSreg= SSreg / dfreg F = MSreg / MSE RSS MSE = RSS / rdf • 3) F-test > anova(model) Analysis of Variance Table Response: Weight Df Sum Sq Mean Sq F value Pr(>F) Ht 1 32518 32518 78.816 1.351e-10 *** Residuals 36 14853 413 ... The F-value so obtained must be compared to the theoretical F probabilities with 1 (numerator) and 36 (denominator) degrees of freedom

22 13 31 How strong is this trend ? 95% future observations band 95% Confidence Interval around Slope

Are the assumptions met ? • 1: Is the response variable continuous ? YES ! • 2: Are the residuals normally distributed ? > library(nortest) ## Package of Normality tests > ad.test(model$residuals) ## Anderson-Darling Anderson-Darling normality test data: mod$residuals A = 1.9511, p-value = 4.502e-05 > plot(model, which=2) ## qqplot of std.residuals Answer: NO ! due to a few outliers ?

A bad example : non-linear trend • 3a : Are the residuals independent ? > plot(model, which=1) ## residualsvs fitted values. Answer: looks OK !

OK Bad • 3b : Are the residuals identically distributed ? > plot(model, which=3) ## sqrt(abs(standardized(residuals)))vs fitted values. Answer: Not perfect, but OK ! Continuing our bad example

Another bad example of residuals Heteroscedasticity Bad Bad Bad

R2= 0.686 minus obs. #22 R2= 0.630 Is our model OK?despite having outliers • Possible approaches • 1) Repeat the analysis, removing the outliers, and check the model parameters. > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -292.9915 50.6402 -5.786 1.34e-06 *** Ht 2.6533 0.2989 8.878 1.35e-10 *** > model2 <- lm(Weight ~ Ht, subset=seq(38)[-22]) > summary(model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -213.8000 47.4169 -4.509 7.00e-05 *** Ht 2.1731 0.2813 7.727 4.52e-09 ***

Possible approaches • 2) Examine directly the impact each observation has on the model output. • Depends on (a) leverage(hii) • "How extreme each x-value is compared to the others" • Has the potential to influence the fit strongly. When there is only one predictor variable:Σhii = 2 In general (for later):Σhii = p(no. parameters. in the model)

However: the leverage is only part of the story • The direct impact of an observation depends also on (b) its residual • A point with strong leverage but small residual is not a problem: the model adequately accounts for it. • A point with weak leverage but large residual is not a problem either: the model is weakly affected by it. • A point with strong leverage and large residual, however, strongly influences the model. • Removing the latter point will usually modify the model output to some extent.

22 13 31 Influential Observations > plot(model, which=5) ## standardized residuals vs Leverage.

Cook's distance • A direct measure of influence • combining both Leverages and Residuals fitted values when point i is omitted original fitted values Mean Square Error No. of Parameters = RSS / (n-p)

Cook's distance • Combining both Leverages and Residuals Leverage values Raw residuals Standardised residuals Any Di value above 0.5 deserves a closer look andAny Di value above 1 merits special attention.

Cook's distance • Combining both Leverages and Residuals Any Di value above 0.5 deserves a closer look andAny Di value above 1 merits special attention.

22 13 31 Remove point #22 ? • Although point #22 is influential, it does not invalidate the model. • We should probably keep it, but note that the regression slope may be shallower than the model suggests. • A "reason" for the presence of theseoutliers can be suggested:some obese people were includedin the survey. • Depending on the purpose of the model,you may want to keep or remove these outliers.

t-value Standard Error of the Difference of Means Next : " 2-sample t-test " > data(energy) ## in ISwR package > attach(energy) > plot(expend ~ stature) > stripchart(expend ~ stature, method="jitter") first, let's have a look at the classical t-test Standard Error of the Mean

unequal variance for the difference Classical 2-sample t-test Assuming Unequal Variance > t.test(expend ~ stature) Welch Two Sample t-test data: expend by stature t = -3.8555, df = 15.919, p-val = 0.001411 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.459167 -1.004081 sample estimates: mean in group lean mean in group obese 8.066154 10.297778

Classical 2-sample t-test Assuming Equal Variance > t.test(expend~stature, var.equal=T) Two Sample t-test data: expend by stature t = -3.9456, df = 20, p-value = 0.000799 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.411451 -1.051796 sample estimates: mean in group lean mean in group obese 8.066154 10.297778 Equal Variance slightly more significant

" 2-sample t-test " as a linear model The t-test could "logically" translate into : βΔobese β0 when X= "obese" when X= "lean" βΔlean However: One of these β parameters is superfluous, and actually makes the model fitting through matrix algebra impossible. So, instead it is usually translated into: βΔobese βdefault for "all" X when X= "obese"

" 2-sample t-test " as a linear model Remember : linear models assume Equal Variance > mod <- lm(expend~stature) > summary(mod) ... Residuals: Min 1Q Median 3Q Max -1.9362 -0.6153 -0.4070 0.2613 2.8138 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 *** statureobese 2.2316 0.5656 3.946 0.000799 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 factor w/ only two levels: lean & obese doesn't look symmetric Average for lean category difference between lean and obese averages

" 2-sample t-test " as a linear model Remember : linear models assume Equal Variance > mod <- lm(expend~stature) > summary(mod) ... Residuals: Min 1Q Median 3Q Max -1.9362 -0.6153 -0.4070 0.2613 2.8138 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 *** statureobese 2.2316 0.5656 3.946 0.000799 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Standard Error of the Mean (SEM) for lean Standard Error of the Difference of the Means (SEDM)

" 2-sample t-test " as a linear model Remember : linear models assume Equal Variance > mod <- lm(expend~stature) > summary(mod) ... Residuals: Min 1Q Median 3Q Max -1.9362 -0.6153 -0.4070 0.2613 2.8138 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 *** statureobese 2.2316 0.5656 3.946 0.000799 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.304 on 20 degrees of freedom Multiple R-squared: 0.4377, Adjusted R-squared: 0.4096 F-statistic: 15.57 on 1 and 20 DF, p-value: 0.000799 Same values as classical t-test (equal variance)

SSmodel MSmodel= SSmodel / dfmodel F = MSmodel / MSE MSE = RSS / rdf RSS SSmodel = 26.485 RSS = 34.026 TSS = 60.511 " 2-sample t-test " as a linear model > anova(mod) Analysis of Variance Table Response: expend Df Sum Sq Mean Sq F value Pr(>F) stature 1 26.485 26.485 15.568 0.000799 *** Residuals 20 34.026 1.701

Are the assumptions met ? • 1 : Is the response variable continuous ? YES ! • 2 : Are the residuals normally distributed ? > library(nortest) ## Package of Normality tests > ad.test(mod$residuals) ## Anderson-Darling Anderson-Darling normality test data: mod$residuals A = 0.9638, p-value = 0.01224 > plot(mod, which=2) ## qqplot of std.residuals Answer: NO !

lean obese • 3 : Are the residuals independent • and identically distributed ? > plot(mod, which=1) ## residualsvs fitted values. Answer: perhaps not identically distributed > plot(mod, which=3) ## sqrt(abs(standardized(residuals)))vs fitted values.

Transforming the response variableto produce normal residuals • Can be optimised using the Box-Cox method. > library(MASS) > boxcox(mod, plotit=T) The method transforms the responseyintogλ(y)where: optimal solutionλ≈ - 1/2

Transforming the response variableto produce normal residuals • Is this transformation good enough ? > new.y <- 2 – 2/sqrt(expend) > mod2 <- lm(new.y ~ stature) > ad.test(residuals(mod2)) Anderson-Darling normality test data: residuals(mod2) A = 0.5629, p-value = 0.1280 > plot(mod2, which=2) > plot(mod, which=2) ## qqplot of std.residuals. Perhaps the assumption of equal variance is not warranted ??

Transforming the response variableto produce normal residuals • Is this transformation good enough ? > var.test(new.y ~ stature)## only works if stature is ##a factor with only 2 levels F test to compare two variances data: new.y by stature F = 1.5897, num df = 12, denom df = 8, p-value = 0.5201 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.3785301 5.5826741 sample estimates: ratio of variances 1.589701

Transforming the response variableto produce normal residuals • Is this transformation good enough ? > summary(mod)## untransformedy Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.0662 0.3618 22.297 1.34e-15 *** statureobese 2.2316 0.5656 3.946 0.000799 *** > summary(mod2) ## transformed y. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.29049 0.01301 99.176 < 2e-16 *** statureobese 0.08264 0.02034 4.062 0.000608 ***

Next : One-Way Anova • Parametric extension of the t-test to more than 2 groups • whose sample sizes can be unequal > data(coagulation) ## in faraway package## blood coagulation times among ## 24 animals fed one of 4 diets > attach(coagulation) > plot(coag ~ diet) > stripchart(coag ~ diet, method="jitter")

One-Way Anova as an F-test > res.aov <- aov(coag ~ diet) ## classical ANOVA > summary(res.aov) Df Sum Sq Mean Sq F value Pr(>F) diet 3 228.0 76.0 13.571 4.658e-05 *** Residuals 20 112.0 5.6 --- RSS = 112 SSmodel = 228 TSS = 340

One-Way Anova as a linear model > mod <- lm(coag ~ diet) ## as a linear model > summary(mod) ## some output left out Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.100e+01 1.183e+00 51.554 < 2e-16 *** dietB 5.000e+00 1.528e+00 3.273 0.003803 ** dietC 7.000e+00 1.528e+00 4.583 0.000181 *** dietD -1.071e-14 1.449e+00 -7.39e-15 1.000000 Residual standard error: 2.366 on 20 degrees of freedom Multiple R-squared: 0.6706, Adjusted R-squared: 0.6212 F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05 > anova(mod) ## or summary(aov(mod)) Analysis of Variance Table Response: coag Df Sum Sq Mean Sq F value Pr(>F) diet 3 228.0 76.0 13.571 4.658e-05 *** Residuals 20 112.0 5.6 not very useful

Are the assumptions met ? • 1 : Is the response variable continuous ? No, discrete ! • 2 : Are the residuals normally distributed ? > library(nortest) ## Package of Normality tests > ad.test(mod$residuals) ## Anderson-Darling Anderson-Darling normality test data: mod$residuals A = 0.301, p-value = 0.5517 > plot(mod, which=2) ## qqplot of std.residuals Answer: Yes !

A D B C • 3 : Are the residuals independent • and identically distributed ? > plot(mod, which=1) ## residualsvs fitted values. Answer: OK > plot(mod, which=3) ## sqrt(abs(standardized(residuals)))vs fitted values. > library(car) > levene.test(mod) Levene's Test for Homogeneity of Variance Df F value Pr(>F) group 3 0.6492 0.5926 20 3 obs.

More Classical Graphs Histogram + Theoretical curve Boxplot Stripchart Pie chart Barplot 3D models

Lecture 2 Linear Models I

Lecture 2 Linear Models I

Presentation Transcript

Lecture 3 Linear random intercept models

General Linear Models; Generalized Linear Models

General Linear Models -- #2

Lecture 3. Linear Models for Classification

Lecture 4. Linear Models for Regression

Biostatistics-Lecture 12 Generalized Linear Models

Lecture 3 Linear Models II

Lecture 5 Linear Mixed Effects Models

Linear Colliders Lecture 2 Subsystems I

Lecture 2: Part I Parallel Programming Models

Lecture 4 Linear Models III

Lecture 2: Linear Regression

Lecture 12: Generalized Linear Models (GLM)

Lecture 6 Generalized Linear Models

2-5 Using linear models

Lecture 4 Linear random coefficients models

Linear Models (I)

Linear Models I

Lecture 5 Linear Mixed Effects Models