Linear Models in R

1 / 51

# Linear Models in R - PowerPoint PPT Presentation

Linear Models in R Fish 507 F: Lecture 7 Supplemental Readings Practical Regression and ANOVA using R (Faraway, 2002) Chapters 2,3,6,7,8,10,16 http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Linear Models in R' - Pat_Xavi

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Linear Models in R

Fish 507 F: Lecture 7

• Practical Regression and ANOVA using R (Faraway, 2002)
• Chapters 2,3,6,7,8,10,16
• http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
• (This is essentially an older version (but a free copy) of Julian Faraway’s excellent book Linear Models with R)
• QERM 514 Lecture Notes (Nesse, 2008)
• Chapters 3-9
• http://students.washington.edu/nesse/qerm514/notes.pdf
• Hans Nesse practically wrote a book for this course !
Linear models
• Today’s lecture will focus on a single equation and how to fit it.
• Classic linear model
• When the response, yi is normally distributed
• Categorical predictor (s) : ANOVA
• Continuous predictor (s) : Regression
• Mixed Categorical/Continuous predictor (s) : Regression/ANCOVA
Typical Goals
• Is this model for . . .
• Estimation : What parameters in a particular model best fit the data?
• Tricks (e.g. Formulaic relationships, Ricker, . . . )
• Inference: How certain are those estimates and what can be interpreted from them?
• Adequacy: Is the model probably the right choice?
• Prediction: When can predictions be made for new observations?
Outline
• ANOVA
• Regression (ANCOVA)
• * This lecture will not go into detail on statistical models or concepts, but rather present functions to fit those models
One-way ANOVA
• The classical (null) hypothesis in a one-way ANOVA is that the means of all the groups are the same
• i : 1, 2, . . . . , nj is the observation within group j: 1,2, . . . J

H0: μ1 = μ2 = . . . . = μJ

H1: At least two of the μj’s are different

Archaeological metals
• Archaeological investigations work to identify similarities and differences between sites. Traces of metals found in artifacts give some indication of manufacturing techniques
• The data metals gives the percentage of iron found in pottery from four Roman-era sites

Al Fe Mg Ca Na Site

1 14.4 7.00 4.30 0.15 0.51 L

2 13.8 7.08 3.43 0.12 0.17 L

3 14.6 7.09 3.88 0.13 0.20 L

Site will automatically get coded as a factor

aov() / summary()
• The simplest way to fit an ANOVA model is with the aov function

> Fe.anova <- aov(Fe ~ Site, data=metals)

> summary(Fe.anova)

Df Sum Sq Mean Sq F value Pr(>F)

Site 3 134.222 44.741 89.883 1.679e-12 ***

Residuals 22 10.951 0.498

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Mean squared error – sum of squares divided by degrees of freedom

Partial F-statistic comparing the full model to the reduced model

Probability of observing F or higher

Significance – a quick visual guide to which p-values are low

Sums of squares group – sum of squared difference between group means and overall means

Sums of squares error – sum of squared difference between observations within a group and their respective mean

Degrees of freedom for each group and the residuals

The model statement
• We fit the ANOVA model by specify a model:
• Fe ~ Site
• The functions in R that fit the more common statistical models take as a first argument a model statement in a compact symbolic form
• We’ve actually briefly seen this symbolic form in the first plotting lecture
• plot(y ~ x, data = )

Predictor, independent variable

Response, dependent variable

The model statement
• Look up help on ?formula for a full explanation
• Some common model statements
One-way ANOVA
• In the previous function we fit the model with the aov() function. We can also fit the ANOVA model in with the functions lm() and anova(). Depending on what analysis we are conducting we might chose either one of these approaches.

> Fe.lm <- lm(Fe ~ Site, data=metals)

> anova(Fe.lm)

Analysis of Variance Table

Response: Fe

Df Sum Sq Mean Sq F value Pr(>F)

Site 3 134.222 44.741 89.883 1.679e-12 ***

Residuals 22 10.951 0.498

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Coding categorical variables
• The lm() function is also used to fit regression models. (In fact, regression and ANOVA are really the same thing). It all has to do with how a categorical variable is coded in the model.
• The ANOVA model can be written in a the familiar looking form

by cleverly selecting the predictors to be

Treatment coding

treatment coding
• Coding schemes describe how each group is represented as the values of x1, x2, . . . . xJ
• In R the default coding scheme for unordered factors is the treatment coding. This is likely what you learned in your introductory statistics courses.
• Recall that in this scheme the estimate of the intercept β0 represents the mean of the baseline group and that the estimate of the of each βJ describes the difference between the mean of group i and the baseline group
Coding schemes
• Treatment coding
• This may seem trivial but its very important to know how categorical variables are being coded in a linear model when interpreting parameters.
• To find out the current coding scheme

> options()\$contrasts

unordered ordered

"contr.treatment" "contr.poly"

μ1 is the group chosen as the baseline

Other coding schemes
• There are several other coding schemes
• hermet : Awkward interpretation. Improves matrix computations.
• poly : Levels of a group are ordered. β0 = constant effect , β1 = linear effect, β2 = quadratic effect, . . .
• SAS : same as treatment, but the last level of a group is used as the baseline (the first level will always be used in treatment
• sum : When the group sample sizes are equal, the estimate of the intercept represents the grand mean and the βJ represent the differences of those levels from the grand mean
Changing coding schemes
• The C function is used to specify the contrast of a factor
• C(object, contr)

> ( metals\$Site <- C(metals\$Site, sum) )

[1] L L L L L L L L L L L L L L C C I I I I I A A A A A

attr(,"contrasts")

[1] contr.sum

Levels: A C I L

• The functions contr.() (e.g. contr.treatment) will create the matrix of contrasts used in lm() and other functions
One-way ANOVA

> summary(Fe.lm)

Call:

lm(formula = Fe ~ Site, data = metals)

Residuals:

Min 1Q Median 3Q Max

-2.11214 -0.33954 0.01143 0.49036 1.22800

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.5120 0.3155 4.792 8.73e-05 ***

SiteC 3.9030 0.5903 6.612 1.20e-06 ***

SiteI 0.2000 0.4462 0.448 0.658

SiteL 4.8601 0.3676 13.222 6.04e-12 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7055 on 22 degrees of freedom

Multiple R-squared: 0.9246, Adjusted R-squared: 0.9143

F-statistic: 89.88 on 3 and 22 DF, p-value: 1.679e-12

More output

Residual summary

Recall that this t-test tests whether the βJ is significantly different from zero. So this says that sites C and L were significantly different from the baseline site, A

Got to report R-squared, right?

Multiple comparisons
• When comparing the means for the levels of a factor in an analysis of variance, a simple comparison using t-tests will inflate the probability of declaring a significant difference when it is not in fact present
• There are several ways around this the most common being Tukey’s honest significant difference
• TukeyHSD(object)

This needs to be a fitted model object fromaov()

TukeyHSD()

> TukeyHSD(Fe.anova)

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = Fe ~ Site, data = metals)

\$Site

C-A 3.9030000 2.2638764 5.542124 0.0000068

I-A 0.2000000 -1.0390609 1.439061 0.9692779

L-A 4.8601429 3.8394609 5.880825 0.0000000

I-C -3.7030000 -5.3421236 -2.063876 0.0000146

L-C 0.9571429 -0.5238182 2.438104 0.3023764

L-I 4.6601429 3.6394609 5.680825 0.0000000

These should not contain zero for a significant difference

Model assumptions
• Independence
• Within and between samples
• Normality
• Histograms, QQ-Plots
• Tests for normality
• Kolmogorov-Smirnov test : ks.test()
• Shapiro-Wilk: shapiro.test()
• Homogeneity of variance
• Boxplots
• Tests for equal variances
• Bartlett’s test: bartlett.test()
• Fligner-Killeen Test: fligner.test()

The null hypothesis is that the data follow a specific distribution

The null hypothesis is that the data came from a normal distribution

The null hypothesis is that all the variances are equal

In-class exercise 1
• Check the assumptions using plots and tests for the archeological ANOVA model. Recall that the normality test is conducted on the residuals of the model so you will need to figure out how to extract these from Fe.lm
• Are the assumptions met ?
ANCOVA / Regression
• With a basic understanding of the lm() function it’s then not hard to fit other linear models
• Regression models with mixed categorical and continuous variables can be fit the with lm() function.
• There are also a suite of functions associated with lm() objects which we use for common model evaluation and prediction routines.
Marmot data
• The length of yellow-bodied marmot whistles in response to simulated predators.

len rep dist type loc

1 0.12214826 1 17.2733271 Human A

2 0.07630072 3 0.2445166 Human A

3 0.11584495 1 13.0901767 Human A

4 0.11318707 1 14.9489510 Human A

5 0.09931512 2 13.0074619 Human A

6 0.10285429 2 10.6129169 Human A

Marmot data
• len: length of marmot whistle (response variable)
• rep: number of repetitions of whistle per bout - continuous
• dist: distance to challenge when whistle began– continuous
• type: type of challenge (Human, RC Plane, Dog) - categorical
• loc: test location (A, B, C) – categorical
Exploring potential models
• Basic exploratory data analysis should always be performed before starting to fit a model
• Always try and fit a meaningful model
• When there are at least two categorical predictors an interaction plot is useful for determining whether the effect of x1 on y depends on the level of x2 (an interaction)
• interaction.plot(x.factor, trace.factor, response)

interaction.plot(marmot\$loc, marmot\$type, marmot\$len, . . .

Slight evidence for an interaction

No RCPlanes were test at location C. This will prevent us from fitting an interaction between these two variables.

Exploring potential models
• We can also examine potential interactions between continuous and categorical variables with simple bivariate plots conditioned on factors

plot(dist, len,xlab="Distance", ylab="Length", type='n')

points(dist[type == "Dog"], len[type == "Dog"],pch=17, col="blue")

points(dist[type == "Human"], len[type == "Human"],pch=18, col="red")

points(dist[type == "RCPlane"], len[type == "RCPlane"], pch=19,col="green")

legend("bottomleft", bty='n', levels(type),

+ col=c("blue", "red", "green"), pch=17:19)

Setup a blank plot

Quick way to extract names of categorical variables

Add points for x and y by some factor

A model
• Suppose after some exploratory data analysis and model fitting we arrived at the model:
• Length ~ Location + Distance + Type + Distance*Type
• We can fit this model simply by

> ( lm1 <- lm(len ~ loc + type*dist, data=marmot) )

Call:

lm(formula = len ~ loc + type * dist, data = marmot)

Coefficients:

(Intercept) locB locC typeHuman

0.0941227 0.0031960 0.0026906 -0.0353553

typeRCPlane dist typeHuman:dist typeRCPlane:dist

0.0001025 0.0005970 0.0034316 -0.0011266

The summary() command works just as before

> summary(lm1)

Call:

lm(formula = len ~ loc + type * dist, data = marmot)

Residuals:

Min 1Q Median 3Q Max

-0.092966 -0.010812 0.001030 0.010029 0.059588

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.0941227 0.0106280 8.856 4.82e-15 ***

locB 0.0031960 0.0042574 0.751 0.45417

locC 0.0026906 0.0049046 0.549 0.58421

typeHuman -0.0353553 0.0136418 -2.592 0.01063 *

typeRCPlane 0.0001025 0.0153766 0.007 0.99469

dist 0.0005970 0.0008158 0.732 0.46555

typeHuman:dist 0.0034316 0.0010809 3.175 0.00187 **

typeRCPlane:dist -0.0011266 0.0011891 -0.947 0.34515

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.02147 on 132 degrees of freedom

Multiple R-squared: 0.2906, Adjusted R-squared: 0.2529

F-statistic: 7.723 on 7 and 132 DF, p-value: 8.208e-08

Extracting model components
• All the components to the summary() output are also stored in a list

> names(lm1)

[1] "coefficients" "residuals" "effects"

[4] "rank" "fitted.values" "assign"

[7] "qr" "df.residual" "contrasts"

[10] "xlevels" "call" "terms"

[13] "model"

> lm1\$call

lm(formula = len ~ loc + type * dist, data = marmot)

Comparing models
• Suppose that the model without the interaction was also a potential model. We can look at the t-values to test whether a single βJ = 0, but need to perform a partial F-test to test whether several predictors = 0.
• This is what an ANOVA model tests.
• H0: Reduced model – H1:Full model

> anova(lm2,lm1)

Analysis of Variance Table

Model 1: len ~ loc + type + dist

Model 2: len ~ loc + type * dist

Res.Df RSS Df Sum of Sq F Pr(>F)

1 134 0.069588

2 132 0.060827 2 0.008761 9.5058 0.0001391 ***

Twolm()objects need to be given toanova()

Evidence for interaction

AIC
• AIC is a more sound way to select a model.
• In it’s most simple form, p = # of parameters
• The function AIC() will extract the AIC from a linear model.
• Note that there is also the function extractAIC() which evaluates the log-likelihood based on the model deviance (Generalized Linear Models) and uses a different penalty.
• Be careful !
AICc
• AIC corrected is a better choice, especially when there a lot of parameters in relationship to the size of the data
• AICc should always be used since AIC and AICc should yield equivalent results as n gets large.

Correction term. What will this converge to as n gets very large ?

Hands-on exercise 2
• Compute the AIC and AICc for the two marmot models that were fit
• For AIC use the AIC() function, but also do the computation by hand
• R does not have a built in function for AICc in its base package, so you will also have to do this computation by hand.
• Hint: Use the loglik() function
Checking assumptions
• Suppose that we decided on the marmot model that included the interaction term between Distance and type of challenge
• The same assumptions must be met and can be evaluated by plotting the model object
• plot(lm1)
• Clicking on the plot will allow to scrow through plots
• Specifying which= in the command will allow to select which plot
• plot(lm1, which=1)

Checking constant variance assumption

Unusual observations are flagged (High leverage)

Check for influential observations

Cook’s distance is a function of the residual and leverage, so the isoclines trace out the Cook’s distance for any point in a region

Box-cox transformations
• Failing to meet assumptions indicates model mis-specification
• The QQ-plot suggested that that we might want to transform the response variable, length of whistle.
• It is always a good idea to use a meaningful transformation if possible.
• Optimal transformation can be found with the boxcox() function
• Computes a likelihood profile over λ
Influential observations
• Unusual observations are also likely driving the lack of fit. Cook’s distance is commonly used to identify influential observations
• This statistic has an F distribution and thus points greater than 1 usually require further examination. Points greater than 0.5 is also commonly used
Influential observations
• cooks.distance() takes as input a lm() object and returns the Cook’s distance
• None of the observations were influential

> which(cooks.distance(lm1) > 0.5)

named integer(0)

and the values flagged by R likely had high leverage, meaning that they had the potential to be influential.

Making predictions
• After assessing model assumptions, influential observations, refitting the model, validating the model , . . . . .
• We want to make prediction for a future observation
• The point estimate for a new observation is:

but we wish express our uncertainty in this observation

• Two types of predictions with uncertainty
• Confidence interval: Express uncertainty in a parameter
• Prediction interval: Express uncertainty in a new observation
predict()

predict(object,

newdata,

interval = c("none", "confidence", "prediction"),

level = 0.95,

na.action = na.pass)

lm()object

Data frame of new observations. Note the column names of the data frame must match the data used in the model

Desired confidence interval

How to handle missing values. Recall all the na. functions. By default R will not make the prediction (pass)

predict()

> ( newobs <- data.frame(rep=c(3,1), dist=c(10, 13),

+ type=rep("Dog",2), loc=rep("A",2)) )

rep dist type loc

1 3 10 Dog A

2 1 13 Dog A

> predict(lm1, newobs, interval="confidence")

fit lwr upr

1 0.1000929 0.09170514 0.1084807

2 0.1018840 0.09411091 0.1096571

> predict(lm1, newobs, interval="prediction")

fit lwr upr

1 0.1000929 0.05680941 0.1433764

2 0.1018840 0.05871539 0.1450526

Note, names match data

Wider intervals

Parameter confidence intervals
• Confidence intervals for model parameters can easily be obtained with the confint() function

> confint(lm1)

2.5 % 97.5 %

(Intercept) 0.073099364 0.115145945

locB -0.005225605 0.011617694

locC -0.007011095 0.012392320

typeHuman -0.062340202 -0.008370466

typeRCPlane -0.030313894 0.030518809

dist -0.001016644 0.002210696

typeHuman:dist 0.001293523 0.005569594

typeRCPlane:dist -0.003478859 0.001225620

Other useful functions
• addterm: Forward selection using AIC
• dropterm: Backwards selection using AIC
• stepAIC: Step-wise selection using AIC
Hands-on Exercise 3
• If time permitting, explore one of the functions on the previous slide. Either use a model that we previously fit or a new one