1 / 34

# Workshop in R & GLMs: #2 - PowerPoint PPT Presentation

Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca Start by loading your Lakedata_06 dataset: diane<-read.table(file.choose(),sep=";",header=TRUE) Dataframes Two ways to identify a column (called "treatment") in your dataframe (called "diane"):

## Related searches for Workshop in R & GLMs: #2

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

### Download Presentation

Workshop in R & GLMs: #2

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

### Workshop in R & GLMs: #2

Diane Srivastava

University of British Columbia

srivast@zoology.ubc.ca

Start by loading your Lakedata_06 dataset:

diane<-read.table(file.choose(),sep=";",header=TRUE)

### Dataframes

Two ways to identify a column (called "treatment") in your dataframe (called "diane"):

diane\$treatment

OR

attach(diane); treatment

At end of session, remember to: detach(diane)

### Summary statistics

length (x)

mean (x)

var (x)

cor (x,y)

sum (x)

summary (x)

minimum, maximum, mean, median, quartiles

What is the correlation between two variables in your dataset?

### Factors

• A factor has several discrete levels (e.g. control, herbicide)

• If a vector contains text, R automatically assumes it is a factor.

• To manually convert numeric vector to a factor:

• x <- as.factor(x)

• To check if your vector is a factor, and what the levels are:

• is.factor(x) ; levels(x)

### Exercise

Make lake area into a factor called AreaFactor:

Area 0 to 5 ha: small

Area 5.1 to 10: medium

Area > 10 ha: large

You will need to:

1. Tell R how long AreaFactor will be.

AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium"

2. Assign cells in AreaFactor to each of the 3 levels

3. Make AreaFactor into a factor, then check that it is a factor

### Exercise

Make lake area into a factor called AreaFactor:

Area 0 to 5 ha: small

Area 5.1 to 10: medium

Area > 10 ha: large

You will need to:

1. Tell R how long AreaFactor will be.

AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium"

2. Assign cells in AreaFactor to each of the 3 levels

AreaFactor[Area<5.1]<-“small"; AreaFactor[Area>10]<-“large"

3. Make AreaFactor into a factor, then check that it is a factor

AreaFactor<-as.factor(AreaFactor); is.factor(AreaFactor)

### Linear regression

ALT+126

model <- lm (y ~ x, data = diane)

insert your

dataframe

name here

invent a name for your model

linear model

insert your

x vector name here

insert your

y vector name here

### Linear regression

model <- lm (Species ~ Elevation, data = diane)

summary (model)

Call:

lm(formula = Species ~ Elevation, data = diane)

Residuals:

Min 1Q Median 3Q Max

-7.29467 -2.75041 -0.04947 1.83054 15.00270

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 9.421568 2.426983 3.882 0.000551 ***

Elevation -0.0026090.003663 -0.712 0.482070

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.811 on 29 degrees of freedom

Multiple R-Squared: 0.01719, Adjusted R-squared: -0.0167

F-statistic: 0.5071 on 1 and 29 DF, p-value: 0.4821

### Linear regression

model2 <- lm (Species ~ AreaFactor, data = diane)

summary (model2)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 11.833 1.441 8.210 6.17e-09 ***

AreaFactormedium -2.405 1.723 -1.396 0.174

AreaFactorsmall -8.288 1.792 -4.626 7.72e-05 ***

Large has mean species richness of 11.8

Medium has mean species richness of 11.8 - 2.4 = 9.4

Small has a mean species richness of 11.8 - 8.3 = 3.5

mean(Species[AreaFactor=="medio"])

### ANOVA

model2 <- lm (Species ~ AreaFactor, data = diane)

anova (model2)

Analysis of Variance Table

Response: Species

Df Sum Sq Mean Sq F value Pr(>F)

AreaFactor 2 333.85 166.92 13.393 8.297e-05 ***

Residuals 28 348.99 12.46

### F tests in regression

model3 <- lm (Species ~ Elevation + pH, data = diane)

anova (model, model3)

Model 1: Species ~ Elevation

Model 2: Species ~ Elevation + pH

Res.Df RSS Df Sum of Sq F Pr(>F)

1 29 671.10

2 28 502.06 1 169.05 9.4279 0.004715 **

F 1, 28 = 9.43

Exercise

• Fit the model: Species~pH

• Fit the model: Species~pH+pH2

• ("pH2" is just pH2)

• Use the ANOVA command to decide whether species richness is a linear or quadratic function of pH

### Distributions: not so normal!

• Review assumptions for parametric stats (ANOVA, regressions)

• Why don’t transformations always work?

• Introduce non-normal distributions

## Tests before ANOVA, t-tests

Normality

Constant variances

### Tests for normality: exercise

data<-c(rep(0:6,c(42,30,10,5,5,4,4)));data

How many datapoints are there?

### Tests for normality: exercise

• Shapiro-Wilks (if sig, not normal)

shapiro.test (data)

If error message, make sure the stats package is loaded, then try again:

library(stats); shapiro.test (data)

### Tests for normality: exercise

• Kolmogorov-Smirnov (if sig, not normal)

ks.test(data,”pnorm”,mean(data),sd=sqrt(var(data)))

### Tests for normality: exercise

• Quantile-quantile plot (if wavers substantially off 1:1 line, not normal)

par(pty="s")

qqnorm(data); qqline(data)

opens up a single plot window

### Tests for normality: exercise

If the distribution isn´t normal, what is it?

freq.data<-table(data); freq.data

barplot(freq.data)

### Non-normal distributions

• Poisson (count data, left-skewed, variance = mean)

• Negative binomial (count data, left-skewed, variance >> mean)

• Binomial (binary or proportion data, left-skewed, variance constrained by 0,1)

• Gamma (variance increases as square of mean, often used for survival data)

### Exercise

model2 <- lm (Species ~ AreaFactor, data = diane)

anova (model2)

1. Test for normality of residuals

resid2<- resid (model2)

...you do the rest!

2. Test for homogeneity of variances

summary (lm (abs (resid2) ~ AreaFactor))

### Regression diagnostics

• Residuals are normally distributed

• Absolute value of residuals do not change with predicted value (homoscedastcity)

• Residuals show no pattern with predicted values (i.e. the function “fits”)

• No datapoint has undue leverage on the model.

### Regression diagnostics

model3 <- lm (Species ~ Elevation + pH, data = diane)

par(mfrow=c(2,2)); plot(model3)

### 1. Residuals are normally distributed

• Straight “Normal Q-Q plot”

Std. deviance resid.

Theoretical Quantiles

### 2. Absolute residuals do not change with predicted values

• No fan shape in Residuals vs fitted plot

• No upward (or downward) trend in Scale-location plot

MALO

BUENO

Residuals

Sqrt (abs (SD resid.))

Fitted values

Fitted values

### 3. Residuals show no pattern

Curved residual plots result from fitting a straight line to non-linear data (e.g. quadratic)

### 4. No unusual leverage

Cook’s distance > 1 indicates a point with undue leverage (large change in model fit when removed)

### Transformations

Try transforming your y-variable to improve the regression diagnostic plot

• replace Species with log(Species)

• replace Species with sqrt(Species)

Poisson distribution

• Frequency data

• Lots of zero (or minimum value) data

• Variance increases with the mean

### What do you do with Poisson data?

• Correct for correlation between mean and variance by log-transforming y (but log (0) is undefined!!)

• Use non-parametric statistics (but low power)

• Use a “generalized linear model” specifying a Poisson distribution

### The problem: Hard to transform data to satisfy all requirements!

Tarea: Janka example

Janka dataset: Asks if Janka hardness values are a good estimate of timber density? N=36