Loading in 5 sec....

Workshop in R & GLMs: #2PowerPoint Presentation

Workshop in R & GLMs: #2

- 422 Views
- Updated On :
- Presentation posted in: Sports / GamesEducation / CareerFashion / BeautyGraphics / DesignNews / Politics

Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca Start by loading your Lakedata_06 dataset: diane<-read.table(file.choose(),sep=";",header=TRUE) Dataframes Two ways to identify a column (called "treatment") in your dataframe (called "diane"):

Workshop in R & GLMs: #2

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Diane Srivastava

University of British Columbia

srivast@zoology.ubc.ca

Start by loading your Lakedata_06 dataset:

diane<-read.table(file.choose(),sep=";",header=TRUE)

Two ways to identify a column (called "treatment") in your dataframe (called "diane"):

diane$treatment

OR

attach(diane); treatment

At end of session, remember to: detach(diane)

length (x)

mean (x)

var (x)

cor (x,y)

sum (x)

summary (x)

minimum, maximum, mean, median, quartiles

What is the correlation between two variables in your dataset?

- A factor has several discrete levels (e.g. control, herbicide)
- If a vector contains text, R automatically assumes it is a factor.
- To manually convert numeric vector to a factor:
- x <- as.factor(x)
- To check if your vector is a factor, and what the levels are:
- is.factor(x) ; levels(x)

Make lake area into a factor called AreaFactor:

Area 0 to 5 ha: small

Area 5.1 to 10: medium

Area > 10 ha: large

You will need to:

1. Tell R how long AreaFactor will be.

AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium"

2. Assign cells in AreaFactor to each of the 3 levels

3. Make AreaFactor into a factor, then check that it is a factor

Make lake area into a factor called AreaFactor:

Area 0 to 5 ha: small

Area 5.1 to 10: medium

Area > 10 ha: large

You will need to:

1. Tell R how long AreaFactor will be.

AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium"

2. Assign cells in AreaFactor to each of the 3 levels

AreaFactor[Area<5.1]<-“small"; AreaFactor[Area>10]<-“large"

3. Make AreaFactor into a factor, then check that it is a factor

AreaFactor<-as.factor(AreaFactor); is.factor(AreaFactor)

ALT+126

model <- lm (y ~ x, data = diane)

insert your

dataframe

name here

invent a name for your model

linear model

insert your

x vector name here

insert your

y vector name here

model <- lm (Species ~ Elevation, data = diane)

summary (model)

Call:

lm(formula = Species ~ Elevation, data = diane)

Residuals:

Min 1Q Median 3Q Max

-7.29467 -2.75041 -0.04947 1.83054 15.00270

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 9.421568 2.426983 3.882 0.000551 ***

Elevation -0.0026090.003663 -0.712 0.482070

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.811 on 29 degrees of freedom

Multiple R-Squared: 0.01719, Adjusted R-squared: -0.0167

F-statistic: 0.5071 on 1 and 29 DF, p-value: 0.4821

model2 <- lm (Species ~ AreaFactor, data = diane)

summary (model2)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 11.833 1.441 8.210 6.17e-09 ***

AreaFactormedium -2.405 1.723 -1.396 0.174

AreaFactorsmall -8.288 1.792 -4.626 7.72e-05 ***

Large has mean species richness of 11.8

Medium has mean species richness of 11.8 - 2.4 = 9.4

Small has a mean species richness of 11.8 - 8.3 = 3.5

mean(Species[AreaFactor=="medio"])

model2 <- lm (Species ~ AreaFactor, data = diane)

anova (model2)

Analysis of Variance Table

Response: Species

Df Sum Sq Mean Sq F value Pr(>F)

AreaFactor 2 333.85 166.92 13.393 8.297e-05 ***

Residuals 28 348.99 12.46

model3 <- lm (Species ~ Elevation + pH, data = diane)

anova (model, model3)

Model 1: Species ~ Elevation

Model 2: Species ~ Elevation + pH

Res.Df RSS Df Sum of Sq F Pr(>F)

1 29 671.10

2 28 502.06 1 169.05 9.4279 0.004715 **

F 1, 28 = 9.43

Exercise

- Fit the model: Species~pH
- Fit the model: Species~pH+pH2
- ("pH2" is just pH2)
- Use the ANOVA command to decide whether species richness is a linear or quadratic function of pH

- Review assumptions for parametric stats (ANOVA, regressions)
- Why don’t transformations always work?
- Introduce non-normal distributions

Tests before ANOVA, t-tests

Normality

Constant variances

data<-c(rep(0:6,c(42,30,10,5,5,4,4)));data

How many datapoints are there?

- Shapiro-Wilks (if sig, not normal)
shapiro.test (data)

If error message, make sure the stats package is loaded, then try again:

library(stats); shapiro.test (data)

- Kolmogorov-Smirnov (if sig, not normal)
ks.test(data,”pnorm”,mean(data),sd=sqrt(var(data)))

- Quantile-quantile plot (if wavers substantially off 1:1 line, not normal)
par(pty="s")

qqnorm(data); qqline(data)

opens up a single plot window

If the distribution isn´t normal, what is it?

freq.data<-table(data); freq.data

barplot(freq.data)

- Poisson (count data, left-skewed, variance = mean)
- Negative binomial (count data, left-skewed, variance >> mean)
- Binomial (binary or proportion data, left-skewed, variance constrained by 0,1)
- Gamma (variance increases as square of mean, often used for survival data)

model2 <- lm (Species ~ AreaFactor, data = diane)

anova (model2)

1. Test for normality of residuals

resid2<- resid (model2)

...you do the rest!

2. Test for homogeneity of variances

summary (lm (abs (resid2) ~ AreaFactor))

- Residuals are normally distributed
- Absolute value of residuals do not change with predicted value (homoscedastcity)
- Residuals show no pattern with predicted values (i.e. the function “fits”)
- No datapoint has undue leverage on the model.

model3 <- lm (Species ~ Elevation + pH, data = diane)

par(mfrow=c(2,2)); plot(model3)

- Straight “Normal Q-Q plot”

Std. deviance resid.

Theoretical Quantiles

- No fan shape in Residuals vs fitted plot
- No upward (or downward) trend in Scale-location plot

MALO

BUENO

Residuals

Sqrt (abs (SD resid.))

Fitted values

Fitted values

Curved residual plots result from fitting a straight line to non-linear data (e.g. quadratic)

Cook’s distance > 1 indicates a point with undue leverage (large change in model fit when removed)

Try transforming your y-variable to improve the regression diagnostic plot

- replace Species with log(Species)
- replace Species with sqrt(Species)

Poisson distribution

- Frequency data
- Lots of zero (or minimum value) data
- Variance increases with the mean

- Correct for correlation between mean and variance by log-transforming y (but log (0) is undefined!!)
- Use non-parametric statistics (but low power)
- Use a “generalized linear model” specifying a Poisson distribution

Tarea: Janka example

Janka dataset: Asks if Janka hardness values are a good estimate of timber density? N=36