1 / 29

Introduction to R

Introduction to R. Linear Modelling A n introduction using R. 19 June 2014. Karim Malki. AGENDA. R for statistical analysis Understanding Linear Models Data pre-processing Building Linear Models in R Graphing Reporting Results Further Reading. R For Statistics.

konala
Download Presentation

Introduction to R

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to R Linear Modelling An introduction using R 19 June 2014 KarimMalki

  2. AGENDA • R for statistical analysis • Understanding Linear Models • Data pre-processing • Building Linear Models in R • Graphing • Reporting Results • Further Reading

  3. R For Statistics R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available on the R project website as "packages". The base installation contains a powerful set of tools for many statistical purposes including linear modelling. Requires library or More advanced

  4. Variance and CoVariance • Variance • Sum of each data point minus the mean for that variable, squared • When a participant deviates from the mean on one variable, do they deviate on another variable in a similar, or opposite, way? = “Covariance”.

  5. Correlation x<- runif(10, 5.0, 15) y<- sample(5:15, 10, replace=T) x is.atomic(x) str(x) y is.atomic(y) str(y) var(x) var(y) cov(x,y)

  6. Correlation Standardising covariance measures Standardising a covariance value gives a measure of the strength of the relationship -> Correlation coefficient E.g. covariance divided by (sd of X * sd of Y) is the ‘Pearson product moment correlation coefficient’ This will give coefficients between -1 (perfect negative relationship) and 1 (perfect positive relationship) cov(x,y)/(sqrt(var(x))*sqrt(var(y))) myfunction<-function(x,y){cov(x,y)/(sqrt(var(x))*sqrt(var(y)))} cort(x,y) cor(x,y)

  7. Correlation ?faithful data(faithful) summary(faithful) dim(faithful) str(faithful) names(faithful) library(psych) describe(faithful)

  8. Correlation

  9. Correlation ls() hist(faithful$eruptions, col="grey") hist(faithful$waiting, col="grey") attach(faithful) plot(eruptions~waiting) abline(lm(faithful$eruptions~faithful$waiting)) cor(eruptions,waiting) cor(faithful, method = "pearson”) library(car) scatterplot(eruptions~waiting, reg.line=lm, smooth=TRUE, spread=TRUE, id.method='mahal', id.n = 2, boxplots='xy', span=0.5, data=faithful) library(psych) cor.test(waiting,eruptions)

  10. Correlation

  11. Correlation

  12. Correlation • corr.mat<-cor.matrix(variables=d(eruptions,waiting),, data=faithful, test=cor.test, method='pearson’, alternative="two.sided") • > print(corr.mat) • Pearson's product-moment correlation • eruptions waiting • Eruptions cor1 0.9008 • N 272 272 • CI* (0.8757,0.9211) • p-value 0.0000

  13. Regression • Linear Regression is conceptually similar to correlation • However, correlation does not treat the two variables differently • In contrast, Linear Regression is asking about the effect of one on the other. It distinguishes between IVs (the thing that may influence) and DVs (the things being influenced) • So, sometimes problematically, you choose which you expect to have the causal effect • Fits a straight line that minimises squared error in the DV (vertical distances of points from the line= “Method of Least Squares” • And then asks about the relative variance explained by this straight line model relative to the unexplained variance

  14. Regression x <- c(173, 169, 176, 166, 161, 164, 160, 158, 180, 187)
y <- c(80, 68, 72, 75, 70, 65, 62, 60, 85, 92)

# plot scatterplot and the regression line
mod1 <- lm(y ~ x)
plot(x, y, xlim=c(min(x)-5, max(x)+5), ylim=c(min(y)-10, max(y)+10))
abline(mod1, lwd=2)# calculate residuals and predicted values
res <- signif(residuals(mod1), 5)
pre <- predict(mod1)

# plot distances between points and the regression line
segments(x, y, x, pre, col="red")

# add labels (res values) to points
library(calibrate)
textxy(x, y, res, cx=0.7)

  15. Regression

  16. Method of Least square

  17. Parameters The regression model know what is the best fitting line but it can tell you only two things. The slope (gradient or coefficient) and the intercept (or constant)

  18. Parameters

  19. Linear Modelling #data(faithful);ls() mod1<- lm(eruptions~waiting,data=faithful) mod1 Coefficients: (Intercept) waiting -1.87402 0.07563 summary(mod1) Call: lm(formula = eruptions ~ waiting, data = faithful) Residuals: Min 1Q Median 3Q Max -1.29917 -0.37689 0.03508 0.34909 1.19329 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.874016 0.160143 -11.70 <2e-16 *** waiting 0.075628 0.002219 34.09 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4965 on 270 degrees of freedom Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108 F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16

  20. Linear Modelling co<-coef(mod1) # calculate residuals and predicted values
res <- signif(residuals(mod1), 5)
pre<- predict(mod1) # Residuals should be normally distributed and this is easy to check hist(res) library(MASS) truehist(res) qqnorm(res) abline(0,1) Plot your regression attach(faithful) mod1 <- lm(eruptions~waiting)
plot(waiting, eruptions, xlim=c(min(faithful$waiting)-10, max(faithful$waiting)+5), ylim=c(min(faithful$eruptions)-3, max(faithful$eruptions))+1);abline(mod1, lwd=2) # plot distances between points and the regression line
segments(faithful$waiting, faithful$eruptions, faithful$waiting, pre, col='red')

  21. Return p-value lmp <- function (modelobject) { if (class(modelobject) != "lm") stop("Not an object of class 'lm' ") f <- summary(modelobject)$fstatistic p <- pf(f[1],f[2],f[3],lower.tail=F) attributes(p) <- NULL print(modelobject) return(p) }

  22. Model Fit Ifthe model does not fit, itmaybebecause of: Outliers Unmodelledcovariates Heteroscedasticity(residuals have unequalvariance) Clustering (residuals have lowervariancewithinsubgroups) Autocorrelation(correlationbetweenresiduals at successive time points)

  23. Model Fit library(MASS, pysch, lattice, grid, hexbin) library(solaR) data(hills) splom(~hills) data <- subset(hills, select=c('dist', 'time', 'climb' )) splom(hills, panel=panel.hexbinplot, colramp=BTC, diag.panel = function(x, ...){ yrng <- current.panel.limits()$ylim d <- density(x, na.rm=TRUE) d$y <- with(d, yrng[1] + 0.95 * diff(yrng) * y / max(y) ) panel.lines(d) diag.panel.splom(x, ...) }, lower.panel = function(x, y, ...){ panel.hexbinplot(x, y, ...) panel.loess(x, y, ..., col = 'red') }, pscale=0, varname.cex=0.7 )

  24. Model Fit mod2=lm(time~dist,data=hills) summary(mod2) attach(hills) co2=coef(mod2) plot(dist,time) abline(co2) fl=fitted(mod2) for(i in 1:35) lines(c(dist[i],dist[i]), c(time[i],fl[i]),col=‘red’) #Can you spot outliers? sr=stdres(mod2) names(sr) truehist(sr,xlim=c(-3,5),h=.4) names(sr)[sr>3] names(sr)[sr<-3]

  25. Model Fit attach(hills) plot(dist,time, ylim=c(0,250)) abline(coef(lm1)) identify(dist,time, labels=rownames(hills))

  26. What to do with outliers Data driven methods for the removal of outliers – some limitations Fit a better model Robust regression is an alternative to least squares regression when data are contaminated with outliers or influential observations Leverage: An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. Influence: An observation is influential if removing the observation substantially changes the estimate of the regression coefficients. Cook's distance (or Cook's D): A measure that combines the information of leverage and residual of the observation.

  27. What to do with outliers attach(hills);summary(ols <- lm(time ~ dist)) opar<- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0)) plot(ols, las = 1) Influence.measures(lm1) #Using M estimator rlm1=rlm(time~dist,data=hills,method=‘MM’) summary(rlm1) attach(hills) plot(dist,time, ylim=c(0,250)) abline(coef(lm1)) abline(coef(rlm1),col="red") identify(dist,time)

  28. What to do with outliers attach(hills);summary(ols <- lm(time ~ dist)) opar<- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0)) plot(ols, las = 1) Influence.measures(lm1) Using M estimator rlm1=rlm(time~dist,data=hills,method=‘MM’) summary(rlm1) attach(hills) plot(dist,time, ylim=c(0,250)) abline(coef(lm1)) abline(coef(rlm1),col="red") identify(dist,time, labels=rownames(hills))

  29. Linear Modelling I will be around for questions or, for a slower response, email me: Karim.malki@kcl.ac.uk

More Related