Statistical models in R - Part II

Statistical models in R - Part II • R has several statistical functions packages • We have already covered a few of the functions • t-tests (one- and two-sample, paired) • wilcox tests • hypothesis testing • chi-squared tests • Here we will cover: • linear and multiple regression • analysis of variance • correlation coefficients • Explore further on your own

Linear regression • To really get at the regression model, you need to learn how to access the data found by the lm command • The lm function is that for a linear model • Here is a short list: > summary(lm(y ~ x)) # to view the results # y variable is a function of x > resid(lm(y ~ x)) # to access the residuals > coef(lm(y ~ x)) # to view the coefficients > fitted(lm(y ~ x)) # to get the fitted values

Multiple linear regression • Linear regression was used to model the effect one variable, an explanatory variable, on another • Multiple linear regression does the same, only there are multiple explanatory variables or regressors • In this case, the model formula syntax is pretty easy to use • In simple regression we used: z~x • To add another explanatory variable you just “add" it to the right side of the formula • That is, to add ‘y’ we use z~x + y instead of simply z~x

Multiple linear regression Let's investigate the model: > x = 1:10 > y = sample(1:100,10) > z = x+y # notice no error term -- sigma = 0 > lm(z ~ x+y) # we use lm() as before ... Coefficients: (Intercept) x y 4.2e-15 1.0e+00 1.0e+00 # model finds b_0 = 0, b_1 = 1, b_2 = 1 as expected

Multiple linear regression Continuation… > z = x+y + rnorm(10,0,2) # now sigma = 2 > lm(z ~ x+y) ... Coefficients: (Intercept) x y 0.4694 0.9765 0.9891 # found b_0 = .4694, b_1 = 0.9765, b_2 = 0.9891 > z = x+y + rnorm(10,0,10) # more noise -- sigma = 10 > lm(z ~ x+y) ... Coefficients: (Intercept) x y 10.5365 1.2127 0.7909

Multiple linear regression • The lm command only returns the coefficients (and the formula call) by default. • The two methods summary and anova can yield more information • The output of summary is similar as for simple regression • In the example of multiple linear regression, the R command is given as: summary(lm(z ~ x+y ))

Analysis of variance • The t-test was used to test hypotheses about the means of two independent samples. For example, to test if there is a difference between control and treatment groups • The analysis of variance (ANOVA) allows one to compare means for more than 2 independent samples

One-way analysis of variance Example: Scholarship grading • Suppose a school is trying to grade 300 different scholarship applications. As the job is too much work for one grader, suppose 6 are used • The scholarship committee would like to ensure that each grader is using the same grading scale, as otherwise the students aren't being treated equally. One approach to checking if the graders are using the same scale is to randomly assign each grader 50 exams and have them grade. Then compare the grades for the 6 graders knowing that the differences should be due to chance errors if the graders all grade equally • To illustrate, suppose we have just 24 tests and 3 graders (not 300 and 6 to simplify data entry). Furthermore, suppose the grading scale is on the range 1-5, with 5 being the best

One-way analysis of variance Data for the scholarship grading example: > x = c(4,3,4,5,2,3,4,5) # enter this into our R session > y = c(4,4,5,5,4,5,4,4) > z = c(3,4,2,4,5,5,4,4) > scores = data.frame(x,y,z) # make a data frame > boxplot(scores) # compare the three distributions • From the boxplots, it appears that grader 2 is different from graders 1 and 3

One-way analysis of variance Scholarship grading example: • Analysis of variance allows us to investigate if all the graders have the same mean • The R function to do the analysis of variance hypothesis test (oneway.test) requires the data to be in a different format • It wants to have the data with a single variable holding the scores, and a factor describing the grader or category - the stack command will do this for us

One-way analysis of variance Scholarship grading example: > scores = stack(scores) # look at scores if not clear > names(scores) [1] "values" "ind“ > oneway.test(values ~ ind, data=scores, var.equal=T) Notice: we set explicitly that the variances are equal with var.equal=T Result: We see a p-value of 0.34 which means we accept the null hypothesis of equal means

One-way analysis of variance Scholarship grading example: • The anova function gives more detail - you need to call it on the result of lm > anova(lm(values ~ ind, data=scores)) • Notice that the output is identical to that given by oneway.test • Alternatively, you could use the aov function to replace the combination of anova(lm()) • However, to get a similar output you need to apply the summary command to the output of aov (for more on this, get help: enter ?aov)

> t4data=read.delim("t4data.txt") > attach(t4data) > names(t4data) > plot(FT4~Group) > boxplot(FT4~Group,notch=T) 1 > aov(FT4~Group) > lm(FT4~Group) > a1=aov(FT4~Group) > summary(a1) > names(a1) > a1$coefficients > l1=lm(FT4~Group) > anova(l1) > names(l1) > l1$coefficients > l1$effects 2 3 One-way analysis of variance Example: T4data

Two-way analysis of variance Example: ISdata > ISdata=read.delim(“ISdata.txt") > attach(ISdata) > names(ISdata) > boxplot(IS~HT+BMI) > table(HT,BMI) > t2=tapply(IS,list(HT,BMI),median) > t2[2,3] > lm(IS~BMI*HT) # ‘*’ means all interactions > lm(IS~BMI+HT+BMI:HT) > anova(lm(IS~BMI+HT+BMI:HT)) > anova(lm(IS~BMI+HT)) > anova(lm(IS~HT+BMI)) > aov(IS~BMI+HT)

Two-way analysis of variance Example: ISdata cont. > TukeyHSD(aov(IS~BMI+HT)) > par(mfrow=c(1,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > par(mfrow=c(2,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > plot(TukeyHSD(aov(IS~HT+BMI),which="BMI")) > plot(TukeyHSD(aov(IS~HT+BMI),which="HT"))

Correlation coefficient Example: t4data > ls() > attach(t4data) > names(t4data) > plot(FT4,FT3) > plot(FT4,FT3,pch=c(2,4)[Gender]) # according to gender > plot(FT4,FT3,pch=c(2,4)[Gender],col=c(2,4)[Gender]) > table(Gender) > cor(FT4,FT3) # to find R > cor(FT4,FT3)^2 # to find R2 > cor.test(FT4,FT3) > cor.test(FT4,FT3,method="s") # Spearman R > cor.test(FT4,FT3,method="k") # Kendall tau

Miscellaneous exercises • Due to time restrictions, we cannot cover all areas • Find accompanying “R_miscellaneous_practical.doc” • The document contains some useful R commands • Work through the R commands in your own time

Sources of help • The R project website http://www.r-project.org • The 'official' introduction to R at http://cran.r-project.org/doc/manuals/R-intro.pdf • Manuals, tutorials, etc. provided by users of R, located at http://cran.R-project.org/other-docs.html

Statistical models in R - Part II

Statistical models in R - Part II

Presentation Transcript

Statistical Forecasting Models

Part II: Graphical models

PS 601 Notes – Part II Statistical Tests

Statistical Inventory Models

Building Statistical Models

Models and Modelling in FEWS Part II

Statistical Models in Meta-Analysis

BIO503: Lecture 4 Statistical models in R --- Recap ---

Part II: Population Models

BIO503: Lecture 4 Statistical models in R

Statistical methods in LHC data analysis part II.2

STATISTICAL INFERENCE PART II POINT ESTIMATION

Statistical Models in S

Statistical Language Modelling Part I – Observable Models

Statistical Machine Translation Part IV – Log-Linear Models

Part 4: Statistical Models in Simulation

Statistical databases in theory and practice Part II: Data models

Part II: Population Models

Part II: Graphical models

Statistical / empirical models

Statistical methods in LHC data analysis part II.1