R: Because the names of other stat programs don’t make sense so why should this one?

R: Because the names of other stat programs don’t make sense so why should this one? Alexander C. LoPilato

Outline The three Ws of R: What, Where, and Why Commonly used operators Formatting your data for R Working with data in R Exporting data from R

What is R? “R is a language and environment for statistical computing and graphics.” (http://www.r-project.org) It’s a programming language first and a statistical analysis tool second Entirely syntax based – Similar to the SAS and SPSS syntax User can download “packages” which are similar to SPSS Modules

Where is R? • Available for download at: • http://www.r-project.org/ • Works on PCs, Macs, and Linux OS • Doesn’t require a ton of computer memory (I find that it runs smoother than both SAS and SPSS)

Why use R? It’s an open-source project aka FREE! It’s gaining traction in both industry (Google, Facebook, & Kickstarter) and academia It’s combination of programming flexibility and statistical analyses capabilities makes it one of the more powerful data analysis programs out there

Commonly used operators <- :Assignment operator # :Comment operator >, <, ==, | :Boolean operators +, -, *, ^ :Mathematical operators

Formatting your data for R: A Brief Intro • R can read SAS, SPSS, STATA, txt files, and csv files • I recommend that you store your data in a csv file • R can easily read csv files • Csv files can be imported to and exported from SAS and SPSS • Other statistical programs can easily read csv files • I write all of my code in notepad (more habit than anything else), but R has many different GUIs

Formatting your data for R: Three easy steps • 1) Turn your data file into a csv file • 2) Use the read.csv() function • Dataset <- read.csv(‘Dataset location.csv’) • 3) Dataset is now a user-defined object (in this particular case it’s a dataframe in R) that contains all of your data

Formatting your data for R: Common Mistakes (That I’ve made 100 times over) R cannot read \ (the backslash), thus when you write the location of your dataset you have to use either / or \\ R is case sensitive, so ‘C:\\Dataset.csv’ and ‘C:\\dataset.csv’ are not the same in R speak Always make sure you include the file extension (.csv, .txt, .whatever)!

Working with data in R: Things to check • I always check the dimensions of my dataset • dim(Dataset) – this will return two numbers: row x column. Rows = number of cases and columns = number of variables • Check the names of your dataset • names(Dataset) • Check the descriptive statistics for anything out of the ordinary: • colMeans(Dataset[,1:10],na.rm=T) • sapply(Dataset[,1:10],sd,na.rm=T) • Notice the brackets?

Working with data in R: Subsetting your dataset • First, begin thinking about your dataset as a matrix • Rows = cases and columns = variables • Dataset[5,1] means return the observation stored in row 5 column 1 • Dataset[,1] means return all of the rows in column 1 • Dataset[2,1:5] means return all of the observations in row 2 and columns 1 through 5

Working with data in R: Subsetting your dataset • Alternatively, you can reference a column directly by using the $ operator: • Dataset$Var1 will return the entire Var1 column from Dataset • What if I want to filter by some variable? • ds.Female <- Dataset[Dataset$Var11 == ‘Female’,] • The above creates a dataframe called ds.Female that filtered out any case where Var11 equaled ‘Male’

Working with data in R: Reverse Coding • What do you do if you have some variables that need to be reverse coded? • (1 + highest scale value – Variable) is the general formula • Dataset$Var12 <- 8 – Dataset$Var10 – This does two things. 1) Creates another column in Dataset labeled Var12 and 2) Sets Var12 equal to 8 – Var10 • Check with cor(Dataset$Var10, Dataset$Var12, use=‘complete.obs’)

Working with data in R: Internal R Functions mean(Dataset$Var1,na.rm=T) = mean(Dataset[,1],na.rm=T) sd(Dataset$Var4, na.rm=T) min(Dataset$Var5,na.rm=T) and max(Dataset$Var5, na.rm=T) cor(Dataset, use=‘complete.obs’)

Working with data in R: Internal R Functions • modlm <- lm(Var2 ~ Var3, data=Dataset) • Ordinary Least Squares Regression, regressing variable 2 onto variable 3 • modanova <- lm(Var4 ~ as.factor(Var11), data=Dataset) • OLS Regression, regressing variable 4 onto the categorical gender variable – This is an ANOVA! • modanova1 <- aov(Var4 ~ as.factor(Var11), data=Dataset) • aov is R’s built in ANOVA function • dif <- TukeyHSD(modanova1) • Tukey’s Honestly Significant Difference Test

Exporting Data from R write.csv(Dataset, ‘Location.csv’) BOOM goes the dynamite

Thank you!

R: Because the names of other stat programs don’t make sense so why should this one?