Mastering Data Manipulation in R: Practical Guide

Introduction to R: Lesson 2 - Manipulating Data Andrew Jaffe 9/13/10

Reminder • Here is the course website: http://www.biostat.jhsph.edu/~ajaffe/rseminar.html • There is a running collection of functions that we have covered in class

Dataset • For the remaining sessions, we’re going to learn R by using data from the Baltimore Dog Study • Data collection is ongoing, and dataset will be updated weekly http://metrodog.blogspot.com/

Overview • Importing Data • Examining Data • Recoding Variables • Exporting Data

Importing Data • Here is a link to the data: http://www.biostat.jhsph.edu/~ajaffe/files/lecture_2_data.csv • So how do we get it into R? Two options! • Both involve read.table()

Importing Data • read.table(filename, header = F, sep = "", as.is = !stringsAsFactors, …) • In functions, "…" means additional parameters can be passed/used • These are some of the options associated with this functions – all can be seen typing ?read.table in the console

Importing Data • filename: the path to your file, in quotes • If no path is specified (ie "C:\Docs\data.txt" or "\Users\Andrew\data.txt"), then R will look in your working directory for the file (ie "data.txt") • For PCs, you need double backslashes to designate paths (ie "C:\\Docs\\data.txt") • Basically, a single backslash is the 'escape' character

Importing Data • filename - you can: • Write out the full file path using quotes and the correct syntax • Manually set your working directory to where your script and files are located [setwd()] • Or, if your script and files are in the same place, use Notepad++. It sets the script's location to be the working directory

Importing Data • header – default is false • Does the first row of your file contain column names? If so, include 'header = T' in your read.table() call

Importing Data • sep = "" – what character separates columns? • The escape character followed by the delimiter is used here: • Tab: "\t" • Newline/Enter/Return: "\n" • Ampersand: "\&", etc

Importing Data • CSV is an exception • A special case of read.table() exists: read.csv(), which takes all of the same parameters, except defaults sep = "," • Analogously, read.delim() defaults sep = "\t"

Importing data • as.is = F (as stringsAsFactors=T) : should character strings be treated as factors? • I prefer character strings as characters (ie as.is = T) and not factors • Easier to manipulate, search, and match • You can always change to factors later

Importing Data • Let's open up a new script: • Notepad++ : File  New • Mac: File  New Document • Save it somewhere you can find later • Write a header (using #) • If Mac, use setwd() and include the folder you put the script

Importing Data • Let's get our data R • Option 1: remember ‘scan’ from last session? file = "http://www.biostat.jhsph.edu/~ajaffe/files/lecture_2_data.csv"

Importing Data • Option 2: Right click on the link to the data on the webpage, and save it as a csv file in the same folder as your script file = "lecture_2_data.csv"

Importing Data Either way: dat <- read.csv(file, header = T, as.is=T)

Examining Data • What are the dimensions of the dataset?

Examining Data • What are the dimensions of the dataset? > dim(dat) [1] 1000 7 Rows Columns

Examining Data • What variables are included? What are their names?

Examining Data • What variables are included? What are their names? > head(dat) id age sex height weight dog dog_type 1 1 40 F 63.5 134.5 no <NA> 2 2 36 M 65.6 191.6 no <NA> 3 3 69 M 68.2 170.0 no <NA> 4 4 56 F 62.9 134.5 no <NA> 5 5 66 F 63.7 133.4 no <NA> 6 6 84 M 70.8 200.6 no <NA>

Examining Data • What variables are included? What are their names? > names(dat) [1] "id" "age" "sex" "height" [5] "weight" "dog" "dog_type"

Examining Data • What class of data is 'id'? 'dog_type'?

Examining Data • What class of data is 'id'? 'dog_type'? > class(dat$id) [1] "integer" > class(dat$dog_type) [1] "character"

Examining Data • What class of data is 'id'? 'dog_type'? > str(dat) 'data.frame': 1000 obs. of 7 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 ... $ age : int 40 36 69 56 66 84 40 73 76 38 ... $ sex : chr "F" "M" "M" "F" ... $ height : num 63.5 65.6 68.2 62.9 63.7 70.8 67 67 62.6 62.2 ... $ weight : num 134 192 170 134 133 ... $ dog : chr "no" "no" "no" "no" ... $ dog_type: chr NA NA NA NA ...

Examining Data • How many total participants are there? • How many men and how many women?

Examining Data • How many total participants are there? • How many men and how many women? > length(unique(dat$id)) [1] 1000 > unique(c(1,1,2,2,3)) [1] 1 2 3 > length(unique(c(1,1,2,2,3))) [1] 3 > length(c(1,1,2,2,3)) [1] 5

Examining Data • How many total participants are there? • How many men and how many women? > table(dat$sex) F M 493 507

Examining Data • How many people have dogs?

Examining Data • How many people have dogs? > table(dat$dog) no yes 518 482

Examining Data • How many different types of dogs are there? How many of each?

Examining Data • How many different types of dogs are there? How many of each? > table(dat$dog_type) husky lab poodle retriever 113 125 111 133

Recoding Data • Missingness: represented by 'NA' [default] • read.table(…,na.strings = "NA",…) – you can change based on your data • 'NA' is NOT a character string: > x = rep(NA,3) > x [1] NA NA NA > class(x) [1] "logical"

Recoding Data • NA values are essentially ignored, except when you use certain functions > x = c(NA, 1, NA, 3, 4) > x*2 [1] NA 2 NA 6 8 > mean(x) [1] NA > mean(x, na.rm = TRUE) [1] 2.666667

Recoding Data • is.na() tests for missing entries • Returns TRUE or FALSE at each entry > x = c(NA, 1, NA, 3, 4) > x [1] NA 1 NA 3 4 > class(x) [1] "numeric" > is.na(x) [1] TRUE FALSE TRUE FALSE FALSE

Recoding Data • which() returns the indices for entries that are TRUE > which(is.na(x)) [1] 1 3

Recoding Data • '!' means 'not': > which(!is.na(x)) [1] 2 4 5 > x [1] NA 1 NA 3 4 > Index = which(!is.na(x)) > x[Index] [1] 1 3 4

Recoding Data • ‘which’ is implicit when you subset using ‘is.na’ (or !is.na) # in one step > x[!is.na(x)] [1] 1 3 4

Recoding Data • Renaming binary variables – ex: change sex from M/F to 0/1 > head(dat$sex) [1] "F" "M" "M" "F" "F" "M" > bin.sex = ifelse(dat$sex=="F",1,0) > head(bin.sex) [1] 1 0 0 1 1 0

Recoding Data • ?ifelse: ifelse(test, yes, no) • test - an object which can be coerced to logical mode (ie TRUE or FALSE) • yes - return values for true elements of test • no - return values for false elements of test

Recoding Data • Logical characters: ==, !=, <, >, <=, >= • Also: is.[type] – ie: is.na, is.character, is.data.frame, is.numeric, etc… > x = c(1,3,7,9) > x > 3 [1] FALSE FALSE TRUE TRUE > x == 3 [1] FALSE TRUE FALSE FALSE

Recoding Data > bin.sex = ifelse(dat$sex=="F",1,0) > head(dat$sex == "F") [1] TRUE FALSE FALSE TRUE TRUE FALSE > head(bin.sex) [1] 1 0 0 1 1 0

Recoding Data • Analogously, creating a cut-point in continuous data: > head(dat$age) [1] 40 36 69 56 66 84 > bin.age = ifelse(dat$age < 50, 0, 1) > head(bin.age) [1] 0 0 1 1 1 1

Exporting Data • write.table(data, filename, quote = T, row.names = T, col.names = T, sep = " ") • 'data' is an R object – ie 'dat' in our case • 'filename' is similar to read.table – you should include a '.txt' in the filename • 'quote' puts character strings in quotes (I like setting that to be FALSE [or F])

Exporting Data • row.names: includes the row.names in the output, which is usually just a sequence from 1 to nrow(dat) – I prefer FALSE, as excel automatically has row indices • col.names: include the header names in the output file? Depending on the data, I usually use TRUE

Practice • Make a 2 x 2 table of sex and dog • Create a 'BMI' variable using height and weight • Hint: BMI = weight[lbs]*703/(height[in])^2 • Create an 'overweight' variable, which gives the value 1 for people with BMI > 30 and 0 otherwise

Mastering Data Manipulation in R: Practical Guide