Introduction to R - Lecture 5: More loops

Introduction to R - Lecture 5: More loops Andrew Jaffe 10/4/2010

Overview • Review: For Loop • Lists • Aside: Patterns • Application

Review: For Loop • The syntax is: for(var in seq) {code} • The seq determines what values var will take in the loop • The loop is performed length(seq) times • On the n’th iteration of the loop, var takes the value seq[n] • var is a completely new variable and notdirectlyrelated to anything other variable

Review: For Loop • Setting up your loop requires determining the correct seq to loop over: usually easy • The real challenge of looping is relating the values of seq to the dimensions/ indices of your data

Review: For Loop • From last lecture: we’re relating seq to the columns of the data • var is indirectly related to the data, as it links/relates to the column indices – but it has only has the values 1-12 Index = 4:15 mean_wt <- rep(0, length(Index)) for(i in 1:length(Index)) { ind = Index[i] # column index mean_wt[i] = mean(dog_dat[,ind]) }

Lists • "An R list is an object consisting of an ordered collection of objects known as its components." • "Components are always numbered and may always be referred to as such" – double brackets can subset lists CRAN. Intro to R

Lists > L = list() # empty list > L[[1]] = 1:4 > L[[2]] = 2:7 > L[[3]] = c("a","b","c") > L[[4]] = matrix(rnorm(4), nrow = 2) > L [[1]] [1] 1 2 3 4 [[2]] [1] 2 3 4 5 6 7 [[3]] [1] "a" "b" "c" [[4]] [,1] [,2] [1,] -1.43944849 -0.4801696 [2,] 0.09923108 1.0783053

Lists > names(L) = c("seq1","seq2","letters","mat") > L $seq1 [1] 1 2 3 4 $seq2 [1] 2 3 4 5 6 7 $letters [1] "a" "b" "c" $mat [,1] [,2] [1,] 1.824487 0.3431034 [2,] -0.533006 0.9406285

Lists > L[[1]] [1] 1 2 3 4 > str(L) List of 4 $ seq1 : int [1:4] 1 2 3 4 $ seq2 : int [1:6] 2 3 4 5 6 7 $ letters: chr [1:3] "a" "b" "c" $ mat : num [1:2, 1:2] 1.824 -0.533 0.343 0.941

Lists • Why know lists? • Can store data of different lengths and types • Some functions return lists

Lists • Load back in the lecture 4 data • We still have one problem to solve - the averages of weight, length, and food for each dog type at each visit

Lists • First we can create a list containing each group we care about Indexes = list() Indexes[[1]] = 4:15 # weight Indexes[[2]] = 16:27 # length Indexes[[3]] = 28:39 # food names(Indexes) = c("weight", "length", "food")

Lists > Indexes $weight [1] 4 5 6 7 8 9 10 11 12 13 14 15 $length [1] 16 17 18 19 20 21 22 23 24 25 26 27 $food [1] 28 29 30 31 32 33 34 35 36 37 38 39

Lists • Next, we can create an output list for our results, and recreate the unique dog list for our loop out <- list() dogs = unique(dog_dat$dog_type)

Lists • We want to loop over the different covariates (wt, len, food) and within each, the different dog types • For looping over the groups, either works: > seq(along = Indexes) [1] 1 2 3 > 1:length(Indexes) [1] 1 2 3

Lists for(i in seq(along = Indexes)) { # 1:3 # take the i'th index from the list Index = Indexes[[i]] # for that variable, create a new matrix tmp = matrix(nrow = length(dogs), ncol = length(Index)) ...

Lists • We can then fill in that temporary matrix with an inner 'for' loop • Note that this is the exact same loop as last week (note the j's): Index from the outer loop for(j in 1:length(dogs)) { hold = dog_dat[dog_dat$dog_type == dogs[j],Index] tmp[j,] = colMeans(hold) } rownames(tmp) = dogs colnames(tmp) = paste("month",1:12,sep="_")

Lists • Lastly, we save that tmp matrix in our output list: out[[i]] = tmp

for(i in seq(along = Indexes)) { # groups Index = Indexes[[i]] tmp = matrix(nrow = length(dogs), ncol = length(Index)) for(j in 1:length(dogs)) { # dogs hold = dog_dat[dog_dat$dog_type == dogs[j],Index] tmp[j,] = colMeans(hold) } rownames(tmp) = dogs colnames(tmp) = paste("month",1:12,sep="_") out[[i]] = tmp } names(out) <- c("weight","length","food")

> out $weight month_1 month_2 month_3 month_4 month_5 month_6 month_7 lab 49.81840 48.69200 49.03360 50.26560 50.17600 49.67280 48.41600 poodle 49.40090 48.27297 48.61892 49.84414 49.76126 49.25856 47.99820 husky 49.26372 48.13097 48.48142 49.70088 49.61858 49.11327 47.86195 retriever 50.19474 49.06466 49.40602 50.62632 50.54361 50.04135 48.79248 month_8 month_9 month_10 month_11 month_12 lab 46.54640 44.68640 45.15040 44.30640 45.88240 poodle 46.12613 44.26577 44.73243 43.89009 45.46306 husky 45.98761 44.12832 44.59469 43.75221 45.31858 retriever 46.91278 45.05263 45.51654 44.68496 46.24586 $length month_1 month_2 month_3 month_4 month_5 month_6 month_7 lab 19.91840 20.16800 20.28720 20.49600 20.57840 20.86400 20.96800 poodle 20.63964 20.88198 21.00090 21.20991 21.29189 21.58108 21.68198 husky 20.29115 20.54159 20.65575 20.86195 20.94867 21.23805 21.34071 retriever 20.47068 20.71955 20.83233 21.04135 21.12556 21.41729 21.51880 month_8 month_9 month_10 month_11 month_12 lab 21.10400 21.20880 21.40720 21.57440 21.87440 poodle 21.82072 21.92432 22.12342 22.29009 22.58919 husky 21.47699 21.58142 21.77876 21.94779 22.24779 retriever 21.64962 21.75414 21.95263 22.12406 22.42406 $food month_1 month_2 month_3 month_4 month_5 month_6 month_7 lab 30.04000 29.77200 28.77680 28.20880 29.52240 30.24960 30.90160 poodle 30.03063 29.76306 28.77117 28.20631 29.51892 30.23874 30.89910 husky 30.12301 29.85221 28.85841 28.29646 29.60973 30.33363 30.98584 retriever 29.89248 29.62556 28.63008 28.06617 29.37744 30.10075 30.75564 month_8 month_9 month_10 month_11 month_12 lab 29.20880 30.03200 29.89120 29.54240 30.89520 poodle 29.20631 30.02613 29.88739 29.53243 30.89550 husky 29.29646 30.11770 29.97345 29.62389 30.98053 retriever 29.06617 29.88722 29.74887 29.39248 30.75338

Aside • This step is potentially dangerous: • Indexes[[1]] = 4:15 # weight • Indexes[[2]] = 16:27 # length • Indexes[[3]] = 28:39 # food • Is there a better way? YES! Each group shares a common term in the name: • wt, len, food

Aside • grep(pattern, x) : matches "pattern" in vector x > grep("wt", names(dog_dat)) [1] 4 5 6 7 8 9 10 11 12 13 14 15 > grep("len", names(dog_dat)) [1] 16 17 18 19 20 21 22 23 24 25 26 27 > grep("food", names(dog_dat)) [1] 28 29 30 31 32 33 34 35 36 37 38 39

Aside > Indexes = list() > Indexes[[1]] = grep("wt", names(dog_dat)) > Indexes[[2]] = grep("len", names(dog_dat)) > Indexes[[3]] = grep("food", names(dog_dat)) > Indexes [[1]] [1] 4 5 6 7 8 9 10 11 12 13 14 15 [[2]] [1] 16 17 18 19 20 21 22 23 24 25 26 27 [[3]] [1] 28 29 30 31 32 33 34 35 36 37 38 39

Aside • grep can be a lot more powerful when combined with 'regular expression' but we're not going to get into that

Aside • Opposite of paste: strsplit(x, split) – splits term 'x' on 'split' character or pattern • Returns a list: > x = paste("month",1:12,sep="_") > head(strsplit(x,"_"),3) [[1]] [1] "month" "1" [[2]] [1] "month" "2" [[3]] [1] "month" "3"

Aside • If you want one element (in this case, the number), easiest to just use a 'for' loop • If you split each element separately, the output list only has 1 element: [[1]] • You then need to figure out which slot you want using the single bracket

Aside x = paste("month",1:12,sep="_") num = rep(0,length(x)) for(i in 1:length(x)) { num[i] = strsplit(x[i],"_")[[1]][2] } > i = 1 > strsplit(x[i],"_") # list [[1]] [1] "month" "1" > strsplit(x[i],"_")[[1]] # vector [1] "month" "1" > strsplit(x[i],"_")[[1]][2] # element [1] "1"

Applied Example • Load in "lec5_data.rda" from the course website • These are the people from "lec2_data.rda" that did not have a dog at baseline • Over monthly follow-up, some of these people borrowed dogs over the past month

Applied Example • dog_0: baseline dog ownership – all of these people should have "no" • dog_1 - dog_12: did you borrow a dog over the past month?

Applied Example • Determine person-time at risk for dog borrowing • Create a "survival" dataset from this data with columns: ID, start, end • Note that there is missing data…

Applied Example • We want to convert each person's wide data into two numbers: start and end • Because of missing data, some people might have more than 1 row – people aren't at risk for dog borrowing if they did not report (/are missing)

Applied Example • Take person 1: > dat[1,] id age sex height weight dog_0 dog_1 dog_2 1 1 40 F 63.5 134.5 no no yes dog_3 dog_4 dog_5 dog_6 dog_7 dog_8 dog_9 1 yes no no yes yes no yes dog_10 dog_11 dog_12 1 <NA> no no

Applied Example • Person 1 in the new dataset should be: ID start end 1 0 9 1 11 12

Applied Example • Basic premise: write a for-loop that passes over each person and determines their non-missing follow-up time • Caveat: how many rows do we make our output matrix? • Perfect opportunity for using rbind()…

Applied Example • Create a matrix with 0 rows and 3 columns • Within the body of the loop, using rbind to append new rows (this is slow though) > out = matrix(nr = 0, nc = 3) > dim(out) [1] 0 3 > p1 = c(1,0,9) > out = rbind(out, p1) > out [,1] [,2] [,3] p1 1 0 9

Applied Example out = matrix(nrow = 0, ncol = 3) cols = grep("dog", names(dat)) for(i in 1:nrow(dat)) { hold = as.numeric(dat[i,cols]) ...

Applied Example • Here, the follow-up results are factors, which have numerical values: > dat[i,cols] dog_0 dog_1 dog_2 dog_3 dog_4 dog_5 dog_6 1 no no yes yes no no yes dog_7 dog_8 dog_9 dog_10 dog_11 dog_12 1 yes no yes <NA> no no > as.numeric(dat[i,cols]) [1] 1 1 2 2 1 1 2 2 1 2 NA 1 1

Applied Example • Now a cool little trick: rle() – run length encoding • Compute the lengths and values of runs of equal values in a vector • We're going to combine this with is.na()

Applied Example • This says that there are 10 FALSE in a row, then 1 TRUE, then 2 FALSE • We need to get this in a better format… > x = rle(is.na(hold)) > x Run Length Encoding lengths: int [1:3] 10 1 2 values : logi [1:3] FALSE TRUE FALSE

Applied Example > x = data.frame(cbind(x$values, x$length)) > names(x) <- c("missing", "length") > x missing length 1 0 10 2 1 1 3 0 2

Applied Example • cumsum() returns the cumulative sum of a vector > x$end <- cumsum(x$length) > x$start <- x$end - x$length + 1 > > x missing length end start 1 0 10 10 1 2 1 1 11 11 3 0 2 13 12

Applied Example • Note that we actually want all of the values to be less one, since our time starts at 0 > x$end <- cumsum(x$length) - 1 > x$start <- x$end - x$length + 1 > x missing length end start 1 0 10 9 0 2 1 1 10 10 3 0 2 12 11

Applied Example • Quick rearrangement: > x <- x[,c(1,2,4,3)] > x missing length start end 1 0 10 0 9 2 1 1 10 10 3 0 2 11 12

Applied Example • We want the last two columns of the non-missing visits > tmp = x[which(x$missing == 0),3:4] > tmp start end 1 0 9 3 11 12

Applied Example • We then want to add a column of the individual ID to the front id = dat[i,1] tmp = cbind(rep(id,nrow(tmp)), tmp) names(tmp)[1] = "ID" > tmp ID start end 1 1 0 9 3 1 11 12

Applied Example • Lastly, bind the tmp matrix to the growing out matrix • This finishes off our loop body out = rbind(out,tmp)

for(i in 1:nrow(dat)) { hold = as.numeric(dat[i,cols]) x = rle(is.na(hold)) x = data.frame(cbind(x$values, x$length)) names(x) <- c("missing", "length") x$end <- cumsum(x$length) - 1 x$start <- x$end - x$length + 1 x <- x[,c(1,2,4,3)] tmp = x[which(x$missing == 0),3:4] id = dat[i,1] tmp = cbind(rep(id,nrow(tmp)), tmp) names(tmp)[1] = "ID" out = rbind(out,tmp) } rownames(out) = 1:nrow(out) # cleaning

Introduction to R - Lecture 5: More loops