Introduction to R - Lecture 5: More loops

1 / 54

# Introduction to R - Lecture 5: More loops - PowerPoint PPT Presentation

Introduction to R - Lecture 5: More loops. Andrew Jaffe 10/4/2010. Overview. Review: For Loop Lists Aside: Patterns Application. Review: For Loop. The syntax is: for( var in seq ) {code} The seq determines what values var will take in the loop

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Introduction to R - Lecture 5: More loops' - moeshe

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Introduction to R - Lecture 5: More loops

Andrew Jaffe

10/4/2010

Overview
• Review: For Loop
• Lists
• Aside: Patterns
• Application
Review: For Loop
• The syntax is: for(var in seq) {code}
• The seq determines what values var will take in the loop
• The loop is performed length(seq) times
• On the n’th iteration of the loop, var takes the value seq[n]
• var is a completely new variable and notdirectlyrelated to anything other variable
Review: For Loop
• Setting up your loop requires determining the correct seq to loop over: usually easy
• The real challenge of looping is relating the values of seq to the dimensions/ indices of your data
Review: For Loop
• From last lecture: we’re relating seq to the columns of the data
• var is indirectly related to the data, as it links/relates to the column indices – but it has only has the values 1-12

Index = 4:15

mean_wt <- rep(0, length(Index))

for(i in 1:length(Index)) {

ind = Index[i] # column index

mean_wt[i] = mean(dog_dat[,ind])

}

Overview
• Review: For Loop
• Lists
• Aside: Patterns
• Application
Lists
• "An R list is an object consisting of an ordered collection of objects known as its components."
• "Components are always numbered and may always be referred to as such" – double brackets can subset lists

CRAN. Intro to R

Lists

> L = list() # empty list

> L[[1]] = 1:4

> L[[2]] = 2:7

> L[[3]] = c("a","b","c")

> L[[4]] = matrix(rnorm(4), nrow = 2)

> L

[[1]]

[1] 1 2 3 4

[[2]]

[1] 2 3 4 5 6 7

[[3]]

[1] "a" "b" "c"

[[4]]

[,1] [,2]

[1,] -1.43944849 -0.4801696

[2,] 0.09923108 1.0783053

Lists

> names(L) = c("seq1","seq2","letters","mat")

> L

\$seq1

[1] 1 2 3 4

\$seq2

[1] 2 3 4 5 6 7

\$letters

[1] "a" "b" "c"

\$mat

[,1] [,2]

[1,] 1.824487 0.3431034

[2,] -0.533006 0.9406285

Lists

> L[[1]]

[1] 1 2 3 4

> str(L)

List of 4

\$ seq1 : int [1:4] 1 2 3 4

\$ seq2 : int [1:6] 2 3 4 5 6 7

\$ letters: chr [1:3] "a" "b" "c"

\$ mat : num [1:2, 1:2] 1.824 -0.533 0.343 0.941

Lists
• Why know lists?
• Can store data of different lengths and types
• Some functions return lists
Lists
• Load back in the lecture 4 data
• We still have one problem to solve - the averages of weight, length, and food for each dog type at each visit
Lists
• First we can create a list containing each group we care about

Indexes = list()

Indexes[[1]] = 4:15 # weight

Indexes[[2]] = 16:27 # length

Indexes[[3]] = 28:39 # food

names(Indexes) = c("weight",

"length", "food")

Lists

> Indexes

\$weight

[1] 4 5 6 7 8 9 10 11 12 13 14 15

\$length

[1] 16 17 18 19 20 21 22 23 24 25 26 27

\$food

[1] 28 29 30 31 32 33 34 35 36 37 38 39

Lists
• Next, we can create an output list for our results, and recreate the unique dog list for our loop

out <- list()

dogs = unique(dog_dat\$dog_type)

Lists
• We want to loop over the different covariates (wt, len, food) and within each, the different dog types
• For looping over the groups, either works:

> seq(along = Indexes)

[1] 1 2 3

> 1:length(Indexes)

[1] 1 2 3

Lists

for(i in seq(along = Indexes)) { # 1:3

# take the i'th index from the list

Index = Indexes[[i]]

# for that variable, create a new matrix

tmp = matrix(nrow = length(dogs),

ncol = length(Index))

...

Lists
• We can then fill in that temporary matrix with an inner 'for' loop
• Note that this is the exact same loop as last week (note the j's):

Index from the outer loop

for(j in 1:length(dogs)) {

hold = dog_dat[dog_dat\$dog_type == dogs[j],Index]

tmp[j,] = colMeans(hold)

}

rownames(tmp) = dogs

colnames(tmp) = paste("month",1:12,sep="_")

Lists
• Lastly, we save that tmp matrix in our output list:

out[[i]] = tmp

for(i in seq(along = Indexes)) { # groups

Index = Indexes[[i]]

tmp = matrix(nrow = length(dogs),

ncol = length(Index))

for(j in 1:length(dogs)) { # dogs

hold = dog_dat[dog_dat\$dog_type == dogs[j],Index]

tmp[j,] = colMeans(hold)

}

rownames(tmp) = dogs

colnames(tmp) = paste("month",1:12,sep="_")

out[[i]] = tmp

}

names(out) <- c("weight","length","food")

> out

\$weight

month_1 month_2 month_3 month_4 month_5 month_6 month_7

lab 49.81840 48.69200 49.03360 50.26560 50.17600 49.67280 48.41600

poodle 49.40090 48.27297 48.61892 49.84414 49.76126 49.25856 47.99820

husky 49.26372 48.13097 48.48142 49.70088 49.61858 49.11327 47.86195

retriever 50.19474 49.06466 49.40602 50.62632 50.54361 50.04135 48.79248

month_8 month_9 month_10 month_11 month_12

lab 46.54640 44.68640 45.15040 44.30640 45.88240

poodle 46.12613 44.26577 44.73243 43.89009 45.46306

husky 45.98761 44.12832 44.59469 43.75221 45.31858

retriever 46.91278 45.05263 45.51654 44.68496 46.24586

\$length

month_1 month_2 month_3 month_4 month_5 month_6 month_7

lab 19.91840 20.16800 20.28720 20.49600 20.57840 20.86400 20.96800

poodle 20.63964 20.88198 21.00090 21.20991 21.29189 21.58108 21.68198

husky 20.29115 20.54159 20.65575 20.86195 20.94867 21.23805 21.34071

retriever 20.47068 20.71955 20.83233 21.04135 21.12556 21.41729 21.51880

month_8 month_9 month_10 month_11 month_12

lab 21.10400 21.20880 21.40720 21.57440 21.87440

poodle 21.82072 21.92432 22.12342 22.29009 22.58919

husky 21.47699 21.58142 21.77876 21.94779 22.24779

retriever 21.64962 21.75414 21.95263 22.12406 22.42406

\$food

month_1 month_2 month_3 month_4 month_5 month_6 month_7

lab 30.04000 29.77200 28.77680 28.20880 29.52240 30.24960 30.90160

poodle 30.03063 29.76306 28.77117 28.20631 29.51892 30.23874 30.89910

husky 30.12301 29.85221 28.85841 28.29646 29.60973 30.33363 30.98584

retriever 29.89248 29.62556 28.63008 28.06617 29.37744 30.10075 30.75564

month_8 month_9 month_10 month_11 month_12

lab 29.20880 30.03200 29.89120 29.54240 30.89520

poodle 29.20631 30.02613 29.88739 29.53243 30.89550

husky 29.29646 30.11770 29.97345 29.62389 30.98053

retriever 29.06617 29.88722 29.74887 29.39248 30.75338

Overview
• Review: For Loop
• Lists
• Aside: Patterns
• Application
Aside
• This step is potentially dangerous:
• Indexes[[1]] = 4:15 # weight
• Indexes[[2]] = 16:27 # length
• Indexes[[3]] = 28:39 # food
• Is there a better way? YES! Each group shares a common term in the name:
• wt, len, food
Aside
• grep(pattern, x) : matches "pattern" in vector x

> grep("wt", names(dog_dat))

[1] 4 5 6 7 8 9 10 11 12 13 14 15

> grep("len", names(dog_dat))

[1] 16 17 18 19 20 21 22 23 24 25 26 27

> grep("food", names(dog_dat))

[1] 28 29 30 31 32 33 34 35 36 37 38 39

Aside

> Indexes = list()

> Indexes[[1]] = grep("wt", names(dog_dat))

> Indexes[[2]] = grep("len", names(dog_dat))

> Indexes[[3]] = grep("food", names(dog_dat))

> Indexes

[[1]]

[1] 4 5 6 7 8 9 10 11 12 13 14 15

[[2]]

[1] 16 17 18 19 20 21 22 23 24 25 26 27

[[3]]

[1] 28 29 30 31 32 33 34 35 36 37 38 39

Aside
• grep can be a lot more powerful when combined with 'regular expression' but we're not going to get into that
Aside
• Opposite of paste: strsplit(x, split) – splits term 'x' on 'split' character or pattern
• Returns a list:

> x = paste("month",1:12,sep="_")

[[1]]

[1] "month" "1"

[[2]]

[1] "month" "2"

[[3]]

[1] "month" "3"

Aside
• If you want one element (in this case, the number), easiest to just use a 'for' loop
• If you split each element separately, the output list only has 1 element: [[1]]
• You then need to figure out which slot you want using the single bracket
Aside

x = paste("month",1:12,sep="_")

num = rep(0,length(x))

for(i in 1:length(x)) {

num[i] = strsplit(x[i],"_")[[1]][2]

}

> i = 1

> strsplit(x[i],"_") # list

[[1]]

[1] "month" "1"

> strsplit(x[i],"_")[[1]] # vector

[1] "month" "1"

> strsplit(x[i],"_")[[1]][2] # element

[1] "1"

Overview
• Review: For Loop
• Lists
• Aside: Patterns
• Application
Applied Example
• Load in "lec5_data.rda" from the course website
• These are the people from "lec2_data.rda" that did not have a dog at baseline
• Over monthly follow-up, some of these people borrowed dogs over the past month
Applied Example
• dog_0: baseline dog ownership – all of these people should have "no"
• dog_1 - dog_12: did you borrow a dog over the past month?
Applied Example
• Determine person-time at risk for dog borrowing
• Create a "survival" dataset from this data with columns: ID, start, end
• Note that there is missing data…
Applied Example
• We want to convert each person's wide data into two numbers: start and end
• Because of missing data, some people might have more than 1 row – people aren't at risk for dog borrowing if they did not report (/are missing)
Applied Example
• Take person 1:

> dat[1,]

id age sex height weight dog_0 dog_1 dog_2

1 1 40 F 63.5 134.5 no no yes

dog_3 dog_4 dog_5 dog_6 dog_7 dog_8 dog_9

1 yes no no yes yes no yes

dog_10 dog_11 dog_12

1 <NA> no no

Applied Example
• Person 1 in the new dataset should be:

ID start end

1 0 9

1 11 12

Applied Example
• Basic premise: write a for-loop that passes over each person and determines their non-missing follow-up time
• Caveat: how many rows do we make our output matrix?
• Perfect opportunity for using rbind()…
Applied Example
• Create a matrix with 0 rows and 3 columns
• Within the body of the loop, using rbind to append new rows (this is slow though)

> out = matrix(nr = 0, nc = 3)

> dim(out)

[1] 0 3

> p1 = c(1,0,9)

> out = rbind(out, p1)

> out

[,1] [,2] [,3]

p1 1 0 9

Applied Example

out = matrix(nrow = 0, ncol = 3)

cols = grep("dog", names(dat))

for(i in 1:nrow(dat)) {

hold = as.numeric(dat[i,cols])

...

Applied Example
• Here, the follow-up results are factors, which have numerical values:

> dat[i,cols]

dog_0 dog_1 dog_2 dog_3 dog_4 dog_5 dog_6

1 no no yes yes no no yes

dog_7 dog_8 dog_9 dog_10 dog_11 dog_12

1 yes no yes <NA> no no

> as.numeric(dat[i,cols])

[1] 1 1 2 2 1 1 2 2 1 2 NA 1 1

Applied Example
• Now a cool little trick: rle() – run length encoding
• Compute the lengths and values of runs of equal values in a vector
• We're going to combine this with is.na()
Applied Example
• This says that there are 10 FALSE in a row, then 1 TRUE, then 2 FALSE
• We need to get this in a better format…

> x = rle(is.na(hold))

> x

Run Length Encoding

lengths: int [1:3] 10 1 2

values : logi [1:3] FALSE TRUE FALSE

Applied Example

> x = data.frame(cbind(x\$values, x\$length))

> names(x) <- c("missing", "length")

> x

missing length

1 0 10

2 1 1

3 0 2

Applied Example
• cumsum() returns the cumulative sum of a vector

> x\$end <- cumsum(x\$length)

> x\$start <- x\$end - x\$length + 1

>

> x

missing length end start

1 0 10 10 1

2 1 1 11 11

3 0 2 13 12

Applied Example
• Note that we actually want all of the values to be less one, since our time starts at 0

> x\$end <- cumsum(x\$length) - 1

> x\$start <- x\$end - x\$length + 1

> x

missing length end start

1 0 10 9 0

2 1 1 10 10

3 0 2 12 11

Applied Example
• Quick rearrangement:

> x <- x[,c(1,2,4,3)]

> x

missing length start end

1 0 10 0 9

2 1 1 10 10

3 0 2 11 12

Applied Example
• We want the last two columns of the non-missing visits

> tmp = x[which(x\$missing == 0),3:4]

> tmp

start end

1 0 9

3 11 12

Applied Example
• We then want to add a column of the individual ID to the front

id = dat[i,1]

tmp = cbind(rep(id,nrow(tmp)), tmp)

names(tmp)[1] = "ID"

> tmp

ID start end

1 1 0 9

3 1 11 12

Applied Example
• Lastly, bind the tmp matrix to the growing out matrix
• This finishes off our loop body

out = rbind(out,tmp)

for(i in 1:nrow(dat)) {

hold = as.numeric(dat[i,cols])

x = rle(is.na(hold))

x = data.frame(cbind(x\$values, x\$length))

names(x) <- c("missing", "length")

x\$end <- cumsum(x\$length) - 1

x\$start <- x\$end - x\$length + 1

x <- x[,c(1,2,4,3)]

tmp = x[which(x\$missing == 0),3:4]

id = dat[i,1]

tmp = cbind(rep(id,nrow(tmp)), tmp)

names(tmp)[1] = "ID"

out = rbind(out,tmp)

}

rownames(out) = 1:nrow(out) # cleaning

ID start end

1 1 0 9

2 1 11 12

3 2 0 5

4 2 7 12

5 3 0 2

6 3 5 12

7 4 0 3

8 4 6 12

9 5 0 0

10 5 3 8

> dim(out)

[1] 1414 3

Applied Example
• The non-0 starts must be less 1 since these are currently indices of visit, but not time at risk

ID start end

1 1 0 9

2 1 11 12

ID start end

1 1 0 9

2 1 10 12

Applied Example
• What is the total time at risk of this population?

> time = out\$end - out\$start

> sum(time)

[1] 4988 # person-months

Applied Example
• Save the 'out' matrix as an rda so it can be used next week