Mastering Data Frames in R: Essentials, Manipulation, and Visualization

Basics2

Outline • Review of data structures • Data frames • Review of data frame structure • Importing/Exporting • Data frame example • Browsing • Selecting/Subsetting • Manipulating • Table and unique functions • Merging • Summarizing with apply, by, aggregate • Basic Programming • if/else • for/while loops • functions • Summarizing with functions • Graphics • Scatterplots • Histograms • Barplots • Pie Charts

Initialize R Environment ## Initialize R Environment # Set working directory.. setwd("C:/Rworkshop") getwd() # Create a new folder named 'Outfolder' in working directory if(!file.exists("Outfolder")){ dir.create("Outfolder") } # Set options options(scipen=6) # bias against scientific notation # Load libraries library(RODBC) # ODBC database interface

Get Sample Data ## Get Sample data # Load sample data files from website (SampleData_PartI.zip) http://www.fs.fed.us/rm/ogden/research/Rworkshop/SampleData.shtml # Extract zip file to working directory unzip("SampleData_PartI.zip") # List files in folder list.files("PlotData")

R Data Structures # vector group of single objects or elements (all elements must be the same mode) # factorsvector of values that are limited to a fixed set of values (categories) # list group of objects called components (can be different modes and lengths) # array group of vectors with more than 1 dimension (all elements must be the same mode) format: array(data=NA, dim=c(dim1, dim2, ..) # matrix 2-dimensional array (group of vectors with rows and columns) (all elements must be the same mode). format: matrix(data=NA, nrow=1, ncol=1, byrow=FALSE) # data frame 2-dimensional list (group of vectors of the same length) (can be different modes) format: data.frame(row, col) Images from: https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node64.html

Data Frames Review ## Data frames are the primary data structure in R. ## Data frames are used for storing data tables, such as forest inventory data tables. ## A data frame is similar structure to a matrix, except a data frame may contain categorical data, such as species names. ## Each column of a data frame is treated as a vector component of a list and therefore, can be different modes. The difference is that each vector component must have same length.

Data Frames Review # data frame 2-dimensional list (group of vectors of the same length; can be different modes) #data.frame(row, column) ## Start with a matrix mat <- matrix(data=c(seq(1,5), rep(2,5)), nrow=5, ncol=2) df <- data.frame(mat) # Convert matrix to data frame is.data.frame(df) # Check if object is a data frame dim(df) # Get dimensions of data frame str(df) # Get structure of data frame matv <- c(mat) # Create a vector from matrix is.vector(matv) dflst <- c(df) # Create a list from data frame is.list(dflst) ## Accessing elements of a data frame ## Data frames can be accessed exactly like matrices, but they can also be accessed by column names. df[1,2] # Identify value in row 1, column 2 df[3,] # Identify values in row 3 (vector) df[,2] # Identify values in column 2 (vector) df[2:4,1] # Identify values in rows 2 thru 4and column 1 (vector) df[3,"X2"] # Identify value in row 3, column 2 df[,"X2"] # Identify values in column 2('X2') df[2:4,"X1"] # Identify values in column 2 thru 4 and column 1('X1')

Data Frames Review • ## Building your own data frame.. • dat <- edit(data.frame()) ## Build data frame from vectors of rows and columns.. # Column vectors numbervect <- c(1, 2, 3) stringvect <- c("a", "b", "c") logicalvect <- c(TRUE, FALSE, TRUE) # Combine column vectors into a data frame and name columns df <- data.frame(cbind(numbervect, stringvect, logicalvect)) df <- data.frame(cbind(COLUMN1=numbervect, COLUMN2=stringvect, COLUMN3=logicalvect)) # Row vectors row1vect <- c(1, "a", TRUE) row2vect <- c(2, "b", FALSE) row3vect <- c(3, "c", TRUE) # Combine row vectors into a data frame df <- data.frame(row1vect, row2vect, row2vect) # Name data frame columns names(df) <- c("COLUMN1", "COLUMN2", "COLUMN3")

Importing Data Frames ## Import data to R as data frame.. read.table() # reads text file into a dataframe (file, header=TRUE, sep="", na.strings="NA", skip = 0) scan() # reads text file into a vector or a list read.csv() # reads comma-delimited text file (*.csv) into a dataframe read.fwf() # reads text file with fixed width format into a dataframe read.dbf() # reads DBF file into a dataframe (library foreign) # dat <- read.table("c:/data/mydata.txt", header=TRUE) # valid # dat <- read.table("c:\\data\\mydata.txt", header=TRUE) # valid # dat <- read.table ("c:\data\mydata.txt", header=TRUE) # invalid # Notes.. help(read.table) # read.table() will return an error if not all lines have the full set of values for variables. # This is common with data from spreadsheets when the last (or right) columns have missing values. # read.table() – the default delimiter is a space, tab, or newline. # For comma-delimited files sep=", " #For tab-delimited files sep="\t" .. # File can be a URL (ex. ftp:// urls) # dat<- read.csv("ftp://ftp2.fs.fed.us/incoming/rmrs/testdat.csv", header=TRUE)

Importing Data Frames ## Excel files.. # It is recommended to export the data from Excel to a tab-delimited or comma-separated file before importing to R. # Convert long numbers represented with scientific notation to a number or a text before exporting (using Format Cells..). # Check imported dataframe in R for entire rows or columns of NA values… from extraneous empty cells. ## Missing Values.. # By default, only blanks are read as missing values. If you have some other indication for a missing value, you need to let R know what that missing value character is by specifying na.strings (ex. na.strings="" or na.strings="." ).

Importing Data Frames ## Package: RODBC # channel <- odbcConnect(dsn) # Data Source Name (dsn) # channel <- odbcConnectAccess("filenm.mdb")# ACCESS database # channel <- odbcConnectAccess2007("filenm.accdb") # ACCESS database (2007, 2010) # channel <- odbcConnectExcel("filenm.xls") # Excel workbook (*.xlsx??) # channel <- odbcConnectExcel("filenm.xlsx")# Excel workbook (2007, 2010) # channel <- odbcConnect(dsn="idbfia.mci") # Oracle database (need Oracle driver) # channel <- odbcConnect("mydb1", case="postgresql") # PostgreSQLdatabase (need PostgreSQL driver) # Note: If channel = -1 , something is wrong # Get data.. # Set channel to path of Excel spreadsheet channel <- odbcConnectExcel("PlotData/testdat.xls") # Get list of tables (sheets) in spreadsheet sqlTables(channel) # Fetch data from Excel spreadsheet (using sqlFetch or sqlQuery) testdat <- sqlFetch(channel, "testdat") # Specify sheet name dat<- sqlQuery(channel, "select PLT_CN, TREE from [testdat$]") odbcClose(channel) # Close connection (frees resources) odbcDataSources() # Displays available data source names (dsn) http://cran.r-project.org/web/packages/RODBC/vignettes/RODBC.pdf

Exporting Data Frames ## Exporting data from R.. # write.table() # writes a dataframe to a text file (append = FALSE, sep=" ") # write.csv() # writes a dataframe to a comma-delimited text file (*.csv) # write.dbf() # writes a dataframe to a DBF file (library foreign) write.table(testdat, "Outfolder/testdat2.txt") write.table(testdat, "Outfolder/testdat2.txt", row.names=FALSE, sep=",") write.csv(testdat, "Outfolder/testdat2.csv", row.names=FALSE) # Notes.. help(write.table) # The default settings include row names and column names and has a space delimiter. # write.table() – the default delimiter is a space, tab, or newline. # For comma-delimited files: sep=", " # For tab-delimited files: sep="\t"

Data Frame Example • ## Read in sample data frame.. • setwd("C:/Rworkshop") # use path to your Rworkshop • dat <- read.csv("PlotData/testdat.csv", header=TRUE, stringsAsFactors=FALSE) • dat # Display dat. Check if data frame has extra rows of NA values. PLT_CN unique plot number(numeroparcelaunico) TREE uniquetree#(numero arbolunico) STATUSCD live/dead(codigode estado (vivir/muertos)) SPCD speciescode (estado arbol) DIA diameter - inches(dap) HT height - feet (altura) BA basal area sqft (area basal)

Data Frames - Browsing ## Browsing data frames dat # Print dataframe to screen View(dat) # Display dataframe in spreadsheet format dat <- edit(dat) # Edit dataframe in spreadsheet format head(dat) # Display first rows of dataframe (default = 6 rows) head(dat, n=10) # Display first 10 rows of dataframe tail(dat) # Display last rows of dataframe (default = 6 rows) names(dat) # Display column names of dat row.names(dat) # Display row names of dat summary(dat) # Display summary statistics for columns of dat str(dat) # Display internal structure of dat ## Dimensions of data frames dim(dat) nrow(dat) ncol(dat)

Data Frames – Selecting/Subsetting • ## Selecting and subsetting columns and rows in data frames - df[rows, cols) • # To define or select data frame columns: df$colnm or df[,colnm] • # To define or select data frame rows: df[rowid, ] • dat[, c("TREE", "DIA")] # Select TREE and DIA columns for all rows • dat[3:5,] # Select all columns for rows 3 thru 5 • dat[c(4,6), c(2:4)] # Select rows 4 and 6 and columns 2 thru 4 dat[dat$SPCD== 746,] # Select all columns where SPCD = 746 • dat[dat$DIA < 10,] # Subset rows • subset(dat, DIA < 10) # Subset rows • dfsub<- subset(dat, DIA<10) # Assign new data frame object to subset of dat • dfsub # Display dfsub • # Select the TREE and DIA columns for dead trees only • dat[dat$STATUSCD == 2, c("TREE", "DIA")] • # Select columns where SPCD=746 and STATUSCD=2 • dfsub[dfsub$SPCD == 746 & dfsub$STATUSCD == 2, ]

Data Frames - Manipulating • ## Manipulating data in data frames.. • # To remove column(s) from data frame • dfsub$HT <- NULL # Removes one column • dfsub[,!(names(dfsub) %in% c("TREE", "DIA"))] # Removes one or more columns • # To change name of a column in a data frame • names(dat)[names(dat)== "SPCD"] <- "SPECIES" • # To order a dataframe • dat <- dat[order(dat$SPECIES, dat$DIA),] # Order by 2 columns • dat <- dat[order(dat$DIA, decreasing=TRUE),] # Descending order • dat <- dat[order(dat$PLT_CN),] # Order by 1 column

Data Frames - Manipulating • ## Manipulating data in data frames cont.. • # To add a column of concatenated columns • dat$NEW <- paste(dat$TREE, ":", dat$SPECIES) # With default spaces • dat$NEW <- paste(dat$TREE, ":", dat$SPECIES, sep="") # With no spaces • # .. with leading 0 values • dat$NEW <- paste(dat$TREE, ":", formatC(dat$SPECIES,width=3,flag=0),sep="") • # To add a column of percent of total basal area for each tree in dat (rounded). • dat$BApertot <- round(dat$BA / sum(dat$BA) * 100, 2)

Data Frames - Exploring • ## Exploring data frames with the unique and table functions.. • # Display the unique values of species in dataset and sort • unique(dat$SPECIES) • sort(unique(dat$SPECIES)) • # Display the number of trees by species • help(table) • tab <- table(dat$SPECIES) • class(tab) • # Display the number of trees by species and plot • table(dat$PLT_CN, dat$SPECIES) • table(dat[,c("PLT_CN", "SPECIES")]) • data.frame(table(dat$PLT_CN, dat$SPECIES)) • # Create an object with the number of trees by status, species, and plot • # .. and add column names • tabdf <- data.frame(table(dat$PLT_CN, dat$SPECIES, dat$STATUSCD)) • names(tabdf) <- c("PLT_CN", "SPECIES", "STATUSCD", "FREQ") • # Get a count of dead tree species by plot • table(dat[dat$STATUSCD == 2, "SPECIES"], dat[dat$STATUSCD==2, "PLT_CN"]) • table(dat[dat$STATUSCD ==2, "PLT_CN"], dat[dat$STATUSCD == 2, "SPECIES"])

Data Frames - Exploring • ## More ways to explore data frames.. • # Get the maximum height (HT) for Douglas-fir (SPECIES=202) • max(dat[dat$SPECIES == 202, "HT"]) • # Get the average height (HT) for both Douglas-fir (SPECIES=202) and lodgepole pine (SPECIES=106) • mean(dat[dat$SPECIES %in% c(202, 108), "HT"]) • # Get the average diameter (DIA) for dead (STATUSCD=2) aspen trees (SPECIES=746) • mean(dat[(dat$SPECIES == 746 & dat$STATUSCD == 2), "DIA"]) • # Get the average diameter (DIA) for live (STATUSCD=1) aspen trees (SPECIES=746) • mean(dat[(dat$SPECIES== 746 & dat$STATUSCD == 1), "DIA"])

Data Frames - Merging • # Example: merge variable code descriptions for species to data frame • Get data frame of code descriptions • Merge to table

Data Frames - Merging • # Example: merge variable code descriptions for species to data frame • ## First, get data frame of code descriptions • # Build species look-up table (or import from file) • # Gets unique values of species, sorted: • SPECIES <- sort(unique(dat$SPECIES)) • SPECIESNM <- c("subalpine fir", "Engelmann spruce", "lodgepole pine", • "Douglas-fir", "aspen") • sptab <- data.frame(cbind(SPECIES, SPECIESNM)) • ## Then, merge to table • help(merge) • dat <- merge(dat, sptab, by="SPECIES") • dat • dat[order(dat$PLT_CN, dat$TREE), ]

Exercise 7 ## 7.1. Load testdat.csv again, assigning it to object ex1df. ## 7.2. Display the dimensions, the column names, and the structure of ex1df. ## 7.3. Display the first 6 rows of ex1df. ## 7.4. Display this table ordered by heights with the largest trees first and display the maximum height. ## 7.5. Change name of column STATUS to 'ESTADO', DIA to 'DAP', and column HT to 'ALTURA'. ## 7.6. Merge the sptab table we made on slide 18 to the ex1df table, using the SPCD column, and assign to ex1df2. Hint: use by.x and by.y for merging to columns with different names. ## 7.7. Display only rows with subalpine fir and DAP greater than 10.0. ## 7.8. Display the number of trees by ESTADO and SPECIESNM. ## 7.9. Display the total basal area (BA) for lodgepolepine.

Exercise 7 cont. ## 7.10. Create a new object named 'aspen' with just the rows in ex1df2 that are aspen. ## 7.11. The 2 columns, SPCD and SPECIESNM, are not important in the aspen table. Remove them and overwrite object 'aspen'. ## 7.12. Display the mean ALTURA of live aspen trees. ## 7.13. Create a look up table for ESTADO and merge this table to ex1df2. Assign the merged table to ex1df3. Then order ex1df3 by PLT_CN. ## Hint: ## 1. Get vector of unique values of ESTADO ## 2. Define new vector called ESTADONM where 1 = Live and 2 = Dead ## 3. Create dataframe of vectors with names ESTADO and ESTADONM ## 4. Merge new dataframe to ex1df2 ## 5. Order table by PLT_CN and TREE ## 7.14. Display the number of trees again, this time by ESTADONM and SPECIESNM.

Summarizing Data Frames with apply, by, aggregate

Summarizing Data Frames apply – applies a function to rows or columns of an array, matrix, or data frame (returns a vector or array). sapply – applies a function to each element of a list or vector (returns a vector or array). lapply – similar to sapply, applies a function to each element of a list or vector (returns a list). tapply – applies a function to a vector, separated into groups defined by a second vector or list of vectors (returns an array). by – similar to tapply, applies a function to vector or matrix, separated in to groups by a second vector or list vectors (returns an array or list). aggregate – applies a function to one or more vectors, separated into groups defined by a second vector or list of vectors(returns a data frame). Function Input Output apply array/matrix/data frame vector/array sapply vector/list vector/array lapply vector/list list tapply vector array/matrix by data frame array/list aggregate data frame data frame

apply, lapply, sapply # apply – applies a function to the rows or columns of an array, matrix, or data frame # (1-rows; 2-columns). # Apply a mean across 3 columns of dat meandat <- apply(dat[,c("DIA", "HT", "BA")], 2, mean) meandat is.vector(meandat) # Returns a vector # sapply/lapply– applies a function to a each element of a list or vector. # Check if columns of dat are numeric is.list(dat) sapply(dat, is.numeric) # Returns a vector lapply(dat, is.numeric) # Returns a list # Display the number of characters for each column name of dat. sapply(names(dat), nchar)

tapply tapply - applies a function to a vector, separated into groups defined bya second vector or list of vectors (returns an array/matrix). ## Get max DIA for each SPECIESNM. maxdia <- tapply(dat$DIA, dat$SPECIESNM, max) maxdia class(maxdia) is.vector(maxdia) # Get summed basal area by species and plot. spba.tapply<- tapply(dat$BA, dat[,c("PLT_CN", "SPECIESNM")], sum) spba.tapply dim(spba.tapply) class(spba.tapply) is.array(spba.tapply) # Change NA values to 0 (0 basal area) and round all values to 2 decimal places. spba.tapply[is.na(spba.tapply)] <- 0 spba.tapply <- round(spba.tapply, 2) spba.tapply ## Accessing output of tapply spba.tapply[1,] spba.tapply["4836418010690",] spba.tapply["4836418010690", "lodgepole pine"]

tapply cont.. ## Now, let's add one more variable.. and summarize by STATUSCD as well. stspba.tapply<- tapply(dat$BA, dat[,c("PLT_CN", "SPECIESNM", "STATUSCD")], sum) stspba.tapply[is.na(stspba.tapply)] <- 0 stspba.tapply class(stspba.tapply) ## Accessing output of tapply dim(stspba.tapply) # An array with 3 dimensions [1] 2 5 2 #(PLT_CN, SPECIESNM, STATUSCD) # Display all elements of dimensions 2 and 3, for the first element of dimension 1. stspba.tapply[1,,] stspba.tapply["4836418010690",,] # ..or the name of the element # Display all elements of dimensions 1 and 3, for the third element of dimension 2. stspba.tapply[,3,] stspba.tapply[,"Engelmann spruce",] # ..or the name of the element # Display all elements of dimensions 1 and 2, for the second element of dimension 3. stspba.tapply[,,2] stspba.tapply[,,"2"] # ..or the name of the element # Display one element of array. stspba.tapply[2,4,2] stspba.tapply["40374256010690","lodgepole pine","2"]

by by – applies a function to vector or matrix, separated in to groups by a second vector or list of vectors (returns a by class(array/list(if simplify=FALSE)). ## Get max DIA for each SPECIESNM ## The default by statement returns an object of class 'by' maxdia <- by(dat$DIA, dat$SPECIESNM, mean) maxdia class(maxdia) is.list(maxdia) ## Accessing output from by (default) maxdia[1] # Display element 1 maxdia[[1]] # Display value of element 1 maxdia[["aspen"]] # Display value of element named 'aspen' ## Adding simplify=FALSE, returns on object of class list or array maxdia2 <- by(dat$DIA, dat$SPECIESNM, mean, simplify=FALSE) maxdia2 class(maxdia2) is.list(maxdia2) ## Accessing is the same as above, except you can also use '$', because it is a list. maxdia2[1] # Display element 1 maxdia2[[1]] # Display value of element 1 maxdia2[["aspen"]] # Display value of element named 'aspen' maxdia2$aspen

by cont.. # Get summed basal area by species and plot spba.by <- by(dat$BA, dat[,c("PLT_CN", "SPECIESNM")], sum, simplify=FALSE) spba.by[is.na(spba.by)] <- 0 spba.by class(spba.by) is.list(spba.by) dim(spba.by) ## Accessing output from by statement (same as tapply, except can use list option) dim(stspba.by) # An array(list) with 2 dimensions [1] 2 5 #(PLT_CN, SPECIESNM) # Display all elements of dimensions 2, for the first element of dimension 1. spba.by[1,] spba.by["4836418010690",] # ..or the name of the element # Display all elements of dimension 1, for the forth element of dimension 2. spba.by[,4] spba.by[,"lodgepole pine"] # ..or the name of the element # Display one element and element value from array/list. spba.by[2,4] # Element spba.by["40374256010690","lodgepole pine"] # Element spba.by[,"lodgepole pine"]$"40374256010690" # Element value spba.by["40374256010690",]$"lodgepole pine" # Element value

by cont.. by – applies a function to vector or matrix. tapply– applies a function to a vector only. # Use by to apply a function to more than one variable. How correlated are DIA, BA, and HT for each species? by(dat[,c("DIA", "BA", "HT")], dat$SPECIESNM, cor)

aggregate aggregate – applies a function to one or more vectors, separated into groups defined by a second vector or list of vectors (returns a data frame). # Get sum of basal area by plot sumbaspp<- aggregate(dat$BA, list(dat$SPECIESNM), sum) names(sumbaspp) <- c("SPECIESNM", "SUMBA") # Get sum of basal area by plot and STATUSCD spba.ag <- aggregate(dat$BA, list(dat$PLT_CN, dat$SPECIESNM), sum) class(spba.ag) # Add names to data frame names(spba.ag)<- c("PLT_CN", "SPECIESNM", "SUMBA") # Get average diameter and average height by SPECIESNM avgdiahtspp<- aggregate(dat[,c("DIA", "HT")], list(dat$SPECIESNM), mean) names(avgdiahtspp) <- c("SPECIESNM", "AVGDIA", "AVGHT") # Merge summed basal area by species with average dia/height by species datsum<- merge(sumbaspp, avgdiahtspp, by="SPECIESNM")

Comparison of Summary Functions ## Differences in output # Example: Get the number of trees by plot # Using tapply treecnt <- tapply(dat$PLT_CN, dat$PLT_CN, length) treecnt class(treecnt) # Using by treecnt.by <- by(dat$PLT_CN, dat$PLT_CN, length) treecnt.by class(treecnt.by) is.array(treecnt.by) # Using aggregate treecnt.ag <- aggregate(dat$PLT_CN, list(dat$PLT_CN), length) names(treecnt.ag) <- c("PLT_CN", "NUM_TREES") class(treecnt.ag) treecnt.ag # Using table function treecnt.tab <- table(dat$PLT_CN) treecnt.tab class(treecnt.tab) is.array(treecnt.tab)

Exercise 8 ## 8.1 Use sapply to change the names of dat to all upper case letters. ## 8.2 Use apply to get the range of values for DIA and HT in dat. Set the results to an object named 'ranges'. Is this object an array? What is the class of 'ranges'? ## 8.3 Use tapply to get the minimum 'HT' by SPECIESNM and STATUSNM and set to an object named minht. Change any NA values to 0. ## 8.4 Use aggregate to get a sum of BA by SPECIESNM. Set this to an object named 'ba.sum'. Add names to ba.sum, SPECIESNM, SUMBA. What is the class of 'ba.sum'?

Basics Programming in R

if/else statements • ## Format • # if(logical statement1){ • # then statement1 • # }else if(logical statement2){ • # then statement2 • # }else{ • # then statement3 • # } • # Example: Print to screen if the value x is less than, equal to, or greater than 10. • x <- 5 • if(x < 10){ • print("less than 10") • }else if(x == 10){ • print("equals 10") • }else{ • print("greater than 10") • } • # Assign x to 10 and run the if statement again. • x <- 10

ifelse statement • ## Format • # ifelse(logical statement1, yes, no) • # Example: Print to screen if the value x is less than 10 or greater than or equal to 10. • x <- 5 • answer <- ifelse(x < 10, "less than 10", "greater than or equal to 10") • answer • x <- 10 • ifelse(x < 10, "less than 10", "greater than or equal to 10") • # Example: Append a new column named STATUSNM to datwith 2 values, "live", where STATUSCD = 1, and "dead", where STATUSCD = 2. • dat$STATUSNM <- ifelse(dat$STATUSCD == 1, "live", "dead")

for/while loops • ## Format • # for(i in 1:n){ • # execute statement • # } • # Example: for loop • for(i in 1:10){ • print(i) • } • # Example: while loop • i <- 1 • while(i < 10){ • print(i) • i <- i+ 1 • } • # Example: for loop • weekdays <- c("Mon", "Tues", "Wednes", "Thurs", "Fri", "Satur", "Sun") • for(day in weekdays){ • print(paste(day, "day", sep="")) • }

if/else and for loops • ## IFELSE statements and FOR loops • # Example: Create a loop using the previous example for if/else statements. • for(xin 1:20){ • if(x < 10){ • print("less than 10") • }else if(x == 10){ • print("equals 10") • }else{ • print("greater than 10") • } • } • # Example: Create a loop that compiles a vector with column names of dat that are less than 5 characters wide. • datnames <- vector() • for(name in names(dat)){ • name.length <- nchar(name) • if(name.length < 5){ • datnames <- c(datnames, name) • } • } • datnames

Functions # Format # f <- function(<arguments>) { # Do something interesting # } # Example: Create a function using the previous example, with x as a user-defined parameter. getx<- function(x){ if(x < 10){ print("less than 10") }else if(x == 10){ print("equals 10") }else{ print("greater than 10") } } getx(5) getx(10) # Example: Function to perform 2 calculations and return the result. getx2 <- function(x, y){ x2 <- x * y x2 <- x2 + 5 return(x2) } x2 <- getx2(5, 100) getx2(2, 20)

Functions cont. # Example: Function to print the mean diameter of a specified species. meanDIA<- function(SPECIES){ # Gets mean DIA for SPECIES in dat dia.mean <- mean(dat[dat$SPECIES== SPECIES, "DIA"]) # Gets name for SPECIES in dat spnm <- unique(dat[dat$SPECIES == SPECIES, "SPECIESNM"]) paste("Mean diameter for", spnm, "is:", round(dia.mean)) } meanDIA(108) meanDIA(746) # Example: Function to print the mean of a specified variable for a specified species. meanval<- function(SPECIES, var){ # Gets mean varfor SPECIES in dat var.mean <- mean(dat[dat$SPECIES== SPECIES, var]) # Gets name for SPECIES in dat spnm<- unique(dat[dat$SPECIES == SPECIES, "SPECIESNM"]) paste("Mean", var, "for", spnm, "is:", round(var.mean)) } meanval(746, "HT") meanval(108, "DIA")

Functions cont. # Example: Function to print the specified function of a specified variable for a specified species. The default function is mean. funval<- function(SPECIES, var, funnm=mean){ # Summarizes varby funnm for SPECIES in dat var.mean <- funnm(dat[dat$SPECIES== SPECIES, var]) # Gets name for SPECIES in dat spnm <- unique(dat[dat$SPECIES == SPECIES, "SPECIESNM"]) # Gets the string name for the function funnm <- deparse(substitute(funnm)) paste(funnm, var, "for", spnm, "is:", round(var.mean)) } funval(746, "HT", mean) funval(746, "HT") funval(108, "BA", sum)

Functions cont. # Example: Function to return a table of pfun applied to pvar by xvar and yvar. datPivot<- function(dat, pvar, xvar, yvar, pfun=sum){ ptab<- tapply(dat[,pvar], dat[,c(xvar, yvar)], pfun) ptab[is.na(ptab)] <- 0 return(ptab) } datPivot(dat, "BA", "PLT_CN", "SPECIESNM") datPivot(dat, "HT", "PLT_CN", "SPECIESNM") datPivot(dat, "HT", "PLT_CN", "SPECIESNM",mean) # Help on functions https://ems-team.usda.gov/sites/fs-fia-iw-tt/R%20Help%20docs/Rfunctions.pdf

Summarizing with Functions # Create a function to use with an apply statement to calculate a percent of the total basal area for multiple variables by species. # First, create a table summarizing multiple variables by species. sptab <- aggregate(dat[,c("DIA", "BA")], list(dat$SPECIESNM), sum) names(sptab) <- c("SPECIESNM", "SUMDIA", "SUMBA") sptab # Then, create a function to calculate percentages by species. perfun <- function(x){ round(x / sum(x) * 100, 2) } # Apply this function on all columns of sptab apply(sptab[,-1], 2, perfun)

Summarizing with Functions # Use the perfun function to calculate a percent of total basal area per plot. # First, create a table summarizing BA by species and plot sppltba<- aggregate(dat$BA, list(dat$SPECIESNM, dat$PLT_CN), sum) sppltba names(sppltba) <- c("SPECIESNM", "PLT_CN", "SUMBA") sppltba # Then use a by statement to calculate percent basal area per species sppltba.by<- by(sppltba$SUMBA, sppltba$PLT_CN, perfun) sppltba.by # Append the percent calculation to sppltba. sppltba$BApercplot<- unlist(sppltba.by) sppltba # Check the column totals by plot by(sppltba$BApercplot, sppltba$PLT_CN, sum)

Exercise 9 ## 9.1 Use the sapply function to determine if any columns of dat are factors. ## 9.2 Create a loop to check if any of the columns of dat is a factor. Print to screen the name of the columns that are factors. ## 9.3 Append a column named HTCLASS to dat with 2 categories: "SHORT "or "TALL", use the mean of HT as the break point. So: for trees with HT less than the mean HT, set HTCLASS = "SHORT", and for trees with HT greater than or equal to the mean HT, set HTCLASS = "TALL". Check if this works using a table on HT and HTCLASS. ## 9.4 Create a function named 'getrows', that will print out all the records (or trees) of dat for a specified SPECIES. Include one parameter named 'sp'. How many records have lodgepole trees (108)? ## 9.5. Using the function you created in 9.4, how many records have aspen trees? What is the average HT for these records (trees)? ## 9.6 Create a function to use with sapply to add a "_1" to all the column names of dat.

A Touch on Graphics in R

Scatterplots # Plot a scatterplot of basal area by height plot(dat$BA, dat$HT) help(plot) help(par) # Help on graphical parameters par() # Display current settings of par ## Add labels plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area", ylab="Height") ## Change symbol (see below) plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area", ylab="Height", pch=19) ## Change color and size of points plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area", ylab="Height", pch=19, col="red", cex=0.5) pch colors() # List of color choices cex<-0.5 # Make the symbol half the size

Scatterplots # Add multiple plots to graphics device (display window). par(mfrow=c(1,3)) # 1*3 = 3 plots (1 row, 3 columns) ## Add plots plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area", ylab="Height") plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area", ylab="Height", pch=19) plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area", ylab="Height", pch=19, col="red", cex=0.5) # Change width of space surrounding plots in graphics device (display window). # mar Number of lines of margin to be specified on all 4 sides of plot. # Default is c(5, 4, 4, 2) + 0.1 (bottom, left, top, right). par(mfrow=c(1,3), mar=c(5, 6, 4, 2)) ## Add plots plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area", ylab="Height") plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area", ylab="Height", pch=19) plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area", ylab="Height", pch=19, col="red", cex=0.5)

Scatterplots cont. par(mfrow=c(1,1)) plot(dat$BA, dat$HT, main="Basal Area by Height", xlab="Basal Area", ylab="Height", pch=19, col="red", cex=0.5) # Add fit lines to scatterplot to look at trends abline(lm(dat$HT~ dat$BA)) # Regression line (y~x) lines(lowess(dat$BA, dat$HT), col="green") # lowess line (x,y) # Scatterplot matrix pairs(dat[,c("BA", "HT", "DIA")]) help(pairs) ## Other graphics packages # car # scatterplot3d

Mastering Data Frames in R: Essentials, Manipulation, and Visualization

Mastering Data Frames in R: Essentials, Manipulation, and Visualization

Presentation Transcript