420 likes | 499 Views
Discover the fundamentals of R statistical computing, from creating projects and manipulating datasets to analyzing data structures and dealing with missing values. Learn to read various file formats, reference data, and extend R's capabilities with additional packages.
E N D
Introduction to R Statistics are no substitute for judgment Henry Clay, U.S. congressman and senator
R • R is a free software environment for statistical computing and graphics • Object-oriented • It runs on a wide variety of platforms • Highly extensible • Command line and GUI • Conflict between extensible and GUI
R Studio Datasets Scripts Results Files, plots, packages, & help
Creating a project • Project > Create Project… Store all R scripts and data in the same folder or directory by creating a project
Script • # CO2 parts per million for 2000-2009 • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) # a range of values • # show values • co2 • year • #compute mean and standard deviation • mean(co2) • sd(co2) • plot(year,co2) • A script is a set of R commands • A program • c is short for combine in c(369.40, …)
Exercise Smart editing Copy each column to a word processor Convert table to text Search and replace commas with null Search and replace returns with commas Edit to put R text around numbers Plot kWh per square foot by year for the following University of Georgia data.
Datasets • A dataset is a table • One row for each observation • Columns contain observation values • Same as the relational model • R supports multiple data structures and multiple data types
Data structures • co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78) • year <- (2000:2009) • m <- matrix(1:12, nrow=4,ncol=3) • Vector • A single row table where data are all of the same type • Matrix • A table where all data are of the same type
Exercise Create a matrix with 6 rows and 3 columns containing the numbers 1 through 18
Data structures • a <- array(1:24, c(4,3,2)) • gender <- c("m","f","f") • age <- c(5,8,3) • df <- data.frame(gender,age) • Array • Extends a matrix beyond two dimensions • Data frame • Same as a relational table • Columns can have different data types • Typically, read a file to create a data frame
Data structures • l <- list(co2,m,df) • List • An ordered collection of objects • Can store a variety of objects under one name
Objects • Anything that can be assigned to a variable • Constant • Data structure • Function • Graph • …
Types of data • Classification • Nominal • Sorting or ranking • Ordinal • Measurement • Interval • Ratio
Factors Nominal and ordinal data are factors Determine how data are analyzed and presented
Missing values • sum(c(1,NA,2)) • sum(c(1,NA,2),na.rm=T) Missing values are indicated by NA (not available) Arithmetic expressions and functions containing missing values generate missing values Use the na.rm=T option to exclude missing values from calculations
Missing values • gender <- c("m","f","f","f") • age <- c(5,8,3,NA) • df <- data.frame(gender,age) • df2 <- na.omit(df) You remove rows with missing values by using na.omit()
Reading a file • R can read a wide variety of input formats • Text • Statistical package formats (e.g., SAS) • DBMS
Reading a text file • t <- read.table("~/Dropbox/ Documents/R/Data/centralparktemps.txt", header=T, sep=”\t") • t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",") • Delimited text file, such as CSV • Creates a data frame • Specify as required • Presence of header • Separator • Row names
Learning about an object Click on the name of the file in the top-right window to see its content t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) head(t) # first few rows tail(t) # last few rows dim(t) # dimension str(t) # structure of a dataset class(t) #type of object
Referencing data Data set t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",”) # qualify with tablename to reference fields mean(t$temperature) max(t$year) range(t$month) # create a new column with the temperature in Celsius t$Ctemp = (t$temperature-32)*5/9 Column datasetName$columName
Packages • R’s base set of packages can be extended by installing additional packages • Over 4,000 packages • Search the R Project site to identify packages and functions • Install using R studio • Packages must be installed prior to useand their use specified in a script • require(packagename)
Packages t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep="," ) require(weathermetrics) #previously installed # compute Celsius t$Ctemp = fahrenheit.to.celsius(t$temperature,round=1)
Exercise Install the weathermetrics package and run the preceding code
Reshaping Melt Cast • Converting data from one format to another • Wide to narrow
Reshaping require(reshape) s <- read.table('http://dl.dropbox.com/u/6960256/data/meltExample.csv',sep=',') colnames(s) <- c('year', 1:12) m <- melt(s,id='year') colnames(m) <- c('year','month','co2') c <- cast(m,year~month, value='co2')
Writing files t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",") # compute Celsius and round to one decimal place t$Ctemp = round((t$temperature-32)*5/9,1) colnames(t)[3] <- 'Ftemp' # rename third column to indicate Fahrenheit write.table(t,"centralparktempsCF.txt")
Subset • trow <- t[t$year== 1999,] • tcol <- t[,c(1:2,4)] • trowcol <- t[(t$year > 1989 & t$year < 2000) ,c(1:2,4)] Selecting rows Selecting columns Selecting rows and columns
Sort • Sorting on column name • s <- t[order(-t$year, t$month),] • s <- t[order(-t[,1], t[,2]),] Sort on column 1 descending and column 2 ascending
Recoding m$Cut <- 'Other' m$Cut[m$Temperature >= 90] <- 'Hot' • Some analyses might be facilitated by the recoding of data • Split a continuous measure into two categories
Exercise • Download the spreadsheet of monthly mean CO2 measurements (PPM) taken at the Mauna Loa Observatory from 1958 onwards http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html • Export a CSV file that contains three columns: year, month, and average CO2 • Read the file into R • Recode missing values (-99.99) to NA • Plot year versus CO2
Aggregate data • # average temperate for each year • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'mean') Summarize data using a specified function Compute the mean monthly temperature for each year
Merging files • t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=',') • # averagemonthlytempforeachyear • a <- aggregate(t$temperature, by=list(t$year), FUN=mean) • # name columns • colnames(a) = c('year', 'meanTemp') • # readcarbon data (http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html) • carbon <- read.csv("http://dl.dropbox.com/u/6960256/data/carbon1959-2011.txt", sep=',',header=T) • m <- merge(carbon,a,by='year') There must be a common column in both files
Concatenating files • Taking a set of files of with the same structure and creating a single file • Same type of data in corresponding columns • Files should be in the same directory
Concatenating files # read the file names from a local directory filenames <- list.files("homeC-all/homeC-power", pattern="*.csv", full.names=TRUE) # append the files one after another for (i in 1:length(filenames)) { # Create the concatenated data frame using the first file if (i == 1) { cp <- read.table(filenames[i], header=F, sep=',') } else { temp <-read.table(filenames[i], header=F, sep=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } } colnames(cp) <- c('time','watts') Local directory
Concatenating files # read the file names from a remote directory (FTP) require(RCurl) url <- "ftp://watson_ftp:bulldawg1989@richardtwatson.com/power/" dir <- getURL(url, dirlistonly = T) filenames <- unlist(strsplit(dir,"\n")) # split into filennames # append the files one after another for (i in 1:length(filenames)) { # Create the concatenated data frame using the first file if (i == 1) { cp <- read.table(filenames[i], header=F, sep=',') } else { temp <-read.table(filenames[i], header=F, sep=',') cp <-rbind(cp, temp) #append to existing file rm(temp)# remove the temporary file } } colnames(cp) <- c('time','kwh') Remote directory with FTP
Correlation coefficient • cor.test(m$meanTemp,m$CO2) Pearson's product-moment correlation data: m$meanTemp and m$CO2 t = 3.1173, df = 51, p-value = 0.002997 95 percentconfidenceinterval: 0.1454994 0.6049393 sampleestimates: cor 0.4000598 Significant
Database access • MySQL access • You need the appropriate Java ARchive (JAR) for MySQL access file installed on your computer • http://dev.mysql.com/downloads/connector/j/
Database access * xx is the release number • Decompress and move mysql-connector-java-5.1.xx*-bin.jar • OS X • Macintosh HD/Library/Java/Extensions • Windows • c:\jre\lib\ext
Database access Change path for Windows require(RJDBC) # Load the driver – Change path to point to your jar file drv <- JDBC("com.mysql.jdbc.Driver", "Macintosh HD/Library/Java/Extensions/mysql-connector-java-5.1.26-bin.jar") # connect to the database # change user and pwd for your values conn <- dbConnect(drv, "jdbc:mysql://wallaby.terry.uga.edu/Weather", "student", "student") # Query the database and create file t for use with R t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;") head(t) MySQL access
Exercise • Using the Atlanta weather database and the lubridate package • Compute the average temperature at 5 pm in August • Determine the maximum temperature for each day in August for each year
Resources R books Reference card Quick-R
Key points • R is a platform for a wide variety of data analytics • Statistical analysis • Data visualization • HDFS and MapReduce • Text mining • Energy Informatics • R is a programming language • Much to learn