1 / 30

Introduction , Data Structures

R. Introduction , Data Structures. An Excellent R B ook (among many others). R in Action Data Analysis and Graphics with R Robert I. Kabacoff. http:// www.manning.com /affiliate/ idevaffiliate.php?id =1102_173. Steps in a typical data analysis ( Kabacoff , 2011).

toby
Download Presentation

Introduction , Data Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. R Introduction, Data Structures

  2. An Excellent R Book (among many others) R in Action Data Analysis and Graphics with R Robert I. Kabacoff http://www.manning.com/affiliate/idevaffiliate.php?id=1102_173

  3. Steps in a typical data analysis (Kabacoff, 2011)

  4. R features (Kabacoff, 2011) • R is free! (SPSS, SAS, etc. cost thousands or tens of thousands of dollars • R is a comprehensive statistical platform, offering all manner of data analytic techniques • R has state-of-the-art graphics capabilities • R is a powerful platform for interactive data analysis and exploration • R can easily import data from a wide variety of sources, including text files, database management systems, statistical packages, and specialized data repositories. It can write data out to these systems as well • R provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner. It’s easily extensible and provides a natural language for quickly programming recently published methods • R contains advanced statistical routines not yet available in other packages. In fact, new methods become available for download on a weekly basis • A variety of graphic user interfaces (GUIs) are available, offering the power of R through menus and dialogs. • R runs on a wide array of platforms, including Windows, Unix, and Mac OS X

  5. Data structures in R

  6. Vectors • Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data • The combine function c() is used to form the vector > x = c(1, 3, 5, 7, 25, -13, 47) > y = c(”unu", ”doi", ”trei”, “opt”) • The data in a vector must only be one type (numeric, character, or logical) • Elements of a vector can be referred using a numeric vector of positions within brackets: x[c(4, 6)] refers to the 4th and 6th element of vector x. > x = c(1, 3, 5, 7, 25, -13, 47) > c[3] [1] 5 > x [c(1, 2, 4)] [1] 1 3 7 > x[2:6] [1] 3 5 7 25 -13 • Last statement generates a sequence of numbers; x <- c(2:6) is equivalent to x <- c(2, 3, 4, 5, 6)

  7. Date type • Date type handling is more difficult to handle • Dates are represented as the number of days since 1970-01-01, with negative values for earlier dates. • as.Date( )converts strings to dates > mydates<- as.Date(c ('2013-10-01', '2013-10-03', '2013-11-10')) • number of days between 10/11/2013 and 3/10/ 2013 > days <- mydates[3] - mydates[2] > days > # notice the way of displaying the result • # print today's date > today <- Sys.Date() > format(today, format="%d %B %Y")

  8. Symbols used with format( )

  9. Date conversions • Character to Date: as.Date(x,"format") > # convert date info in format ’dd/mm/yyyy' > strDates = c("01/10/2013", ”31/10/2013") > dates = as.Date(strDates,"%d/%m/%Y") • Date to Character: as.Character( ) > # convert dates to character data > strDates2 = as.character(dates)

  10. Matrices • Two-dimensional arrays where each element has the same type (numeric,character, or logical) • Created with the matrixfunction. Format: > Myymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns, byrow=logical_value, dimnames=list( char_vector_rownames, char_vector_colnames)) • vectorcontains the elements for the matrix • nrowand ncol specify the row and column dimensions • dimnamescontains optional row and column labels stored in character vectors. • byrowindicates whether the matrix should be filled in by row (byrow=TRUE) or by column (byrow=FALSE); the default is by column.

  11. Creating matrices (1) • First example (a 5 x 4 matrix) > m1 <- matrix(1:20, nrow=5, ncol=4) > m1 [,1] [,2] [,3] [,4] [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20 • Second example (a 2 x 2 matrix, filled by rows) > cells <- c(1,26,24,68) > rownames <- c("Row1", "Row2") > colnames <- c("Col1", "Col2") > m2 <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, + dimnames=list(rownames, colnames)) > m2 Col1 Col2 Row1 1 26 Row2 24 68

  12. Creating matrices (2) • Third example (a 2 x 2 matrix, filled by columns) > m3 <- matrix(cells, nrow=2, ncol=2, + byrow=FALSE, dimnames=list(rownames, + colnames)) > > m3 Col1 Col2 Row1 1 24 Row2 26 68

  13. Accesing matrix elements (1) • (re) create the matrix > m1<- matrix(1:20, nrow=5) > m1 [,1] [,2] [,3] [,4] [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20 • display the 3rd row > m1[3,] [1] 3 8 13 18 • display the 3rd column > m1[,3] [1] 11 12 13 14 15

  14. Accesing matrix elements (2) • display the element in 2nd row anf 3rd column > m1 [2,3] [1] 12 • display two elements from the same row: m1 [2,3] and m1[2,4] > m1 [2, c(3,4)] [1] 12 17 • display three elements from the same column: m1 [1,2], m1 [2,2] and m1[3,2] > m1 [c(1,2, 3), 2] [1] 6 7 8 • display a "submatrix", from m1 [2,2] to m2[4.4] > m1 [ c(2,3,4), c(2,3,4)] [,1] [,2] [,3] [1,] 7 12 17 [2,] 8 13 18 [3,] 9 14 19

  15. Arrays • Similar to matrices but can have more than two dimensions • Elements must be of the same type • Createdwith array function: > myarray<- array(vector, + dimensions, dimnames) • vector contains the data for the array • dimensionsis a numeric vector giving the maximal index for each dimension • dimnames - optional list of dimension labels. • Elements in arrays are accessed similar to those in matrices

  16. Create and access arrays (1) • Cont. of previous column , , C3 B1B2B3 A1 13 15 17 A2 14 16 18 , , C4 B1B2B3 A1 19 21 23 A2 20 22 24 • display element [2,2,3] > a1 [2,2,3] [1] 16 > dim1 <- c("A1", "A2") > dim2 <- c("B1", "B2", "B3") > dim3<- c("C1", "C2", + "C3", "C4") > a1 <- array(1:24, c(2, 3, 4), + dimnames=list(dim1, dim2, + dim3)) > > a1 , , C1 B1B2B3 A1 1 3 5 A2 2 4 6 , , C2 B1B2B3 A1 7 9 11 A2 8 10 12

  17. Create and access arrays (2) • display a subarraycontaing all elements from first two rows/columns of A, B and C > a1 [c(1,2),c(1,2),c(1,2)] , , C1 B1B2 A1 1 3 A2 2 4 , , C2 B1B2 A1 7 9 A2 8 10 • display a matrix from elements of A and B for first row/column of C > a1 [,,1] B1B2B3 A1 1 3 5 A2 2 4 6 • display elements of A for the 3rd "row" of B and 2nd row/columns of C > a1 [,3,2] A1 A2 11 12

  18. Data Frames • Most important data structure in R (at least for us) • A data frame is a structure in R that holds data and is similar to the datasets found in standard statistical packages (for example, SAS, SPSS, and Stata)and databases • The columns are variables and the rows are observations • Variables can have different types (for example, numeric, character) in the same data frame. • Data frames are the main structures we’ll use to store datasets

  19. data.frame function • A data frame is created with the data.frame() function : > mydata<- data.frame(col1, col2, col3,…) • col1, col2, col3, … are column vectors of any type (such as character, numeric,orlogical). • names for each column can be provided with the names function. > studentID <- c(1, 2, 3, 4, 5) > name <- c("Popescu I. Vasile", "Ianos W. Adriana", + "Kovacz V. Iosef", "Babadag I. Maria", "Pop P. Ion") > age <- c(23, 19, 21, 22, 31) > scholarship <- c("Social","Studiu1","Studiu2","Merit","Studiu1") > lab_assessment <- c("Bine", "Foarte bine", "Excelent", "Bine", "Slab") > final_grade <- c(9, 9.45, 9.75, 7.21, 6) > student_gi <- data.frame(studentID, name, age, scholarship, + lab_assessment, final_grade) > student_gi studentID name age scholarship lab_assessment final_grade 1 1 Popescu I. Vasile 23 Social Bine 9.00 2 2 Ianos W. Adriana 19 Studiu1 Foarte bine 9.45 3 3 Kovacz V. Iosef 21 Studiu2 Excelent 9.75 4 4 Babadag I. Maria 22 Merit Bine 7.21 5 5 Pop P. Ion 31 Studiu1 Slab 6.00

  20. accessing elements of a data frame (1) • display first two columns (studentID and name ) > student_gi [1:2] studentID name 1 1 Popescu I. Vasile 2 2 Ianos W. Adriana 3 3 Kovacz V. Iosef 4 4 Babadag I. Maria 5 5 Pop P. Ion • the same operation could be done with > student_gi[c("studentID", "name")] • display final_grade column as a vector > student_gi$final_grade [1] 9.00 9.45 9.75 9.00 6.00

  21. accessing elements of a data frame (2) • cross tabulate (a sort of pivot table) lab_assessment by final_grade > table (student_gi$lab_assessment, + student_gi$final_grade) 6 9 9.45 9.75 Bine 0 2 0 0 Excelent 0 0 0 1 Foarte bine 0 0 1 0 Slab 1 0 0 0 • summary statistics of final_grade > summary(student_gi$final_grade) Min. 1st Qu. Median Mean 3rd Qu. Max. 6.00 9.00 9.00 8.64 9.45 9.75 • two plots > plot(student_gi$lab_assessment, student_gi$final_grade) > plot(student_gi$age, student_gi$final_grade)

  22. attach() • attach() function adds the data frame to the R search path • When a variable name is encountered, data frames in the search path are checked in order to locate the variable. • But first we'll delete the vectors which formed the data frame (to avoid confusion) > rm(studentID, name, age, scholarship, lab_assessment, + final_grade) • Now we'll launch the previous commands but with attach > attach(student_gi) > final_grade > table (lab_assessment, final_grade) > summary(final_grade) > plot(lab_assessment, final_grade) > plot(age, final_grade) • At the end, detach remove the data frame from the R search path > detach(student_gi)

  23. Case Identifiers • Can be specified with a rowname option in the data frame function • New values for studentID (to avoid confusion with regular row numbers) > studentID <- c(1001, 1002, 1003, 1004, 1005) • Vectors name, age, scholarship, lab_assessment and final_grade are the same • (Slightly) new version of the data frame > student_gi <- data.frame(studentID, name, age, + scholarship, lab_assessment, final_grade, + row.names = studentID) • studentID is the variable to use in labeling cases > student_gi studentID name age scholarship lab_assessmentfinal_grade 1001 1001 Popescu I. Vasile 23 Social Bine 9.00 1002 1002 Ianos W. Adriana 19 Studiu1Foarte bine 9.45 1003 1003 Kovacz V. Iosef 21 Studiu2Excelent 9.75 1004 1004 Babadag I. Maria 22 Merit Bine 9.00 1005 1005 Pop P. Ion 31 Studiu1 Slab 6.00

  24. Factors (1) • Variables can be described as nominal, ordinal, or continuous • Nominal variables are categorical, without an implied order. Examples: MaritalStatus, Sex, Job, MasterProgramme • Ordinal variables imply order but not amount. Examples: Status (poor, improved, excellent ), LabAssessment (slab, bine, foarteBine, excelent) • Continuous variables can take on any value within some range, and both order and amount are implied. Examples: LitersPer100Km, Height, Weight, FinalGrade (with decimals) • Categorical (nominal) and ordered categorical (ordinal) variables are called factors. • Factors determine how data will be analyzed and presented visually • The function factor() stores the categorical values as a vector of integers in the range [1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers

  25. factor function • A nominal variable > scholarship <- c("Social","Studiu1","Studiu2","Merit", + "Studiu1") • factor function > scholarship_f <- factor(scholarship) > scholarship_f [1] Social Studiu1Studiu2 Merit Studiu1 Levels: Merit Social Studiu1Studiu2 • Ordinal variable > lab_assessment <- c("Bine", "Foarte bine", "Excelent", + "Bine", "Slab") > lab_assessment [1] "Bine" "Foarte bine" "Excelent" "Bine" "Slab" > lab_assessment <- factor(lab_assessment, order=TRUE, + levels=c("Slab", "Bine", "Foarte bine", "Excelent")) > lab_assessment [1] Bine Foarte bine Excelent Bine Slab Levels: Slab < Bine < Foarte bine < Excelent

  26. Data Frame with Factors (1) • Vectors studentID, name, age, final_grade are identical as previous • Scholarship and lab_assessment are factors > scholarship <- c("Social", "Studiu1", "Studiu2", "Merit", "Studiu1") > scholarship <- factor(scholarship) > lab_assessment <- c("Bine", "Foarte bine", "Excelent", "Bine", "Slab") > lab_assessment <- factor(lab_assessment, order=TRUE, levels=c("Slab", + "Bine", "Foarte bine", "Excelent")) • Another version of the data frame (column studentID is removed and becomes row identifier) > student_gi <- data.frame(name, age, scholarship, + lab_assessment, final_grade, row.names = studentID) • Structure of the data frame > str(student_gi) 'data.frame': 5 obs. of 5 variables: $ name : Factor w/ 5 levels "Babadag I. Maria",..: 5 2 3 1 4 $ age : num 23 19 21 22 31 $ scholarship : Factor w/ 4 levels "Merit","Social",..: 2 3 4 1 3 $ lab_assessment: Ord.factor w/ 4 levels "Slab"<"Bine"<..: 2 3 4 2 1 $ final_grade : num 9 9.45 9.75 9 6

  27. Data Frame with Factors (2) • Basic statistics about variables in data frame > summary(student_gi)

  28. Factors and Value Labels • The factor() function can be used to create value labels for categorical variables > patientID<- c(1, 2, 3, 4) > age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > diabetes <- factor(diabetes) > status <- factor(status, order=TRUE) > gender <- c(1, 2, 2, 1) > patientdata<- data.frame(patientID, age, diabetes, + status, gender) • Variable gender is coded 1 for male and 2 for female. Create value labels: > patientdata$gender<- factor(patientdata$gender, + levels = c(1,2),labels = c("male", "female")) • levels indicate the actual values of the variable • labels refer to a character vector containing the desired labels.

  29. Lists • Lists are the most complex of the R data types • A list is an ordered collection of objects (components). • A list allows gathering a large variety of (possibly unrelated) objects under one name. • A list can contain a combination of vectors, matrices, data frames, and even other list • Created using list() function : mylist <- list(object1, object2, …) where the objects are any of the structures seen so far • Optionally, the objects in a list can be named: mylist <- list(name1=object1, + name2=object2, …)

  30. Useful functions for Data Objects

More Related