1 / 37

Data Manipulation with R: A Comprehensive Guide

Learn how to manipulate and analyze data using the R programming language. This session covers key packages, functions, and techniques for data manipulation in R.

slavin
Download Presentation

Data Manipulation with R: A Comprehensive Guide

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. R與資料(數據)操作Data Manipulation with R 講者:陳慶文 Ching-Wen Chen, Ph. D. 國立高雄第一科技大學資訊管理系教授 Dept. of I.M., NKFUST中華民國106年3月5日

  2. 為何用R • 公開碼源、自由使用且免費 • 進入門檻低(但不易專精) • 參予「生產」套件的使用者多(不必受制於「菁英」所寫的內容),且品質有「人」把關 • 論壇多,找尋答案容易。 • 「我」不喜歡使用套裝軟體,因為「貴」。$會阻礙知識的獲取與傳播,不利於「不富裕」的使用者學習。

  3. R是什麼 • R is a language and environment for statistical computing and graphics.  • The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools. • R is an integrated suite of software facilities for data manipulation, calculation and graphical display. •  R can be extended (easily) via packages.  • Read yourselves: https://www.r-project.org/about.html

  4. 資料操作的範疇 Data Scientist vs. Data Engineer

  5. Packages in Use • eight core packages (base, stats,…) • Documentation for package ‘base’ version 3.4.0 • dplyr(main focus) • ggplot2

  6. Data from package {nycflights13} • Description Airline on-time data for all flights departing NYC in 2013. Also includes useful 'metadata' on airlines, airports, weather, and planes • URL http://github.com/hadley/nycflights13 • airlines • airports • flights • planes • weather

  7. Relational data

  8. dplyr: two table verbs • Mutating joins • Filtering joins • Set operations

  9. Mutating joins

  10. dplyr: Single table verbs • filter() (and slice()) • arrange() • select() (and rename()) • distinct() • mutate() (and transmute()) • summarise() • sample_n() (and sample_frac())

  11. Filter rows with filter() library(nycflights13) library(dplyr) filter(flights, month == 1, day == 1) View(filter(flights, month == 1, day == 1)) flights[flights$month == 1 & flights$day == 1, ] filter(flights, month == 1 | month == 2) slice(flights, 1:10) slice(flights, 11101:11110)

  12. Arrange rows with arrange() arrange(flights, year, month, day) arrange(flights, desc(arr_delay)) flights[order(flights$year, flights$month, flights$day), ] flights[order(flights$arr_delay, decreasing = TRUE), ] flights[order(-flights$arr_delay), ]

  13. Select columns with select() select(flights, year, month, day) select(flights, year:day) select(flights, -(year:day)) select(flights, tail_num = tailnum) rename(flights, tail_num = tailnum)

  14. Extract distinct (unique) rows distinct(flights, tailnum) distinct(flights, origin, dest)

  15. Add new columns with mutate() mutate(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60) mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60)) transform(flights, gain = arr_delay - delay, gain_per_hour = gain / (air_time / 60)) transmute(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) )

  16. Summarise values with summarise() summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) sample_n(flights, 10) sample_frac(flights, 0.01)

  17. Grouped operations by_tailnum <- group_by(flights, tailnum) delay <- summarise(by_tailnum, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delay <- filter(delay, count > 20, dist < 2000)

  18. Plot:delay vs. distance library(ggplot2) ggplot(delay, aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area()

  19. A handful of others destinations <- group_by(flights, dest) summarise(destinations, planes = n_distinct(tailnum), flights = n()) daily <- group_by(flights, year, month, day) (per_day <- summarise(daily, flights = n()) (per_month <- summarise(per_day, flights = sum(flights))) (per_year <- summarise(per_month, flights = sum(flights)))

  20. Chaining a1 <- group_by(flights, year, month, day) a2 <- select(a1, arr_delay, dep_delay) a3 <- summarise(a2, arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE)) a4 <- filter(a3, arr > 30 | dep > 30) a4

  21. Pipes filter( summarise( select( group_by(flights, year, month, day), arr_delay, dep_delay ), arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ), arr > 30 | dep > 30 ) flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30)

  22. dplyr and pipes: the basics • Verbs make up the majority of the data manipulation • Select certain columns of data. • Filter your data to select specific rows. • Arrange the rows of your data into an order. • Mutate your data frame to contain new columns. • Summarise chunks of you data in some way.

  23. The data pantheria <- "http://esapubs.org/archive/ecol/E090/184/PanTHERIA_1-0_WR05_Aug2008.txt" download.file(pantheria, destfile = "mammals.txt") mammals <- read.table("mammals.txt", sep = "\t", header = TRUE, stringsAsFactors = FALSE) str(mammals) View(head(mammals)) # cleaning data names(mammals) <- sub("X[0-9._]+", "", names(mammals)) names(mammals) <- sub("MSW05_", "", names(mammals)) mammals <- dplyr::select(mammals, Order, Binomial, AdultBodyMass_g, AdultHeadBodyLen_mm, HomeRange_km2, LitterSize) names(mammals) <- gsub("([A-Z])", "_\\L\\1", names(mammals), perl = TRUE) names(mammals) <- gsub("^_", "", names(mammals), perl = TRUE) mammals[mammals == -999] <- NA names(mammals)[names(mammals) == "binomial"] <- "species" mammals <- dplyr::tbl_df(mammals) # for prettier printing

  24. Looking at the data library(dplyr) mammals glimpse(mammals)

  25. Selecting columns select(mammals, adult_head_body_len_mm) select(mammals, adult_head_body_len_mm, litter_size) select(mammals, adult_head_body_len_mm:litter_size) select(mammals, -adult_head_body_len_mm) select(mammals, contains("body")) View(select(mammals, starts_with("adult"))) select(mammals, ends_with("g")) select(mammals, 1:3)

  26. Filtering rows View(filter(mammals, adult_body_mass_g > 1e7)[ , 1:3]) filter(mammals, species == "Balaena mysticetus") filter(mammals, order == "Carnivora" & adult_body_mass_g < 200)

  27. Arranging rows arrange(mammals, adult_body_mass_g)[ , 1:3] arrange(mammals, desc(adult_body_mass_g))[ , 1:3] arrange(mammals, order, adult_body_mass_g)[ , 1:3]

  28. Mutating columns glimpse(mutate(mammals, adult_body_mass_kg = adult_body_mass_g / 1000)) glimpse(mutate(mammals, g_per_mm = adult_body_mass_g / adult_head_body_len_mm)) glimpse(mutate(mammals, g_per_mm = adult_body_mass_g / adult_head_body_len_mm, kg_per_mm = g_per_mm / 1000))

  29. Summarising columns summarise(mammals, mean_mass = mean(adult_body_mass_g, na.rm = TRUE)) # summarise with group_by: head(summarise(group_by(mammals, order), mean_mass = mean(adult_body_mass_g, na.rm = TRUE)))

  30. Piping data # ?magrittr::`%>%` x <- rnorm(10) x %>% max # is the same thing as: max(x) mammals %>% arrange(adult_body_mass_g)

  31. Piping data (Cont.) ## Say we wanted to find the species with the highest body-mass-to-length ratio: mammals %>% mutate(mass_to_length = adult_body_mass_g / adult_head_body_len_mm) %>% arrange(desc(mass_to_length)) %>% select(species, mass_to_length)

  32. Piping data (Cont.) ## The above is equivalent to: select( arrange( mutate(mammals, mass_to_length = adult_body_mass_g / adult_head_body_len_mm), desc(mass_to_length)), species, mass_to_length)

  33. Piping data (Cont.) ## taxonomic orders have a median litter size greater than 3 mammals %>% group_by(order) %>% summarise(median_litter = median(litter_size, na.rm = TRUE)) %>% filter(median_litter > 3) %>% arrange(desc(median_litter)) %>% select(order, median_litter)

  34. Good reference of dplyr dplyr: A Grammar of DataManipulation

  35. 結語 Manipulating data is that process of re-sorting, rearranging and otherwise moving your research data, without fundamentally changing it. One of the key characteristics of a manipulation technique versus related techniques like transformation is that the underlying data remains unchanged. The main thing we’re doing is changing the relationship – logical or physical – that one piece of data has with another.

  36. 謝謝聆聽! Q&A

More Related