class: center, middle, inverse, title-slide # Data Transformation ### (
@RLadiesLancs
) ### 2019-06-05 --- layout: true <div class = "rladies-header"> <span class="social"><table><tr><td><img src="images/twitter.gif"/></td><td> @RLadiesLancs</td></tr></table></span> </div> --- # Tonight * 18:00 - 18:50 - `filter()`, `arrange()` and `select()` * 18:50 - 19:15 - Nibbles and networking 🧀 * 19:15 - 20:00 - `mutate()` and `summarise()` ### Slides: [github.com/rladies](https://github.com/rladies/) ### Book chapter: [r4ds.had.co.nz](https://r4ds.had.co.nz/transform.html) ### Slack: [bit.ly/R4DSslack](https://bit.ly/R4DSslack ) <img src="images/jr.png" width="250px" style="display: block; margin: auto;" /> --- # dplyr and Tidyverse To access the datasets, help pages, and functions that we will use tonight, load the tidyverse by running this code: ```r install.packages("tidyverse") ``` ```r library("tidyverse") ``` <img src="images/hex-dplyr.png" width="250px" style="display: block; margin: auto;" /> --- # The Data ✈️ * All flights that departed from New York City in 2013. ```r library("nycflights13") ``` Typing `flights` will print out the data in our console. ```r flights ``` Using a question mark opens the help page. ```r ?flights ``` Using the `View()` function opens the data in a new tab. ```r View(flights) ``` --- # The big 5 dplyr verbs * Pick observations by their values: `filter()`. * Reorder the rows: `arrange()`. * Pick variables by their names: `select()`. * Create new variables with functions of existing variables: `mutate()`. * Collapse many values down to a single summary: `summarise()`. ### `function(dataframe, options)` --- # `filter()` `filter()` allows you to subset observations based on their values. Let's say we want to select all the rows for for flights which departed on 1st January. -- ```r filter(flights, month == 1, day == 1) ``` ``` ## # A tibble: 842 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 533 529 4 850 ## 3 2013 1 1 542 540 2 923 ## 4 2013 1 1 544 545 -1 1004 ## 5 2013 1 1 554 600 -6 812 ## 6 2013 1 1 554 558 -4 740 ## 7 2013 1 1 555 600 -5 913 ## 8 2013 1 1 557 600 -3 709 ## 9 2013 1 1 557 600 -3 838 ## 10 2013 1 1 558 600 -2 753 ## # … with 832 more rows, and 12 more variables: sched_arr_time <int>, ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` -- ```r jan1 = filter(flights, month == 1, day == 1) ``` --- # Making comparisons | Symbol | Name | |--------|--------------------------| | > | greater than | | >= | greater than or equal to | | < | less than | | <= | less than or equal to | | == | is equal to | | != | is not equal to | Careful! ```r filter(flights, month = 1) ``` ``` ## `month` (`month = 1`) must not be named, do you need `==`? ``` --- # Logical operators | Symbol | Name | |--------|------| | & | and | | I | or | | ! | not | Let's find all the flights that occured in Autumn 🍁. ```r filter(flights, month == 9 | month == 10 | month == 11 ) ``` ```r autumn <- filter(flights, month %in% c(10, 11, 12)) ``` --- # Missing values If you want to determine if a value is missing, use `is.na()`: ```r x = NA is.na(x) ``` ``` ## [1] TRUE ``` `filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values. If you want to preserve missing values, ask for them explicitly: ```r filter(flights, dep_delay > 400 | is.na(dep_delay)) ``` ``` ## # A tibble: 8,390 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 848 1835 853 1001 ## 2 2013 1 1 NA 1630 NA NA ## 3 2013 1 1 NA 1935 NA NA ## 4 2013 1 1 NA 1500 NA NA ## 5 2013 1 1 NA 600 NA NA ## 6 2013 1 2 NA 1540 NA NA ## 7 2013 1 2 NA 1620 NA NA ## 8 2013 1 2 NA 1355 NA NA ## 9 2013 1 2 NA 1420 NA NA ## 10 2013 1 2 NA 1321 NA NA ## # … with 8,380 more rows, and 12 more variables: sched_arr_time <int>, ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` --- # `filter()` excercises 1. Find all flights that: * Had an arrival delay of two or more hours * Flew to Houston (IAH or HOU) * Were operated by United, American, or Delta * Departed in summer (July, August, and September) * Arrived more than two hours late, but didn’t leave late * Were delayed by at least an hour, but made up over 30 minutes in flight * Departed between midnight and 6am (inclusive) 2. Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges? 3. How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent? --- # `arrange()` `arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order. ```r arrange(flights, year, month, day) ``` ``` ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 533 529 4 850 ## 3 2013 1 1 542 540 2 923 ## 4 2013 1 1 544 545 -1 1004 ## 5 2013 1 1 554 600 -6 812 ## 6 2013 1 1 554 558 -4 740 ## 7 2013 1 1 555 600 -5 913 ## 8 2013 1 1 557 600 -3 709 ## 9 2013 1 1 557 600 -3 838 ## 10 2013 1 1 558 600 -2 753 ## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>, ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` --- # `desc()` Use `desc()` to re-order by a column in descending order: ```r arrange(flights, desc(dep_delay)) ``` ``` ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 9 641 900 1301 1242 ## 2 2013 6 15 1432 1935 1137 1607 ## 3 2013 1 10 1121 1635 1126 1239 ## 4 2013 9 20 1139 1845 1014 1457 ## 5 2013 7 22 845 1600 1005 1044 ## 6 2013 4 10 1100 1900 960 1342 ## 7 2013 3 17 2321 810 911 135 ## 8 2013 6 27 959 1900 899 1236 ## 9 2013 7 22 2257 759 898 121 ## 10 2013 12 5 756 1700 896 1058 ## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>, ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` --- # `arrange()` excercises 1. Sort flights to find the most delayed flights. Find the flights that left earliest. 2. Sort flights to find the fastest flights. 3. Which flights travelled the longest? Which travelled the shortest? 4. How could you use `arrange()` to sort all missing values to the start? (Hint: use `is.na()`). --- # `select()` * Some datasets have many columns * `select()` allows you to rapidly zoom in on a useful a subset ```r select(flights, year, month, day) ``` ``` ## # A tibble: 336,776 x 3 ## year month day ## <int> <int> <int> ## 1 2013 1 1 ## 2 2013 1 1 ## 3 2013 1 1 ## 4 2013 1 1 ## 5 2013 1 1 ## 6 2013 1 1 ## 7 2013 1 1 ## 8 2013 1 1 ## 9 2013 1 1 ## 10 2013 1 1 ## # … with 336,766 more rows ``` --- # `select()` ```r select(flights, year:day) ``` ``` ## # A tibble: 336,776 x 3 ## year month day ## <int> <int> <int> ## 1 2013 1 1 ## 2 2013 1 1 ## 3 2013 1 1 ## 4 2013 1 1 ## 5 2013 1 1 ## 6 2013 1 1 ## 7 2013 1 1 ## 8 2013 1 1 ## 9 2013 1 1 ## 10 2013 1 1 ## # … with 336,766 more rows ``` --- # `select()` ```r select(flights, -(year:day)) ``` ``` ## # A tibble: 336,776 x 16 ## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay ## <int> <int> <dbl> <int> <int> <dbl> ## 1 517 515 2 830 819 11 ## 2 533 529 4 850 830 20 ## 3 542 540 2 923 850 33 ## 4 544 545 -1 1004 1022 -18 ## 5 554 600 -6 812 837 -25 ## 6 554 558 -4 740 728 12 ## 7 555 600 -5 913 854 19 ## 8 557 600 -3 709 723 -14 ## 9 557 600 -3 838 846 -8 ## 10 558 600 -2 753 745 8 ## # … with 336,766 more rows, and 10 more variables: carrier <chr>, ## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- # Helper functions There are a number of helper functions you can use within `select()`: * `starts_with("abc")`: matches names that begin with “abc”. * `ends_with("xyz")`: matches names that end with “xyz”. * `contains("ijk")`: matches names that contain “ijk”. * `matches("(.)\\1")`: selects variables that match a regular expression. This one matches any variables that contain repeated characters. * `num_range("trial", 1:3)`: matches trial1, trial2 and trial3. See `?select` for more details. --- # `select()` excercises 1. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights. 2. What happens if you include the name of a variable multiple times in a `select()` call? 3. What does the `one_of()` function do? Why might it be helpful in conjunction with this vector? ```r vars <- c("year", "month", "day", "dep_delay", "arr_delay") ``` 4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default? ```contains select(flights, contains("TIME")) ``` --- # `mutate()` * `mutate()` creates new columns that are functions of existing columns. Useful! * `mutate()` always adds columns to the end of the dataset. So let's make a smaller dataset first. ```r flights_sml = select(flights, ends_with("delay"), distance, air_time ) flights_sml ``` ``` ## # A tibble: 336,776 x 4 ## dep_delay arr_delay distance air_time ## <dbl> <dbl> <dbl> <dbl> ## 1 2 11 1400 227 ## 2 4 20 1416 227 ## 3 2 33 1089 160 ## 4 -1 -18 1576 183 ## 5 -6 -25 762 116 ## 6 -4 12 719 150 ## 7 -5 19 1065 158 ## 8 -3 -14 229 53 ## 9 -3 -8 944 140 ## 10 -2 8 733 138 ## # … with 336,766 more rows ``` --- # `mutate()` ```r mutate(flights_sml, gain = dep_delay - arr_delay, speed = (distance / air_time) * 60 ) ``` ``` ## # A tibble: 336,776 x 6 ## dep_delay arr_delay distance air_time gain speed ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2 11 1400 227 -9 370. ## 2 4 20 1416 227 -16 374. ## 3 2 33 1089 160 -31 408. ## 4 -1 -18 1576 183 17 517. ## 5 -6 -25 762 116 19 394. ## 6 -4 12 719 150 -16 288. ## 7 -5 19 1065 158 -24 404. ## 8 -3 -14 229 53 11 259. ## 9 -3 -8 944 140 5 405. ## 10 -2 8 733 138 -10 319. ## # … with 336,766 more rows ``` --- # `mutate()` Note that you can refer to columns that you’ve just created: ```r mutate(flights_sml, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours ) ``` ``` ## # A tibble: 336,776 x 7 ## dep_delay arr_delay distance air_time gain hours gain_per_hour ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2 11 1400 227 -9 3.78 -2.38 ## 2 4 20 1416 227 -16 3.78 -4.23 ## 3 2 33 1089 160 -31 2.67 -11.6 ## 4 -1 -18 1576 183 17 3.05 5.57 ## 5 -6 -25 762 116 19 1.93 9.83 ## 6 -4 12 719 150 -16 2.5 -6.4 ## 7 -5 19 1065 158 -24 2.63 -9.11 ## 8 -3 -14 229 53 11 0.883 12.5 ## 9 -3 -8 944 140 5 2.33 2.14 ## 10 -2 8 733 138 -10 2.3 -4.35 ## # … with 336,766 more rows ``` --- # Useful functions with `mutate()` * Arithmetic operators `+`, `-`, `*`, `/`, `^`. These are all vectorised, using the so called “recycling rules”. E.g `air_time / 60` * Modular arithmetic: `%/%` (integer division) and `%%` (remainder) Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from `dep_time` with: ```r transmute(flights, dep_time, hour = dep_time %/% 100, minute = dep_time %% 100 ) ``` ``` ## # A tibble: 336,776 x 3 ## dep_time hour minute ## <int> <dbl> <dbl> ## 1 517 5 17 ## 2 533 5 33 ## 3 542 5 42 ## 4 544 5 44 ## 5 554 5 54 ## 6 554 5 54 ## 7 555 5 55 ## 8 557 5 57 ## 9 557 5 57 ## 10 558 5 58 ## # … with 336,766 more rows ``` --- # `mutate()` excercises 1. Create some of your own new columns using the `mutate()` function. 2. Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight. 3. Compare air_time with `arr_time - dep_time`. What do you expect to see? What do you see? What do you need to do to fix it? 4. Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related? 5. What does `1:3 + 1:10` return? Why? --- # `summarise()` * The last key verb is `summarise()`. It collapses a data frame to a single row: ```r summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) ``` ``` ## # A tibble: 1 x 1 ## delay ## <dbl> ## 1 12.6 ``` --- # `summarise()` `summarise()` is not terribly useful unless we pair it with `group_by()`. ```r by_month = group_by(flights, month) summarise(by_month, delay = mean(dep_delay, na.rm = TRUE)) ``` ``` ## # A tibble: 12 x 2 ## month delay ## <int> <dbl> ## 1 1 10.0 ## 2 2 10.8 ## 3 3 13.2 ## 4 4 13.9 ## 5 5 13.0 ## 6 6 20.8 ## 7 7 21.7 ## 8 8 12.6 ## 9 9 6.72 ## 10 10 6.24 ## 11 11 5.44 ## 12 12 16.6 ``` --- # Combining multiple operations Imagine that we want to explore the relationship between the distance and average delay for each destination. ```r by_dest = group_by(flights, dest) delay = summarise(by_dest, total = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) delay = filter(delay, total > 20, dest != "HNL") ``` There are three steps to prepare this data: * Group flights by destination. * Summarise to compute distance, average delay, and number of flights. * Filter to remove noisy points and Honolulu airport This code is a little frustrating to write! --- # Pipe! `%>%` There’s another way to tackle the same problem with the pipe, `%>%`: ```r delays = flights %>% group_by(dest) %>% summarise( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") ``` * Easier to read * No unnecessary data frames * Pronouce `%>%` as "then" --- # Missing values You may have wondered about the `na.rm` argument we used earlier. What happens if we don’t set it? ```r flights %>% group_by(year, month, day) %>% summarise(mean = mean(dep_delay)) ``` ``` ## # A tibble: 365 x 4 ## # Groups: year, month [12] ## year month day mean ## <int> <int> <int> <dbl> ## 1 2013 1 1 NA ## 2 2013 1 2 NA ## 3 2013 1 3 NA ## 4 2013 1 4 NA ## 5 2013 1 5 NA ## 6 2013 1 6 NA ## 7 2013 1 7 NA ## 8 2013 1 8 NA ## 9 2013 1 9 NA ## 10 2013 1 10 NA ## # … with 355 more rows ``` --- # Missing values Luckily, we can use the `na.rm` to tell `mean()` to ignore any `NA`s in the calculation. ```r flights %>% group_by(year, month, day) %>% summarise(mean = mean(dep_delay, na.rm = TRUE)) ``` ``` ## # A tibble: 365 x 4 ## # Groups: year, month [12] ## year month day mean ## <int> <int> <int> <dbl> ## 1 2013 1 1 11.5 ## 2 2013 1 2 13.9 ## 3 2013 1 3 11.0 ## 4 2013 1 4 8.95 ## 5 2013 1 5 5.73 ## 6 2013 1 6 7.15 ## 7 2013 1 7 5.42 ## 8 2013 1 8 2.55 ## 9 2013 1 9 2.28 ## 10 2013 1 10 2.84 ## # … with 355 more rows ``` --- # Missing values Or, we could filter the dataset to remove the `NA`s before doing the calculation. ```r not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay)) not_cancelled ``` ``` ## # A tibble: 327,346 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 533 529 4 850 ## 3 2013 1 1 542 540 2 923 ## 4 2013 1 1 544 545 -1 1004 ## 5 2013 1 1 554 600 -6 812 ## 6 2013 1 1 554 558 -4 740 ## 7 2013 1 1 555 600 -5 913 ## 8 2013 1 1 557 600 -3 709 ## 9 2013 1 1 557 600 -3 838 ## 10 2013 1 1 558 600 -2 753 ## # … with 327,336 more rows, and 12 more variables: sched_arr_time <int>, ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ``` --- # Counts To count we can use `n()`. ```r flights %>% group_by(dest) %>% summarise(total = n()) ``` Note: Counts are so useful that dplyr provides a simple shortcut for counting. ```r flights %>% count(dest) ``` You can also provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew. ```r not_cancelled %>% count(tailnum, wt = distance) ``` --- # Averages * `mean()`, `median()` What is the average flight delay on each day? ```r not_cancelled %>% group_by(year, month, day) %>% summarise(avg_delay = mean(arr_delay)) ``` ``` ## # A tibble: 365 x 4 ## # Groups: year, month [12] ## year month day avg_delay ## <int> <int> <int> <dbl> ## 1 2013 1 1 12.7 ## 2 2013 1 2 12.7 ## 3 2013 1 3 5.73 ## 4 2013 1 4 -1.93 ## 5 2013 1 5 -1.53 ## 6 2013 1 6 4.24 ## 7 2013 1 7 -4.95 ## 8 2013 1 8 -3.23 ## 9 2013 1 9 -0.264 ## 10 2013 1 10 -5.90 ## # … with 355 more rows ``` --- # Spread * `sd()`, `IQR()`, `mad()` Why is distance to some destinations more variable than to others? ```r not_cancelled %>% group_by(dest) %>% summarise(distance_sd = sd(distance)) %>% arrange(desc(distance_sd)) ``` ``` ## # A tibble: 104 x 2 ## dest distance_sd ## <chr> <dbl> ## 1 EGE 10.5 ## 2 SAN 10.4 ## 3 SFO 10.2 ## 4 HNL 10.0 ## 5 SEA 9.98 ## 6 LAS 9.91 ## 7 PDX 9.87 ## 8 PHX 9.86 ## 9 LAX 9.66 ## 10 IND 9.46 ## # … with 94 more rows ``` --- ## Measures of Rank * `min(x)`, `max(x)`, `quantile(x, 0.25)` When do the first and last flights leave each day? ```r not_cancelled %>% group_by(year, month, day) %>% summarise( first = min(dep_time), last = max(dep_time) ) ``` ``` ## # A tibble: 365 x 5 ## # Groups: year, month [12] ## year month day first last ## <int> <int> <int> <int> <int> ## 1 2013 1 1 517 2356 ## 2 2013 1 2 42 2354 ## 3 2013 1 3 32 2349 ## 4 2013 1 4 25 2358 ## 5 2013 1 5 14 2357 ## 6 2013 1 6 16 2355 ## 7 2013 1 7 49 2359 ## 8 2013 1 8 454 2351 ## 9 2013 1 9 2 2252 ## 10 2013 1 10 3 2320 ## # … with 355 more rows ``` --- # `summarise()` excercises 1. Calculate the total number of flights which occured for each month. 2. In which month was the variability of arrival delay the highest? <!-- July--> 3. Which carrier has the worst delays? 4. Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` without using `count()`. (Hint - think about `group_by()`). 5. What does the sort argument to `count()` do? When might you use it?