Data Transformation

class: center, middle, inverse, title-slide

# Data Transformation
### (<a href="https://twitter.com/RLadiesLancs">@RLadiesLancs</a>)
### 2019-06-05

---

layout: true

<div class = "rladies-header">
<span class="social"><table><tr><td><img src="images/twitter.gif"/></td><td> @RLadiesLancs</td></tr></table></span>
</div>

---

# Tonight

* 18:00 - 18:50 - `filter()`, `arrange()` and `select()`
* 18:50 - 19:15 - Nibbles and networking 🧀
* 19:15 - 20:00 - `mutate()` and `summarise()`

### Slides: [github.com/rladies](https://github.com/rladies/)

### Book chapter: [r4ds.had.co.nz](https://r4ds.had.co.nz/transform.html)

### Slack: [bit.ly/R4DSslack](https://bit.ly/R4DSslack )

---
# dplyr and Tidyverse

To access the datasets, help pages, and functions that we will use tonight, load the tidyverse by running this code:

```r
install.packages("tidyverse")
```

```r
library("tidyverse")
```

---

# The Data ✈️

* All flights that departed from New York City in 2013.

```r
library("nycflights13")
```

Typing `flights` will print out the data in our console.

```r
flights
```

Using a question mark opens the help page.

```r
?flights
```

Using the `View()` function opens the data in a new tab.

```r
View(flights)
```

---

# The big 5 dplyr verbs

* Pick observations by their values: `filter()`.
* Reorder the rows: `arrange()`.
* Pick variables by their names: `select()`.
* Create new variables with functions of existing variables: `mutate()`.
* Collapse many values down to a single summary: `summarise()`.

### `function(dataframe, options)`

---

# `filter()`

`filter()` allows you to subset observations based on their values.

Let's say we want to select all the rows for for flights which departed on 1st January.

```r
filter(flights, month == 1, day == 1)
```

```
## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 832 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
```

```r
jan1 = filter(flights, month == 1, day == 1)
```

---

# Making comparisons

| Symbol | Name |
|--------|--------------------------|
| > | greater than |
| >= | greater than or equal to |
| < | less than |
| <= | less than or equal to |
| == | is equal to |
| != | is not equal to |

Careful!

```r
filter(flights, month = 1)
```

```
## `month` (`month = 1`) must not be named, do you need `==`?
```

---

# Logical operators

| Symbol | Name |
|--------|------|
| &      | and |
| I      | or |
| !      | not |

Let's find all the flights that occured in Autumn 🍁.

```r
filter(flights, month == 9 | month == 10 |  month == 11 )
```

```r
autumn <- filter(flights, month %in% c(10, 11, 12))
```

---

# Missing values

If you want to determine if a value is missing, use `is.na()`:

```r
x = NA
is.na(x)
```

```
## [1] TRUE
```

`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values. If you want to preserve missing values, ask for them explicitly:

```r
filter(flights, dep_delay > 400 | is.na(dep_delay))
```

```
## # A tibble: 8,390 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      848           1835       853     1001
##  2  2013     1     1       NA           1630        NA       NA
##  3  2013     1     1       NA           1935        NA       NA
##  4  2013     1     1       NA           1500        NA       NA
##  5  2013     1     1       NA            600        NA       NA
##  6  2013     1     2       NA           1540        NA       NA
##  7  2013     1     2       NA           1620        NA       NA
##  8  2013     1     2       NA           1355        NA       NA
##  9  2013     1     2       NA           1420        NA       NA
## 10  2013     1     2       NA           1321        NA       NA
## # … with 8,380 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
```

---

# `filter()` excercises

1. Find all flights that:
  * Had an arrival delay of two or more hours
  * Flew to Houston (IAH or HOU)
  * Were operated by United, American, or Delta
  * Departed in summer (July, August, and September)
  * Arrived more than two hours late, but didn’t leave late
  * Were delayed by at least an hour, but made up over 30 minutes in flight
  * Departed between midnight and 6am (inclusive)

2. Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges?

3. How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent?

---

# `arrange()`

`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.

```r
arrange(flights, year, month, day)
```

```
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
```

---

# `desc()`
Use `desc()` to re-order by a column in descending order:

```r
arrange(flights, desc(dep_delay))
```

```
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     9      641            900      1301     1242
##  2  2013     6    15     1432           1935      1137     1607
##  3  2013     1    10     1121           1635      1126     1239
##  4  2013     9    20     1139           1845      1014     1457
##  5  2013     7    22      845           1600      1005     1044
##  6  2013     4    10     1100           1900       960     1342
##  7  2013     3    17     2321            810       911      135
##  8  2013     6    27      959           1900       899     1236
##  9  2013     7    22     2257            759       898      121
## 10  2013    12     5      756           1700       896     1058
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
```

---
# `arrange()` excercises

1. Sort flights to find the most delayed flights. Find the flights that left earliest.

2. Sort flights to find the fastest flights.

3. Which flights travelled the longest? Which travelled the shortest?

4. How could you use `arrange()` to sort all missing values to the start? (Hint: use `is.na()`).

---

# `select()`

* Some datasets have many columns
* `select()` allows you to rapidly zoom in on a useful a subset

```r
select(flights, year, month, day)
```

```
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows
```

---
# `select()`

```r
select(flights, year:day)
```

```r
select(flights, -(year:day))
```

```
## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##       <int>          <int>     <dbl>    <int>          <int>     <dbl>
##  1      517            515         2      830            819        11
##  2      533            529         4      850            830        20
##  3      542            540         2      923            850        33
##  4      544            545        -1     1004           1022       -18
##  5      554            600        -6      812            837       -25
##  6      554            558        -4      740            728        12
##  7      555            600        -5      913            854        19
##  8      557            600        -3      709            723       -14
##  9      557            600        -3      838            846        -8
## 10      558            600        -2      753            745         8
## # … with 336,766 more rows, and 10 more variables: carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

---

# Helper functions

There are a number of helper functions you can use within `select()`:

* `starts_with("abc")`: matches names that begin with “abc”.

* `ends_with("xyz")`: matches names that end with “xyz”.

* `contains("ijk")`: matches names that contain “ijk”.

* `matches("(.)\\1")`: selects variables that match a regular expression. This one matches any variables that contain repeated characters.

* `num_range("trial", 1:3)`: matches trial1, trial2 and trial3.

See `?select` for more details.

---

# `select()` excercises

1. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights.

2. What happens if you include the name of a variable multiple times in a `select()` call?

3. What does the `one_of()` function do? Why might it be helpful in conjunction with this vector?
  
  ```r
  vars <- c("year", "month", "day", "dep_delay", "arr_delay")
  ```

4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

```contains
  select(flights, contains("TIME"))
  ```

---

# `mutate()`

* `mutate()` creates new columns that are functions of existing columns. Useful!

* `mutate()` always adds columns to the end of the dataset. So let's make a smaller dataset first.

```r
flights_sml = select(flights, 
  ends_with("delay"), 
  distance, 
  air_time
)

flights_sml
```

```
## # A tibble: 336,776 x 4
##    dep_delay arr_delay distance air_time
##        <dbl>     <dbl>    <dbl>    <dbl>
##  1         2        11     1400      227
##  2         4        20     1416      227
##  3         2        33     1089      160
##  4        -1       -18     1576      183
##  5        -6       -25      762      116
##  6        -4        12      719      150
##  7        -5        19     1065      158
##  8        -3       -14      229       53
##  9        -3        -8      944      140
## 10        -2         8      733      138
## # … with 336,766 more rows
```

---
# `mutate()`

```r
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  speed = (distance / air_time) * 60
)
```

```
## # A tibble: 336,776 x 6
##    dep_delay arr_delay distance air_time  gain speed
##        <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1         2        11     1400      227    -9  370.
##  2         4        20     1416      227   -16  374.
##  3         2        33     1089      160   -31  408.
##  4        -1       -18     1576      183    17  517.
##  5        -6       -25      762      116    19  394.
##  6        -4        12      719      150   -16  288.
##  7        -5        19     1065      158   -24  404.
##  8        -3       -14      229       53    11  259.
##  9        -3        -8      944      140     5  405.
## 10        -2         8      733      138   -10  319.
## # … with 336,766 more rows
```

---

# `mutate()`

Note that you can refer to columns that you’ve just created:

```r
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
```

```
## # A tibble: 336,776 x 7
##    dep_delay arr_delay distance air_time  gain hours gain_per_hour
##        <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>         <dbl>
##  1         2        11     1400      227    -9 3.78          -2.38
##  2         4        20     1416      227   -16 3.78          -4.23
##  3         2        33     1089      160   -31 2.67         -11.6 
##  4        -1       -18     1576      183    17 3.05           5.57
##  5        -6       -25      762      116    19 1.93           9.83
##  6        -4        12      719      150   -16 2.5           -6.4 
##  7        -5        19     1065      158   -24 2.63          -9.11
##  8        -3       -14      229       53    11 0.883         12.5 
##  9        -3        -8      944      140     5 2.33           2.14
## 10        -2         8      733      138   -10 2.3           -4.35
## # … with 336,766 more rows
```

---
# Useful functions with `mutate()`

* Arithmetic  operators `+`, `-`, `*`, `/`, `^`. These are all vectorised, using the so called “recycling rules”.  E.g  `air_time / 60`

* Modular arithmetic: `%/%` (integer division) and `%%` (remainder) Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from `dep_time` with:

```r
transmute(flights,
  dep_time,
  hour = dep_time %/% 100,
  minute = dep_time %% 100
)
```

```
## # A tibble: 336,776 x 3
##    dep_time  hour minute
##       <int> <dbl>  <dbl>
##  1      517     5     17
##  2      533     5     33
##  3      542     5     42
##  4      544     5     44
##  5      554     5     54
##  6      554     5     54
##  7      555     5     55
##  8      557     5     57
##  9      557     5     57
## 10      558     5     58
## # … with 336,766 more rows
```

---

# `mutate()` excercises

1. Create some of your own new columns using the `mutate()` function.

2. Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

3. Compare air_time with `arr_time - dep_time`. What do you expect to see? What do you see? What do you need to do to fix it?

4. Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related?

5. What does `1:3 + 1:10` return? Why?

---

# `summarise()`

* The last key verb is `summarise()`. It collapses a data frame to a single row:

```r
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
```

```
## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  12.6
```

---
# `summarise()`

`summarise()` is not terribly useful unless we pair it with `group_by()`.

```r
by_month = group_by(flights, month)
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))
```

```
## # A tibble: 12 x 2
##    month delay
##    <int> <dbl>
##  1     1 10.0 
##  2     2 10.8 
##  3     3 13.2 
##  4     4 13.9 
##  5     5 13.0 
##  6     6 20.8 
##  7     7 21.7 
##  8     8 12.6 
##  9     9  6.72
## 10    10  6.24
## 11    11  5.44
## 12    12 16.6
```

---

# Combining multiple operations

Imagine that we want to explore the relationship between the distance and average delay for each destination.

```r
by_dest = group_by(flights, dest)
delay = summarise(by_dest,
  total = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)
delay = filter(delay, total > 20, dest != "HNL")
```
There are three steps to prepare this data:

* Group flights by destination.
* Summarise to compute distance, average delay, and number of flights.
* Filter to remove noisy points and Honolulu airport

This code is a little frustrating to write!

---

# Pipe! `%>%`
There’s another way to tackle the same problem with the pipe, `%>%`:

```r
delays = flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")
```

* Easier to read
* No unnecessary data frames
* Pronouce `%>%` as "then"
---

# Missing values

You may have wondered about the `na.rm` argument we used earlier.
What happens if we don’t set it?

```r
flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay))
```

```
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1    NA
##  2  2013     1     2    NA
##  3  2013     1     3    NA
##  4  2013     1     4    NA
##  5  2013     1     5    NA
##  6  2013     1     6    NA
##  7  2013     1     7    NA
##  8  2013     1     8    NA
##  9  2013     1     9    NA
## 10  2013     1    10    NA
## # … with 355 more rows
```

---

# Missing values

Luckily, we can use the `na.rm` to tell `mean()` to ignore any `NA`s in the calculation.

```r
flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay, na.rm = TRUE))
```

```
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # … with 355 more rows
```

---

# Missing values

Or, we could filter the dataset to remove the `NA`s before doing the calculation.

```r
not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled
```

```
## # A tibble: 327,346 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 327,336 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
```
---

# Counts
To count we can use `n()`.

```r
flights %>% 
  group_by(dest) %>% 
  summarise(total = n())
```

Note: Counts are so useful that dplyr provides a simple shortcut for counting.

```r
flights %>%
  count(dest)
```

You can also provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew.

```r
not_cancelled %>% 
  count(tailnum, wt = distance)
```

---
# Averages

* `mean()`, `median()`

What is the average flight delay on each day?

```r
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(avg_delay = mean(arr_delay))
```

```
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day avg_delay
##    <int> <int> <int>     <dbl>
##  1  2013     1     1    12.7  
##  2  2013     1     2    12.7  
##  3  2013     1     3     5.73 
##  4  2013     1     4    -1.93 
##  5  2013     1     5    -1.53 
##  6  2013     1     6     4.24 
##  7  2013     1     7    -4.95 
##  8  2013     1     8    -3.23 
##  9  2013     1     9    -0.264
## 10  2013     1    10    -5.90 
## # … with 355 more rows
```

---

# Spread

* `sd()`, `IQR()`, `mad()`

Why is distance to some destinations more variable than to others?

```r
not_cancelled %>% 
  group_by(dest) %>% 
  summarise(distance_sd = sd(distance)) %>% 
  arrange(desc(distance_sd))
```

```
## # A tibble: 104 x 2
##    dest  distance_sd
##    <chr>       <dbl>
##  1 EGE         10.5 
##  2 SAN         10.4 
##  3 SFO         10.2 
##  4 HNL         10.0 
##  5 SEA          9.98
##  6 LAS          9.91
##  7 PDX          9.87
##  8 PHX          9.86
##  9 LAX          9.66
## 10 IND          9.46
## # … with 94 more rows
```
---

## Measures of Rank

* `min(x)`, `max(x)`, `quantile(x, 0.25)`

When do the first and last flights leave each day?

```r
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first = min(dep_time),
    last = max(dep_time)
  )
```

```
## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day first  last
##    <int> <int> <int> <int> <int>
##  1  2013     1     1   517  2356
##  2  2013     1     2    42  2354
##  3  2013     1     3    32  2349
##  4  2013     1     4    25  2358
##  5  2013     1     5    14  2357
##  6  2013     1     6    16  2355
##  7  2013     1     7    49  2359
##  8  2013     1     8   454  2351
##  9  2013     1     9     2  2252
## 10  2013     1    10     3  2320
## # … with 355 more rows
```

---

# `summarise()` excercises

1. Calculate the total number of flights which occured for each month.

2. In which month was the variability of arrival delay the highest?

3. Which carrier has the worst delays?

4. Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` without using `count()`. (Hint - think about `group_by()`).

5. What does the sort argument to `count()` do? When might you use it?