Data Wrangling with {dplyr}

class: title-slide, center, middle

# Data Wrangling with {dplyr}

---
background-image: url(images/hex/tidyverse.png)
background-position: 90% 5%
background-size: 10%

# Tidyverse

There are a few key ideas to be aware of about how the tidyverse works in general before we dive into `{dplyr}`

1. Packages are designed to be like **grammars** for their task. You can string these grammatical elements together to form more complex statements, just like with language.

1. The first argument of (basically) every function is `data`. This is very handy, especially when it comes to piping.

1. Variable names are usually not quoted (read more [here](https://tidyselect.r-lib.org/reference/language.html)).

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# Star Wars

```r
starwars
```

```
## # A tibble: 87 × 6
##    name               height  mass hair_color    eye_color species
##    <chr>               <int> <dbl> <chr>         <chr>     <chr>  
##  1 Luke Skywalker        172    77 blond         blue      Human  
##  2 C-3PO                 167    75 <NA>          yellow    Droid  
##  3 R2-D2                  96    32 <NA>          red       Droid  
##  4 Darth Vader           202   136 none          yellow    Human  
##  5 Leia Organa           150    49 brown         brown     Human  
##  6 Owen Lars             178   120 brown, grey   blue      Human  
##  7 Beru Whitesun lars    165    75 brown         blue      Human  
##  8 R5-D4                  97    32 <NA>          red       Droid  
##  9 Biggs Darklighter     183    84 black         brown     Human  
## 10 Obi-Wan Kenobi        182    77 auburn, white blue-gray Human  
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# {dplyr}

`{dplyr}` is a grammar of data manipulation, providing a consistent set of core verbs that help you solve the most common data manipulation challenges

--
***

**Manipulating observations**

+ `filter()` picks cases based on their values.

+ `arrange()` changes the ordering of the rows.

--
***

**Manipulating variables**

+ `select()` picks variables based on their names.

+ `mutate()` adds new variables that are functions of existing variables.

--
***

**Summarizing data**

+ `summarise()` reduces multiple values down to single summary statistics.

---
class: inverse, center, middle

# Manipulating observations <br> (rows)

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `filter()`

### Subset observations (rows) with `filter()`

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `filter()`

.footnote[Artwork by [@allison_horst](https://twitter.com/allison_horst)]

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `filter()`

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Operator </th>
   <th style="text-align:left;"> Description </th>
   <th style="text-align:left;"> Usage </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> &lt; </td>
   <td style="text-align:left;"> less than </td>
   <td style="text-align:left;"> x &lt; y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &lt;= </td>
   <td style="text-align:left;"> less than or equal to </td>
   <td style="text-align:left;"> x &lt;= y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &gt; </td>
   <td style="text-align:left;"> greater than </td>
   <td style="text-align:left;"> x &gt; y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &gt;= </td>
   <td style="text-align:left;"> greater than or equal to </td>
   <td style="text-align:left;"> x &gt;= y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> == </td>
   <td style="text-align:left;"> exactly equal to </td>
   <td style="text-align:left;"> x == y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> != </td>
   <td style="text-align:left;"> not equal to </td>
   <td style="text-align:left;"> x != y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &amp; </td>
   <td style="text-align:left;"> and </td>
   <td style="text-align:left;"> x == a &amp; y == b </td>
  </tr>
  <tr>
   <td style="text-align:left;"> | </td>
   <td style="text-align:left;"> or </td>
   <td style="text-align:left;"> x == a | y == b </td>
  </tr>
  <tr>
   <td style="text-align:left;"> %in% </td>
   <td style="text-align:left;"> group membership </td>
   <td style="text-align:left;"> x %in% y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> is.na </td>
   <td style="text-align:left;"> is missing </td>
   <td style="text-align:left;"> is.na(x) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> !is.na </td>
   <td style="text-align:left;"> is not missing </td>
   <td style="text-align:left;"> !is.na(x) </td>
  </tr>
</tbody>
</table>

.footnote[Source: [Alison Hill](https://share-blogdown.netlify.app/slides/02-slides.html#15)]

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `filter()`

.panelset[

.panel[.panel-name[Arguments]

```r
filter(.data, ...)
```

`.data` is a data frame or tibble

`...` includes expressions that return a logical value and are defined in terms of the variables in .data. If multiple expressions are included, they are combined with the & operator. Only rows for which all conditions evaluate to TRUE are kept.
]

.panel[.panel-name[Example 1]

```r
starwars %>% 
  filter(species == "Human" & eye_color != "blue")
```

```
## # A tibble: 23 × 6
##    name              height  mass hair_color    eye_color species
##    <chr>              <int> <dbl> <chr>         <chr>     <chr>  
##  1 Darth Vader          202 136   none          yellow    Human  
##  2 Leia Organa          150  49   brown         brown     Human  
##  3 Biggs Darklighter    183  84   black         brown     Human  
##  4 Obi-Wan Kenobi       182  77   auburn, white blue-gray Human  
##  5 Han Solo             180  80   brown         brown     Human  
##  6 Wedge Antilles       170  77   brown         hazel     Human  
##  7 Palpatine            170  75   grey          yellow    Human  
##  8 Boba Fett            183  78.2 black         brown     Human  
##  9 Lando Calrissian     177  79   black         brown     Human  
## 10 Arvel Crynyd          NA  NA   brown         brown     Human  
## # … with 13 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

]

.panel[.panel-name[Example 2]

```r
starwars %>% 
  filter(species %in% c("Wookiee", "Ewok"))
```

```
## # A tibble: 3 × 6
##   name                  height  mass hair_color eye_color species
##   <chr>                  <int> <dbl> <chr>      <chr>     <chr>  
## 1 Chewbacca                228   112 brown      blue      Wookiee
## 2 Wicket Systri Warrick     88    20 brown      brown     Ewok   
## 3 Tarfful                  234   136 brown      blue      Wookiee
```
]

.panel[.panel-name[Example 3]

```r
starwars %>% 
  filter(!is.na(hair_color))
```

```
## # A tibble: 82 × 6
##    name               height  mass hair_color    eye_color species
##    <chr>               <int> <dbl> <chr>         <chr>     <chr>  
##  1 Luke Skywalker        172    77 blond         blue      Human  
##  2 Darth Vader           202   136 none          yellow    Human  
##  3 Leia Organa           150    49 brown         brown     Human  
##  4 Owen Lars             178   120 brown, grey   blue      Human  
##  5 Beru Whitesun lars    165    75 brown         blue      Human  
##  6 Biggs Darklighter     183    84 black         brown     Human  
##  7 Obi-Wan Kenobi        182    77 auburn, white blue-gray Human  
##  8 Anakin Skywalker      188    84 blond         blue      Human  
##  9 Wilhuff Tarkin        180    NA auburn, grey  blue      Human  
## 10 Chewbacca             228   112 brown         blue      Wookiee
## # … with 72 more rows
## # ℹ Use `print(n = ...)` to see more rows
```
]

.panel[.panel-name[Example 4]

```r
starwars %>% 
  filter(height <= 100)
```

```
## # A tibble: 7 × 6
##   name                  height  mass hair_color eye_color species       
##   <chr>                  <int> <dbl> <chr>      <chr>     <chr>         
## 1 R2-D2                     96    32 <NA>       red       Droid         
## 2 R5-D4                     97    32 <NA>       red       Droid         
## 3 Yoda                      66    17 white      brown     Yoda's species
## 4 Wicket Systri Warrick     88    20 brown      brown     Ewok          
## 5 Dud Bolt                  94    45 none       yellow    Vulptereen    
## 6 Ratts Tyerell             79    15 none       unknown   Aleena        
## 7 R4-P17                    96    NA none       red, blue Droid
```
]

]

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `arrange()`

### Arrange rows by column values with `arrange()`

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `arrange()`

.panelset[

.panel[.panel-name[Arguments]

```r
arrange(.data, ...)
```

`.data` is a data frame or tibble

`...` are the variables to sort by. Use `desc()` to sort a variable in descending order
]

.panel[.panel-name[Example 1]

```r
starwars %>% 
  arrange(mass)
```

```
## # A tibble: 87 × 6
##    name                  height  mass hair_color eye_color species       
##    <chr>                  <int> <dbl> <chr>      <chr>     <chr>         
##  1 Ratts Tyerell             79    15 none       unknown   Aleena        
##  2 Yoda                      66    17 white      brown     Yoda's species
##  3 Wicket Systri Warrick     88    20 brown      brown     Ewok          
##  4 R2-D2                     96    32 <NA>       red       Droid         
##  5 R5-D4                     97    32 <NA>       red       Droid         
##  6 Sebulba                  112    40 none       orange    Dug           
##  7 Dud Bolt                  94    45 none       yellow    Vulptereen    
##  8 Padmé Amidala            165    45 brown      brown     Human         
##  9 Wat Tambor               193    48 none       unknown   Skakoan       
## 10 Sly Moore                178    48 none       white     <NA>          
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

]

.panel[.panel-name[Example 2]

```r
starwars %>% 
  arrange(desc(mass))
```

```
## # A tibble: 87 × 6
##    name                  height  mass hair_color  eye_color     species   
##    <chr>                  <int> <dbl> <chr>       <chr>         <chr>     
##  1 Jabba Desilijic Tiure    175  1358 <NA>        orange        Hutt      
##  2 Grievous                 216   159 none        green, yellow Kaleesh   
##  3 IG-88                    200   140 none        red           Droid     
##  4 Darth Vader              202   136 none        yellow        Human     
##  5 Tarfful                  234   136 brown       blue          Wookiee   
##  6 Owen Lars                178   120 brown, grey blue          Human     
##  7 Bossk                    190   113 none        red           Trandoshan
##  8 Chewbacca                228   112 brown       blue          Wookiee   
##  9 Jek Tono Porkins         180   110 brown       blue          Human     
## 10 Dexter Jettster          198   102 none        yellow        Besalisk  
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```
]
]

---
class: yourturn

# Your turn 1

1. Convert the data frame `mtcars` to a tibble and assign the resulting object to `data`. *Note.* The data frame `mtcars` is automatically loaded in R; you don't have to install it separately.

1. Filter `data` for cars that have 4 `cyl`. Arrange the resulting observations by descending order of `mpg`.

1. Filter `data` for cars that have `disp`s greater than or equal to 350, `hp`s greater than or equal to 200, and `qsec`s less than or equal to 17.

1. Filter `data` for cars that have `carb`s equal to 4 or `cyl`s equal to 4. Assign the result to an object called `data_filtered`.
---
class: solution

# Solution

.panelset[
.panel[.panel-name[Q1]

```r
data <- tibble(mtcars)
```

]

.panel[.panel-name[Q2]

```r
data %>% 
  filter(cyl == 4) %>% 
  arrange(desc(mpg))
```

```
## # A tibble: 11 × 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
##  2  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
##  3  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
##  4  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
##  5  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
##  6  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
##  7  24.4     4 147.     62  3.69  3.19  20       1     0     4     2
##  8  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
##  9  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
## 10  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
## 11  21.4     4 121     109  4.11  2.78  18.6     1     1     4     2
```

]

.panel[.panel-name[Q3]

```r
data %>% 
  filter(disp >= 350, hp >= 200, qsec <= 17)
```

```
## # A tibble: 3 × 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  14.3     8   360   245  3.21  3.57  15.8     0     0     3     4
## 2  13.3     8   350   245  3.73  3.84  15.4     0     0     3     4
## 3  15.8     8   351   264  4.22  3.17  14.5     0     1     5     4
```

```r
data %>% 
  filter(disp >= 350 & hp >= 200 & qsec <= 17)
```

]

.panel[.panel-name[Q4]

```r
data_filtered <- data %>% 
  filter(carb == 4 | cyl == 4)

data_filtered
```

```
## # A tibble: 21 × 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  5  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  6  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
##  7  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
##  8  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
##  9  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
## 10  10.4     8  460    215  3     5.42  17.8     0     0     3     4
## # … with 11 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

]
]

---
class: inverse, center, middle

# Manipulating variables <br> (columns)

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `select()`

### Select columns with `select()`

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `select()`

.panelset[

.panel[.panel-name[Arguments]

```r
select(.data, ...)
```

`.data` is a data frame or tibble

`...` is one or more unquoted expressions separated by commas. Variable names can be used as if they were positions in the data frame, so expressions like `x:y` can be used to select a range of variables.
]

.panel[.panel-name[Example 1]

```r
starwars %>%
  select(name, mass)
```

```
## # A tibble: 87 × 2
##    name                mass
##    <chr>              <dbl>
##  1 Luke Skywalker        77
##  2 C-3PO                 75
##  3 R2-D2                 32
##  4 Darth Vader          136
##  5 Leia Organa           49
##  6 Owen Lars            120
##  7 Beru Whitesun lars    75
##  8 R5-D4                 32
##  9 Biggs Darklighter     84
## 10 Obi-Wan Kenobi        77
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

]

.panel[.panel-name[Example 2]

```r
starwars %>%
  select(height:hair_color)
```

```
## # A tibble: 87 × 3
##    height  mass hair_color   
##     <int> <dbl> <chr>        
##  1    172    77 blond        
##  2    167    75 <NA>         
##  3     96    32 <NA>         
##  4    202   136 none         
##  5    150    49 brown        
##  6    178   120 brown, grey  
##  7    165    75 brown        
##  8     97    32 <NA>         
##  9    183    84 black        
## 10    182    77 auburn, white
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

]

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# Selection helpers

Selection helpers work in concert with `select()` to make it easier to select specific groups of variables.

***

Here are some commonly used ones:

`everything()`: Matches all variables.

`last_col()`: Select last variable.

`starts_with()`: Starts with a prefix.

`ends_with()`: Ends with a suffix.

`contains()`: Contains a literal string.

`matches()`: Matches a regular expression.

.footnote[🔗 https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html#overview-of-selection-features]

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# Selection helpers

Selection helpers work in concert with `select()` to make it easier to select specific groups of variables.

.panelset[

.panel[.panel-name[Example 1]

```r
starwars %>%
  select(starts_with("h"))
```

```
## # A tibble: 87 × 2
##    height hair_color   
##     <int> <chr>        
##  1    172 blond        
##  2    167 <NA>         
##  3     96 <NA>         
##  4    202 none         
##  5    150 brown        
##  6    178 brown, grey  
##  7    165 brown        
##  8     97 <NA>         
##  9    183 black        
## 10    182 auburn, white
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```
]

.panel[.panel-name[Example 2]

```r
starwars %>%
  select(ends_with("color"))
```

```
## # A tibble: 87 × 2
##    hair_color    eye_color
##    <chr>         <chr>    
##  1 blond         blue     
##  2 <NA>          yellow   
##  3 <NA>          red      
##  4 none          yellow   
##  5 brown         brown    
##  6 brown, grey   blue     
##  7 brown         blue     
##  8 <NA>          red      
##  9 black         brown    
## 10 auburn, white blue-gray
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```
]

.panel[.panel-name[Example 3]

```r
starwars %>%
  select(contains("_"))
```

]

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `mutate()`

### Create (or overwrite) variables with `mutate()`

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `mutate()`

.center[
<img src="images/dplyr_mutate.png" width="75%" />
]

.footnote[Artwork by [@allison_horst](https://twitter.com/allison_horst)]

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `mutate()`

.panelset[

.panel[.panel-name[Arguments]

```r
mutate(.data, ...)
```

`.data` is a data frame or tibble

`...` are name-value pairs. The name gives the name of the column in the output.

]

.panel[.panel-name[Example 1]

```r
starwars %>%
  mutate(height_in = height * .39)
```

```
## # A tibble: 87 × 7
##    name               height  mass hair_color    eye_color species height_in
##    <chr>               <int> <dbl> <chr>         <chr>     <chr>       <dbl>
##  1 Luke Skywalker        172    77 blond         blue      Human        67.1
##  2 C-3PO                 167    75 <NA>          yellow    Droid        65.1
##  3 R2-D2                  96    32 <NA>          red       Droid        37.4
##  4 Darth Vader           202   136 none          yellow    Human        78.8
##  5 Leia Organa           150    49 brown         brown     Human        58.5
##  6 Owen Lars             178   120 brown, grey   blue      Human        69.4
##  7 Beru Whitesun lars    165    75 brown         blue      Human        64.4
##  8 R5-D4                  97    32 <NA>          red       Droid        37.8
##  9 Biggs Darklighter     183    84 black         brown     Human        71.4
## 10 Obi-Wan Kenobi        182    77 auburn, white blue-gray Human        71.0
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

]

.panel[.panel-name[Example 2]

```r
starwars %>%
  mutate(mass_lb = mass * 2.2)
```

```
## # A tibble: 87 × 7
##    name               height  mass hair_color    eye_color species mass_lb
##    <chr>               <int> <dbl> <chr>         <chr>     <chr>     <dbl>
##  1 Luke Skywalker        172    77 blond         blue      Human     169. 
##  2 C-3PO                 167    75 <NA>          yellow    Droid     165  
##  3 R2-D2                  96    32 <NA>          red       Droid      70.4
##  4 Darth Vader           202   136 none          yellow    Human     299. 
##  5 Leia Organa           150    49 brown         brown     Human     108. 
##  6 Owen Lars             178   120 brown, grey   blue      Human     264  
##  7 Beru Whitesun lars    165    75 brown         blue      Human     165  
##  8 R5-D4                  97    32 <NA>          red       Droid      70.4
##  9 Biggs Darklighter     183    84 black         brown     Human     185. 
## 10 Obi-Wan Kenobi        182    77 auburn, white blue-gray Human     169. 
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```
]

.panel[.panel-name[Example 3]

```r
starwars %>%
  mutate(species = tolower(species))
```

```
## # A tibble: 87 × 6
##    name               height  mass hair_color    eye_color species
##    <chr>               <int> <dbl> <chr>         <chr>     <chr>  
##  1 Luke Skywalker        172    77 blond         blue      human  
##  2 C-3PO                 167    75 <NA>          yellow    droid  
##  3 R2-D2                  96    32 <NA>          red       droid  
##  4 Darth Vader           202   136 none          yellow    human  
##  5 Leia Organa           150    49 brown         brown     human  
##  6 Owen Lars             178   120 brown, grey   blue      human  
##  7 Beru Whitesun lars    165    75 brown         blue      human  
##  8 R5-D4                  97    32 <NA>          red       droid  
##  9 Biggs Darklighter     183    84 black         brown     human  
## 10 Obi-Wan Kenobi        182    77 auburn, white blue-gray human  
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```
]

]

---
class: yourturn

# Your turn 2

1. In `data`, select only the variables `mpg` and `hp`.

1. As we did with indexing in base R, you can use the minus sign (`-`) to "de-select" columns. Keep everything in `data` except `vs`.

1. Use `mutate()` to convert `cyl` from type "double" to type "factor". *Hint:* You might want to look up the function `as.factor()`.

---
class: solution

# Solution

.panelset[
.panel[.panel-name[Q1]

```r
data %>%
  select(mpg, hp)
```

```
## # A tibble: 32 × 2
##      mpg    hp
##    <dbl> <dbl>
##  1  21     110
##  2  21     110
##  3  22.8    93
##  4  21.4   110
##  5  18.7   175
##  6  18.1   105
##  7  14.3   245
##  8  24.4    62
##  9  22.8    95
## 10  19.2   123
## # … with 22 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

]

.panel[.panel-name[Q2]

```r
data %>%
  select(-vs)
```

```
## # A tibble: 32 × 10
##      mpg   cyl  disp    hp  drat    wt  qsec    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     0     4     4
## # … with 22 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

]

.panel[.panel-name[Q3]

```r
data %>% 
  mutate(cyl = as.factor(cyl))
```

```
## # A tibble: 32 × 11
##      mpg cyl    disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21   6      160    110  3.9   2.62  16.5     0     1     4     4
##  2  21   6      160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8 4      108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4 6      258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7 8      360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1 6      225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3 8      360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4 4      147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8 4      141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2 6      168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

]
]

---
class: inverse, center, middle

# Summarizing data

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `summarize()`

`summarize()` reduces your raw data frame into a smaller summary data frame that only contains the variables resulting from the **summary functions** that you specify within `summarize()`

--
***

Summary functions take vectors as inputs and return single values as outputs.

Common examples are `mean()`, `sd()`, `max()`, `min()`, `sum()`, etc...

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `summarize()`

.panelset[

.panel[.panel-name[Arguments]

```r
summarize(.data, ...)
```

`.data` is a data frame or tibble.

`...` are name-value pairs of summary functions. The names will be the names of the variable in the resulting object.

]

.panel[.panel-name[Example]

```r
starwars %>% 
  summarize(mean_height = mean(height, na.rm = TRUE),
            max_mass    = max(mass, na.rm = TRUE))
```

```
## # A tibble: 1 × 2
##   mean_height max_mass
##         <dbl>    <dbl>
## 1        174.     1358
```

]

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `group_by()`

`group_by()` creates groups based on one or more variables in the data, and this affects any downstream operations.

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `group_by()`

What happens if we combine `group_by()` and `summarize()`?

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# `summarize()`

Let's see a couple examples of how we can combine `group_by()` and `summarize()`

.panelset[

.panel[.panel-name[Example 1]

```r
starwars %>%
  filter(species  %in% c("Human", "Droid", "Gungan")) %>%
  group_by(species) %>% 
  summarize(mean_mass = mean(mass, na.rm = TRUE),
            sd_mass   =   sd(mass, na.rm = TRUE))
```

```
## # A tibble: 3 × 3
##   species mean_mass sd_mass
##   <chr>       <dbl>   <dbl>
## 1 Droid        69.8    51.0
## 2 Gungan       74      11.3
## 3 Human        82.8    19.4
```

]

.panel[.panel-name[Example 2]

```r
starwars %>% 
  filter(species  %in% c("Human", "Droid", "Gungan")) %>%
  group_by(species, eye_color) %>% 
  summarize(mean_mass = mean(mass, na.rm = TRUE),
            sd_mass   =   sd(mass, na.rm = TRUE))
```

```
## # A tibble: 11 × 4
## # Groups:   species [3]
##    species eye_color mean_mass sd_mass
##    <chr>   <chr>         <dbl>   <dbl>
##  1 Droid   black         NaN      NA  
##  2 Droid   red            68      62.4
##  3 Droid   red, blue     NaN      NA  
##  4 Droid   yellow         75      NA  
##  5 Gungan  orange         74      11.3
##  6 Human   blue           90.6    17.6
##  7 Human   blue-gray      77      NA  
##  8 Human   brown          74.7    13.9
##  9 Human   dark          NaN      NA  
## 10 Human   hazel          77      NA  
## 11 Human   yellow        106.     43.1
```

]

---
class: yourturn

# Your turn 3

1. From `data`, get the mean `hp` for each of the different `cyl` values.

1. Now get the mean `hp` for each unique combination of `cyl` and `gear` and arrange the resulting rows by descending order of `hp`. Which combination of `cyl` and `gear` had the greatest average `hp`?

---
class: solution

# Solution

.panelset[
.panel[.panel-name[Q1]

```r
data %>% 
  group_by(cyl) %>% 
  summarize(mean_hp = mean(hp))
```

```
## # A tibble: 3 × 2
##     cyl mean_hp
##   <dbl>   <dbl>
## 1     4    82.6
## 2     6   122. 
## 3     8   209.
```

]

.panel[.panel-name[Q2]

```r
data %>% 
  group_by(cyl, gear) %>% 
  summarize(mean_hp = mean(hp)) %>% 
  arrange(desc(mean_hp))
```

```
## # A tibble: 8 × 3
## # Groups:   cyl [3]
##     cyl  gear mean_hp
##   <dbl> <dbl>   <dbl>
## 1     8     5    300.
## 2     8     3    194.
## 3     6     5    175 
## 4     6     4    116.
## 5     6     3    108.
## 6     4     5    102 
## 7     4     3     97 
## 8     4     4     76
```

]

---
background-image: url(images/hex/dplyr.png)
background-position: 90% 5%
background-size: 10%

# Deeper dive

What if we want to apply `{dplyr}` verbs across multiple columns and rows **simultaneously**?

.footnote[🔗 [dplyr-wisely.netlify.app](https://dplyr-wisely.netlify.app/)]

---
class: inverse, center, middle
# Q & A