class: center, inverse background-image: url(img/title_slide.png) background-position: 50% 40% background-size: 100% # Split-apply-combine with `dplyr`

### Learning objectives

* **Review:** What is split-apply-combine and why it is useful? -- <br> * **Learn:** How to do split-apply-combine using `dplyr` functions `group_by()` and `summarise()`. -- <br> * **Apply:** Use `group_by()` and `summarise()` to effectively summarise data. --- ### Review --[ .center[**60 second task**] What do you think of when you hear the phrase "split-apply-combine"? ] <br> -- .ba.bw2.br3.shadow-5.ph4[ .can-edit.key-likes[ * <br> * <br><br><br> ] ] --- ### Review What is split-apply-combine? -- - **Split** the data into groups based on some criteria. <br> <br> <br> <br> <br> <img src="img/example.png" width="28.3%" /> --- ### Review What is split-apply-combine? - **Split** the data into groups based on some criteria. - **Apply** a function to each group independently. <br> <br> <br> <img src="img/example_half.png" width="64.4%" /> --- ### Review What is split-apply-combine? - **Split** the data into different groups. - **Apply** a function to each group independently. - **Combine** the results into a data structure. <br> <img src="img/example_full.png" width="1432" /> -- <br>[ .center[**So how do we do this with `dplyr`?**] ] --- ### Split-apply-combine with the `dplyr` package -- * **Split**: `group_by()` -- * **Apply & combine**: `summarise()` -- * We can link these commands together using the "pipe" operator: `%>%` <br> -- **All together, this looks like:** --[ `data %>%` `group_by() %>%` `summarise()` ] --- ### Split-apply-combine with the `dplyr` package * **Split**: `group_by()` * **Apply & combine**: `summarise()` * We can link these commands together using the "pipe" operator: `%>%` <br> **All together, this looks like:**[ `data %>%` `group_by(group1, group2, ...) %>%` `summarise(summary_column1 = summary_function1(...), ...)` ] -- <br>[ .center[**Let's try with some real data!**] ] --- ```r library(palmerpenguins) library(dplyr) glimpse(penguins) ``` ``` ## Rows: 344 ## Columns: 8 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade… ## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,… ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,… ## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18… ## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,… ## $ sex <fct> male, female, female, NA, female, male, female, mal… ## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200… ``` --- ### Examples --[ Our data contain three species of penguins. **Goal:** We want to gather some summary statistics about the different species. ] -- .panelset[ .panel[.panel-name[Example #1] Count the number of penguins in each species... <code class ='r hljs remark-code'>penguins %>% <br> <span style="color:red">group_by</span>(species) %>% <br> <span style="color:red">summarise</span>(count = <span style="color:purple">n</span>())</code> What do you expect the output to look like? ] .panel[.panel-name[Output #1] ``` ## # A tibble: 3 x 2 ## species count ## <fct> <int> ## 1 Adelie 152 ## 2 Chinstrap 68 ## 3 Gentoo 124 ``` ] .panel[.panel-name[Example #2] Also add a column for the mean bill length... <code class ='r hljs remark-code'>penguins %>% <br> <span style="color:red">group_by</span>(species) %>% <br> <span style="color:red">summarise</span>(count = <span style="color:purple">n</span>(),<br> mean_bill_length = <span style="color:purple">mean</span>(bill_length_mm))</code> ] .panel[.panel-name[Output #2] ``` ## # A tibble: 3 x 3 ## species count mean_bill_length ## <fct> <int> <dbl> ## 1 Adelie 152 NA ## 2 Chinstrap 68 48.8 ## 3 Gentoo 124 NA ``` ] ] --- class: center, inverse, middle ## Oh no! `NA`'s! --- ### Example with NA's

.center[**Live example**]

--- ### Example with NA's Because our data contains `NA`'s, we have to let R know we want to ignore these values and still calculate the mean for the values we _do_ have.

<code class ='r hljs remark-code'>penguins %>% <br> <span style="color:red">group_by</span>(species) %>% <br> <span style="color:red">summarise</span>(count = <span style="color:purple">n</span>(),<br> mean_bill_length = <span style="color:purple">mean</span>(bill_length_mm, <br> <span style="color:orange">na.rm = TRUE</span>))</code> ``` ## # A tibble: 3 x 3 ## species count mean_bill_length ## <fct> <int> <dbl> ## 1 Adelie 152 38.8 ## 2 Chinstrap 68 48.8 ## 3 Gentoo 124 47.5 ```

.center[**Much better!**] [click here]( --- ### Summary Today, you have: -- * **Reviewed** the split-apply-combine workflow for summarising data. -- * **Learnt** how to use `group_by` and `summarise`. -- * **Applied** your knowledge to some example problems concerning penguins. -- <br><br>[ .center[**Good job!**] ] --- ### Concept map <img src="img/grouby_summarise.png" width="935" /> .footnote[Source: [rstudio/concept-maps](] --- class: inverse ### More resources * An [introduction to data manipulation with `dplyr`]( from the carpentries. * Another [split-apply-combind tutorial]( with `dplyr`. * [R for data science]( book for more on this topic and many other related concepts.