class: center, inverse background-image: url(img/title_slide.png) background-position: 50% 40% background-size: 100% # Split-apply-combine with `dplyr` <br><br><br><br><br><br><br><br> <!-- better way to do this?! --> ### Harry Fisher ### [<svg viewBox="0 0 512 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> [ comment ] <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg>](https://twitter.com/harryfishr) [harryfishr](https://twitter.com/harryfishr) | [<svg viewBox="0 0 496 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> [ comment ] <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg>](https://github.com/hfshr) [hfshr](https://github.com/hfshr) | [<svg viewBox="0 0 496 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> [ comment ] <path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"></path></svg>](https://hfshr.xyz) [https://hfshr.xyz](https://hfshr.xyz) --- ### Learning objectives -- * **Review:** What is split-apply-combine and why it is useful? -- <br> * **Learn:** How to do split-apply-combine using `dplyr` functions `group_by()` and `summarise()`. -- <br> * **Apply:** Use `group_by()` and `summarise()` to effectively summarise data. --- ### Review -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4[ .center[**60 second task**] What do you think of when you hear the phrase "split-apply-combine"? ] <br> -- .ba.bw2.br3.shadow-5.ph4[ .can-edit.key-likes[ * <br> * <br><br><br> ] ] --- ### Review What is split-apply-combine? -- - **Split** the data into groups based on some criteria. <br> <br> <br> <br> <br> <img src="img/example.png" width="28.3%" /> --- ### Review What is split-apply-combine? - **Split** the data into groups based on some criteria. - **Apply** a function to each group independently. <br> <br> <br> <img src="img/example_half.png" width="64.4%" /> --- ### Review What is split-apply-combine? - **Split** the data into different groups. - **Apply** a function to each group independently. - **Combine** the results into a data structure. <br> <img src="img/example_full.png" width="1432" /> -- <br> .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4[ .center[**So how do we do this with `dplyr`?**] ] --- ### Split-apply-combine with the `dplyr` package -- * **Split**: `group_by()` -- * **Apply & combine**: `summarise()` -- * We can link these commands together using the "pipe" operator: `%>%` <br> -- **All together, this looks like:** -- .bg-light-gray.b--dark-gray.ba.bw2.br3.shadow-5.ph4[ `data %>%` `group_by() %>%` `summarise()` ] --- ### Split-apply-combine with the `dplyr` package * **Split**: `group_by()` * **Apply & combine**: `summarise()` * We can link these commands together using the "pipe" operator: `%>%` <br> **All together, this looks like:** .bg-light-gray.b--dark-gray.ba.bw2.br3.shadow-5.ph4[ `data %>%` `group_by(group1, group2, ...) %>%` `summarise(summary_column1 = summary_function1(...), ...)` ] -- <br> .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4[ .center[**Let's try with some real data!**] ] --- ```r library(palmerpenguins) library(dplyr) glimpse(penguins) ``` ``` ## Rows: 344 ## Columns: 8 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade… ## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,… ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,… ## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18… ## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,… ## $ sex <fct> male, female, female, NA, female, male, female, mal… ## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200… ``` --- ### Examples -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4[ Our data contain three species of penguins. **Goal:** We want to gather some summary statistics about the different species. ] -- .panelset[ .panel[.panel-name[Example #1] Count the number of penguins in each species... <code class ='r hljs remark-code'>penguins %>% <br> <span style="color:red">group_by</span>(species) %>% <br> <span style="color:red">summarise</span>(count = <span style="color:purple">n</span>())</code> What do you expect the output to look like? ] .panel[.panel-name[Output #1] ``` ## # A tibble: 3 x 2 ## species count ## <fct> <int> ## 1 Adelie 152 ## 2 Chinstrap 68 ## 3 Gentoo 124 ``` ] .panel[.panel-name[Example #2] Also add a column for the mean bill length... <code class ='r hljs remark-code'>penguins %>% <br> <span style="color:red">group_by</span>(species) %>% <br> <span style="color:red">summarise</span>(count = <span style="color:purple">n</span>(),<br> mean_bill_length = <span style="color:purple">mean</span>(bill_length_mm))</code> ] .panel[.panel-name[Output #2] ``` ## # A tibble: 3 x 3 ## species count mean_bill_length ## <fct> <int> <dbl> ## 1 Adelie 152 NA ## 2 Chinstrap 68 48.8 ## 3 Gentoo 124 NA ``` ] ] --- class: center, inverse, middle ## Oh no! `NA`'s! --- ### Example with NA's -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4[ .center[**Live example**] ] --- ### Example with NA's Because our data contains `NA`'s, we have to let R know we want to ignore these values and still calculate the mean for the values we _do_ have. -- <code class ='r hljs remark-code'>penguins %>% <br> <span style="color:red">group_by</span>(species) %>% <br> <span style="color:red">summarise</span>(count = <span style="color:purple">n</span>(),<br> mean_bill_length = <span style="color:purple">mean</span>(bill_length_mm, <br> <span style="color:orange">na.rm = TRUE</span>))</code> ``` ## # A tibble: 3 x 3 ## species count mean_bill_length ## <fct> <int> <dbl> ## 1 Adelie 152 38.8 ## 2 Chinstrap 68 48.8 ## 3 Gentoo 124 47.5 ``` -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4[ .center[**Much better!**] ] --- class: center, inverse, middle # Your turn! [click here](https://harryfish.shinyapps.io/formative_assessment/) --- ### Summary Today, you have: -- * **Reviewed** the split-apply-combine workflow for summarising data. -- * **Learnt** how to use `group_by` and `summarise`. -- * **Applied** your knowledge to some example problems concerning penguins. -- <br><br> .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4[ .center[**Good job!**] ] --- ### Concept map <img src="img/grouby_summarise.png" width="935" /> .footnote[Source: [rstudio/concept-maps](https://github.com/rstudio/concept-maps#group_by-and-summarize)] --- class: inverse ### More resources * An [introduction to data manipulation with `dplyr`](https://datacarpentry.org/R-genomics/04-dplyr.html) from the carpentries. * Another [split-apply-combind tutorial](https://jofrhwld.github.io/teaching/courses/2017_lsa/lectures/Session_3.nb.html#piping) with `dplyr`. * [R for data science](https://r4ds.had.co.nz/) book for more on this topic and many other related concepts.