+ - 0:00:00
Notes for current slide
Notes for next slide

Split-apply-combine with dplyr









Harry Fisher

[ comment ] harryfishr | [ comment ] hfshr | [ comment ] https://hfshr.xyz

1 / 17

Learning objectives

2 / 17

Learning objectives

  • Review: What is split-apply-combine and why it is useful?
2 / 17

Learning objectives

  • Review: What is split-apply-combine and why it is useful?


  • Learn: How to do split-apply-combine using dplyr functions group_by() and summarise().
2 / 17

Learning objectives

  • Review: What is split-apply-combine and why it is useful?


  • Learn: How to do split-apply-combine using dplyr functions group_by() and summarise().


  • Apply: Use group_by() and summarise() to effectively summarise data.
2 / 17

Review

3 / 17

Review

60 second task

What do you think of when you hear the phrase "split-apply-combine"?


3 / 17

Review

60 second task

What do you think of when you hear the phrase "split-apply-combine"?


*


*




3 / 17

Review

What is split-apply-combine?

4 / 17

Review

What is split-apply-combine?

  • Split the data into groups based on some criteria.






4 / 17

Review

What is split-apply-combine?

  • Split the data into groups based on some criteria.

  • Apply a function to each group independently.




5 / 17

Review

What is split-apply-combine?

  • Split the data into different groups.

  • Apply a function to each group independently.

  • Combine the results into a data structure.


6 / 17

Review

What is split-apply-combine?

  • Split the data into different groups.

  • Apply a function to each group independently.

  • Combine the results into a data structure.



So how do we do this with dplyr?

6 / 17

Split-apply-combine with the dplyr package

7 / 17

Split-apply-combine with the dplyr package

  • Split: group_by()
7 / 17

Split-apply-combine with the dplyr package

  • Split: group_by()

  • Apply & combine: summarise()

7 / 17

Split-apply-combine with the dplyr package

  • Split: group_by()

  • Apply & combine: summarise()

  • We can link these commands together using the "pipe" operator: %>%


7 / 17

Split-apply-combine with the dplyr package

  • Split: group_by()

  • Apply & combine: summarise()

  • We can link these commands together using the "pipe" operator: %>%


All together, this looks like:

7 / 17

Split-apply-combine with the dplyr package

  • Split: group_by()

  • Apply & combine: summarise()

  • We can link these commands together using the "pipe" operator: %>%


All together, this looks like:

data %>%

group_by() %>%

summarise()

7 / 17

Split-apply-combine with the dplyr package

  • Split: group_by()

  • Apply & combine: summarise()

  • We can link these commands together using the "pipe" operator: %>%


All together, this looks like:

data %>%

group_by(group1, group2, ...) %>%

summarise(summary_column1 = summary_function1(...), ...)

8 / 17

Split-apply-combine with the dplyr package

  • Split: group_by()

  • Apply & combine: summarise()

  • We can link these commands together using the "pipe" operator: %>%


All together, this looks like:

data %>%

group_by(group1, group2, ...) %>%

summarise(summary_column1 = summary_function1(...), ...)


Let's try with some real data!

8 / 17
library(palmerpenguins)
library(dplyr)
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
## $ sex <fct> male, female, female, NA, female, male, female, mal…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
9 / 17

Examples

10 / 17

Examples

Our data contain three species of penguins.

Goal: We want to gather some summary statistics about the different species.

10 / 17

Examples

Our data contain three species of penguins.

Goal: We want to gather some summary statistics about the different species.

Count the number of penguins in each species...

penguins %>%
  group_by(species) %>%
  summarise(count = n())

What do you expect the output to look like?

## # A tibble: 3 x 2
## species count
## <fct> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124

Also add a column for the mean bill length...

penguins %>%
  group_by(species) %>%
  summarise(count = n(),
            mean_bill_length = mean(bill_length_mm))

## # A tibble: 3 x 3
## species count mean_bill_length
## <fct> <int> <dbl>
## 1 Adelie 152 NA
## 2 Chinstrap 68 48.8
## 3 Gentoo 124 NA
10 / 17

Oh no! NA's!

11 / 17

Example with NA's

12 / 17

Example with NA's

Live example

12 / 17

Example with NA's

Because our data contains NA's, we have to let R know we want to ignore these values and still calculate the mean for the values we do have.

13 / 17

Example with NA's

Because our data contains NA's, we have to let R know we want to ignore these values and still calculate the mean for the values we do have.

penguins %>%
  group_by(species) %>%
  summarise(count = n(),
            mean_bill_length = mean(bill_length_mm,
                                    na.rm = TRUE))

## # A tibble: 3 x 3
## species count mean_bill_length
## <fct> <int> <dbl>
## 1 Adelie 152 38.8
## 2 Chinstrap 68 48.8
## 3 Gentoo 124 47.5
13 / 17

Example with NA's

Because our data contains NA's, we have to let R know we want to ignore these values and still calculate the mean for the values we do have.

penguins %>%
  group_by(species) %>%
  summarise(count = n(),
            mean_bill_length = mean(bill_length_mm,
                                    na.rm = TRUE))

## # A tibble: 3 x 3
## species count mean_bill_length
## <fct> <int> <dbl>
## 1 Adelie 152 38.8
## 2 Chinstrap 68 48.8
## 3 Gentoo 124 47.5

Much better!

13 / 17

Your turn!

click here

14 / 17

Summary

Today, you have:

15 / 17

Summary

Today, you have:

  • Reviewed the split-apply-combine workflow for summarising data.
15 / 17

Summary

Today, you have:

  • Reviewed the split-apply-combine workflow for summarising data.

  • Learnt how to use group_by and summarise.

15 / 17

Summary

Today, you have:

  • Reviewed the split-apply-combine workflow for summarising data.

  • Learnt how to use group_by and summarise.

  • Applied your knowledge to some example problems concerning penguins.

15 / 17

Summary

Today, you have:

  • Reviewed the split-apply-combine workflow for summarising data.

  • Learnt how to use group_by and summarise.

  • Applied your knowledge to some example problems concerning penguins.



Good job!

15 / 17

Concept map

Source: rstudio/concept-maps

16 / 17

More resources

17 / 17

Learning objectives

2 / 17
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow