An intro to: Tidyverse Operators

Intro to
R
Tools
Author

Vinícius Félix

Published

January 8, 2023

In this post you will learn that a walrus is not just a animal.

Context

The tidyverse is an ecosystem of R packages that revolutionized how data is handled in the language. It provides amazing and famous libraries such as dplyr and ggplot2, that have great functions, for example, we covered the across function from dplyr.

But, we can have the need to create our own functions using the tidyverse functions inside them, and a problem may surge as the tidyverse works based on a dataframe, and how to pass the arguments can be a issue.

So, to make it easier to create this functions, some special operators were created, in a way that we can pass an input as an argument to functions that will work based on a dataframe, even if we just pass the column name.

First of all, let’s do something in tidyverse:

library(palmerpenguins)
library(dplyr)

penguins %>% 
  filter(!is.na(sex)) %>% 
  group_by(species,sex) %>%
  summarise(
    n = n(),
    mean_body_mass_g = mean(body_mass_g,na.rm = TRUE)
    ) %>% 
  group_by(species) %>% 
  mutate(p = n/sum(n,na.rm = TRUE))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 x 5
# Groups:   species [3]
  species   sex        n mean_body_mass_g     p
  <fct>     <fct>  <int>            <dbl> <dbl>
1 Adelie    female    73            3369. 0.5  
2 Adelie    male      73            4043. 0.5  
3 Chinstrap female    34            3527. 0.5  
4 Chinstrap male      34            3939. 0.5  
5 Gentoo    female    58            4680. 0.487
6 Gentoo    male      61            5485. 0.513

In the example above we used the dataframe penguins, where we did some actions:

  1. Removed the observations with missing values for the variable sex;

  2. Computed the count of penguin’s, by species and sex;

  3. Computed the mean of the penguin’s body mass (in grams), by species and sex;

  4. Computed the proportion of the penguin’s sex, by species.

Ok, that was very simple and effective, but what if we want to transform this in a function called penguin_summary?

Operators

{{}} Curly-curly

The first operator we will learn is the curly-curly, using the command {{}}, the goal of this operator is to allow us to have an argument passed to our function refering to a column inside a dataframe.

So, we will create the function penguin_summary, where the variable used to count the penguins, in the example before species, will be generalized By the argument grp_var.

penguin_summary <- function(grp_var){
  penguins %>% 
  filter(!is.na(sex)) %>% 
  group_by({{grp_var}},sex) %>%
  summarise(
    n = n(),
    mean_body_mass_g = mean(body_mass_g,na.rm = TRUE)
    ) %>% 
  group_by({{grp_var}}) %>% 
  mutate(p = n/sum(n,na.rm = TRUE))
}

We can see that inside the dplyr verbs we write the argument grp_var inside the operator {{}} in the verb group_by.

Let’s now apply the variable species to see if the result is the same as before.

penguin_summary(grp_var = species)
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 x 5
# Groups:   species [3]
  species   sex        n mean_body_mass_g     p
  <fct>     <fct>  <int>            <dbl> <dbl>
1 Adelie    female    73            3369. 0.5  
2 Adelie    male      73            4043. 0.5  
3 Chinstrap female    34            3527. 0.5  
4 Chinstrap male      34            3939. 0.5  
5 Gentoo    female    58            4680. 0.487
6 Gentoo    male      61            5485. 0.513

Yes! We got the same result, but there is also another interesting fact, the variable species was passed without quotes, so no need to use functions such as quo, enquote, etc.

And now we can pass other variable to our function, let’s give it a try.

penguin_summary(grp_var = island)
`summarise()` has grouped output by 'island'. You can override using the
`.groups` argument.
# A tibble: 6 x 5
# Groups:   island [3]
  island    sex        n mean_body_mass_g     p
  <fct>     <fct>  <int>            <dbl> <dbl>
1 Biscoe    female    80            4319. 0.491
2 Biscoe    male      83            5105. 0.509
3 Dream     female    61            3446. 0.496
4 Dream     male      62            3987. 0.504
5 Torgersen female    24            3396. 0.511
6 Torgersen male      23            4035. 0.489

Ok, after generalizing the species variable, we will do the same for the body_mass_g creating another argument, num_var.

penguin_summary <- function(grp_var,num_var){
  penguins %>% 
  filter(!is.na(sex)) %>% 
  group_by({{grp_var}},sex) %>%
  summarise(
    n = n(),
    mean = mean({{num_var}},na.rm = TRUE)
    ) %>% 
  group_by({{grp_var}}) %>% 
  mutate(p = n/sum(n,na.rm = TRUE))
}
penguin_summary(
  grp_var = species,
  num_var = body_mass_g
  )
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 x 5
# Groups:   species [3]
  species   sex        n  mean     p
  <fct>     <fct>  <int> <dbl> <dbl>
1 Adelie    female    73 3369. 0.5  
2 Adelie    male      73 4043. 0.5  
3 Chinstrap female    34 3527. 0.5  
4 Chinstrap male      34 3939. 0.5  
5 Gentoo    female    58 4680. 0.487
6 Gentoo    male      61 5485. 0.513

Okay, we kind of succeeded, but we had to give the new variable for the mean a generic name; to make this dynamic, we’ll need the assistance of another operator.

:= Walrus

The second operator is the walrus, using the command :=, the goal of this operator is to allow us to create new variables using the argument dynamically in the name of the variable created.

penguin_summary <- function(grp_var,num_var){
  penguins %>% 
  filter(!is.na(sex)) %>% 
  group_by({{grp_var}},sex) %>%
  summarise(
    n = n(),
    "mean_{{num_var}}" := mean({{num_var}},na.rm = TRUE)
    ) %>% 
  group_by({{grp_var}}) %>% 
  mutate(p = n/sum(n,na.rm = TRUE))
}
penguin_summary(
  grp_var = species,
  num_var = body_mass_g
  )
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 x 5
# Groups:   species [3]
  species   sex        n mean_body_mass_g     p
  <fct>     <fct>  <int>            <dbl> <dbl>
1 Adelie    female    73            3369. 0.5  
2 Adelie    male      73            4043. 0.5  
3 Chinstrap female    34            3527. 0.5  
4 Chinstrap male      34            3939. 0.5  
5 Gentoo    female    58            4680. 0.487
6 Gentoo    male      61            5485. 0.513

The walrus operator substitute the = operator, and we can use the argument num_var inside the {{}} operator to generalize our variable name, not only that, but we can also set other characters such as a prefix or suffix.

Now that we’ve finished our function, what if we want to make it even more generalized? For example, our dataframe and the variable sex are still inside the function, that is easy we just need create two more arguments:

penguin_summary <- function(df = penguins,main_var = sex,grp_var,num_var){
  df %>% 
  filter(!is.na({{main_var}})) %>% 
  group_by({{grp_var}},{{main_var}}) %>%
  summarise(
    n = n(),
    "mean_{{num_var}}" := mean({{num_var}},na.rm = TRUE)
    ) %>% 
  group_by({{grp_var}}) %>% 
  mutate(p = n/sum(n,na.rm = TRUE))
}
penguin_summary(
  grp_var = species,
  num_var = body_mass_g
  )
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 x 5
# Groups:   species [3]
  species   sex        n mean_body_mass_g     p
  <fct>     <fct>  <int>            <dbl> <dbl>
1 Adelie    female    73            3369. 0.5  
2 Adelie    male      73            4043. 0.5  
3 Chinstrap female    34            3527. 0.5  
4 Chinstrap male      34            3939. 0.5  
5 Gentoo    female    58            4680. 0.487
6 Gentoo    male      61            5485. 0.513

So we created an argument called df to be our data.frame, without any operator since it is been called “directly”, and already left the penguins dataset as the default. We did the same with the sex variable with the argument main_var.

And even though we created a function called penguin_summary now we can apply it to another dataframe:

penguin_summary(
  df = mtcars,
  main_var = vs,
  grp_var = cyl,
  num_var = drat
  )
`summarise()` has grouped output by 'cyl'. You can override using the `.groups`
argument.
# A tibble: 5 x 5
# Groups:   cyl [3]
    cyl    vs     n mean_drat      p
  <dbl> <dbl> <int>     <dbl>  <dbl>
1     4     0     1      4.43 0.0909
2     4     1    10      4.04 0.909 
3     6     0     3      3.81 0.429 
4     6     1     4      3.42 0.571 
5     8     0    14      3.23 1     

Ok, now we got a function that is completely generalized, with only arguments inside of it, but there is still way to make an even more powerful function, let’s say we want to apply our function to two numerical variables.

penguin_summary(
  grp_var = species,
  num_var = c(body_mass_g,bill_depth_mm)
  )
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 x 5
# Groups:   species [3]
  species   sex        n `mean_c(body_mass_g, bill_depth_mm)`     p
  <fct>     <fct>  <int>                                <dbl> <dbl>
1 Adelie    female    73                                1693. 0.5  
2 Adelie    male      73                                2031. 0.5  
3 Chinstrap female    34                                1772. 0.5  
4 Chinstrap male      34                                1979. 0.5  
5 Gentoo    female    58                                2347. 0.487
6 Gentoo    male      61                                2750. 0.513

So it is not what we expected, right? To pass multiple variables into a single argument, we will need the help of an old friend.

Across

So let’s recur to across, because it allows together with the curly-curly operator to pass multiple variables into one argument.

penguin_summary <- function(df = penguins,main_var = sex,grp_var,num_var){
  df %>% 
  filter(!is.na({{main_var}})) %>% 
  group_by(across({{grp_var}}),{{main_var}}) %>%
  summarise(
    n = n(),
    "mean_{{num_var}}" := mean({{num_var}},na.rm = TRUE)
    ) %>% 
  group_by(across({{grp_var}})) %>% 
  mutate(p = n/sum(n,na.rm = TRUE))
}
penguin_summary(
  grp_var = c(species, island),
  num_var = body_mass_g
  )
`summarise()` has grouped output by 'species', 'island'. You can override using
the `.groups` argument.
# A tibble: 10 x 6
# Groups:   species, island [5]
   species   island    sex        n mean_body_mass_g     p
   <fct>     <fct>     <fct>  <int>            <dbl> <dbl>
 1 Adelie    Biscoe    female    22            3369. 0.5  
 2 Adelie    Biscoe    male      22            4050  0.5  
 3 Adelie    Dream     female    27            3344. 0.491
 4 Adelie    Dream     male      28            4046. 0.509
 5 Adelie    Torgersen female    24            3396. 0.511
 6 Adelie    Torgersen male      23            4035. 0.489
 7 Chinstrap Dream     female    34            3527. 0.5  
 8 Chinstrap Dream     male      34            3939. 0.5  
 9 Gentoo    Biscoe    female    58            4680. 0.487
10 Gentoo    Biscoe    male      61            5485. 0.513

Now we passed both species and island variables to group_by, but to do the same to the num_var argument we can benefit from across arguments, as we saw in our post An intro to dplyr::across.

penguin_summary <- function(df = penguins,main_var = sex,grp_var,num_var){
  df %>% 
  filter(!is.na({{main_var}})) %>% 
  group_by(across({{grp_var}}),{{main_var}}) %>%
  summarise(
    n = n(),
    across(.cols = {{num_var}},
           .fns = ~mean(.,na.rm = TRUE),
           .names = "mean_{.col}")
    ) %>% 
  group_by(across({{grp_var}})) %>% 
  mutate(p = n/sum(n,na.rm = TRUE))
}
penguin_summary(
  grp_var = species,
  num_var = c(body_mass_g,bill_depth_mm)
  )
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 x 6
# Groups:   species [3]
  species   sex        n mean_body_mass_g mean_bill_depth_mm     p
  <fct>     <fct>  <int>            <dbl>              <dbl> <dbl>
1 Adelie    female    73            3369.               17.6 0.5  
2 Adelie    male      73            4043.               19.1 0.5  
3 Chinstrap female    34            3527.               17.6 0.5  
4 Chinstrap male      34            3939.               19.3 0.5  
5 Gentoo    female    58            4680.               14.2 0.487
6 Gentoo    male      61            5485.               15.7 0.513

We can now compute the mean for multiple numeric variables and also group by any number of variables we want:

penguin_summary(
  grp_var = c(species,island),
  num_var = c(body_mass_g,bill_depth_mm)
  )
`summarise()` has grouped output by 'species', 'island'. You can override using
the `.groups` argument.
# A tibble: 10 x 7
# Groups:   species, island [5]
   species   island    sex        n mean_body_mass_g mean_bill_depth_mm     p
   <fct>     <fct>     <fct>  <int>            <dbl>              <dbl> <dbl>
 1 Adelie    Biscoe    female    22            3369.               17.7 0.5  
 2 Adelie    Biscoe    male      22            4050                19.0 0.5  
 3 Adelie    Dream     female    27            3344.               17.6 0.491
 4 Adelie    Dream     male      28            4046.               18.8 0.509
 5 Adelie    Torgersen female    24            3396.               17.6 0.511
 6 Adelie    Torgersen male      23            4035.               19.4 0.489
 7 Chinstrap Dream     female    34            3527.               17.6 0.5  
 8 Chinstrap Dream     male      34            3939.               19.3 0.5  
 9 Gentoo    Biscoe    female    58            4680.               14.2 0.487
10 Gentoo    Biscoe    male      61            5485.               15.7 0.513