In this post you will learn that a walrus is not just a animal.
Context
The tidyverse is an ecosystem of R packages that revolutionized how data is handled in the language. It provides amazing and famous libraries such as dplyr and ggplot2, that have great functions, for example, we covered the across function from dplyr.
But, we can have the need to create our own functions using the tidyverse functions inside them, and a problem may surge as the tidyverse works based on a dataframe, and how to pass the arguments can be a issue.
So, to make it easier to create this functions, some special operators were created, in a way that we can pass an input as an argument to functions that will work based on a dataframe, even if we just pass the column name.
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 x 5
# Groups: species [3]
species sex n mean_body_mass_g p
<fct> <fct> <int> <dbl> <dbl>
1 Adelie female 73 3369. 0.5
2 Adelie male 73 4043. 0.5
3 Chinstrap female 34 3527. 0.5
4 Chinstrap male 34 3939. 0.5
5 Gentoo female 58 4680. 0.487
6 Gentoo male 61 5485. 0.513
In the example above we used the dataframe penguins, where we did some actions:
Removed the observations with missing values for the variable sex;
Computed the count of penguin’s, by species and sex;
Computed the mean of the penguin’s body mass (in grams), by species and sex;
Computed the proportion of the penguin’s sex, by species.
Ok, that was very simple and effective, but what if we want to transform this in a function called penguin_summary?
Operators
{{}} Curly-curly
The first operator we will learn is the curly-curly, using the command {{}}, the goal of this operator is to allow us to have an argument passed to our function refering to a column inside a dataframe.
So, we will create the function penguin_summary, where the variable used to count the penguins, in the example before species, will be generalized By the argument grp_var.
We can see that inside the dplyr verbs we write the argument grp_var inside the operator {{}} in the verb group_by.
Let’s now apply the variable species to see if the result is the same as before.
penguin_summary(grp_var = species)
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 x 5
# Groups: species [3]
species sex n mean_body_mass_g p
<fct> <fct> <int> <dbl> <dbl>
1 Adelie female 73 3369. 0.5
2 Adelie male 73 4043. 0.5
3 Chinstrap female 34 3527. 0.5
4 Chinstrap male 34 3939. 0.5
5 Gentoo female 58 4680. 0.487
6 Gentoo male 61 5485. 0.513
Yes! We got the same result, but there is also another interesting fact, the variable species was passed without quotes, so no need to use functions such as quo, enquote, etc.
And now we can pass other variable to our function, let’s give it a try.
penguin_summary(grp_var = island)
`summarise()` has grouped output by 'island'. You can override using the
`.groups` argument.
# A tibble: 6 x 5
# Groups: island [3]
island sex n mean_body_mass_g p
<fct> <fct> <int> <dbl> <dbl>
1 Biscoe female 80 4319. 0.491
2 Biscoe male 83 5105. 0.509
3 Dream female 61 3446. 0.496
4 Dream male 62 3987. 0.504
5 Torgersen female 24 3396. 0.511
6 Torgersen male 23 4035. 0.489
Ok, after generalizing the species variable, we will do the same for the body_mass_g creating another argument, num_var.
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 x 5
# Groups: species [3]
species sex n mean p
<fct> <fct> <int> <dbl> <dbl>
1 Adelie female 73 3369. 0.5
2 Adelie male 73 4043. 0.5
3 Chinstrap female 34 3527. 0.5
4 Chinstrap male 34 3939. 0.5
5 Gentoo female 58 4680. 0.487
6 Gentoo male 61 5485. 0.513
Okay, we kind of succeeded, but we had to give the new variable for the mean a generic name; to make this dynamic, we’ll need the assistance of another operator.
:= Walrus
The second operator is the walrus, using the command :=, the goal of this operator is to allow us to create new variables using the argument dynamically in the name of the variable created.
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 x 5
# Groups: species [3]
species sex n mean_body_mass_g p
<fct> <fct> <int> <dbl> <dbl>
1 Adelie female 73 3369. 0.5
2 Adelie male 73 4043. 0.5
3 Chinstrap female 34 3527. 0.5
4 Chinstrap male 34 3939. 0.5
5 Gentoo female 58 4680. 0.487
6 Gentoo male 61 5485. 0.513
The walrus operator substitute the = operator, and we can use the argument num_var inside the {{}} operator to generalize our variable name, not only that, but we can also set other characters such as a prefix or suffix.
Now that we’ve finished our function, what if we want to make it even more generalized? For example, our dataframe and the variable sex are still inside the function, that is easy we just need create two more arguments:
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 x 5
# Groups: species [3]
species sex n mean_body_mass_g p
<fct> <fct> <int> <dbl> <dbl>
1 Adelie female 73 3369. 0.5
2 Adelie male 73 4043. 0.5
3 Chinstrap female 34 3527. 0.5
4 Chinstrap male 34 3939. 0.5
5 Gentoo female 58 4680. 0.487
6 Gentoo male 61 5485. 0.513
So we created an argument called df to be our data.frame, without any operator since it is been called “directly”, and already left the penguins dataset as the default. We did the same with the sex variable with the argument main_var.
And even though we created a function called penguin_summary now we can apply it to another dataframe:
`summarise()` has grouped output by 'cyl'. You can override using the `.groups`
argument.
# A tibble: 5 x 5
# Groups: cyl [3]
cyl vs n mean_drat p
<dbl> <dbl> <int> <dbl> <dbl>
1 4 0 1 4.43 0.0909
2 4 1 10 4.04 0.909
3 6 0 3 3.81 0.429
4 6 1 4 3.42 0.571
5 8 0 14 3.23 1
Ok, now we got a function that is completely generalized, with only arguments inside of it, but there is still way to make an even more powerful function, let’s say we want to apply our function to two numerical variables.
`summarise()` has grouped output by 'species', 'island'. You can override using
the `.groups` argument.
# A tibble: 10 x 6
# Groups: species, island [5]
species island sex n mean_body_mass_g p
<fct> <fct> <fct> <int> <dbl> <dbl>
1 Adelie Biscoe female 22 3369. 0.5
2 Adelie Biscoe male 22 4050 0.5
3 Adelie Dream female 27 3344. 0.491
4 Adelie Dream male 28 4046. 0.509
5 Adelie Torgersen female 24 3396. 0.511
6 Adelie Torgersen male 23 4035. 0.489
7 Chinstrap Dream female 34 3527. 0.5
8 Chinstrap Dream male 34 3939. 0.5
9 Gentoo Biscoe female 58 4680. 0.487
10 Gentoo Biscoe male 61 5485. 0.513
Now we passed both species and island variables to group_by, but to do the same to the num_var argument we can benefit from across arguments, as we saw in our post An intro to dplyr::across.