In this post you will learn to never repeat a function again inside a dplyr pipeline.
Context
As described in their site, Tidyverse is an opinionated collection of R packages developed for data applications. One of the ecosystem main packages is dplyr, which offers a consistent set of verbs to assist you in resolving data manipulation problems.
In June of 2020 we had the official release of dplyr 1.0.0 where a new function was introduced to us, opening new possibilities to data manipulation, that was the birth of across one of the most powerful and versatily functions to work with data.
Before talking about it, let’s see how we used to work before.
Before across
As one of thre greatest R packages, dplyr possesses a lot functions, but it has two main verbs to manipulate data, they are:
summarise: allows us to apply a transformation to data that reduce the number of observations, e.g., mean;
mutate: allows us to apply a transformation to our existing variables or even creating new ones with the same size, e.g., multiplying one variable by 2.
We will summarize every numerical variable using the dataset penguins from the palmerpenguins package, computing the mean for each. The mean function can then be applied to each variable inside the verb summarize.
In the example above we see that it works, but have some problems:
Due to its manual nature and increased risk of human error from writing numerous lines of code or even copying and pasting it, it would become a tiresome task if there were many columns;
The function will be given to the new tvariables as their names if their names are not set.
A smarter approach is the use of a summarise variant, called summarise_if.
In the example above we see that inside summarise_if we define two argumens:
.predicate: the condition to check which variables we are going to apply the functions;
.funs: a function or list of functions.
Even though these variables are the means of the originals, unlike the first method, the function here kept the names of the original variables. The fact that we can now apply a function to 5 columns with only 2 lines of code is another advantage.
So it was successful, but what is the issue? What if I also wanted to learn the mode of the variable species? How could we go about doing that?
In the example above we apply across , we see that it is used inside the conventional verb summarise , meaning we can still apply other functions even using across.
So across is a function that is complementary to mutate and summarise, that allows us to apply multiples functions across multiples variables.
Just as curiosity, even though this old functions are superseeded they still exists, and their suffixes are _at() , _if() and _all() .
across
Now that we understood the overall goal of across, we will explore each argument of the function.
.cols
The first argument of across determine which columns of the data.frame we are going to apply our functions, this argument is:
Non-optional
The default is every single variable of the data.frame, by using the function everything.
Accepts as input:
Integers, referencing the variables positions;
Strings, referencing the variables names;
Select helpers functions, e.g., contains, as we will see below.
In the example above we compute the mean for the variables that ends with the pattern _mm.
So all the the selection helpers can be used:
all_of: allows us to pass a string vector to select specific variables, that helps when we are looking for a group of variables, which not obey a simples check condition such as been of the same type or having the a name pattern, e.g., all_of(vector_of_variables) ;
any_of: is a similar function to all_of , but it can be used to remove variables with the operator -, e.g., any_of(-vector_of_variables) ;
contains: allows to select variables that contains a specific string in their names. e.g., contains(length);
ends_with: variables that ends with a specific string pattern, e.g., ends_with("_mm");
everything: all variables, and already the default of the argument .cols;
last_col: the last variable of the data.frame;
matches: variables with a name that matches a given regular expression;
num_range: variables that have a numeric sequence in their name, e.g., var1, var2 and var3 then we can use num_range("var",1:3);
starts_with: variables that starts with a specific string pattern, e.g., starts_with("bill_").
By order
Another method of column selection is using the name of the variables and the operator : to apply the function to a sequence of variables.
The mean and median are computed in the aforementioned example, but since more than one function is applied to the same variable, a numerical suffix is added based on the order in which our functions were defined inside the list, making mean 1 and median 2. This can be confusing and lead to errors later on.
Since we defined the names of the functions in the example above and added them automatically as suffixes, it is now clearer what we are doing.
.names
The argument .names determines the name of resultant the variables after the functions are applied, so it allows us to change the names of the variables, this argument is:
Optional
The default is NULL
Accepts as input:
A string, where we can use {.col} and {.fn} as variables to receive the respective names of the columns and/or functions.
# A tibble: 2 x 4
name Adelie Chinstrap Gentoo
<chr> <int> <int> <int>
1 heaviest 4775 4800 6300
2 lightest 2850 2700 3950
In the preceding example, we create a data.frame in wide format, with a column for each species and two rows representing the heaviest and lightest penguins of each. Assume our goal is to calculate the total weight of the heaviest and lightest penguins, which entails adding the weights of the three species in a fourth column called total weight.
# A tibble: 2 x 5
name Adelie Chinstrap Gentoo total_weight
<chr> <int> <int> <int> <int>
1 heaviest 4775 4800 6300 15875
2 lightest 2850 2700 3950 9500
We can use a simple solution of manually entering each variable name and adding each other, which works but is not ideal, especially when we have many columns.
# A tibble: 2 x 5
# Rowwise:
name Adelie Chinstrap Gentoo total_weight
<chr> <int> <int> <int> <int>
1 heaviest 4775 4800 6300 15875
2 lightest 2850 2700 3950 9500
An alternative is to apply c_across, first it works together with the verb rowwise that make the commands below it to operate by row, not column.
Not only that, but c_across differs from across in that it has only a .cols argument, so it must be placed within a function, which provides an advantage over the first approach in that we can now use the functions arguments.
# A tibble: 2 x 6
# Rowwise:
name Adelie Chinstrap Gentoo total_weight_plus total_weight_cacross
<chr> <int> <int> <int> <int> <int>
1 heaviest 4775 NA 6300 NA 11075
2 lightest 2850 2700 3950 9500 19000
Finally, above we show what would happen if we had an NA in the data. Most functions in R by default give NA as results if a NA is present in the data, so we can benefit from the use of the function sum, since it has an argument to ignore them (na.rm = TRUE).