5 min read

Dplyr’s select - tips and tricks

It’s not the most interesting function, or even that much fun. But dplyr’s select has a lot of functionality that is often overlooked. All select does is ‘select’ columns, yet understanding some of its functionality deeper than surface level saves me a bunch of time in my day-to-day work. Let’s take a tour of where dplyr::select can save you time too, by calling out situations where I find myself using the function over and over again.

library(dplyr)

Let’s imagine we are working with the following data frame.

x = tibble(id = 1:100,
           category1 = rep(c(rep('a',10), rep('b',10)),5),
           category2 = base::sample(x=c('red','green','blue'),replace = T, size = 100),
           value1 = round(runif(100),2),
           value2 = round(rnorm(100,mean = 1),2),
           value3 = round(rnorm(100,mean = 2),2),
           value4 = round(rnorm(100,mean = 3),2),
           value5 = round(rnorm(100,mean = 4),2),
           value6 = round(rnorm(100,mean = 5),2))
x %>% head() %>% knitr::kable()
id category1 category2 value1 value2 value3 value4 value5 value6
1 a blue 1.00 1.12 4.31 3.83 3.87 4.10
2 a red 0.42 0.61 2.44 3.00 4.03 3.07
3 a green 0.56 1.47 2.01 0.48 3.62 5.95
4 a blue 0.39 1.22 2.02 2.97 2.37 6.14
5 a blue 0.42 1.32 1.61 2.49 5.13 4.61
6 a red 0.71 0.55 3.17 1.86 3.25 5.29

We can imagine situations where we want to only grab a specific column or two in a tidy fashion - dplyr::select is literally made for that!

x %>%
        select(id, value2)
## # A tibble: 100 x 2
##       id value2
##    <int>  <dbl>
##  1     1   1.12
##  2     2   0.61
##  3     3   1.47
##  4     4   1.22
##  5     5   1.32
##  6     6   0.55
##  7     7   1.7 
##  8     8   0.97
##  9     9   1.23
## 10    10   3.06
## # ... with 90 more rows

Now, what if we want to remove a column?

x %>%
        select(-id)
## # A tibble: 100 x 8
##    category1 category2 value1 value2 value3 value4 value5 value6
##    <chr>     <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 a         blue        1      1.12   4.31   3.83   3.87   4.1 
##  2 a         red         0.42   0.61   2.44   3      4.03   3.07
##  3 a         green       0.56   1.47   2.01   0.48   3.62   5.95
##  4 a         blue        0.39   1.22   2.02   2.97   2.37   6.14
##  5 a         blue        0.42   1.32   1.61   2.49   5.13   4.61
##  6 a         red         0.71   0.55   3.17   1.86   3.25   5.29
##  7 a         blue        0.61   1.7    1.25   2.87   1.95   5.88
##  8 a         green       0.38   0.97   2.54  -0.08   3.44   3.9 
##  9 a         blue        0.11   1.23   1.95   2.45   2.08   3.65
## 10 a         green       0.72   3.06   2.09   3.93   4.08   3.91
## # ... with 90 more rows

That’s right, you can simply drop columns by adding a - to the column name.

What about renaming a column?

x %>%
        select(color = category2)
## # A tibble: 100 x 1
##    color
##    <chr>
##  1 blue 
##  2 red  
##  3 green
##  4 blue 
##  5 blue 
##  6 red  
##  7 blue 
##  8 green
##  9 blue 
## 10 green
## # ... with 90 more rows

This is a handy alternative to using dplyr::rename.

Ok, but what if we want to rename a column, and grab every other column except for one?

x %>%
        select(color = category2, everything(), -id)
## # A tibble: 100 x 8
##    color category1 value1 value2 value3 value4 value5 value6
##    <chr> <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 blue  a           1      1.12   4.31   3.83   3.87   4.1 
##  2 red   a           0.42   0.61   2.44   3      4.03   3.07
##  3 green a           0.56   1.47   2.01   0.48   3.62   5.95
##  4 blue  a           0.39   1.22   2.02   2.97   2.37   6.14
##  5 blue  a           0.42   1.32   1.61   2.49   5.13   4.61
##  6 red   a           0.71   0.55   3.17   1.86   3.25   5.29
##  7 blue  a           0.61   1.7    1.25   2.87   1.95   5.88
##  8 green a           0.38   0.97   2.54  -0.08   3.44   3.9 
##  9 blue  a           0.11   1.23   1.95   2.45   2.08   3.65
## 10 green a           0.72   3.06   2.09   3.93   4.08   3.91
## # ... with 90 more rows

I think of dplyr::everything (actually from the tidyselect package) as analogous to SELECT * in SQL.

Well, what if we want to grab all of our ‘value’ columns?

x %>%
        select(starts_with('value'))
## # A tibble: 100 x 6
##    value1 value2 value3 value4 value5 value6
##     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1   1      1.12   4.31   3.83   3.87   4.1 
##  2   0.42   0.61   2.44   3      4.03   3.07
##  3   0.56   1.47   2.01   0.48   3.62   5.95
##  4   0.39   1.22   2.02   2.97   2.37   6.14
##  5   0.42   1.32   1.61   2.49   5.13   4.61
##  6   0.71   0.55   3.17   1.86   3.25   5.29
##  7   0.61   1.7    1.25   2.87   1.95   5.88
##  8   0.38   0.97   2.54  -0.08   3.44   3.9 
##  9   0.11   1.23   1.95   2.45   2.08   3.65
## 10   0.72   3.06   2.09   3.93   4.08   3.91
## # ... with 90 more rows

This is particularly handy for selections after pivoting wider where we may have many columns that have a certain prefix. See also dplyr::one_of.

Well what if we only want a few value columns?

Say, 1-4?

x %>%
        select(num_range(prefix = 'value', range = 1:4))
## # A tibble: 100 x 4
##    value1 value2 value3 value4
##     <dbl>  <dbl>  <dbl>  <dbl>
##  1   1      1.12   4.31   3.83
##  2   0.42   0.61   2.44   3   
##  3   0.56   1.47   2.01   0.48
##  4   0.39   1.22   2.02   2.97
##  5   0.42   1.32   1.61   2.49
##  6   0.71   0.55   3.17   1.86
##  7   0.61   1.7    1.25   2.87
##  8   0.38   0.97   2.54  -0.08
##  9   0.11   1.23   1.95   2.45
## 10   0.72   3.06   2.09   3.93
## # ... with 90 more rows

Also see dplyr::contains for more ways to select based on matching a literal string.

For even more, checkout the documentation and in particular the ’useful functions portion.