It’s not the most interesting function, or even that much fun. But dplyr’s select
has a lot of functionality that is often overlooked. All select
does is ‘select’ columns, yet understanding some of its functionality deeper than surface level saves me a bunch of time in my day-to-day work. Let’s take a tour of where dplyr::select
can save you time too, by calling out situations where I find myself using the function over and over again.
library(dplyr)
Let’s imagine we are working with the following data frame.
x = tibble(id = 1:100,
category1 = rep(c(rep('a',10), rep('b',10)),5),
category2 = base::sample(x=c('red','green','blue'),replace = T, size = 100),
value1 = round(runif(100),2),
value2 = round(rnorm(100,mean = 1),2),
value3 = round(rnorm(100,mean = 2),2),
value4 = round(rnorm(100,mean = 3),2),
value5 = round(rnorm(100,mean = 4),2),
value6 = round(rnorm(100,mean = 5),2))
x %>% head() %>% knitr::kable()
id | category1 | category2 | value1 | value2 | value3 | value4 | value5 | value6 |
---|---|---|---|---|---|---|---|---|
1 | a | blue | 1.00 | 1.12 | 4.31 | 3.83 | 3.87 | 4.10 |
2 | a | red | 0.42 | 0.61 | 2.44 | 3.00 | 4.03 | 3.07 |
3 | a | green | 0.56 | 1.47 | 2.01 | 0.48 | 3.62 | 5.95 |
4 | a | blue | 0.39 | 1.22 | 2.02 | 2.97 | 2.37 | 6.14 |
5 | a | blue | 0.42 | 1.32 | 1.61 | 2.49 | 5.13 | 4.61 |
6 | a | red | 0.71 | 0.55 | 3.17 | 1.86 | 3.25 | 5.29 |
We can imagine situations where we want to only grab a specific column or two in a tidy fashion - dplyr::select
is literally made for that!
x %>%
select(id, value2)
## # A tibble: 100 x 2
## id value2
## <int> <dbl>
## 1 1 1.12
## 2 2 0.61
## 3 3 1.47
## 4 4 1.22
## 5 5 1.32
## 6 6 0.55
## 7 7 1.7
## 8 8 0.97
## 9 9 1.23
## 10 10 3.06
## # ... with 90 more rows
Now, what if we want to remove a column?
x %>%
select(-id)
## # A tibble: 100 x 8
## category1 category2 value1 value2 value3 value4 value5 value6
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 a blue 1 1.12 4.31 3.83 3.87 4.1
## 2 a red 0.42 0.61 2.44 3 4.03 3.07
## 3 a green 0.56 1.47 2.01 0.48 3.62 5.95
## 4 a blue 0.39 1.22 2.02 2.97 2.37 6.14
## 5 a blue 0.42 1.32 1.61 2.49 5.13 4.61
## 6 a red 0.71 0.55 3.17 1.86 3.25 5.29
## 7 a blue 0.61 1.7 1.25 2.87 1.95 5.88
## 8 a green 0.38 0.97 2.54 -0.08 3.44 3.9
## 9 a blue 0.11 1.23 1.95 2.45 2.08 3.65
## 10 a green 0.72 3.06 2.09 3.93 4.08 3.91
## # ... with 90 more rows
That’s right, you can simply drop columns by adding a -
to the column name.
What about renaming a column?
x %>%
select(color = category2)
## # A tibble: 100 x 1
## color
## <chr>
## 1 blue
## 2 red
## 3 green
## 4 blue
## 5 blue
## 6 red
## 7 blue
## 8 green
## 9 blue
## 10 green
## # ... with 90 more rows
This is a handy alternative to using dplyr::rename
.
Ok, but what if we want to rename a column, and grab every other column except for one?
x %>%
select(color = category2, everything(), -id)
## # A tibble: 100 x 8
## color category1 value1 value2 value3 value4 value5 value6
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 blue a 1 1.12 4.31 3.83 3.87 4.1
## 2 red a 0.42 0.61 2.44 3 4.03 3.07
## 3 green a 0.56 1.47 2.01 0.48 3.62 5.95
## 4 blue a 0.39 1.22 2.02 2.97 2.37 6.14
## 5 blue a 0.42 1.32 1.61 2.49 5.13 4.61
## 6 red a 0.71 0.55 3.17 1.86 3.25 5.29
## 7 blue a 0.61 1.7 1.25 2.87 1.95 5.88
## 8 green a 0.38 0.97 2.54 -0.08 3.44 3.9
## 9 blue a 0.11 1.23 1.95 2.45 2.08 3.65
## 10 green a 0.72 3.06 2.09 3.93 4.08 3.91
## # ... with 90 more rows
I think of dplyr::everything
(actually from the tidyselect
package) as analogous to SELECT *
in SQL.
Well, what if we want to grab all of our ‘value’ columns?
x %>%
select(starts_with('value'))
## # A tibble: 100 x 6
## value1 value2 value3 value4 value5 value6
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.12 4.31 3.83 3.87 4.1
## 2 0.42 0.61 2.44 3 4.03 3.07
## 3 0.56 1.47 2.01 0.48 3.62 5.95
## 4 0.39 1.22 2.02 2.97 2.37 6.14
## 5 0.42 1.32 1.61 2.49 5.13 4.61
## 6 0.71 0.55 3.17 1.86 3.25 5.29
## 7 0.61 1.7 1.25 2.87 1.95 5.88
## 8 0.38 0.97 2.54 -0.08 3.44 3.9
## 9 0.11 1.23 1.95 2.45 2.08 3.65
## 10 0.72 3.06 2.09 3.93 4.08 3.91
## # ... with 90 more rows
This is particularly handy for selections after pivoting wider where we may have many columns that have a certain prefix. See also dplyr::one_of
.
Well what if we only want a few value columns?
Say, 1-4?
x %>%
select(num_range(prefix = 'value', range = 1:4))
## # A tibble: 100 x 4
## value1 value2 value3 value4
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1.12 4.31 3.83
## 2 0.42 0.61 2.44 3
## 3 0.56 1.47 2.01 0.48
## 4 0.39 1.22 2.02 2.97
## 5 0.42 1.32 1.61 2.49
## 6 0.71 0.55 3.17 1.86
## 7 0.61 1.7 1.25 2.87
## 8 0.38 0.97 2.54 -0.08
## 9 0.11 1.23 1.95 2.45
## 10 0.72 3.06 2.09 3.93
## # ... with 90 more rows
Also see dplyr::contains
for more ways to select based on matching a literal string.
For even more, checkout the documentation and in particular the ’useful functions portion.