Chapter 4 Data mangling with package dplyr
This package, which is also in the tidyverse, is quite versatile. You can use it for a wide range of activities. Some examples are
- summarizing data; e.g. counting, ranking,
- selecting, filtering and sampling cases
- manipulating data; creating new or changing existing variables
- combining tables
In this chapter only a small selection of this package will be discussed.
There is an excellent cheat sheet for this package. You can find it here. For convenience, primarily because this gitbook is also used in offline mode during examinations, I included it here as well: dplyr-data-transformation.pdf
Before embarking on an overview of the most important functions, let’s first look at the tibble
and the %>%
chaining operator.
The sections below are copied (and adapted) for in part from the dplyr
and tibble
vignettes which can be found here and here
In this chapter I will often use the term case instead of row and variable instead of column since they more precisely describe the essence. Also, these terms are used more in the tidyverse packages.
4.1 Tibbles
Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors). There is an entire package dedicated to tibbles, not surprisingly called tibble
you usually do not have to load the package because dplyr and tidyr do that already (they depend on it themselves).
Use the tibble()
constructor to create them as literals. There are several advantages over the old data.frame
constructor:
It never changes an input’s type (i.e., no more
stringsAsFactors = FALSE
!).It never adjusts the names of variables:
name with space
does not becomename.with.space
.It evaluates its arguments lazily and sequentially:
tibble(x = 1:5, y = x ^ 2)
## # A tibble: 5 x 2 ## x y ## <int> <dbl> ## 1 1 1 ## 2 2 4 ## 3 3 9 ## 4 4 16 ## 5 5 25
It never uses
row.names()
. The whole point of tidy data is to store variables in a consistent way. So it never stores a variable as special attribute.It only recycles vectors of length 1. This is because recycling vectors of greater lengths is a frequent source of bugs.
Coercion
To complement tibble()
, tibble provides as_tibble()
to coerce objects into tibbles.
4.1.1 By-row constructor
There is a third function, tribble()
that you can use to define a table in an alternative way: row-wise.
tribble(
~colA, ~colB,
"a", 1,
"b", 2,
"c", 3
)
## # A tibble: 3 x 2
## colA colB
## <chr> <dbl>
## 1 a 1
## 2 b 2
## 3 c 3
4.1.2 Tibbles vs data frames
There are three key differences between tibbles and data frames: printing, subsetting, and recycling rules.
Printing
- When you print a tibble, it only shows the first ten rows and all the columns that fit on one screen. It also prints an abbreviated description of the column type, and uses font styles and color for highlighting.
- You can control the default appearance with options:
- options(tibble.print_max = n, tibble.print_min = m)
: if there are more than n
rows, print only the first m
rows. Use options(tibble.print_max = Inf)
to always show all rows.
- options(tibble.width = Inf)
will always print all columns, regardless of the width of the screen.
Subsetting
Tibbles are quite strict about subsetting. [
always returns another tibble. Contrast this with a data frame: sometimes [
returns a data frame and sometimes it just returns a vector.
Recycling
When constructing a tibble, only values of length 1 are recycled. The first column with length different to one determines the number of rows in the tibble, conflicts lead to an error. This also extends to tibbles with zero rows, which is sometimes important for programming:
4.1.3 The str()
equivalent: glimpse()
The glimpse()
function is the dplyr equivalent of str()
:
glimpse(dose_response_long)
## Rows: 52
## Columns: 4
## $ patient <chr> "001", "002", "003", "004", "005", "006", "007", "008", "009"…
## $ sex <fct> f, f, m, m, f, f, f, m, f, m, m, m, m, f, m, f, f, m, m, f, f…
## $ Dose <chr> "dose10mg", "dose10mg", "dose10mg", "dose10mg", "dose10mg", "…
## $ Response <int> 12, 11, 54, 71, 19, 22, 23, 68, 30, 83, 72, 48, 67, 13, 73, 2…
4.2 The chaining operator %>%
In any workflow, it happens all the time that you apply some function to a dataframe, store the result in a new variable (or overwrite the first) and apply a second function to this dataframe. And so on. There are two undesirable results with this. The first is cluttered code: many variables; how are you going to name them? Just have a look at the previous chapter and you’ll understand. The second -and much worse if you are working with big dataframes- is cluttering of the environment and memory footprint.
This is where the chaining operator comes in. It helps you create clean workflows where intermediate results are only stored when opportune.
It comes down to this simple notion: x %>% f(y))
is equivalent to f(x, y) where f
is any function.
Here is the good old dose-response example again, converted to a tibble.
dose_response_long
## # A tibble: 52 x 4
## patient sex Dose Response
## <chr> <fct> <chr> <int>
## 1 001 f dose10mg 12
## 2 002 f dose10mg 11
## 3 003 m dose10mg 54
## 4 004 m dose10mg 71
## 5 005 f dose10mg 19
## 6 006 f dose10mg 22
## 7 007 f dose10mg 23
## 8 008 m dose10mg 68
## 9 009 f dose10mg 30
## 10 010 m dose10mg 83
## # … with 42 more rows
Suppose I want to remove cases with missing values (there aren’t any - this is for the sake of argument), select the female subjects and then calculate the mean response for the two doses. In base R, you could do something like this.
<- na.omit(dose_response_long)
dose_response_long_no_na <- subset(x = dose_response_long_no_na, subset = sex == "f")
dose_response_long_no_na_only_female aggregate(Response ~ Dose, data = dose_response_long_no_na_only_female, FUN = mean)
## Dose Response
## 1 dose100mg 80.9
## 2 dose10mg 19.6
I know, I exaggerated a bit with the variable names.
Here is the same workflow, using dplyr
, but with the intermediate variables. It even has an explicit operation extra (group_by()
).
<- drop_na(dose_response_long)
dose_response_long_no_na <- filter(dose_response_long_no_na, sex == "f")
dose_response_long_no_na_only_female <- group_by(dose_response_long_no_na_only_female,
dose_response_long_no_na_only_female_grouped
Dose)summarize(dose_response_long_no_na_only_female_grouped, mean_response = mean(Response))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## Dose mean_response
## <chr> <dbl>
## 1 dose100mg 80.9
## 2 dose10mg 19.6
And, finally, how dplyr is supposed to be used.
%>%
dose_response_long drop_na() %>%
filter(sex == "f") %>%
group_by(Dose) %>%
summarize(mean_response = mean(Response))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## Dose mean_response
## <chr> <dbl>
## 1 dose100mg 80.9
## 2 dose10mg 19.6
Isn’t that a treat for your eyes? A highly readable, minimal piece of code, and what’s more - no environment clogged with data you forget to clean up.
Note that drop_na
is actually from the tidyr package. However, it works seamlessly in the chaining context of dplyr functions.
This is the power of dplyr and the chaining operator! If you do actually want the result stored, you only need to assign to a single variable at the beginning of the chain.
Does it work with ggplot2
as well?
%>%
dose_response_long drop_na() %>%
ggplot(mapping = aes(x = sex, y = Response)) +
geom_boxplot() +
facet_wrap(Dose ~ .)
I don’t know about you, but this kind of thing makes me happy!
The only thing that bothers me slightly is the +
instead of %>%
in ggplot2 context.
On the other hand it is layering, not chaining what ggplot2 does, so there is clear distinction.
You have seen the essence of the tidyverse: clean chained workflows.
The sections below are copied (and adapted) for a large part from the dplyr
vignette which can be found here
Dplyr aims to provide a function for each basic verb of data manipulation:
filter()
and unique()
to select cases based on (the uniqueness of) their values.
arrange()
to reorder the cases.
select()
and rename()
to select variables based on their names.
mutate()
and transmute()
to add new variables that are functions of existing variables.
summarise()
to condense multiple values to a single value.
sample_n()
and sample_frac()
to take random samples.
4.3 Selecting
4.3.1 Selecting rows by index: slice()
If you simply want to select rows by index, use slice()
slice(dose_response_long, 2:4)
## # A tibble: 3 x 4
## patient sex Dose Response
## <chr> <fct> <chr> <int>
## 1 002 f dose10mg 11
## 2 003 m dose10mg 54
## 3 004 m dose10mg 71
The functions head()
and tail()
work as expected; they work with tibbles as well (by overloading) and in the context of chained actions.
4.3.2 Get unique cases with distinct()
The distinct()
function retains only unique/distinct cases from an input tbl
or data.frame
. You provide variables to use when determining uniqueness. If there are multiple cases for a given combination of inputs, only the first case will be preserved. If omitted, will use all variables. The .keep_all
argument specifies whether all variables in the tbl should be kept.
%>% distinct(sex, Dose, .keep_all = T) dose_response_long
## # A tibble: 4 x 4
## patient sex Dose Response
## <chr> <fct> <chr> <int>
## 1 001 f dose10mg 12
## 2 003 m dose10mg 54
## 3 001 f dose100mg 88
## 4 003 m dose100mg 14
%>% distinct(Dose) dose_response_long
## # A tibble: 2 x 1
## Dose
## <chr>
## 1 dose10mg
## 2 dose100mg
4.3.3 filter()
cases
This function is similar to the subset
argument of the subset()
function.
The filter function filter()
allows you to select a subset of cases in a data frame. The first argument is the tibble or data frame. The second and subsequent arguments refer to variables within that data frame, selecting cases where the expression is TRUE.
%>% filter(Dose == "dose10mg" & Response > 60) dose_response_long
## # A tibble: 7 x 4
## patient sex Dose Response
## <chr> <fct> <chr> <int>
## 1 004 m dose10mg 71
## 2 008 m dose10mg 68
## 3 010 m dose10mg 83
## 4 011 m dose10mg 72
## 5 013 m dose10mg 67
## 6 015 m dose10mg 73
## 7 024 m dose10mg 61
When you want to filter rows based on a regular expression pattern matching a character value you can do something like the chunk below, because the only thing select needs is a logical vector.
<- tibble(x = c("abb", "bbc", "dbbd", "aacc"), y = 1:4)) (t
## # A tibble: 4 x 2
## x y
## <chr> <int>
## 1 abb 1
## 2 bbc 2
## 3 dbbd 3
## 4 aacc 4
%>% filter(grepl("bb", x)) t
## # A tibble: 3 x 2
## x y
## <chr> <int>
## 1 abb 1
## 2 bbc 2
## 3 dbbd 3
Using str_detect()
from the stringr
tidyverse package this is also possible:
%>% filter(str_detect(x, "bb")) t
## # A tibble: 3 x 2
## x y
## <chr> <int>
## 1 abb 1
## 2 bbc 2
## 3 dbbd 3
4.3.4 Selecting variables: select()
This function is similar to the select
argument of the subset()
function.
Choose variables from a table. Closely related to rename()
discussed below; select()
keeps only the listed variables and rename()
keeps all variables.
When you use the key = value
format this will result in a rename of the variable.
select(dose_response_long, patient, gender = sex)
## # A tibble: 52 x 2
## patient gender
## <chr> <fct>
## 1 001 f
## 2 002 f
## 3 003 m
## 4 004 m
## 5 005 f
## 6 006 f
## 7 007 f
## 8 008 m
## 9 009 f
## 10 010 m
## # … with 42 more rows
Use the minus sign when you want to select everything but a variable:
select(dose_response_long, -patient, -sex)
## # A tibble: 52 x 2
## Dose Response
## <chr> <int>
## 1 dose10mg 12
## 2 dose10mg 11
## 3 dose10mg 54
## 4 dose10mg 71
## 5 dose10mg 19
## 6 dose10mg 22
## 7 dose10mg 23
## 8 dose10mg 68
## 9 dose10mg 30
## 10 dose10mg 83
## # … with 42 more rows
## same as
#select(dose_response_long, -c(patient, sex))
You can use the colon operator to indicate a range of variables:
select(dose_response_long, patient:Dose)
## # A tibble: 52 x 3
## patient sex Dose
## <chr> <fct> <chr>
## 1 001 f dose10mg
## 2 002 f dose10mg
## 3 003 m dose10mg
## 4 004 m dose10mg
## 5 005 f dose10mg
## 6 006 f dose10mg
## 7 007 f dose10mg
## 8 008 m dose10mg
## 9 009 f dose10mg
## 10 010 m dose10mg
## # … with 42 more rows
4.3.4.1 Tidyselect helpers
Both with select()
and rename()
and also with mutate_at()
you can use the special helper functions of the tidyselect package:
starts_with()
: Starts with a prefix.ends_with()
: Ends with a suffix.contains()
: Contains a literal string.matches()
: Matches a regular expression.num_range()
: Matches a numerical range like x01, x02, x03.one_of()
: Matches variable names in a character vector.everything()
: Matches all variables.last_col()
: Select last variable, possibly with an offset.
select(dose_response_long, contains("o"))
## # A tibble: 52 x 2
## Dose Response
## <chr> <int>
## 1 dose10mg 12
## 2 dose10mg 11
## 3 dose10mg 54
## 4 dose10mg 71
## 5 dose10mg 19
## 6 dose10mg 22
## 7 dose10mg 23
## 8 dose10mg 68
## 9 dose10mg 30
## 10 dose10mg 83
## # … with 42 more rows
4.3.5 Renaming variables: rename()
Rename variables from a table.
%>% rename(Patient = patient, Gender = sex) dose_response_long
## # A tibble: 52 x 4
## Patient Gender Dose Response
## <chr> <fct> <chr> <int>
## 1 001 f dose10mg 12
## 2 002 f dose10mg 11
## 3 003 m dose10mg 54
## 4 004 m dose10mg 71
## 5 005 f dose10mg 19
## 6 006 f dose10mg 22
## 7 007 f dose10mg 23
## 8 008 m dose10mg 68
## 9 009 f dose10mg 30
## 10 010 m dose10mg 83
## # … with 42 more rows
4.3.6 Selecting from ranked data
The top_n()
function makes it easy to select a few cases that based on the ranking of a value:
%>% top_n(3, Response) dose_response_long
## # A tibble: 4 x 4
## patient sex Dose Response
## <chr> <fct> <chr> <int>
## 1 005 f dose100mg 89
## 2 006 f dose100mg 99
## 3 017 f dose100mg 96
## 4 021 f dose100mg 89
We see 4 cases returned because the third rank is the same for two cases.
This is especially interesting with grouped data:
%>% group_by(Dose) %>% top_n(3, Response) dose_response_long
## # A tibble: 7 x 4
## # Groups: Dose [2]
## patient sex Dose Response
## <chr> <fct> <chr> <int>
## 1 010 m dose10mg 83
## 2 011 m dose10mg 72
## 3 015 m dose10mg 73
## 4 005 f dose100mg 89
## 5 006 f dose100mg 99
## 6 017 f dose100mg 96
## 7 021 f dose100mg 89
4.3.7 Extract a column as vector
Using pull()
you can obtain atomic vectors.
pull(dose_response_long, Response)
## [1] 12 11 54 71 19 22 23 68 30 83 72 48 67 13 73 20 22 40 57 26 17 29 54 61 57
## [26] 11 88 54 14 21 89 99 69 31 85 18 37 28 16 79 22 84 96 14 12 63 89 77 21 10
## [51] 36 80
This is of course the same as dose_response_long[[4]]
or dose_response_long$Response
but the difference is that pull()
can be applied in a %>%
pipeline.
With dose_response_long[, 4]
it matters whether you are working with a tibble or a dataframe; a tibble returns a tibble and a dataframe returns a vector.
4.3.8 Sorting with arrange()
If you want to sort the rows of a dataframe/tibble by the values of one or more columns, use arrange()
%>% arrange(Response) %>% slice(1:3) dose_response_long
## # A tibble: 3 x 4
## patient sex Dose Response
## <chr> <fct> <chr> <int>
## 1 024 m dose100mg 10
## 2 002 f dose10mg 11
## 3 026 f dose10mg 11
Use the function desc()
to reverse the ordering
%>% arrange(desc(Response)) %>% head(1) dose_response_long
## # A tibble: 1 x 4
## patient sex Dose Response
## <chr> <fct> <chr> <int>
## 1 006 f dose100mg 99
Note that the previous chunk is equivalent to this
%>% top_n(1, Response) dose_response_long
## # A tibble: 1 x 4
## patient sex Dose Response
## <chr> <fct> <chr> <int>
## 1 006 f dose100mg 99
So natural ordering is from low to high, but the top_n()
function always orders from high to low. You can reverse this as well using the desc()
function.
4.3.9 Random sampling
There are two functions available for random sampling: sample_n()
and sample_frac()
.
sample_frac(dose_response_long, 0.05, replace = TRUE)
## # A tibble: 3 x 4
## patient sex Dose Response
## <chr> <fct> <chr> <int>
## 1 014 f dose100mg 79
## 2 020 f dose10mg 26
## 3 022 f dose100mg 77
The workings of sample_n()
are obvious I guess.
4.4 Adding and changing variables
4.4.1 Window functions
A new variable is usually the result of some operation on one or more previous variables. The data in an original variables is processed such that for each old value a new value is generated. Functions that carry out this kind of operation are called window functions. Thus, window functions are functions that take a vector and return another vector of the same length.
For instance, the cumsum()
function returns the cumulative sum of a numeric vector:
cumsum(1:5)
## [1] 1 3 6 10 15
Here are the main window functions. Note that some of them come from base R. Later you will see the use of several of them in concert with the mutate()
function.
dplyr::lead
Copy with values shifted by 1.lead(c(1, 4, 2, 8))
## [1] 4 2 8 NA
dplyr::lag
Copy with values lagged by 1.lag(c(1, 4, 2, 8))
## [1] NA 1 4 2
dplyr::min_rank
Ranks on values, from low to high. Usedesc()
to reverse.min_rank(c(5, 4, 2, 8))
## [1] 3 2 1 4
dplyr::ntile
Bin vector into n buckets.ntile(c(5, 4, 2, 8, 1), 3)
## [1] 2 2 1 3 1
dplyr::between
Are values between a and b?between(c(5, 4, 2, 8, 1), 3, 5)
## [1] TRUE TRUE FALSE FALSE FALSE
dplyr::cummean
Cumulative meancummean(c(5, 4, 2, 8, 1))
## [1] 5.00 4.50 3.67 4.75 4.00
cumsum
Cumulative sumcumsum(c(5, 4, 2, 8, 1))
## [1] 5 9 11 19 20
cummax
Cumulative maximumcummax(c(5, 4, 2, 8, 1))
## [1] 5 5 5 8 8
cummin
Cumulative minimumcummin(c(5, 4, 2, 8, 1))
## [1] 5 4 2 2 1
cumprod
Cumulative productcumprod(c(5, 4, 2, 8, 1))
## [1] 5 20 40 320 320
pmax
Element-wise maximumpmax(c(5, 4, 2, 8, 1), c(2, 2, 3, 4, 3))
## [1] 5 4 3 8 3
pmin
Element-wise minimumpmin(c(5, 4, 2, 8, 1), c(2, 2, 3, 4, 3))
## [1] 2 2 2 4 1
4.4.2 Add one or more variables: mutate()
The function mutate()
can be used to calculate and append one or more columns.The window functions from the previous section are often-used helpers.
For instance, given the ChickWeight
dataset which shows weight gain for 50 chicks:
<- as_tibble(ChickWeight)
chicks %>% head(5) chicks
## # A tibble: 5 x 4
## weight Time Chick Diet
## <dbl> <dbl> <ord> <fct>
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
Suppose we want to know the daily weight gain of these chicks (as a challenge, you could try to do this in base R).
Using lag()
and mutate()
this is a breeze (or so it seems):
<- chicks %>% mutate(weight_gain = weight - lag(weight))) (chicks
## # A tibble: 578 x 5
## weight Time Chick Diet weight_gain
## <dbl> <dbl> <ord> <fct> <dbl>
## 1 42 0 1 1 NA
## 2 51 2 1 1 9
## 3 59 4 1 1 8
## 4 64 6 1 1 5
## 5 76 8 1 1 12
## 6 93 10 1 1 17
## 7 106 12 1 1 13
## 8 125 14 1 1 19
## 9 149 16 1 1 24
## 10 171 18 1 1 22
## # … with 568 more rows
…but the devil is in the details:
%>% slice(10:15) chicks
## # A tibble: 6 x 5
## weight Time Chick Diet weight_gain
## <dbl> <dbl> <ord> <fct> <dbl>
## 1 171 18 1 1 22
## 2 199 20 1 1 28
## 3 205 21 1 1 6
## 4 40 0 2 1 -165
## 5 49 2 2 1 9
## 6 58 4 2 1 9
The transition from chick 1 to chick 2 is not taken into account! So to get the weight gain for each chick, we need to split the data first. This is dealt with in a later section but here is a preview:
<- chicks %>%
chicks group_by(Chick) %>% #split on chicks
mutate(weight_gain = weight - lag(weight)) %>%
ungroup() #put together again
slice(chicks, 10:15)
## # A tibble: 6 x 5
## weight Time Chick Diet weight_gain
## <dbl> <dbl> <ord> <fct> <dbl>
## 1 171 18 1 1 22
## 2 199 20 1 1 28
## 3 205 21 1 1 6
## 4 40 0 2 1 NA
## 5 49 2 2 1 9
## 6 58 4 2 1 9
Can you use a custom function in a mutate
context? Of course you can!
<- function(x) {
my_z abs((abs(x - mean(x)) / sd(x)))
}%>%
women mutate(z_score = my_z(weight)) %>%
head()
## height weight z_score
## 1 58 115 1.402
## 2 59 117 1.273
## 3 60 120 1.080
## 4 61 123 0.886
## 5 62 126 0.693
## 6 63 129 0.499
And what’s more, you can make multiple columns in one operation where the calculations for the subsequent columns are interdependent.
%>%
women mutate(z_score = my_z(weight),
z_bin = ntile(z_score, 3)) %>%
head()
## height weight z_score z_bin
## 1 58 115 1.402 3
## 2 59 117 1.273 3
## 3 60 120 1.080 2
## 4 61 123 0.886 2
## 5 62 126 0.693 2
## 6 63 129 0.499 1
4.4.3 Multi-column mutate or summary
Since the previous version of this ebook, sn entire group of functions has been replaced by a novel strategy to select columns for multiple-column mutate operations (and other similar actions). The functions mutate_all()
, mutate_at()
and mutate_if()
have been superseded by the use of across()
. The help document also directs the reader to vignette("colwise")
- paste this in the console to read it for yourself. You will encounter the deprecated functions in many online examples but I encourage you to use the across()
syntax.
The across()
function takes two arguments: a selection of columns - often the result of one of the tidyselect helpers - and a function or list of functions to apply.
Many ways to select columns
In this first example the tidyselect helper function everything()
is used.
This is similar to usage of the base R apply()
function.
Suppose you want to calculate the log2 of all numeric values in the iris
dataset.
In base R you would probably do it like this:
head(apply(iris[, -5], MARGIN = 2, FUN = log2))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] 2.35 1.81 0.485 -2.32
## [2,] 2.29 1.58 0.485 -2.32
## [3,] 2.23 1.68 0.379 -2.32
## [4,] 2.20 1.63 0.585 -2.32
## [5,] 2.32 1.85 0.485 -2.32
## [6,] 2.43 1.96 0.766 -1.32
When you do it with mutate(across(...))
this is the most similar solution:
%>%
iris select(-Species) %>%
mutate(across(everything(), log2)) %>%
head(3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 2.35 1.81 0.485 -2.32
## 2 2.29 1.58 0.485 -2.32
## 3 2.23 1.68 0.379 -2.32
The tidyselect helper everything()
does what you would expect given its name. Notice how the original columns have been dropped and new columns have the same name as the original ones.
We could have used the same approach to column selection with summarise()
:
%>%
iris select(-Species) %>%
summarise(across(everything(), mean))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.84 3.06 3.76 1.2
The select(-Species) %>%
is however a bit redundant if you use the across()
function to its full potential in combination with tidyselect helpers. Here are some examples (only the first evaluated because output is the similar for all).
Using indices:
%>% summarize(across(1:4, mean)) iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.84 3.06 3.76 1.2
Using column names (no quotes required):
%>% summarize(across(c(Sepal.Length, Sepal.Width), mean)) iris
Using text matching on column names:
%>% summarize(across(starts_with("P"), mean))
iris %>% summarize(across(contains("."), mean)) iris
Finally: Using where(is.numeric)
to apply a logical filter to the column type:
%>% summarize(across(where(is.numeric), mean)) iris
Execute help(package = "tidyselect")
to see a listing of more possibilities.
You see, the alternatives to selecting columns are almost endless.
Many ways to apply functions
Besides a single function name to apply, you can provide lists of functions, anonymous functions and lambdas. Here are some examples.
You can provide a named list to apply more than one function. in that case the list name is used as column name. Surprisingly enough, the original columns are maintained in this use case (also when your list contains only one entry)!
%>%
iris select(-Species) %>%
mutate(across(everything(), list(log = log2, sqrt = sqrt))) %>%
head(3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length_log
## 1 5.1 3.5 1.4 0.2 2.35
## 2 4.9 3.0 1.4 0.2 2.29
## 3 4.7 3.2 1.3 0.2 2.23
## Sepal.Length_sqrt Sepal.Width_log Sepal.Width_sqrt Petal.Length_log
## 1 2.26 1.81 1.87 0.485
## 2 2.21 1.58 1.73 0.485
## 3 2.17 1.68 1.79 0.379
## Petal.Length_sqrt Petal.Width_log Petal.Width_sqrt
## 1 1.18 -2.32 0.447
## 2 1.18 -2.32 0.447
## 3 1.14 -2.32 0.447
When you don’t give the entries names (e.g. list(log2, sqrt)
instead of list(log = log2, sqrt = sqrt)
) the resulting column names will receive simple serial numbers.
However, when you use tibble::list()
instead, the function names do get used even though you don’t provide names (results omitted):
%>%
iris select(-Species) %>%
mutate(across(everything(), list(log = log2, sqrt = sqrt)))
You can also pass lambdas to across()
instead of a function name. Lambda’s are anonymous function-like expressions. Although this form is harder to grasp, it does make it easier to pass arguments to the function you want executed (in this case tcenter = FALSE
to scale
).
%>%
iris mutate(across(1:4, ~ scale(.x, center = FALSE))) %>%
head(5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 0.861 1.130 0.336 0.14 setosa
## 2 0.828 0.968 0.336 0.14 setosa
## 3 0.794 1.033 0.312 0.14 setosa
## 4 0.777 1.001 0.360 0.14 setosa
## 5 0.844 1.162 0.336 0.14 setosa
Again, you can also pass a (named) list:
%>%
iris mutate(across(1:4, list(scaled = ~ scale(., center = FALSE),
mean = ~ mean(., trim = 0.2)))) %>%
head(5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_scaled
## 1 5.1 3.5 1.4 0.2 setosa 0.861
## 2 4.9 3.0 1.4 0.2 setosa 0.828
## 3 4.7 3.2 1.3 0.2 setosa 0.794
## 4 4.6 3.1 1.5 0.2 setosa 0.777
## 5 5.0 3.6 1.4 0.2 setosa 0.844
## Sepal.Length_mean Sepal.Width_scaled Sepal.Width_mean Petal.Length_scaled
## 1 5.8 1.130 3.04 0.336
## 2 5.8 0.968 3.04 0.336
## 3 5.8 1.033 3.04 0.312
## 4 5.8 1.001 3.04 0.360
## 5 5.8 1.162 3.04 0.336
## Petal.Length_mean Petal.Width_scaled Petal.Width_mean
## 1 3.84 0.14 1.2
## 2 3.84 0.14 1.2
## 3 3.84 0.14 1.2
## 4 3.84 0.14 1.2
## 5 3.84 0.14 1.2
The expression ~ mean(.x, trim = 0.2)
is therefore equivalent to something like
function (.x) mean(.x, trim = 0.2)
so the tilde is shorthand for function (.x)
. Instead of .x
a single dot .
is also allowed, and this syntax is used in more locations of the tidyverse - see group_map()
and related functions.
4.4.4 Change a variable: recode()
and recode_factor()
These two functions help you to quickly change the values of a variable. Here, the Dose variable is overwritten with new -numeric- values. Note that the original dose_response_long
tibble has not been modified!
head(dose_response_long, 2)
## # A tibble: 2 x 4
## patient sex Dose Response
## <chr> <fct> <chr> <int>
## 1 001 f dose10mg 12
## 2 002 f dose10mg 11
%>%
dose_response_long mutate(Dose = recode(Dose, dose10mg = 10, dose100mg = 100)) %>%
head(2)
## # A tibble: 2 x 4
## patient sex Dose Response
## <chr> <fct> <dbl> <int>
## 1 001 f 10 12
## 2 002 f 10 11
Similarly, the sex
variable can be recoded with the recode_factor()
function. In the example below,
- the Response variable is created based on the sex
variable,
- the sex
variable is deleted and
- the variables are reordered to the original format
%>%
dose_response_long mutate(Gender = recode_factor(sex, f = "female", m = "male"),
sex = NULL) %>%
select(patient, Gender, Dose, Response) %>%
head(3)
## # A tibble: 3 x 4
## patient Gender Dose Response
## <chr> <fct> <chr> <int>
## 1 001 female dose10mg 12
## 2 002 female dose10mg 11
## 3 003 male dose10mg 54
4.5 Operations on groups
Often you want to know something about groups in your data. For instance, in the previous examples with the chickens, it is interesting to know the average weight gain of the chickens for each diet. This is where the group_by()
function comes in, and its counterpart ungroup()
.
One important thing to realize is the group_by()
does not split your data. instead, it adds metadata to a tibble (or data.frame) that marks how rows should be grouped. As long as that metadata is there -i.e. you have not called ungroup()
- you won’t be able to change the factors of the columns involved in the grouping.
If you really want to split, like the base R split()
function, you should use the group_split()
function.
4.5.1 group_by()
and ungroup()
This is the most used set of grouping functions. It is usually followed by some summary function -discussed below- but sometimes, as in the case of the weight gain outlined in section 4.4.2, you want to progress with all data combined. In that case you need to ungroup again.
Here is the summary use case. No need for ungrouping.
%>%
dose_response_long group_by(sex) %>%
summarise(mean = mean(Response))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## sex mean
## <fct> <dbl>
## 1 f 50.3
## 2 m 41.7
Of course you can group by multiple variables.
%>%
dose_response_long group_by(Dose, sex) %>%
summarize(mean = mean(Response))
## `summarise()` regrouping output by 'Dose' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: Dose [2]
## Dose sex mean
## <chr> <fct> <dbl>
## 1 dose100mg f 80.9
## 2 dose100mg m 21.5
## 3 dose10mg f 19.6
## 4 dose10mg m 61.9
After grouping on a variable, you will sometimes be interested in the top or bottom n rows. The head()
function will not work then, surprisingly enough. Suppose you want the bottom 2 responses from both the males and females in the dose100mg group. Surprisingly enough this does not work:
%>%
dose_response_long filter(Dose == "dose100mg") %>%
group_by(sex) %>%
arrange(Response) %>%
head(2) %>%
ungroup()
## # A tibble: 2 x 4
## patient sex Dose Response
## <chr> <fct> <chr> <int>
## 1 024 m dose100mg 10
## 2 019 m dose100mg 12
Instead, to select the top n values after grouping you need to use either slice()
or filter(row_number() <some-logical-test>)
. Here is the correct solution.
%>%
dose_response_long filter(Dose == "dose100mg") %>%
group_by(sex) %>%
arrange(Response) %>%
slice(1:2) %>%
#filter(row_number() %in% 1:2) %>% #also works
ungroup() #ungrouping not required but I added it for clarity of code
## # A tibble: 4 x 4
## patient sex Dose Response
## <chr> <fct> <chr> <int>
## 1 002 f dose100mg 54
## 2 020 f dose100mg 63
## 3 024 m dose100mg 10
## 4 019 m dose100mg 12
Here is rather lengthy use case for ungrouping: the z-score is calculated per sex
group, and then there is some Dose-specific calculation before summarizing. This required an ungroup in between.
%>%
dose_response_long group_by(sex) %>%
mutate(Response_Z = my_z(Response)) %>%
ungroup() %>% ##ungroup for other grouping
group_by(Dose) %>%
mutate(Response_scaled = scale(Response)) %>%
summarize(Z = mean(Response_Z), Scaled_Mean_sd = sd(Response_scaled))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## Dose Z Scaled_Mean_sd
## <chr> <dbl> <dbl>
## 1 dose100mg 0.906 1
## 2 dose10mg 0.912 1
4.5.2 group_split()
and group_keys()
The group_split()
function is equivalent to the base R split()
function. But it uses the grouping structure from group_by() and therefore is subject to the data mask. Also, it does not name the elements of the list based on the grouping. So in the example below (unevaluated), split()
will return a list with two named elements ($m
and $f
), but the list returned by group_split()
will only be accessible with [[1]]
and [[2]]
.
%>%
dose_response_long group_split(sex)
##same as
split(as.data.frame(dose_response_long), dose_response_long$sex)
The last grouping-related function, group_keys()
, only returns a tibble explaining the grouping structure:
%>%
dose_response_long group_keys(sex, Dose)
## Warning: The `...` argument of `group_keys()` is deprecated as of dplyr 1.0.0.
## Please `group_by()` first
## # A tibble: 4 x 2
## sex Dose
## * <fct> <chr>
## 1 f dose100mg
## 2 f dose10mg
## 3 m dose100mg
## 4 m dose10mg
4.5.3 Apply a function to each group
group_map()
, group_modify()
and group_walk()
are functions that can be used to iterate grouped tibbles.
group_modify()
returns a grouped tibble. Therefore the supplied function must return a data frame/tibble to be able to combine the results of the individual subsets.group_map()
returns a list of results, each element being the result of calling the supplied function on each groupgroup_walk()
calls the supplied function for side effects and returns the input.tbl
, invisibly. This makes it possible to, for instance, print the intermediate results or write them to file without breaking the workflow.
Here is an example of group_map()
. As you can see it takes a lambda expression as seen before:
%>%
dose_response_long group_by(Dose) %>%
group_map(~ quantile(.x$Response, probs = seq(0, 1, 0.2)))
## [[1]]
## 0% 20% 40% 60% 80% 100%
## 10 18 31 69 85 99
##
## [[2]]
## 0% 20% 40% 60% 80% 100%
## 11 19 26 54 67 83
Below is another example, where the resulting list of three dataframes is combined into a single dataframe:
%>%
iris group_by(Species) %>%
group_map(.f = ~ data.frame(Sep.rat = .x$Sepal.Length / .x$Sepal.Width,
Pet.rat = .x$Petal.Length / .x$Petal.Width)) %>%
bind_rows()
## Sep.rat Pet.rat
## 1 1.46 7.00
## 2 1.63 7.00
## 3 1.47 6.50
## 4 1.48 7.50
## 5 1.39 7.00
## 6 1.38 4.25
## 7 1.35 4.67
## 8 1.47 7.50
## 9 1.52 7.00
## 10 1.58 15.00
## 11 1.46 7.50
## 12 1.41 8.00
## 13 1.60 14.00
## 14 1.43 11.00
## 15 1.45 6.00
## 16 1.30 3.75
## 17 1.38 3.25
## 18 1.46 4.67
## 19 1.50 5.67
## 20 1.34 5.00
## 21 1.59 8.50
## 22 1.38 3.75
## 23 1.28 5.00
## 24 1.55 3.40
## 25 1.41 9.50
## 26 1.67 8.00
## 27 1.47 4.00
## 28 1.49 7.50
## 29 1.53 7.00
## 30 1.47 8.00
## 31 1.55 8.00
## 32 1.59 3.75
## 33 1.27 15.00
## 34 1.31 7.00
## 35 1.58 7.50
## 36 1.56 6.00
## 37 1.57 6.50
## 38 1.36 14.00
## 39 1.47 6.50
## 40 1.50 7.50
## 41 1.43 4.33
## 42 1.96 4.33
## 43 1.38 6.50
## 44 1.43 2.67
## 45 1.34 4.75
## 46 1.60 4.67
## 47 1.34 8.00
## 48 1.44 7.00
## 49 1.43 7.50
## 50 1.52 7.00
## 51 2.19 3.36
## 52 2.00 3.00
## 53 2.23 3.27
## 54 2.39 3.08
## 55 2.32 3.07
## 56 2.04 3.46
## 57 1.91 2.94
## 58 2.04 3.30
## 59 2.28 3.54
## 60 1.93 2.79
## 61 2.50 3.50
## 62 1.97 2.80
## 63 2.73 4.00
## 64 2.10 3.36
## 65 1.93 2.77
## 66 2.16 3.14
## 67 1.87 3.00
## 68 2.15 4.10
## 69 2.82 3.00
## 70 2.24 3.55
## 71 1.84 2.67
## 72 2.18 3.08
## 73 2.52 3.27
## 74 2.18 3.92
## 75 2.21 3.31
## 76 2.20 3.14
## 77 2.43 3.43
## 78 2.23 2.94
## 79 2.07 3.00
## 80 2.19 3.50
## 81 2.29 3.45
## 82 2.29 3.70
## 83 2.15 3.25
## 84 2.22 3.19
## 85 1.80 3.00
## 86 1.76 2.81
## 87 2.16 3.13
## 88 2.74 3.38
## 89 1.87 3.15
## 90 2.20 3.08
## 91 2.12 3.67
## 92 2.03 3.29
## 93 2.23 3.33
## 94 2.17 3.30
## 95 2.07 3.23
## 96 1.90 3.50
## 97 1.97 3.23
## 98 2.14 3.31
## 99 2.04 2.73
## 100 2.04 3.15
## 101 1.91 2.40
## 102 2.15 2.68
## 103 2.37 2.81
## 104 2.17 3.11
## 105 2.17 2.64
## 106 2.53 3.14
## 107 1.96 2.65
## 108 2.52 3.50
## 109 2.68 3.22
## 110 2.00 2.44
## 111 2.03 2.55
## 112 2.37 2.79
## 113 2.27 2.62
## 114 2.28 2.50
## 115 2.07 2.12
## 116 2.00 2.30
## 117 2.17 3.06
## 118 2.03 3.05
## 119 2.96 3.00
## 120 2.73 3.33
## 121 2.16 2.48
## 122 2.00 2.45
## 123 2.75 3.35
## 124 2.33 2.72
## 125 2.03 2.71
## 126 2.25 3.33
## 127 2.21 2.67
## 128 2.03 2.72
## 129 2.29 2.67
## 130 2.40 3.62
## 131 2.64 3.21
## 132 2.08 3.20
## 133 2.29 2.55
## 134 2.25 3.40
## 135 2.35 4.00
## 136 2.57 2.65
## 137 1.85 2.33
## 138 2.06 3.06
## 139 2.00 2.67
## 140 2.23 2.57
## 141 2.16 2.33
## 142 2.23 2.22
## 143 2.15 2.68
## 144 2.12 2.57
## 145 2.03 2.28
## 146 2.23 2.26
## 147 2.52 2.63
## 148 2.17 2.60
## 149 1.82 2.35
## 150 1.97 2.83
Here is an example of group_modify.
%>%
iris group_by(Species) %>%
group_modify(~ tibble(quant = quantile(.x$Petal.Length, probs = seq(0, 1, 0.2))))
## # A tibble: 18 x 2
## # Groups: Species [3]
## Species quant
## <fct> <dbl>
## 1 setosa 1
## 2 setosa 1.3
## 3 setosa 1.4
## 4 setosa 1.5
## 5 setosa 1.6
## 6 setosa 1.9
## 7 versicolor 3
## 8 versicolor 3.9
## 9 versicolor 4.2
## 10 versicolor 4.5
## 11 versicolor 4.7
## 12 versicolor 5.1
## 13 virginica 4.5
## 14 virginica 5.1
## 15 virginica 5.36
## 16 virginica 5.6
## 17 virginica 6
## 18 virginica 6.9
Of course, it would have been nicer to put the quantiles in separate columns.
%>%
iris group_by(Species) %>%
group_modify(~ tibble(quant = quantile(.x$Petal.Length, probs = seq(0, 1, 0.2)))) %>%
ungroup() %>%
mutate(prob = rep(seq(0, 1, 0.2), times = 3)) %>%
pivot_wider(names_from = prob, values_from = quant)
## # A tibble: 3 x 7
## Species `0` `0.2` `0.4` `0.6` `0.8` `1`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 1 1.3 1.4 1.5 1.6 1.9
## 2 versicolor 3 3.9 4.2 4.5 4.7 5.1
## 3 virginica 4.5 5.1 5.36 5.6 6 6.9
But this is much nicer of course:
%>%
iris group_by(Species) %>%
group_map(~ c(.y, quantile(.x$Petal.Length, probs = seq(0, 1, 0.2)))) %>%
bind_rows()
## # A tibble: 3 x 7
## Species `0%` `20%` `40%` `60%` `80%` `100%`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 1 1.3 1.4 1.5 1.6 1.9
## 2 versicolor 3 3.9 4.2 4.5 4.7 5.1
## 3 virginica 4.5 5.1 5.36 5.6 6 6.9
Use group_walk()
when you only want to have a peek at each group before processing further.
%>%
dose_response_long group_by(Dose) %>%
group_walk(~ print(head(.x, 2))) %>%
summarise(avg = mean(Response))
## # A tibble: 2 x 3
## patient sex Response
## <chr> <fct> <int>
## 1 001 f 88
## 2 002 f 54
## # A tibble: 2 x 3
## patient sex Response
## <chr> <fct> <int>
## 1 001 f 12
## 2 002 f 11
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## Dose avg
## <chr> <dbl>
## 1 dose100mg 51.2
## 2 dose10mg 40.8
4.6 Summarizing and counting
You have seen the summarize()
function at work a few times. In essence, it creates one or more summary statistics, for each group, if existing.
For instance, to calculate the mean and SD of relative weight gain for different diets, this is the workflow for it:
%>%
chicks group_by(Chick) %>% #split on chicks
mutate(rel_weight_gain = (weight - lag(weight))/weight) %>%
ungroup() %>%
group_by(Diet) %>%
summarize(mean_rel_gain = mean(rel_weight_gain, na.rm = T),
sd = sd(rel_weight_gain, na.rm = T))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 3
## Diet mean_rel_gain sd
## <fct> <dbl> <dbl>
## 1 1 0.111 0.0743
## 2 2 0.131 0.0742
## 3 3 0.153 0.0657
## 4 4 0.142 0.0748
There is an equivalent that creates a summary for a group of variables: summarize_all()
, summarize_at()
and summareize_if()
which are analogous to the mutate_xxx()
methods. Here is one example.
%>%
iris summarize_if(is.numeric, list(mean = mean, sd = sd))
## Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
## 1 5.84 3.06 3.76 1.2
## Sepal.Length_sd Sepal.Width_sd Petal.Length_sd Petal.Width_sd
## 1 0.828 0.436 1.77 0.762
You should be able to figure out the rest of them by now.
4.6.0.1 Simple counting of occurrences
An often used operation is simple counting of occurrences of course:
%>%
dose_response_long group_by(sex, Dose) %>%
summarize(count = n())
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: sex [2]
## sex Dose count
## <fct> <chr> <int>
## 1 f dose100mg 13
## 2 f dose10mg 13
## 3 m dose100mg 13
## 4 m dose10mg 13
There is a shorthand for this type of count: tally()
:
%>%
dose_response_long group_by(sex, Dose) %>%
tally()
## # A tibble: 4 x 3
## # Groups: sex [2]
## sex Dose n
## <fct> <chr> <int>
## 1 f dose100mg 13
## 2 f dose10mg 13
## 3 m dose100mg 13
## 4 m dose10mg 13
And even shorter is this:
%>%
dose_response_long count(sex, Dose)
## # A tibble: 4 x 3
## sex Dose n
## <fct> <chr> <int>
## 1 f dose100mg 13
## 2 f dose10mg 13
## 3 m dose100mg 13
## 4 m dose10mg 13
When you want the counts sorted, use the sort = TRUE
argument to count, e.g.
mtcars %>% count(cyl, sort = TRUE)
If you want such a count as a column in your dataset, use add_tally()
or add_count()
%>%
dose_response_long group_by(sex) %>%
add_tally()
## # A tibble: 52 x 5
## # Groups: sex [2]
## patient sex Dose Response n
## <chr> <fct> <chr> <int> <int>
## 1 001 f dose10mg 12 26
## 2 002 f dose10mg 11 26
## 3 003 m dose10mg 54 26
## 4 004 m dose10mg 71 26
## 5 005 f dose10mg 19 26
## 6 006 f dose10mg 22 26
## 7 007 f dose10mg 23 26
## 8 008 m dose10mg 68 26
## 9 009 f dose10mg 30 26
## 10 010 m dose10mg 83 26
## # … with 42 more rows
##same as below, but slightly more readable:
#add_count(sex)
add_count()
is useful for groupwise filtering, for instance when you want to show details of the group that occurs the least.
%>%
mtcars add_count(cyl) %>%
filter(n == min(n)) %>%
select(1:5, n)
## mpg cyl disp hp drat n
## 1 21.0 6 160 110 3.90 7
## 2 21.0 6 160 110 3.90 7
## 3 21.4 6 258 110 3.08 7
## 4 18.1 6 225 105 2.76 7
## 5 19.2 6 168 123 3.92 7
## 6 17.8 6 168 123 3.92 7
## 7 19.7 6 145 175 3.62 7
4.7 Combining data
4.7.1 Merging
There are a few functions, left_join()
, right_join()
, inner_join()
, full_join()
; only full_join()
is discussed here; the others are variations behaving differently with missing data in one or the other data frames to combine. In base R you have merge()
that is responsible for this kind of operations.
Suppose you have two tibbles:
<- tribble(
(trA ~ID, ~var1,
"a", 4,
"b", 6,
"c", 1,
))
## # A tibble: 3 x 2
## ID var1
## <chr> <dbl>
## 1 a 4
## 2 b 6
## 3 c 1
<- tribble(
(trB ~ID, ~var2,
"a", 7,
"b", 3,
"d", 5
))
## # A tibble: 3 x 2
## ID var2
## <chr> <dbl>
## 1 a 7
## 2 b 3
## 3 d 5
Since there is a common variable “ID” we can simply combine these two into a single tibble:
full_join(trA, trB, by = "ID")
## # A tibble: 4 x 3
## ID var1 var2
## <chr> <dbl> <dbl>
## 1 a 4 7
## 2 b 6 3
## 3 c 1 NA
## 4 d NA 5
When the column names differ you can specify it as full_join(x, y, by = c("a" = "b"))
which will match variable a
on tibble x
to variable b
on tibble y
.
You can of course also do this in a workflow setting:
%>% full_join(trB, by = "ID") trA
## # A tibble: 4 x 3
## ID var1 var2
## <chr> <dbl> <dbl>
## 1 a 4 7
## 2 b 6 3
## 3 c 1 NA
## 4 d NA 5
4.7.2 Adding rows and columns
To add rows, use bind_rows()
and to add columns, use bind_cols()
have a look at the docs for the details. They are pretty much self-explanatory, and analogous to the base R cbind()
and rbind()
functions.