8 Tidying dataframes using Package tidyr
This chapter explores the package tidyr
which is part of the tidyverse
set of packages.
This package is an essential tool collection in the process of organizing you data in a tidy way.
A dataset is considered tidy when it is formed according to these rules:
- Each variable has its own column.
- Each observation has its own row.
- Each value has its own cell.
Why do we want tidy data?
- Because the entire tidyverse works with tidy data, especially the ggplot2 package.
- Pivoting data is also often used to make specific data representations for other purposes.
- Finally: we often receive “dirty” data that needs to be processed in order to be able to carry out our analyses.
Here is an example dataset, downloaded from the WHO website here Note: the original data was exported from Excel to csv, and a text editor was used to replace single quote occurrences (“’”) with an underscore. Otherwise, data will be corrupted.
The data concerns disease occurrences for two diseases: measles and rubella (“rode hond” in Dutch). Suppose I would like to analyse distribution and timeline of these two diseases, in relation to each other. There are two problems here: 1. The data needs to be clean 2. The data needs to be combined from two files into one dataframe.
Let’s start with the first: measles. Here is part of the measles data. It is not tidy. Why not?
Region | ISO3 | Country | Year | January | February | March |
---|---|---|---|---|---|---|
AFR | AGO | Angola | 2011 | 17 | 19 | 37 |
AFR | AGO | Angola | 2012 | 373 | 289 | 381 |
AFR | AGO | Angola | 2013 | 725 | 646 | 734 |
AFR | AGO | Angola | 2014 | 1161 | 1101 | 1319 |
AFR | AGO | Angola | 2015 | 4 | 15 | 0 |
AFR | AGO | Angola | 2016 | 3 | 2 | 0 |
AFR | AGO | Angola | 2017 | 1 | 7 | 2 |
AFR | AGO | Angola | 2018 | 3 | 5 | 8 |
AFR | AGO | Angola | 2019 | 120 | 94 | 281 |
AFR | BDI | Burundi | 2011 | 6 | 2 | 8 |
AFR | BDI | Burundi | 2012 | 4 | 3 | 10 |
AFR | BDI | Burundi | 2013 | 0 | 0 | 0 |
AFR | BDI | Burundi | 2014 | 0 | 0 | 0 |
AFR | BDI | Burundi | 2015 | 0 | 0 | 2 |
AFR | BDI | Burundi | 2016 | 0 | 0 | 0 |
The monthly counts are in separate columns. However, they are really all the same variable: measles cases. So this data needs to be tidied: There should be only one column called “cases” and another column called “month”. Or maybe even a single column “Date”? Because year and month are actually elements of a single unit of course. We’ll leave that for a later chapter.
8.1 The pivot_longer()
function
Tidying has never been simpler, using the pivot_longer
function:
measles_tidy <- pivot_longer(data = measles,
cols = 5:16,
names_to = "Month",
values_to = "Cases")
knitr::kable(head(measles_tidy, n=15))
Region | ISO3 | Country | Year | Month | Cases |
---|---|---|---|---|---|
AFR | AGO | Angola | 2011 | January | 17 |
AFR | AGO | Angola | 2011 | February | 19 |
AFR | AGO | Angola | 2011 | March | 37 |
AFR | AGO | Angola | 2011 | April | 41 |
AFR | AGO | Angola | 2011 | May | 11 |
AFR | AGO | Angola | 2011 | June | 8 |
AFR | AGO | Angola | 2011 | July | 5 |
AFR | AGO | Angola | 2011 | August | 4 |
AFR | AGO | Angola | 2011 | September | 32 |
AFR | AGO | Angola | 2011 | October | 10 |
AFR | AGO | Angola | 2011 | November | 8 |
AFR | AGO | Angola | 2011 | December | 0 |
AFR | AGO | Angola | 2012 | January | 373 |
AFR | AGO | Angola | 2012 | February | 289 |
AFR | AGO | Angola | 2012 | March | 381 |
In the pivot_longer
function, you provide three pieces of information:
You need to tell which columns to collect the data and headers from. Here, all the month columns are collected:
cols = 5:16
. you can also use the names of the columns you want to gather.The name of the column to hold the “old” column headers:
names_to
. This is the name of the newly created column that will hold the information that is now present in the column headers that you are going to collect - in this case, the months names January-December. Hence,names_to = Month
.The name for
values_to
. Thevalue_to
argument is the name of the column that will hold the actual measurements. In this case, the number of cases from each of the 12 month columns will be “gathered” in this column. Hence,values_to = Cases
.
The result is a “flattened but elongated” data structure. All data from the non-gathered columns (Region
, ISO3
, Country
, and Year
) will be expanded/duplicated for each of the 12 month rows that will be created.
Here you can see the numbers as verification of the process:
## The original dimensions
dim(measles)
## [1] 1746 16
## The tidied data
dim(measles_tidy)
## [1] 20952 6
## 12 times 1746 is 20952 rows
nrow(measles_tidy) / 12
## [1] 1746
Obviously, the data you gather into a single column must be a single measurement type!
8.1.1 Selecting columns
The cols=
argument can be specified using a whole bunch of different methods. A few are listed here.
cols = -(1:4),
cols = -(Region:Year),
cols = January:December,
cols = -c(Region, ISO3, Country, Year),
Later on, we’ll encounter some more means to select columns using the tidyselect helpers.
8.1.2 Another example
As an extra example, here is the dose_response data again.
dose_response <- read.table("data/dose_response2.txt",
header = T,
sep=";",
colClasses=c("character", "factor", "integer", "integer"))
names(dose_response)
## [1] "patient" "sex" "dose10mg" "dose100mg"
It is processed by pivot_longer as well. This time, a Regular Expression is used to parse the dose as numbers only out of the header names.
tmp <- pivot_longer(data = dose_response,
cols = -c("patient", "sex"),
names_to = "dose",
names_pattern = "dose(10|100)mg",
values_to = "response")
knitr::kable(tmp[1:10, ])
patient | sex | dose | response |
---|---|---|---|
001 | f | 10 | 12 |
001 | f | 100 | 88 |
002 | f | 10 | 11 |
002 | f | 100 | 54 |
003 | m | 10 | 54 |
003 | m | 100 | 14 |
004 | m | 10 | 71 |
004 | m | 100 | 21 |
005 | f | 10 | 19 |
005 | f | 100 | 89 |
The argument names_pattern = "dose(10|100)mg"
specifies, using parentheses and the “or” symbol |
that we want to extract only the numbers 10 or 100 from the column names. Regular expressions are discussed in another chapter (11)
Note however they are still character values so this requires a conversion:
class(tmp$dose)
tmp$dose <- as.integer(tmp$dose)
## [1] "character"
More sophisticated techniques dealt are with in the chapter on package dplyr
(9).
8.1.3 A more complex problem: two sets of columns
How about a case with two sets of columns? Here, to create an example, a tibble
is used. This is an extension of the well-known data.frame
type. It is also part of the tidyverse
and will be discussed in more detail as well.
set.seed(1234)
two_two_col <- tibble(subject = letters[1:5],
T0_Control = rnorm(5, 10, 1),
T0_Treated = rnorm(5, 9.5, 1.2),
T1_Control = rnorm(5, 11, 1.5),
T1_Treated = rnorm(5, 16, 2))
knitr::kable(two_two_col)
subject | T0_Control | T0_Treated | T1_Control | T1_Treated |
---|---|---|---|---|
a | 8.79 | 10.11 | 10.28 | 15.8 |
b | 10.28 | 8.81 | 9.50 | 15.0 |
c | 11.08 | 8.84 | 9.84 | 14.2 |
d | 7.65 | 8.82 | 11.10 | 14.3 |
e | 10.43 | 8.43 | 12.44 | 20.8 |
The measurements are spread over 4 columns and the column names actually represent the levels of two variables: Treatment and Time. Thus, to be tidy we need to mangle the data in this form:
subject time treatment response
a T0 Control 10.764
a T0 Treated 8.681
a T1 Control 13.057
a T1 Treated 15.188
...
To solve this it requires another bit of knowledge of regular expressions:
dose_response_long <- pivot_longer(data = two_two_col,
cols = -"subject",
names_pattern = "(T.)_(Control|Treated)",
names_to = c("Time", "Treatment"),
values_to = "Response")
knitr::kable(head(dose_response_long, 8))
subject | Time | Treatment | Response |
---|---|---|---|
a | T0 | Control | 8.79 |
a | T0 | Treated | 10.11 |
a | T1 | Control | 10.28 |
a | T1 | Treated | 15.78 |
b | T0 | Control | 10.28 |
b | T0 | Treated | 8.81 |
b | T1 | Control | 9.50 |
b | T1 | Treated | 14.98 |
The expression names_pattern = "(T.).(Control|Treated)"
tells the function there are two groups in the column names to be widened. These groups are defined by the two sets of parentheses ()
. The first group is defined by the letter “T” follows by a single character specified by the dot .
which says “any character”. The second group is (Control|Treated)
which tells the function that it is formed by the word Control or Treated. In between is a random single underscore “_“.
Alternatively, and actually much simpler, the names_sep
argument could have been used. For instance, names_sep = '_'
tells to split the column names at the underscore character character, yielding this result:
alt <- pivot_longer(data = two_two_col,
cols = -"subject",
names_sep = "_",
names_to = c("Time", "Treatment"),
values_to = "Response")
knitr::kable(head(alt, 8))
subject | Time | Treatment | Response |
---|---|---|---|
a | T0 | Control | 8.79 |
a | T0 | Treated | 10.11 |
a | T1 | Control | 10.28 |
a | T1 | Treated | 15.78 |
b | T0 | Control | 10.28 |
b | T0 | Treated | 8.81 |
b | T1 | Control | 9.50 |
b | T1 | Treated | 14.98 |
8.2 Widening with pivot_wider()
The opposite of pivot_longer()
is pivot_wider()
. First the simple case of a single column:
head(dose_response_long)
## # A tibble: 6 × 4
## subject Time Treatment Response
## <chr> <chr> <chr> <dbl>
## 1 a T0 Control 8.79
## 2 a T0 Treated 10.1
## 3 a T1 Control 10.3
## 4 a T1 Treated 15.8
## 5 b T0 Control 10.3
## 6 b T0 Treated 8.81
Here is the reverse operation with pivot_wider
:
tmp <- pivot_wider(data = dose_response_long,
names_from = c(Time, Treatment),
values_from = Response)
knitr::kable(head(tmp, 5))
subject | T0_Control | T0_Treated | T1_Control | T1_Treated |
---|---|---|---|---|
a | 8.79 | 10.11 | 10.28 | 15.8 |
b | 10.28 | 8.81 | 9.50 | 15.0 |
c | 11.08 | 8.84 | 9.84 | 14.2 |
d | 7.65 | 8.82 | 11.10 | 14.3 |
e | 10.43 | 8.43 | 12.44 | 20.8 |
8.3 Other utility functions
8.3.1 Split columns with separate()
The function separate()
will split on the non-word character if no separator is specified. Here a separator is prided telling to split on an X with two other characters on either side.
tmp <- data.frame(x = 10:15, a = paste0(1:6, "zXb", letters[1:6]))
tmp
## x a
## 1 10 1zXba
## 2 11 2zXbb
## 3 12 3zXbc
## 4 13 4zXbd
## 5 14 5zXbe
## 6 15 6zXbf
separate(data = tmp, col = a, sep = ".X.", into=c("b", "c"))
## x b c
## 1 10 1 a
## 2 11 2 b
## 3 12 3 c
## 4 13 4 d
## 5 14 5 e
## 6 15 6 f
8.3.2 Combine columns with unite()
When you have two columns that should be logically combined into one, use the unite()
function.
For instance, dates and times are often placed in separate columns but they are two aspects of a single measurement: time.
Here is a small example. In a later presentation, working with dates and times will be extensively treated.
activity <- data.frame(date = c("30/12/2019", "31/12/2019", "01/01/2020", "02/01/2020", "02/01/2020"),
time = c("14:55", "21:01", "08:22", "11:56", "16:38"),
activity = c(34, 48, 5, 19, 22))
unite(activity, "datetime", date, time, sep = "T")
## datetime activity
## 1 30/12/2019T14:55 34
## 2 31/12/2019T21:01 48
## 3 01/01/2020T08:22 5
## 4 02/01/2020T11:56 19
## 5 02/01/2020T16:38 22
8.3.3 Drop rows with missing values: drop_na()
This function is extremely simple and useful. It drops rows with missing from a dataframe and returns a cured copy:
head(drop_na(airquality))
## Ozone Solar.R Wind Temp Month Day bar
## 1 41 190 7.4 67 May 1 1
## 2 36 118 8.0 72 May 2 1
## 3 12 149 12.6 74 May 3 0
## 4 18 313 11.5 62 May 4 0
## 5 23 299 8.6 65 May 7 0
## 6 19 99 13.8 59 May 8 0
There is of course an equivalent in base R: na.omit()
.