Chapter 3 Package tidyr
3.1 Introduction
This chapter explores the package tidyr
. It is also part of the tidyverse
set of packages.
This package is an essential tool in the process of organizing you data in a tidy way. A dataset is considered tidy when it is formed according to these rules:
- Each variable has its own column.
- Each observation has its own row.
- Each value has its own cell.
Here is an example dataset, downloaded from the WHO website here Note: the original data was exported from Excel to csv, and a text editor was used to replace single quote occurrences (“’”) with an underscore. Otherwise, data will be corrupted.
The data concerns disease occurrences for two diseases: measles and rubella (“rode hond” in Dutch). Suppose I would like to analyse distribution and timeline of these two diseases, in relation to each other. There are two problems here: 1. The data needs to be clean 2. The data needs to be combined from two files into one dataframe.
Let’s start with the first: measles. Here is part of the measles data. It is not tidy. Why not?
Region | ISO3 | Country | Year | January | February | March |
---|---|---|---|---|---|---|
AFR | AGO | Angola | 2011 | 17 | 19 | 37 |
AFR | AGO | Angola | 2012 | 373 | 289 | 381 |
AFR | AGO | Angola | 2013 | 725 | 646 | 734 |
AFR | AGO | Angola | 2014 | 1161 | 1101 | 1319 |
AFR | AGO | Angola | 2015 | 4 | 15 | 0 |
AFR | AGO | Angola | 2016 | 3 | 2 | 0 |
AFR | AGO | Angola | 2017 | 1 | 7 | 2 |
AFR | AGO | Angola | 2018 | 3 | 5 | 8 |
AFR | AGO | Angola | 2019 | 120 | 94 | 281 |
AFR | BDI | Burundi | 2011 | 6 | 2 | 8 |
AFR | BDI | Burundi | 2012 | 4 | 3 | 10 |
AFR | BDI | Burundi | 2013 | 0 | 0 | 0 |
AFR | BDI | Burundi | 2014 | 0 | 0 | 0 |
AFR | BDI | Burundi | 2015 | 0 | 0 | 2 |
AFR | BDI | Burundi | 2016 | 0 | 0 | 0 |
The monthly counts are in separate columns. However, they are really all the same variable: measles cases. So this data needs to be tidied: There should be only one column called “cases” and another column called “month.” Or maybe even a single column “Date?” Because year and month are actually elements of a single unit of course. We’ll leave that for a later chapter.
3.2 The pivot_longer()
function
Tidying has never been simpler, using the pivot_longer
function:
<- pivot_longer(data = measles,
measles_tidy cols = 5:16, # or use -(1:4)
names_to = "Month",
values_to = "Cases")
::kable(head(measles_tidy, n=15)) knitr
Region | ISO3 | Country | Year | Month | Cases |
---|---|---|---|---|---|
AFR | AGO | Angola | 2011 | January | 17 |
AFR | AGO | Angola | 2011 | February | 19 |
AFR | AGO | Angola | 2011 | March | 37 |
AFR | AGO | Angola | 2011 | April | 41 |
AFR | AGO | Angola | 2011 | May | 11 |
AFR | AGO | Angola | 2011 | June | 8 |
AFR | AGO | Angola | 2011 | July | 5 |
AFR | AGO | Angola | 2011 | August | 4 |
AFR | AGO | Angola | 2011 | September | 32 |
AFR | AGO | Angola | 2011 | October | 10 |
AFR | AGO | Angola | 2011 | November | 8 |
AFR | AGO | Angola | 2011 | December | 0 |
AFR | AGO | Angola | 2012 | January | 373 |
AFR | AGO | Angola | 2012 | February | 289 |
AFR | AGO | Angola | 2012 | March | 381 |
In the pivot_longer
function, you provide three pieces of information:
You need to tell which columns to collect the data and headers from. Here, all the month columns are collected:
cols = 5:16
. you can also use the names of the columns you want to gather.The name of the column to hold the “old” column headers:
names_to
. This is the name of the newly created column that will hold the information that is now present in the column headers that you are going to collect - in this case, the months names January-December. Hence,names_to = Month
.The name for
values_to
. Thevalue_to
argument is the name of the column that will hold the actual measurements. In this case, the number of cases from each of the 12 month columns will be “gathered” in this column. Hence,values_to = Cases
.
The result is a “flattened but elongated” data structure. All data from the non-gathered columns (Region
, ISO3
, Country
, and Year
) will be expanded/duplicated for each of the 12 month rows that will be created.
Here you can see the numbers as verification of the process:
## The original dimensions
dim(measles)
## [1] 1746 16
## The tidied data
dim(measles_tidy)
## [1] 20952 6
## 12 times 1746 is 20952 rows
nrow(measles_tidy) / 12
## [1] 1746
Obviously, the data you gather into a single column must be a single measurement type!
As an extra example, here is the dose_response data again.
<- read.table("data/dose_response2.txt",
dose_response header = T,
sep=";",
colClasses=c("character", "factor", "integer", "integer"))
names(dose_response)
## [1] "patient" "sex" "dose10mg" "dose100mg"
It is processed by pivot_longer as well. This time, a Regular Expression is used to parse the dose as numbers only out of the header names.
<- pivot_longer(data = dose_response,
tmp cols = -c("patient", "sex"),
names_to = "dose",
names_pattern = "dose(10|100)mg",
values_to = "response")
::kable(tmp[1:10, ]) knitr
patient | sex | dose | response |
---|---|---|---|
001 | f | 10 | 12 |
001 | f | 100 | 88 |
002 | f | 10 | 11 |
002 | f | 100 | 54 |
003 | m | 10 | 54 |
003 | m | 100 | 14 |
004 | m | 10 | 71 |
004 | m | 100 | 21 |
005 | f | 10 | 19 |
005 | f | 100 | 89 |
Note however they are still character values so this requires a conversion, dealt with in chapter dplyr
:
class(tmp$dose)
## [1] "character"
3.3 A more complex problem: two sets of columns
How about a case with two sets of columns? Here, to create an example, a tibble
is used. This is an extension of the well-known data.frame
type. It is also part of the tidyverse
and will be discussed in more detail as well.
set.seed(1234)
<- tibble(subject = letters[1:5],
two_two_col T0_Control = rnorm(5, 10, 1),
T0_Treated = rnorm(5, 9.5, 1.2),
T1_Control = rnorm(5, 11, 1.5),
T1_Treated = rnorm(5, 16, 2))
::kable(two_two_col) knitr
subject | T0_Control | T0_Treated | T1_Control | T1_Treated |
---|---|---|---|---|
a | 8.79 | 10.11 | 10.28 | 15.8 |
b | 10.28 | 8.81 | 9.50 | 15.0 |
c | 11.08 | 8.84 | 9.84 | 14.2 |
d | 7.65 | 8.82 | 11.10 | 14.3 |
e | 10.43 | 8.43 | 12.44 | 20.8 |
The measurements are spread over 4 columns and the column names actually represent the levels of two variables: Treatment and Time. Thus, to be tidy we need to mangle the data in this form:
subject time treatment response
a T0 Control 10.764
a T0 Treated 8.681
a T1 Control 13.057
a T1 Treated 15.188
...
To solve this it requires a bit of knowledge of regular expressions (fortunately dealt with in a previous course):
<- pivot_longer(data = two_two_col,
dose_response_long cols = -"subject",
names_pattern = "(T.)_(Control|Treated)",
names_to = c("Time", "Treatment"),
values_to = "Response")
::kable(head(dose_response_long, 8)) knitr
subject | Time | Treatment | Response |
---|---|---|---|
a | T0 | Control | 8.79 |
a | T0 | Treated | 10.11 |
a | T1 | Control | 10.28 |
a | T1 | Treated | 15.78 |
b | T0 | Control | 10.28 |
b | T0 | Treated | 8.81 |
b | T1 | Control | 9.50 |
b | T1 | Treated | 14.98 |
The expression names_pattern = "(T.).(Control|Treated)"
tells the function there are two groups in the column names to be widened. These groups are defined by the two sets of parentheses ()
. The first group is defined by the letter “T” follows by a single character specified by the dot .
which says “any character.” The second group is (Control|Treated)
which tells the function that it is formed by the word Control or Treated. In between is a random single underscore "_".
Alternatively, the names_sep
argument could have been used. For instance, names_sep = 2
tells to split the column names after the second character, yielding this slightly less aesthetic result:
<- pivot_longer(data = two_two_col,
alt cols = -"subject",
names_sep = 2,
names_to = c("Time", "Treatment"),
values_to = "Response")
::kable(head(alt, 8)) knitr
subject | Time | Treatment | Response |
---|---|---|---|
a | T0 | _Control | 8.79 |
a | T0 | _Treated | 10.11 |
a | T1 | _Control | 10.28 |
a | T1 | _Treated | 15.78 |
b | T0 | _Control | 10.28 |
b | T0 | _Treated | 8.81 |
b | T1 | _Control | 9.50 |
b | T1 | _Treated | 14.98 |
3.4 Widening with pivot_wider()
The opposite of pivot_longer()
is pivot_wider()
. First the simple case of a single column:
head(dose_response_long)
## # A tibble: 6 x 4
## subject Time Treatment Response
## <chr> <chr> <chr> <dbl>
## 1 a T0 Control 8.79
## 2 a T0 Treated 10.1
## 3 a T1 Control 10.3
## 4 a T1 Treated 15.8
## 5 b T0 Control 10.3
## 6 b T0 Treated 8.81
Here is the reverse operation with pivot_wider
:
<- pivot_wider(data = dose_response_long,
tmp names_from = c(Time, Treatment),
values_from = Response)
::kable(head(tmp, 5)) knitr
subject | T0_Control | T0_Treated | T1_Control | T1_Treated |
---|---|---|---|---|
a | 8.79 | 10.11 | 10.28 | 15.8 |
b | 10.28 | 8.81 | 9.50 | 15.0 |
c | 11.08 | 8.84 | 9.84 | 14.2 |
d | 7.65 | 8.82 | 11.10 | 14.3 |
e | 10.43 | 8.43 | 12.44 | 20.8 |
3.5 Split columns with separate()
The function separate()
will split on the non-word character if no separator is specified. Here a separator is prided telling to split on an X with two other characters on either side.
<- data.frame(x = 10:15, a = paste0(1:6, "zXb", letters[1:6]))
tmp tmp
## x a
## 1 10 1zXba
## 2 11 2zXbb
## 3 12 3zXbc
## 4 13 4zXbd
## 5 14 5zXbe
## 6 15 6zXbf
separate(data = tmp, col = a, sep = ".X.", into=c("b", "c"))
## x b c
## 1 10 1 a
## 2 11 2 b
## 3 12 3 c
## 4 13 4 d
## 5 14 5 e
## 6 15 6 f
3.6 Combine columns
When you have two columns that should be logically combined into one, use the unite()
function.
For instance, dates and times are often placed in separate columns but they are two aspects of a single measurement: time.
Here is a small example. In a later presentation, working with dates and times will be extensively treated.
<- data.frame(date = c("30/12/2019", "31/12/2019", "01/01/2020", "02/01/2020", "02/01/2020"),
activity time = c("14:55", "21:01", "08:22", "11:56", "16:38"),
activity = c(34, 48, 5, 19, 22))
unite(activity, "datetime", date, time, sep = "T")
## datetime activity
## 1 30/12/2019T14:55 34
## 2 31/12/2019T21:01 48
## 3 01/01/2020T08:22 5
## 4 02/01/2020T11:56 19
## 5 02/01/2020T16:38 22
3.7 Drop rows with missing values: drop_na()
This function is extremely simple and useful. It drops rows with missing from a dataframe and returns a cured copy:
head(drop_na(airquality))
## Ozone Solar.R Wind Temp Month Day Month_f TempFac
## 1 41 190 7.4 67 5 1 May low
## 2 36 118 8.0 72 5 2 May low
## 3 12 149 12.6 74 5 3 May low
## 4 18 313 11.5 62 5 4 May low
## 5 23 299 8.6 65 5 7 May low
## 6 19 99 13.8 59 5 8 May low
There is of course an equivalent in base R: na.omit()
.