Chapter 3 Package tidyr

3.1 Introduction

This chapter explores the package tidyr. It is also part of the tidyverse set of packages.

This package is an essential tool in the process of organizing you data in a tidy way. A dataset is considered tidy when it is formed according to these rules:

  1. Each variable has its own column.
  2. Each observation has its own row.
  3. Each value has its own cell.

Here is an example dataset, downloaded from the WHO website here Note: the original data was exported from Excel to csv, and a text editor was used to replace single quote occurrences (“’”) with an underscore. Otherwise, data will be corrupted.

The data concerns disease occurrences for two diseases: measles and rubella (“rode hond” in Dutch). Suppose I would like to analyse distribution and timeline of these two diseases, in relation to each other. There are two problems here: 1. The data needs to be clean 2. The data needs to be combined from two files into one dataframe.

Let’s start with the first: measles. Here is part of the measles data. It is not tidy. Why not?

Table 3.1: Why is this not a tidy dataset?
Region ISO3 Country Year January February March
AFR AGO Angola 2011 17 19 37
AFR AGO Angola 2012 373 289 381
AFR AGO Angola 2013 725 646 734
AFR AGO Angola 2014 1161 1101 1319
AFR AGO Angola 2015 4 15 0
AFR AGO Angola 2016 3 2 0
AFR AGO Angola 2017 1 7 2
AFR AGO Angola 2018 3 5 8
AFR AGO Angola 2019 120 94 281
AFR BDI Burundi 2011 6 2 8
AFR BDI Burundi 2012 4 3 10
AFR BDI Burundi 2013 0 0 0
AFR BDI Burundi 2014 0 0 0
AFR BDI Burundi 2015 0 0 2
AFR BDI Burundi 2016 0 0 0

The monthly counts are in separate columns. However, they are really all the same variable: measles cases. So this data needs to be tidied: There should be only one column called “cases” and another column called “month.” Or maybe even a single column “Date?” Because year and month are actually elements of a single unit of course. We’ll leave that for a later chapter.

3.2 The pivot_longer() function

Tidying has never been simpler, using the pivot_longer function:

measles_tidy <- pivot_longer(data = measles,
                             cols = 5:16, # or use -(1:4)
                             names_to = "Month",
                             values_to = "Cases")
knitr::kable(head(measles_tidy, n=15))
Region ISO3 Country Year Month Cases
AFR AGO Angola 2011 January 17
AFR AGO Angola 2011 February 19
AFR AGO Angola 2011 March 37
AFR AGO Angola 2011 April 41
AFR AGO Angola 2011 May 11
AFR AGO Angola 2011 June 8
AFR AGO Angola 2011 July 5
AFR AGO Angola 2011 August 4
AFR AGO Angola 2011 September 32
AFR AGO Angola 2011 October 10
AFR AGO Angola 2011 November 8
AFR AGO Angola 2011 December 0
AFR AGO Angola 2012 January 373
AFR AGO Angola 2012 February 289
AFR AGO Angola 2012 March 381

In the pivot_longer function, you provide three pieces of information:

  1. You need to tell which columns to collect the data and headers from. Here, all the month columns are collected: cols = 5:16. you can also use the names of the columns you want to gather.

  2. The name of the column to hold the “old” column headers: names_to. This is the name of the newly created column that will hold the information that is now present in the column headers that you are going to collect - in this case, the months names January-December. Hence, names_to = Month.

  3. The name for values_to. The value_to argument is the name of the column that will hold the actual measurements. In this case, the number of cases from each of the 12 month columns will be “gathered” in this column. Hence, values_to = Cases.

The result is a “flattened but elongated” data structure. All data from the non-gathered columns (Region, ISO3, Country, and Year) will be expanded/duplicated for each of the 12 month rows that will be created.

Here you can see the numbers as verification of the process:

## The original dimensions
dim(measles)
## [1] 1746   16
## The tidied data
dim(measles_tidy)
## [1] 20952     6
## 12 times 1746 is 20952 rows
nrow(measles_tidy) / 12
## [1] 1746

Obviously, the data you gather into a single column must be a single measurement type!

As an extra example, here is the dose_response data again.

dose_response <- read.table("data/dose_response2.txt",
                            header = T,
                            sep=";",
                            colClasses=c("character", "factor", "integer", "integer"))
names(dose_response)
## [1] "patient"   "sex"       "dose10mg"  "dose100mg"

It is processed by pivot_longer as well. This time, a Regular Expression is used to parse the dose as numbers only out of the header names.

tmp <- pivot_longer(data = dose_response,
             cols = -c("patient", "sex"),
             names_to = "dose",
             names_pattern = "dose(10|100)mg",
             values_to = "response")
knitr::kable(tmp[1:10, ])
patient sex dose response
001 f 10 12
001 f 100 88
002 f 10 11
002 f 100 54
003 m 10 54
003 m 100 14
004 m 10 71
004 m 100 21
005 f 10 19
005 f 100 89

Note however they are still character values so this requires a conversion, dealt with in chapter dplyr:

class(tmp$dose)
## [1] "character"

3.3 A more complex problem: two sets of columns

How about a case with two sets of columns? Here, to create an example, a tibble is used. This is an extension of the well-known data.frame type. It is also part of the tidyverse and will be discussed in more detail as well.

set.seed(1234)
two_two_col <- tibble(subject = letters[1:5],
              T0_Control = rnorm(5, 10, 1),
              T0_Treated = rnorm(5, 9.5, 1.2),
              T1_Control = rnorm(5, 11, 1.5),
              T1_Treated = rnorm(5, 16, 2))
knitr::kable(two_two_col)
subject T0_Control T0_Treated T1_Control T1_Treated
a 8.79 10.11 10.28 15.8
b 10.28 8.81 9.50 15.0
c 11.08 8.84 9.84 14.2
d 7.65 8.82 11.10 14.3
e 10.43 8.43 12.44 20.8

The measurements are spread over 4 columns and the column names actually represent the levels of two variables: Treatment and Time. Thus, to be tidy we need to mangle the data in this form:

subject time treatment response
a       T0   Control   10.764
a       T0   Treated   8.681
a       T1   Control   13.057
a       T1   Treated   15.188
...

To solve this it requires a bit of knowledge of regular expressions (fortunately dealt with in a previous course):

dose_response_long <- pivot_longer(data = two_two_col,
             cols = -"subject",
             names_pattern = "(T.)_(Control|Treated)",
             names_to = c("Time", "Treatment"),
             values_to = "Response")
knitr::kable(head(dose_response_long, 8))
subject Time Treatment Response
a T0 Control 8.79
a T0 Treated 10.11
a T1 Control 10.28
a T1 Treated 15.78
b T0 Control 10.28
b T0 Treated 8.81
b T1 Control 9.50
b T1 Treated 14.98

The expression names_pattern = "(T.).(Control|Treated)" tells the function there are two groups in the column names to be widened. These groups are defined by the two sets of parentheses (). The first group is defined by the letter “T” follows by a single character specified by the dot . which says “any character.” The second group is (Control|Treated) which tells the function that it is formed by the word Control or Treated. In between is a random single underscore "_". Alternatively, the names_sep argument could have been used. For instance, names_sep = 2 tells to split the column names after the second character, yielding this slightly less aesthetic result:

alt <- pivot_longer(data = two_two_col,
             cols = -"subject",
             names_sep = 2,
             names_to = c("Time", "Treatment"),
             values_to = "Response")
knitr::kable(head(alt, 8))
subject Time Treatment Response
a T0 _Control 8.79
a T0 _Treated 10.11
a T1 _Control 10.28
a T1 _Treated 15.78
b T0 _Control 10.28
b T0 _Treated 8.81
b T1 _Control 9.50
b T1 _Treated 14.98

3.4 Widening with pivot_wider()

The opposite of pivot_longer() is pivot_wider(). First the simple case of a single column:

head(dose_response_long)
## # A tibble: 6 x 4
##   subject Time  Treatment Response
##   <chr>   <chr> <chr>        <dbl>
## 1 a       T0    Control       8.79
## 2 a       T0    Treated      10.1 
## 3 a       T1    Control      10.3 
## 4 a       T1    Treated      15.8 
## 5 b       T0    Control      10.3 
## 6 b       T0    Treated       8.81

Here is the reverse operation with pivot_wider:

tmp <- pivot_wider(data = dose_response_long,
                   names_from = c(Time, Treatment),
                   values_from = Response)
knitr::kable(head(tmp, 5))
subject T0_Control T0_Treated T1_Control T1_Treated
a 8.79 10.11 10.28 15.8
b 10.28 8.81 9.50 15.0
c 11.08 8.84 9.84 14.2
d 7.65 8.82 11.10 14.3
e 10.43 8.43 12.44 20.8

3.5 Split columns with separate()

The function separate() will split on the non-word character if no separator is specified. Here a separator is prided telling to split on an X with two other characters on either side.

tmp <- data.frame(x = 10:15, a = paste0(1:6, "zXb", letters[1:6]))
tmp
##    x     a
## 1 10 1zXba
## 2 11 2zXbb
## 3 12 3zXbc
## 4 13 4zXbd
## 5 14 5zXbe
## 6 15 6zXbf
separate(data = tmp, col = a, sep = ".X.", into=c("b", "c"))
##    x b c
## 1 10 1 a
## 2 11 2 b
## 3 12 3 c
## 4 13 4 d
## 5 14 5 e
## 6 15 6 f

3.6 Combine columns

When you have two columns that should be logically combined into one, use the unite() function. For instance, dates and times are often placed in separate columns but they are two aspects of a single measurement: time. Here is a small example. In a later presentation, working with dates and times will be extensively treated.

activity <- data.frame(date = c("30/12/2019", "31/12/2019", "01/01/2020", "02/01/2020", "02/01/2020"),
                 time = c("14:55", "21:01", "08:22", "11:56", "16:38"),
                 activity = c(34, 48, 5, 19, 22))
unite(activity, "datetime", date, time, sep = "T")
##           datetime activity
## 1 30/12/2019T14:55       34
## 2 31/12/2019T21:01       48
## 3 01/01/2020T08:22        5
## 4 02/01/2020T11:56       19
## 5 02/01/2020T16:38       22

3.7 Drop rows with missing values: drop_na()

This function is extremely simple and useful. It drops rows with missing from a dataframe and returns a cured copy:

head(drop_na(airquality))
##   Ozone Solar.R Wind Temp Month Day Month_f TempFac
## 1    41     190  7.4   67     5   1     May     low
## 2    36     118  8.0   72     5   2     May     low
## 3    12     149 12.6   74     5   3     May     low
## 4    18     313 11.5   62     5   4     May     low
## 5    23     299  8.6   65     5   7     May     low
## 6    19      99 13.8   59     5   8     May     low

There is of course an equivalent in base R: na.omit().