8 Tidying dataframes using Package `tidyr`

This chapter explores the package tidyr which is part of the tidyverse set of packages.

This package is an essential tool collection in the process of organizing you data in a tidy way.

A dataset is considered tidy when it is formed according to these rules:

Each variable has its own column.
Each observation has its own row.
Each value has its own cell.

Why do we want tidy data?

Because the entire tidyverse works with tidy data, especially the ggplot2 package.
Pivoting data is also often used to make specific data representations for other purposes.
Finally: we often receive “dirty” data that needs to be processed in order to be able to carry out our analyses.

Here is an example dataset, downloaded from the WHO website here Note: the original data was exported from Excel to csv, and a text editor was used to replace single quote occurrences (“’”) with an underscore. Otherwise, data will be corrupted.

The data concerns disease occurrences for two diseases: measles and rubella (“rode hond” in Dutch). Suppose I would like to analyse distribution and timeline of these two diseases, in relation to each other. There are two problems here: 1. The data needs to be clean 2. The data needs to be combined from two files into one dataframe.

Let’s start with the first: measles. Here is part of the measles data. It is not tidy. Why not?

Table 8.1: Why is this not a tidy dataset?
Region	ISO3	Country	Year	January	February	March
AFR	AGO	Angola	2011	17	19	37
AFR	AGO	Angola	2012	373	289	381
AFR	AGO	Angola	2013	725	646	734
AFR	AGO	Angola	2014	1161	1101	1319
AFR	AGO	Angola	2015	4	15	0
AFR	AGO	Angola	2016	3	2	0
AFR	AGO	Angola	2017	1	7	2
AFR	AGO	Angola	2018	3	5	8
AFR	AGO	Angola	2019	120	94	281
AFR	BDI	Burundi	2011	6	2	8
AFR	BDI	Burundi	2012	4	3	10
AFR	BDI	Burundi	2013	0	0	0
AFR	BDI	Burundi	2014	0	0	0
AFR	BDI	Burundi	2015	0	0	2
AFR	BDI	Burundi	2016	0	0	0

The monthly counts are in separate columns. However, they are really all the same variable: measles cases. So this data needs to be tidied: There should be only one column called “cases” and another column called “month”. Or maybe even a single column “Date”? Because year and month are actually elements of a single unit of course. We’ll leave that for a later chapter.

8.1 The `pivot_longer()` function

Tidying has never been simpler, using the pivot_longer function:

measles_tidy <- pivot_longer(data = measles,
                             cols = 5:16,
                             names_to = "Month",
                             values_to = "Cases")
knitr::kable(head(measles_tidy, n=15))

Region	ISO3	Country	Year	Month	Cases
AFR	AGO	Angola	2011	January	17
AFR	AGO	Angola	2011	February	19
AFR	AGO	Angola	2011	March	37
AFR	AGO	Angola	2011	April	41
AFR	AGO	Angola	2011	May	11
AFR	AGO	Angola	2011	June	8
AFR	AGO	Angola	2011	July	5
AFR	AGO	Angola	2011	August	4
AFR	AGO	Angola	2011	September	32
AFR	AGO	Angola	2011	October	10
AFR	AGO	Angola	2011	November	8
AFR	AGO	Angola	2011	December	0
AFR	AGO	Angola	2012	January	373
AFR	AGO	Angola	2012	February	289
AFR	AGO	Angola	2012	March	381

In the pivot_longer function, you provide three pieces of information:

You need to tell which columns to collect the data and headers from. Here, all the month columns are collected: cols = 5:16. you can also use the names of the columns you want to gather.
The name of the column to hold the “old” column headers: names_to. This is the name of the newly created column that will hold the information that is now present in the column headers that you are going to collect - in this case, the months names January-December. Hence, names_to = Month.
The name for values_to. The value_to argument is the name of the column that will hold the actual measurements. In this case, the number of cases from each of the 12 month columns will be “gathered” in this column. Hence, values_to = Cases.

The result is a “flattened but elongated” data structure. All data from the non-gathered columns (Region, ISO3, Country, and Year) will be expanded/duplicated for each of the 12 month rows that will be created.

Here you can see the numbers as verification of the process:

## The original dimensions
dim(measles)

## [1] 1746   16

## The tidied data
dim(measles_tidy)

## [1] 20952     6

## 12 times 1746 is 20952 rows
nrow(measles_tidy) / 12

## [1] 1746

Obviously, the data you gather into a single column must be a single measurement type!

8.1.1 Selecting columns

The cols= argument can be specified using a whole bunch of different methods. A few are listed here.

cols = -(1:4),
cols = -(Region:Year),
cols = January:December,
cols = -c(Region, ISO3, Country, Year),

Later on, we’ll encounter some more means to select columns using the tidyselect helpers.

8.1.2 Another example

As an extra example, here is the dose_response data again.

dose_response <- read.table("data/dose_response2.txt",
                            header = T,
                            sep=";",
                            colClasses=c("character", "factor", "integer", "integer"))
names(dose_response)

## [1] "patient"   "sex"       "dose10mg"  "dose100mg"

It is processed by pivot_longer as well. This time, a Regular Expression is used to parse the dose as numbers only out of the header names.

tmp <- pivot_longer(data = dose_response,
             cols = -c("patient", "sex"),
             names_to = "dose",
             names_pattern = "dose(10|100)mg",
             values_to = "response")
knitr::kable(tmp[1:10, ])

patient	sex	dose	response
001	f	10	12
001	f	100	88
002	f	10	11
002	f	100	54
003	m	10	54
003	m	100	14
004	m	10	71
004	m	100	21
005	f	10	19
005	f	100	89

The argument names_pattern = "dose(10|100)mg" specifies, using parentheses and the “or” symbol | that we want to extract only the numbers 10 or 100 from the column names. Regular expressions are discussed in another chapter (11)

Note however they are still character values so this requires a conversion:

class(tmp$dose)
tmp$dose <- as.integer(tmp$dose)

## [1] "character"

More sophisticated techniques dealt are with in the chapter on package dplyr (9).

8.1.3 A more complex problem: two sets of columns

How about a case with two sets of columns? Here, to create an example, a tibble is used. This is an extension of the well-known data.frame type. It is also part of the tidyverse and will be discussed in more detail as well.

set.seed(1234)
two_two_col <- tibble(subject = letters[1:5],
              T0_Control = rnorm(5, 10, 1),
              T0_Treated = rnorm(5, 9.5, 1.2),
              T1_Control = rnorm(5, 11, 1.5),
              T1_Treated = rnorm(5, 16, 2))
knitr::kable(two_two_col)

subject	T0_Control	T0_Treated	T1_Control	T1_Treated
a	8.79	10.11	10.28	15.8
b	10.28	8.81	9.50	15.0
c	11.08	8.84	9.84	14.2
d	7.65	8.82	11.10	14.3
e	10.43	8.43	12.44	20.8

The measurements are spread over 4 columns and the column names actually represent the levels of two variables: Treatment and Time. Thus, to be tidy we need to mangle the data in this form:

subject time treatment response
a       T0   Control   10.764
a       T0   Treated   8.681
a       T1   Control   13.057
a       T1   Treated   15.188
...

To solve this it requires another bit of knowledge of regular expressions:

dose_response_long <- pivot_longer(data = two_two_col,
             cols = -"subject",
             names_pattern = "(T.)_(Control|Treated)",
             names_to = c("Time", "Treatment"),
             values_to = "Response")
knitr::kable(head(dose_response_long, 8))

subject	Time	Treatment	Response
a	T0	Control	8.79
a	T0	Treated	10.11
a	T1	Control	10.28
a	T1	Treated	15.78
b	T0	Control	10.28
b	T0	Treated	8.81
b	T1	Control	9.50
b	T1	Treated	14.98

The expression names_pattern = "(T.).(Control|Treated)" tells the function there are two groups in the column names to be widened. These groups are defined by the two sets of parentheses (). The first group is defined by the letter “T” follows by a single character specified by the dot . which says “any character”. The second group is (Control|Treated) which tells the function that it is formed by the word Control or Treated. In between is a random single underscore “_“.
Alternatively, and actually much simpler, the names_sep argument could have been used. For instance, names_sep = '_' tells to split the column names at the underscore character character, yielding this result:

alt <- pivot_longer(data = two_two_col,
             cols = -"subject",
             names_sep = "_",
             names_to = c("Time", "Treatment"),
             values_to = "Response")
knitr::kable(head(alt, 8))

subject	Time	Treatment	Response
a	T0	Control	8.79
a	T0	Treated	10.11
a	T1	Control	10.28
a	T1	Treated	15.78
b	T0	Control	10.28
b	T0	Treated	8.81
b	T1	Control	9.50
b	T1	Treated	14.98

8.2 Widening with `pivot_wider()`

The opposite of pivot_longer() is pivot_wider(). First the simple case of a single column:

head(dose_response_long)

## # A tibble: 6 × 4
##   subject Time  Treatment Response
##   <chr>   <chr> <chr>        <dbl>
## 1 a       T0    Control       8.79
## 2 a       T0    Treated      10.1 
## 3 a       T1    Control      10.3 
## 4 a       T1    Treated      15.8 
## 5 b       T0    Control      10.3 
## 6 b       T0    Treated       8.81

Here is the reverse operation with pivot_wider:

tmp <- pivot_wider(data = dose_response_long,
                   names_from = c(Time, Treatment),
                   values_from = Response)
knitr::kable(head(tmp, 5))

subject	T0_Control	T0_Treated	T1_Control	T1_Treated
a	8.79	10.11	10.28	15.8
b	10.28	8.81	9.50	15.0
c	11.08	8.84	9.84	14.2
d	7.65	8.82	11.10	14.3
e	10.43	8.43	12.44	20.8

8.3 Other utility functions

8.3.1 Split columns with `separate()`

The function separate() will split on the non-word character if no separator is specified. Here a separator is prided telling to split on an X with two other characters on either side.

tmp <- data.frame(x = 10:15, a = paste0(1:6, "zXb", letters[1:6]))
tmp

##    x     a
## 1 10 1zXba
## 2 11 2zXbb
## 3 12 3zXbc
## 4 13 4zXbd
## 5 14 5zXbe
## 6 15 6zXbf

separate(data = tmp, col = a, sep = ".X.", into=c("b", "c"))

##    x b c
## 1 10 1 a
## 2 11 2 b
## 3 12 3 c
## 4 13 4 d
## 5 14 5 e
## 6 15 6 f

8.3.2 Combine columns with `unite()`

When you have two columns that should be logically combined into one, use the unite() function. For instance, dates and times are often placed in separate columns but they are two aspects of a single measurement: time. Here is a small example. In a later presentation, working with dates and times will be extensively treated.

activity <- data.frame(date = c("30/12/2019", "31/12/2019", "01/01/2020", "02/01/2020", "02/01/2020"),
                 time = c("14:55", "21:01", "08:22", "11:56", "16:38"),
                 activity = c(34, 48, 5, 19, 22))
unite(activity, "datetime", date, time, sep = "T")

##           datetime activity
## 1 30/12/2019T14:55       34
## 2 31/12/2019T21:01       48
## 3 01/01/2020T08:22        5
## 4 02/01/2020T11:56       19
## 5 02/01/2020T16:38       22

8.3.3 Drop rows with missing values: `drop_na()`

This function is extremely simple and useful. It drops rows with missing from a dataframe and returns a cured copy:

head(drop_na(airquality))

##   Ozone Solar.R Wind Temp Month Day bar
## 1    41     190  7.4   67   May   1   1
## 2    36     118  8.0   72   May   2   1
## 3    12     149 12.6   74   May   3   0
## 4    18     313 11.5   62   May   4   0
## 5    23     299  8.6   65   May   7   0
## 6    19      99 13.8   59   May   8   0

There is of course an equivalent in base R: na.omit().

7 Flow control and scripting

9 Old school data mangling

patient	sex	dose	response
001	f	10	12
001	f	100	88
002	f	10	11
002	f	100	54
003	m	10	54
003	m	100	14
004	m	10	71
004	m	100	21
005	f	10	19
005	f	100	89

patient	sex	dose	response
001	f	10	12
001	f	100	88
002	f	10	11
002	f	100	54
003	m	10	54
003	m	100	14
004	m	10	71
004	m	100	21
005	f	10	19
005	f	100	89

8 Tidying dataframes using Package tidyr

8.1 The pivot_longer() function