How to restructure a messy dataset with tidyr into tidy data.
The idea of tidy data is fundamental to data analysis with the tidyverse. Tidy data describes a clean and principled structure for your data which simplifies analysis tasks. The tidyr package provides tools to maintain or convert datasets into this tidy format, ensuring they integrate smoothly with other tidyverse tools for tasks like visualisation and modelling.
Tidy data
The tidy data format
Tidy data describes a format for structuring data such that:
Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.
This flat and consistent format makes it easier to perform data manipulation and analysis tasks, especially with other tools from the tidyverse.
Consider this untidy dataset from the World Health Organization’s global tuberculosis report:
library(tidyr)who
# A tibble: 7,240 × 60
country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghani… AF AFG 1980 NA NA NA NA
2 Afghani… AF AFG 1981 NA NA NA NA
3 Afghani… AF AFG 1982 NA NA NA NA
4 Afghani… AF AFG 1983 NA NA NA NA
5 Afghani… AF AFG 1984 NA NA NA NA
6 Afghani… AF AFG 1985 NA NA NA NA
7 Afghani… AF AFG 1986 NA NA NA NA
8 Afghani… AF AFG 1987 NA NA NA NA
9 Afghani… AF AFG 1988 NA NA NA NA
10 Afghani… AF AFG 1989 NA NA NA NA
# ℹ 7,230 more rows
# ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
# new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
# new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
# new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
# new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
# new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>, …
This dataset is structurally untidy since the columns 5 through 60 (prefixed with "new_") mix together the diagnosis method, sex, and age group categories. As a result, the columns no longer describe each variable (since many columns are used for the same variable), and each row isn’t an observation (since rows contain many observations for different methods, genders, and ages).
Untidy data makes it difficult to perform simple analysis tasks such as comparing the rates of tuberculosis for different age groups, since the calculation would span many columns. A tidy data format as shown below simplifies this operation, since the age variable is now in a singular column that can be easily grouped for comparison summaries.
Code for cleaning the who dataset
library(tidyr)who_tidy <- who |>pivot_longer(cols =starts_with("new"),# Split the names into multiple columnsnames_to =c("diagnosis", "sex", "age"),# Use a regular expression where each matching group becomes a columnnames_pattern ="new_?(.*)_(.)(.*)",# Reformat the name columns into factors with better labelsnames_transform =list(diagnosis = \(x) factor(x, levels =c("rel", "sn", "sp", "ep"), labels =c("Relapse", "Negative pulmonary smear", "Positive pulmonary smear", "Extrapulmonary")),sex = \(x) factor(x, levels =c("f", "m"), labels =c("Female", "Male")),age = \(x) factor(x, levels =c("014", "1524", "2534", "3544", "4554", "5564", "65"), labels =c("0-14", "15-24", "25-34", "35-44", "45-54", "55-64", "65+"), ordered =TRUE) ),values_to ="count",# Drop structurally missing valuesvalues_drop_na =TRUE )
who_tidy
# A tibble: 76,046 × 8
country iso2 iso3 year diagnosis sex age count
<chr> <chr> <chr> <dbl> <fct> <fct> <ord> <dbl>
1 Afghanistan AF AFG 1997 Positive pulmonary smear Male 0-14 0
2 Afghanistan AF AFG 1997 Positive pulmonary smear Male 15-24 10
3 Afghanistan AF AFG 1997 Positive pulmonary smear Male 25-34 6
4 Afghanistan AF AFG 1997 Positive pulmonary smear Male 35-44 3
5 Afghanistan AF AFG 1997 Positive pulmonary smear Male 45-54 5
6 Afghanistan AF AFG 1997 Positive pulmonary smear Male 55-64 2
7 Afghanistan AF AFG 1997 Positive pulmonary smear Male 65+ 0
8 Afghanistan AF AFG 1997 Positive pulmonary smear Female 0-14 5
9 Afghanistan AF AFG 1997 Positive pulmonary smear Female 15-24 38
10 Afghanistan AF AFG 1997 Positive pulmonary smear Female 25-34 36
# ℹ 76,036 more rows
Using tidy data
In the next lesson we will learn how to compute summary statistics with dplyr, such as the total tuberculosis cases by age group. A tidy data format makes it much easier to calculate these summaries.
❌ Untidy data code
Untidy data makes it difficult to perform simple calculations.
Since the age variable is ‘wide’ (spread across many columns), the code needs to write out many column names where it is easy to make a mistake. The resulting table is also ‘wide’, and lacks any description of what the data shows (age and totals).
Tidy data makes the code simpler to read and write.
Since each variable is only specified once in the code, mistakes are less likely and easier to find and fix. The resulting table is also in a ‘long’ and tidy format, with descriptive column names.
Pivoting describes changing the structure of the data, where columns are gathered into rows (pivot_longer()) or rows are spread across columns (pivot_wider()).
pivot_longer()
This function is used to transform a dataset from wide to long format. Untidy data often has variables mixed among many columns (like the who tuberculosis dataset), and pivot_longer() is most useful for cleaning data into a long tidy format.
The pivot_longer() function requires two main inputs, the dataset and the columns to pivot.
data |>pivot_longer(cols =c(col1, col2, col3))
There are many ways to specify the columns, one way is to write them all inside c(). Usually wide untidy datasets have many columns to pivot (the who dataset has 56), and so writing each column can be tedious and error prone. Fortunately there is a better way, columns can be specified using tidy selection.
Tidy selection
Tidy selection describes a collection of helpful tools for selecting variables:
col1:col3: select all variables from col1 to col3,
starts_with("col")/ends_with("x"): select variables with a start or end string,
contains("x")/matches("\\d+$"): select variables that contains the string, or matches the regular expression
and much more!
A comprehensive summary can be found in the ?tidyr_tidy_select documentation.
To tidy the who dataset into a long format, we use pivot_longer() on the wide columns which conveniently all start with "new". We can also use names_to and values_to to give better names for the new columns which previously were the names and values of the selected columns.
who |>pivot_longer(cols =starts_with("new"),names_to ="diagnosis_sex_age",values_to ="count" )
# A tibble: 405,440 × 6
country iso2 iso3 year diagnosis_sex_age count
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 Afghanistan AF AFG 1980 new_sp_m014 NA
2 Afghanistan AF AFG 1980 new_sp_m1524 NA
3 Afghanistan AF AFG 1980 new_sp_m2534 NA
4 Afghanistan AF AFG 1980 new_sp_m3544 NA
5 Afghanistan AF AFG 1980 new_sp_m4554 NA
6 Afghanistan AF AFG 1980 new_sp_m5564 NA
7 Afghanistan AF AFG 1980 new_sp_m65 NA
8 Afghanistan AF AFG 1980 new_sp_f014 NA
9 Afghanistan AF AFG 1980 new_sp_f1524 NA
10 Afghanistan AF AFG 1980 new_sp_f2534 NA
# ℹ 405,430 more rows
Since a wide untidy dataset requires values for every cell, the data has many missing values. After pivoting the data into a longer format, these missing values are no longer structurally required by the data (although they may still be useful or informative). We can remove these missing value rows with the values_drop_na option:
who |>pivot_longer(cols =starts_with("new"),names_to ="diagnosis_sex_age",values_to ="count",values_drop_na =TRUE )
# A tibble: 76,046 × 6
country iso2 iso3 year diagnosis_sex_age count
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 Afghanistan AF AFG 1997 new_sp_m014 0
2 Afghanistan AF AFG 1997 new_sp_m1524 10
3 Afghanistan AF AFG 1997 new_sp_m2534 6
4 Afghanistan AF AFG 1997 new_sp_m3544 3
5 Afghanistan AF AFG 1997 new_sp_m4554 5
6 Afghanistan AF AFG 1997 new_sp_m5564 2
7 Afghanistan AF AFG 1997 new_sp_m65 0
8 Afghanistan AF AFG 1997 new_sp_f014 5
9 Afghanistan AF AFG 1997 new_sp_f1524 38
10 Afghanistan AF AFG 1997 new_sp_f2534 36
# ℹ 76,036 more rows
Separating multiple variables in names
It is also possible to split the diagnosis_sex_age column into three separate columns using names_sep (for simple delimiters) or names_pattern (for matching regular expressions).
In this case values like "new_sp_m554", new_sn_f65 and newrel_m5564 are not easily separable by a delimiter, and so a regular expression like "names_pattern = new_?(.*)_(.)(.*)" is required.
Alternatively, we can use separate() to separate a column into multiple columns.
The world_bank_pop dataset contains years across multiple columns, use pivot_longer() to gather all of the year columns into a single "year" column with the population values collected into a "population" column.
Hint
All of the year columns can be selected using either:
the range of columns 2000:2017,
the regular expression matches(\\d{4}) (4 digits), or
the logical test where(is.numeric) to select all columns containing numbers (which is only the year columns)
Conversely, pivot_wider() is used to transform data from long to wide format. This can be useful if your data is untidy because it is too long (multiple variables are contained in rows of a column), or if you’ve completed your analysis and want to nicely present some results in a table of your report.
The pivot_wider() function requires two main inputs, the column names for the wider table’s names and values.
data |>pivot_wider(names_from ="names",values_from ="values" )
The fish_encounters dataset contains sightings of 19 tagged fish as they travel past measurement stations.
fish_encounters
# A tibble: 114 × 3
fish station seen
<fct> <fct> <int>
1 4842 Release 1
2 4842 I80_1 1
3 4842 Lisbon 1
4 4842 Rstr 1
5 4842 Base_TD 1
6 4842 BCE 1
7 4842 BCW 1
8 4842 BCE2 1
9 4842 BCW2 1
10 4842 MAE 1
# ℹ 104 more rows
It can be useful to treat each measurement station as a separate variable, considering each individual fish as the observational unit (rather than a sighting of a fish at a station).
For this we can use pivot_wider() to spread station into the columns with seen as each cell’s values.
# A tibble: 19 × 12
fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 4842 1 1 1 1 1 1 1 1 1 1 1
2 4843 1 1 1 1 1 1 1 1 1 1 1
3 4844 1 1 1 1 1 1 1 1 1 1 1
4 4845 1 1 1 1 1 NA NA NA NA NA NA
5 4847 1 1 1 NA NA NA NA NA NA NA NA
6 4848 1 1 1 1 NA NA NA NA NA NA NA
7 4849 1 1 NA NA NA NA NA NA NA NA NA
8 4850 1 1 NA 1 1 1 1 NA NA NA NA
9 4851 1 1 NA NA NA NA NA NA NA NA NA
10 4854 1 1 NA NA NA NA NA NA NA NA NA
11 4855 1 1 1 1 1 NA NA NA NA NA NA
12 4857 1 1 1 1 1 1 1 1 1 NA NA
13 4858 1 1 1 1 1 1 1 1 1 1 1
14 4859 1 1 1 1 1 NA NA NA NA NA NA
15 4861 1 1 1 1 1 1 1 1 1 1 1
16 4862 1 1 1 1 1 1 1 1 1 NA NA
17 4863 1 1 NA NA NA NA NA NA NA NA NA
18 4864 1 1 NA NA NA NA NA NA NA NA NA
19 4865 1 1 1 NA NA NA NA NA NA NA NA
Notice how the wider form introduces many NA values. These are combinations of fish identifiers and measurement stations which did not have a row in the original dataset. For this dataset, this means that the fish was not seen at the station and we can safely fill these missing values with 0 using the values_fill option:
The indicator variable of the world_bank_pop dataset contains both population totals and growth. These two values are in principle, two different variables which should be treated as separate columns. After pivoting the year columns into a longer structure, now pivot these population indicator variables into a wider structure to produce a tidy data format.
Hint
The indicator variable contains ‘total’ and ‘growth’ variables, which should be made into new columns (names_from). The population variable contains values for these new population totals and growth columns (values_from).
Occasionally the values of single column contains information about multiple variables or observations. Separating this column into multiple columns or rows into a tidy data format (where each column is a variable, and each row is an observation) makes later analysis easier.
separate_wider_*()
Separating combined values into a wider format is appropriate when the values contain information about multiple variables.
There are three ways to separate values into columns:
Function
Description
separate_wider_position()
Splits a column into multiple columns using specific position indices, ideal for fixed-width data.
separate_wider_delim()
Divides a column into new columns based on a specified delimiter, useful for structured patterns.
separate_wider_regex()
Uses regular expressions to separate a column into multiple columns, suitable for complex patterns.
The indicator variable from the world_bank_pop dataset encodes multiple variables separated by ..
The first part is always "SP" and can be ignored, but the second part is "POP" (total population) or "URB" (urban population), and the third part is "TOTL" (population totals) or "GROW" (population growth).
We can use separate_wider_delim() to separate indicator based on the "." delimiter.
Separating combined values into a longer format is appropriate when a column contains multiple values, for example a clothing product’s sizes or styles.
There are two ways to separate values into rows:
Function Name
Description
separate_longer_position()
Breaks a column into multiple rows based on specified fixed positions, useful for expanding fixed-width data.
separate_longer_delim()
Splits a column into multiple rows using a specified delimiter, ideal for lists or values separated by delimiters.
Consider this dataset of clothing items:
# Sample dataset of clothing productsclothing_data <-tibble(item =c("T-Shirt", "Jacket", "Dress"),size =c("S, M, L", "M, L, XL", "XS, S, M, L"),colour =c("Red, Blue, Green", "Black, Grey", "Purple, Yellow"))clothing_data
# A tibble: 3 × 3
item size colour
<chr> <chr> <chr>
1 T-Shirt S, M, L Red, Blue, Green
2 Jacket M, L, XL Black, Grey
3 Dress XS, S, M, L Purple, Yellow
Since each size and colour is separated by ", ", we can use separate_longer_delim() to separate the multiple product sizes and colours into rows.
# A tibble: 23 × 3
item size colour
<chr> <chr> <chr>
1 T-Shirt S Red
2 T-Shirt S Blue
3 T-Shirt S Green
4 T-Shirt M Red
5 T-Shirt M Blue
6 T-Shirt M Green
7 T-Shirt L Red
8 T-Shirt L Blue
9 T-Shirt L Green
10 Jacket M Black
# ℹ 13 more rows
unite()
The unite() function combines multiple columns into a single column, separating the values of each column by a delimiter (by default sep = "_"). This is the opposite operation of separate_wider_*().
It requires two inputs, the data and columns to combine. The col argument supports tidy selection.
data |>unite(col =c(col1, col2, col3))
This is rarely useful for producing tidy data, but it can be useful for preparing data for reports or creating unique identifiers.
library(palmerpenguins)penguins %>%unite(col ="category", c(species, island, sex), sep ="_")
Appropriately representing missing values in your dataset is necessary for an accurate analysis. The tidyr package in R offers several tools for managing missing data, both implicit and explicit. Implicit missing values occur when expected combinations of data are absent, while explicit missing values appear as NA in your dataset.
complete()
The complete() function transforms implicit missing values (where the row itself is missing) into explicit missing values. It does this by finding all combinations of specified variables, and adding any missing combinations using NA values.
It requires two inputs, the dataset and a structure of variables to combine.
data |>complete(brand, nesting(state, region))
The nesting() function can limit combinations within other variables, preventing redundant or impossible combinations. The code uses nesting(state, region) to ensure that only existing state-region pairings are considered when filling missing data, preventing improper combinations of regions into other states.
The fish_encounters dataset only contains rows where a fish was seen at a measurement station, and so there is an implicit missing value for each time a fish was not observed.
fish_encounters
# A tibble: 114 × 3
fish station seen
<fct> <fct> <int>
1 4842 Release 1
2 4842 I80_1 1
3 4842 Lisbon 1
4 4842 Rstr 1
5 4842 Base_TD 1
6 4842 BCE 1
7 4842 BCW 1
8 4842 BCE2 1
9 4842 BCW2 1
10 4842 MAE 1
# ℹ 104 more rows
We can use complete() to make these unobserved fish explicit in the data (notice the additional rows, scroll to see them).
The drop_na() function transforms explicit missing values (represented by NA in the dataset) into implicit non-missing values. This is accomplished by removing the entire rows that contain NA values, leaving behind a dataset with only complete observations. It is the natural opposite of complete().
It requires two inputs, the dataset and the variables to drop rows with missing values.
data |>drop_na(col1, col2, col3)
fill()
Used to fill missing values using the previous or next value. This is a type of imputation, and should be done with care as to not introduce bias or inaccuracies into your analysis. The fill() function is particularly useful for data where missing values can logically be assumed to continue from the nearest previous or next observation.
The fill() function accepts three inputs, the dataset, columns to fill, and the direction for filling missing values. There are four options for the fill direction:
.direction
Description
"down"
Fills NA with the last non-missing value
"up"
Fills NA with the next non-missing value
"downup"
Fills “down” if possible, otherwise uses “up”
"updown"
Fills “up” if possible, otherwise uses “down”
With fill(), you can specify the direction of the fill with .direction, choosing either "down" to propagate the last valid observation forward, "up" to carry the next valid observation backward, or both "downup" or "updown" to fill missing values in more complex patterns. This ensures that imputation respects the data’s natural progression or sequence and maintains its structural integrity.
replace_na()
The replace_na() function allows you to replace NA values with a specific value. It works very similarly to the fill option of the complete() function, where a dataset and list of replacement values for each column is provided.