2. Tidy data restructuring

How to restructure a messy dataset with tidyr into tidy data.

The idea of tidy data is fundamental to data analysis with the tidyverse. Tidy data describes a clean and principled structure for your data which simplifies analysis tasks. The tidyr package provides tools to maintain or convert datasets into this tidy format, ensuring they integrate smoothly with other tidyverse tools for tasks like visualisation and modelling.

Tidy data

The tidy data format

Tidy data describes a format for structuring data such that:

  • Each variable is a column; each column is a variable.
  • Each observation is a row; each row is an observation.
  • Each value is a cell; each cell is a single value.

This flat and consistent format makes it easier to perform data manipulation and analysis tasks, especially with other tools from the tidyverse.

Consider this untidy dataset from the World Health Organization’s global tuberculosis report:

library(tidyr)
who
# A tibble: 7,240 × 60
   country  iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
   <chr>    <chr> <chr> <dbl>       <dbl>        <dbl>        <dbl>        <dbl>
 1 Afghani… AF    AFG    1980          NA           NA           NA           NA
 2 Afghani… AF    AFG    1981          NA           NA           NA           NA
 3 Afghani… AF    AFG    1982          NA           NA           NA           NA
 4 Afghani… AF    AFG    1983          NA           NA           NA           NA
 5 Afghani… AF    AFG    1984          NA           NA           NA           NA
 6 Afghani… AF    AFG    1985          NA           NA           NA           NA
 7 Afghani… AF    AFG    1986          NA           NA           NA           NA
 8 Afghani… AF    AFG    1987          NA           NA           NA           NA
 9 Afghani… AF    AFG    1988          NA           NA           NA           NA
10 Afghani… AF    AFG    1989          NA           NA           NA           NA
# ℹ 7,230 more rows
# ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
#   new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
#   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
#   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
#   new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
#   new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>, …

This dataset is structurally untidy since the columns 5 through 60 (prefixed with "new_") mix together the diagnosis method, sex, and age group categories. As a result, the columns no longer describe each variable (since many columns are used for the same variable), and each row isn’t an observation (since rows contain many observations for different methods, genders, and ages).

Untidy data makes it difficult to perform simple analysis tasks such as comparing the rates of tuberculosis for different age groups, since the calculation would span many columns. A tidy data format as shown below simplifies this operation, since the age variable is now in a singular column that can be easily grouped for comparison summaries.

Code for cleaning the who dataset
library(tidyr)
who_tidy <- who |> 
  pivot_longer(
    cols = starts_with("new"),
    # Split the names into multiple columns
    names_to = c("diagnosis", "sex", "age"),
    # Use a regular expression where each matching group becomes a column
    names_pattern = "new_?(.*)_(.)(.*)",
    # Reformat the name columns into factors with better labels
    names_transform = list(
      diagnosis = \(x) factor(x, levels = c("rel", "sn", "sp", "ep"), labels = c("Relapse", "Negative pulmonary smear", "Positive pulmonary smear", "Extrapulmonary")),
      sex = \(x) factor(x, levels = c("f", "m"), labels = c("Female", "Male")),
      age = \(x) factor(x, levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"), labels = c("0-14", "15-24", "25-34", "35-44", "45-54", "55-64", "65+"), ordered = TRUE)
      ),
    values_to = "count",
    # Drop structurally missing values
    values_drop_na = TRUE
  )
who_tidy
# A tibble: 76,046 × 8
   country     iso2  iso3   year diagnosis                sex    age   count
   <chr>       <chr> <chr> <dbl> <fct>                    <fct>  <ord> <dbl>
 1 Afghanistan AF    AFG    1997 Positive pulmonary smear Male   0-14      0
 2 Afghanistan AF    AFG    1997 Positive pulmonary smear Male   15-24    10
 3 Afghanistan AF    AFG    1997 Positive pulmonary smear Male   25-34     6
 4 Afghanistan AF    AFG    1997 Positive pulmonary smear Male   35-44     3
 5 Afghanistan AF    AFG    1997 Positive pulmonary smear Male   45-54     5
 6 Afghanistan AF    AFG    1997 Positive pulmonary smear Male   55-64     2
 7 Afghanistan AF    AFG    1997 Positive pulmonary smear Male   65+       0
 8 Afghanistan AF    AFG    1997 Positive pulmonary smear Female 0-14      5
 9 Afghanistan AF    AFG    1997 Positive pulmonary smear Female 15-24    38
10 Afghanistan AF    AFG    1997 Positive pulmonary smear Female 25-34    36
# ℹ 76,036 more rows

Using tidy data

In the next lesson we will learn how to compute summary statistics with dplyr, such as the total tuberculosis cases by age group. A tidy data format makes it much easier to calculate these summaries.

❌ Untidy data code

Untidy data makes it difficult to perform simple calculations.

Since the age variable is ‘wide’ (spread across many columns), the code needs to write out many column names where it is easy to make a mistake. The resulting table is also ‘wide’, and lacks any description of what the data shows (age and totals).

library(dplyr)
who |> 
  summarise(
    `0-14` = sum(new_sp_m014, new_sp_f014, new_sn_m014, new_sn_f014, new_ep_m014, new_ep_f014, newrel_m014, newrel_f014, na.rm = TRUE),
    `15-24` = sum(new_sp_m1524, new_sp_f1524, new_sn_m1524, new_sn_f1524, new_ep_m1524, new_ep_f1524, newrel_m1524, newrel_f1524, na.rm = TRUE),
    `25-34` = sum(new_sp_m2534, new_sp_f2534, new_sn_m2534, new_sn_f2534, new_ep_m2534, new_ep_f2534, newrel_m2534, newrel_f2534, na.rm = TRUE),
    `35-44` = sum(new_sp_m3544, new_sp_f3544, new_sn_m3544, new_sn_f3544, new_ep_m3544, new_ep_f3544, newrel_m3544, newrel_f3544, na.rm = TRUE),
    `45-54` = sum(new_sp_m4554, new_sp_f4554, new_sn_m4554, new_sn_f4554, new_ep_m4554, new_ep_f4554, newrel_m4554, newrel_f4554, na.rm = TRUE),
    `55-64` = sum(new_sp_m5564, new_sp_f5564, new_sn_m5564, new_sn_f5564, new_ep_m5564, new_ep_f5564, newrel_m5564, newrel_f5564, na.rm = TRUE),
    `65+` = sum(new_sp_m65, new_sp_f65, new_sn_m65, new_sn_f65, new_ep_m65, new_ep_f65, newrel_m65, newrel_f65, na.rm = TRUE)
  )
# A tibble: 1 × 7
   `0-14` `15-24` `25-34` `35-44` `45-54` `55-64`   `65+`
    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 1708251 7642727 9636127 8766716 6500726 4799668 4343303
✅ Tidy data code

Tidy data makes the code simpler to read and write.

Since each variable is only specified once in the code, mistakes are less likely and easier to find and fix. The resulting table is also in a ‘long’ and tidy format, with descriptive column names.

library(dplyr)
who_tidy |> 
  group_by(age) |> 
  summarise(total = sum(count))
# A tibble: 7 × 2
  age     total
  <ord>   <dbl>
1 0-14  1708251
2 15-24 7642727
3 25-34 9636127
4 35-44 8766716
5 45-54 6500726
6 55-64 4799668
7 65+   4343303

Pivoting with pivot_longer() and pivot_wider()

Pivoting describes changing the structure of the data, where columns are gathered into rows (pivot_longer()) or rows are spread across columns (pivot_wider()).

pivot_longer()

This function is used to transform a dataset from wide to long format. Untidy data often has variables mixed among many columns (like the who tuberculosis dataset), and pivot_longer() is most useful for cleaning data into a long tidy format.

The pivot_longer() function requires two main inputs, the dataset and the columns to pivot.

data |> 
  pivot_longer(cols = c(col1, col2, col3))

There are many ways to specify the columns, one way is to write them all inside c(). Usually wide untidy datasets have many columns to pivot (the who dataset has 56), and so writing each column can be tedious and error prone. Fortunately there is a better way, columns can be specified using tidy selection.

Tidy selection

Tidy selection describes a collection of helpful tools for selecting variables:

  • col1:col3: select all variables from col1 to col3,
  • starts_with("col")/ends_with("x"): select variables with a start or end string,
  • contains("x")/matches("\\d+$"): select variables that contains the string, or matches the regular expression
  • and much more!

A comprehensive summary can be found in the ?tidyr_tidy_select documentation.

To tidy the who dataset into a long format, we use pivot_longer() on the wide columns which conveniently all start with "new". We can also use names_to and values_to to give better names for the new columns which previously were the names and values of the selected columns.

who |> 
  pivot_longer(
    cols = starts_with("new"),
    names_to = "diagnosis_sex_age",
    values_to = "count"
  )
# A tibble: 405,440 × 6
   country     iso2  iso3   year diagnosis_sex_age count
   <chr>       <chr> <chr> <dbl> <chr>             <dbl>
 1 Afghanistan AF    AFG    1980 new_sp_m014          NA
 2 Afghanistan AF    AFG    1980 new_sp_m1524         NA
 3 Afghanistan AF    AFG    1980 new_sp_m2534         NA
 4 Afghanistan AF    AFG    1980 new_sp_m3544         NA
 5 Afghanistan AF    AFG    1980 new_sp_m4554         NA
 6 Afghanistan AF    AFG    1980 new_sp_m5564         NA
 7 Afghanistan AF    AFG    1980 new_sp_m65           NA
 8 Afghanistan AF    AFG    1980 new_sp_f014          NA
 9 Afghanistan AF    AFG    1980 new_sp_f1524         NA
10 Afghanistan AF    AFG    1980 new_sp_f2534         NA
# ℹ 405,430 more rows

Since a wide untidy dataset requires values for every cell, the data has many missing values. After pivoting the data into a longer format, these missing values are no longer structurally required by the data (although they may still be useful or informative). We can remove these missing value rows with the values_drop_na option:

who |> 
  pivot_longer(
    cols = starts_with("new"),
    names_to = "diagnosis_sex_age",
    values_to = "count",
    values_drop_na = TRUE
  )
# A tibble: 76,046 × 6
   country     iso2  iso3   year diagnosis_sex_age count
   <chr>       <chr> <chr> <dbl> <chr>             <dbl>
 1 Afghanistan AF    AFG    1997 new_sp_m014           0
 2 Afghanistan AF    AFG    1997 new_sp_m1524         10
 3 Afghanistan AF    AFG    1997 new_sp_m2534          6
 4 Afghanistan AF    AFG    1997 new_sp_m3544          3
 5 Afghanistan AF    AFG    1997 new_sp_m4554          5
 6 Afghanistan AF    AFG    1997 new_sp_m5564          2
 7 Afghanistan AF    AFG    1997 new_sp_m65            0
 8 Afghanistan AF    AFG    1997 new_sp_f014           5
 9 Afghanistan AF    AFG    1997 new_sp_f1524         38
10 Afghanistan AF    AFG    1997 new_sp_f2534         36
# ℹ 76,036 more rows
Separating multiple variables in names

It is also possible to split the diagnosis_sex_age column into three separate columns using names_sep (for simple delimiters) or names_pattern (for matching regular expressions).

In this case values like "new_sp_m554", new_sn_f65 and newrel_m5564 are not easily separable by a delimiter, and so a regular expression like "names_pattern = new_?(.*)_(.)(.*)" is required.

Alternatively, we can use separate() to separate a column into multiple columns.

The world_bank_pop dataset contains years across multiple columns, use pivot_longer() to gather all of the year columns into a single "year" column with the population values collected into a "population" column.

Hint

All of the year columns can be selected using either:

  • the range of columns 2000:2017,
  • the regular expression matches(\\d{4}) (4 digits), or
  • the logical test where(is.numeric) to select all columns containing numbers (which is only the year columns)
world_bank_pop |> 
  pivot_longer(cols = `2000`:`2017`, names_to = "year", values_to = "population")

pivot_wider()

Conversely, pivot_wider() is used to transform data from long to wide format. This can be useful if your data is untidy because it is too long (multiple variables are contained in rows of a column), or if you’ve completed your analysis and want to nicely present some results in a table of your report.

The pivot_wider() function requires two main inputs, the column names for the wider table’s names and values.

data |> 
  pivot_wider(
    names_from = "names",
    values_from = "values"
  )

The fish_encounters dataset contains sightings of 19 tagged fish as they travel past measurement stations.

fish_encounters
# A tibble: 114 × 3
   fish  station  seen
   <fct> <fct>   <int>
 1 4842  Release     1
 2 4842  I80_1       1
 3 4842  Lisbon      1
 4 4842  Rstr        1
 5 4842  Base_TD     1
 6 4842  BCE         1
 7 4842  BCW         1
 8 4842  BCE2        1
 9 4842  BCW2        1
10 4842  MAE         1
# ℹ 104 more rows

It can be useful to treat each measurement station as a separate variable, considering each individual fish as the observational unit (rather than a sighting of a fish at a station).

For this we can use pivot_wider() to spread station into the columns with seen as each cell’s values.

fish_encounters |> 
  pivot_wider(names_from = "station", values_from = "seen")
# A tibble: 19 × 12
   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
 1 4842        1     1      1     1       1     1     1     1     1     1     1
 2 4843        1     1      1     1       1     1     1     1     1     1     1
 3 4844        1     1      1     1       1     1     1     1     1     1     1
 4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA
 5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
 6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA    NA
 7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
 8 4850        1     1     NA     1       1     1     1    NA    NA    NA    NA
 9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
11 4855        1     1      1     1       1    NA    NA    NA    NA    NA    NA
12 4857        1     1      1     1       1     1     1     1     1    NA    NA
13 4858        1     1      1     1       1     1     1     1     1     1     1
14 4859        1     1      1     1       1    NA    NA    NA    NA    NA    NA
15 4861        1     1      1     1       1     1     1     1     1     1     1
16 4862        1     1      1     1       1     1     1     1     1    NA    NA
17 4863        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
18 4864        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
19 4865        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA

Notice how the wider form introduces many NA values. These are combinations of fish identifiers and measurement stations which did not have a row in the original dataset. For this dataset, this means that the fish was not seen at the station and we can safely fill these missing values with 0 using the values_fill option:

fish_encounters |> 
  pivot_wider(names_from = "station", values_from = "seen", values_fill = 0)
# A tibble: 19 × 12
   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
 1 4842        1     1      1     1       1     1     1     1     1     1     1
 2 4843        1     1      1     1       1     1     1     1     1     1     1
 3 4844        1     1      1     1       1     1     1     1     1     1     1
 4 4845        1     1      1     1       1     0     0     0     0     0     0
 5 4847        1     1      1     0       0     0     0     0     0     0     0
 6 4848        1     1      1     1       0     0     0     0     0     0     0
 7 4849        1     1      0     0       0     0     0     0     0     0     0
 8 4850        1     1      0     1       1     1     1     0     0     0     0
 9 4851        1     1      0     0       0     0     0     0     0     0     0
10 4854        1     1      0     0       0     0     0     0     0     0     0
11 4855        1     1      1     1       1     0     0     0     0     0     0
12 4857        1     1      1     1       1     1     1     1     1     0     0
13 4858        1     1      1     1       1     1     1     1     1     1     1
14 4859        1     1      1     1       1     0     0     0     0     0     0
15 4861        1     1      1     1       1     1     1     1     1     1     1
16 4862        1     1      1     1       1     1     1     1     1     0     0
17 4863        1     1      0     0       0     0     0     0     0     0     0
18 4864        1     1      0     0       0     0     0     0     0     0     0
19 4865        1     1      1     0       0     0     0     0     0     0     0

The indicator variable of the world_bank_pop dataset contains both population totals and growth. These two values are in principle, two different variables which should be treated as separate columns. After pivoting the year columns into a longer structure, now pivot these population indicator variables into a wider structure to produce a tidy data format.

Hint

The indicator variable contains ‘total’ and ‘growth’ variables, which should be made into new columns (names_from). The population variable contains values for these new population totals and growth columns (values_from).

world_bank_pop |> 
  pivot_longer(cols = `2000`:`2017`, names_to = "year", values_to = "population") |> 
  pivot_wider(names_from = "indicator", values_from = "population")

Split or combine columns

Occasionally the values of single column contains information about multiple variables or observations. Separating this column into multiple columns or rows into a tidy data format (where each column is a variable, and each row is an observation) makes later analysis easier.

separate_wider_*()

Separating combined values into a wider format is appropriate when the values contain information about multiple variables.

There are three ways to separate values into columns:

Function Description
separate_wider_position() Splits a column into multiple columns using specific position indices, ideal for fixed-width data.
separate_wider_delim() Divides a column into new columns based on a specified delimiter, useful for structured patterns.
separate_wider_regex() Uses regular expressions to separate a column into multiple columns, suitable for complex patterns.

The indicator variable from the world_bank_pop dataset encodes multiple variables separated by ..

world_bank_pop
# A tibble: 1,064 × 20
   country indicator      `2000`  `2001`  `2002`  `2003`  `2004`  `2005`  `2006`
   <chr>   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ABW     SP.URB.TOTL    4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
 2 ABW     SP.URB.GROW    1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
 3 ABW     SP.POP.TOTL    8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
 4 ABW     SP.POP.GROW    2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
 5 AFE     SP.URB.TOTL    1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
 6 AFE     SP.URB.GROW    3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
 7 AFE     SP.POP.TOTL    4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
 8 AFE     SP.POP.GROW    2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
 9 AFG     SP.URB.TOTL    4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
10 AFG     SP.URB.GROW    1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
# ℹ 1,054 more rows
# ℹ 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
#   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
#   `2016` <dbl>, `2017` <dbl>

The first part is always "SP" and can be ignored, but the second part is "POP" (total population) or "URB" (urban population), and the third part is "TOTL" (population totals) or "GROW" (population growth).

We can use separate_wider_delim() to separate indicator based on the "." delimiter.

world_bank_pop |> 
  separate_wider_delim(indicator, delim = ".", names = c("sp", "location", "indicator"))
# A tibble: 1,064 × 22
   country sp    location indicator       `2000`  `2001`  `2002`  `2003`  `2004`
   <chr>   <chr> <chr>    <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 ABW     SP    URB      TOTL          41625    4.20e+4 4.22e+4 4.23e+4 4.23e+4
 2 ABW     SP    URB      GROW              1.66 9.56e-1 4.01e-1 1.97e-1 9.46e-2
 3 ABW     SP    POP      TOTL          89101    9.07e+4 9.18e+4 9.27e+4 9.35e+4
 4 ABW     SP    POP      GROW              2.54 1.77e+0 1.19e+0 9.97e-1 9.01e-1
 5 AFE     SP    URB      TOTL      115551653    1.20e+8 1.24e+8 1.29e+8 1.34e+8
 6 AFE     SP    URB      GROW              3.60 3.66e+0 3.72e+0 3.71e+0 3.74e+0
 7 AFE     SP    POP      TOTL      401600588    4.12e+8 4.23e+8 4.34e+8 4.45e+8
 8 AFE     SP    POP      GROW              2.58 2.59e+0 2.61e+0 2.62e+0 2.64e+0
 9 AFG     SP    URB      TOTL        4314700    4.36e+6 4.67e+6 5.06e+6 5.30e+6
10 AFG     SP    URB      GROW              1.86 1.15e+0 6.86e+0 7.95e+0 4.59e+0
# ℹ 1,054 more rows
# ℹ 13 more variables: `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>,
#   `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>,
#   `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>

separate_longer_*()

Separating combined values into a longer format is appropriate when a column contains multiple values, for example a clothing product’s sizes or styles.

There are two ways to separate values into rows:

Function Name Description
separate_longer_position() Breaks a column into multiple rows based on specified fixed positions, useful for expanding fixed-width data.
separate_longer_delim() Splits a column into multiple rows using a specified delimiter, ideal for lists or values separated by delimiters.

Consider this dataset of clothing items:

# Sample dataset of clothing products
clothing_data <- tibble(
  item = c("T-Shirt", "Jacket", "Dress"),
  size = c("S, M, L", "M, L, XL", "XS, S, M, L"),
  colour = c("Red, Blue, Green", "Black, Grey", "Purple, Yellow")
)
clothing_data
# A tibble: 3 × 3
  item    size        colour          
  <chr>   <chr>       <chr>           
1 T-Shirt S, M, L     Red, Blue, Green
2 Jacket  M, L, XL    Black, Grey     
3 Dress   XS, S, M, L Purple, Yellow  

Since each size and colour is separated by ", ", we can use separate_longer_delim() to separate the multiple product sizes and colours into rows.

clothing_data %>%
  separate_longer_delim(size, delim = ", ") |> 
  separate_longer_delim(colour, delim = ", ")
# A tibble: 23 × 3
   item    size  colour
   <chr>   <chr> <chr> 
 1 T-Shirt S     Red   
 2 T-Shirt S     Blue  
 3 T-Shirt S     Green 
 4 T-Shirt M     Red   
 5 T-Shirt M     Blue  
 6 T-Shirt M     Green 
 7 T-Shirt L     Red   
 8 T-Shirt L     Blue  
 9 T-Shirt L     Green 
10 Jacket  M     Black 
# ℹ 13 more rows

unite()

The unite() function combines multiple columns into a single column, separating the values of each column by a delimiter (by default sep = "_"). This is the opposite operation of separate_wider_*().

It requires two inputs, the data and columns to combine. The col argument supports tidy selection.

data |> 
  unite(col = c(col1, col2, col3))

This is rarely useful for producing tidy data, but it can be useful for preparing data for reports or creating unique identifiers.

library(palmerpenguins)
penguins %>%
  unite(col = "category", c(species, island, sex), sep = "_")
# A tibble: 344 × 6
   category     bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
   <chr>                 <dbl>         <dbl>             <int>       <int> <int>
 1 Adelie_Torg…           39.1          18.7               181        3750  2007
 2 Adelie_Torg…           39.5          17.4               186        3800  2007
 3 Adelie_Torg…           40.3          18                 195        3250  2007
 4 Adelie_Torg…           NA            NA                  NA          NA  2007
 5 Adelie_Torg…           36.7          19.3               193        3450  2007
 6 Adelie_Torg…           39.3          20.6               190        3650  2007
 7 Adelie_Torg…           38.9          17.8               181        3625  2007
 8 Adelie_Torg…           39.2          19.6               195        4675  2007
 9 Adelie_Torg…           34.1          18.1               193        3475  2007
10 Adelie_Torg…           42            20.2               190        4250  2007
# ℹ 334 more rows

Handling Missing Values

Appropriately representing missing values in your dataset is necessary for an accurate analysis. The tidyr package in R offers several tools for managing missing data, both implicit and explicit. Implicit missing values occur when expected combinations of data are absent, while explicit missing values appear as NA in your dataset.

complete()

The complete() function transforms implicit missing values (where the row itself is missing) into explicit missing values. It does this by finding all combinations of specified variables, and adding any missing combinations using NA values.

It requires two inputs, the dataset and a structure of variables to combine.

data |> 
  complete(brand, nesting(state, region))

The nesting() function can limit combinations within other variables, preventing redundant or impossible combinations. The code uses nesting(state, region) to ensure that only existing state-region pairings are considered when filling missing data, preventing improper combinations of regions into other states.

The fish_encounters dataset only contains rows where a fish was seen at a measurement station, and so there is an implicit missing value for each time a fish was not observed.

fish_encounters
# A tibble: 114 × 3
   fish  station  seen
   <fct> <fct>   <int>
 1 4842  Release     1
 2 4842  I80_1       1
 3 4842  Lisbon      1
 4 4842  Rstr        1
 5 4842  Base_TD     1
 6 4842  BCE         1
 7 4842  BCW         1
 8 4842  BCE2        1
 9 4842  BCW2        1
10 4842  MAE         1
# ℹ 104 more rows

We can use complete() to make these unobserved fish explicit in the data (notice the additional rows, scroll to see them).

fish_encounters |> 
  complete(fish, station)
# A tibble: 209 × 3
   fish  station  seen
   <fct> <fct>   <int>
 1 4842  Release     1
 2 4842  I80_1       1
 3 4842  Lisbon      1
 4 4842  Rstr        1
 5 4842  Base_TD     1
 6 4842  BCE         1
 7 4842  BCW         1
 8 4842  BCE2        1
 9 4842  BCW2        1
10 4842  MAE         1
11 4842  MAW         1
12 4843  Release     1
13 4843  I80_1       1
14 4843  Lisbon      1
15 4843  Rstr        1
16 4843  Base_TD     1
17 4843  BCE         1
18 4843  BCW         1
19 4843  BCE2        1
20 4843  BCW2        1
21 4843  MAE         1
22 4843  MAW         1
23 4844  Release     1
24 4844  I80_1       1
25 4844  Lisbon      1
26 4844  Rstr        1
27 4844  Base_TD     1
28 4844  BCE         1
29 4844  BCW         1
30 4844  BCE2        1
31 4844  BCW2        1
32 4844  MAE         1
33 4844  MAW         1
34 4845  Release     1
35 4845  I80_1       1
36 4845  Lisbon      1
37 4845  Rstr        1
38 4845  Base_TD     1
39 4845  BCE        NA
40 4845  BCW        NA
41 4845  BCE2       NA
42 4845  BCW2       NA
43 4845  MAE        NA
44 4845  MAW        NA
45 4847  Release     1
46 4847  I80_1       1
47 4847  Lisbon      1
48 4847  Rstr       NA
49 4847  Base_TD    NA
50 4847  BCE        NA
# ℹ 159 more rows

Since we know all implicit missing values were cases where the fish wasn’t seen at the station, we can use fill to give a known value.

fish_encounters |> 
  complete(fish, station, fill = list(seen = 0))
# A tibble: 209 × 3
   fish  station  seen
   <fct> <fct>   <int>
 1 4842  Release     1
 2 4842  I80_1       1
 3 4842  Lisbon      1
 4 4842  Rstr        1
 5 4842  Base_TD     1
 6 4842  BCE         1
 7 4842  BCW         1
 8 4842  BCE2        1
 9 4842  BCW2        1
10 4842  MAE         1
11 4842  MAW         1
12 4843  Release     1
13 4843  I80_1       1
14 4843  Lisbon      1
15 4843  Rstr        1
16 4843  Base_TD     1
17 4843  BCE         1
18 4843  BCW         1
19 4843  BCE2        1
20 4843  BCW2        1
21 4843  MAE         1
22 4843  MAW         1
23 4844  Release     1
24 4844  I80_1       1
25 4844  Lisbon      1
26 4844  Rstr        1
27 4844  Base_TD     1
28 4844  BCE         1
29 4844  BCW         1
30 4844  BCE2        1
31 4844  BCW2        1
32 4844  MAE         1
33 4844  MAW         1
34 4845  Release     1
35 4845  I80_1       1
36 4845  Lisbon      1
37 4845  Rstr        1
38 4845  Base_TD     1
39 4845  BCE         0
40 4845  BCW         0
41 4845  BCE2        0
42 4845  BCW2        0
43 4845  MAE         0
44 4845  MAW         0
45 4847  Release     1
46 4847  I80_1       1
47 4847  Lisbon      1
48 4847  Rstr        0
49 4847  Base_TD     0
50 4847  BCE         0
# ℹ 159 more rows

drop_na()

The drop_na() function transforms explicit missing values (represented by NA in the dataset) into implicit non-missing values. This is accomplished by removing the entire rows that contain NA values, leaving behind a dataset with only complete observations. It is the natural opposite of complete().

It requires two inputs, the dataset and the variables to drop rows with missing values.

data |> 
  drop_na(col1, col2, col3)

fill()

Used to fill missing values using the previous or next value. This is a type of imputation, and should be done with care as to not introduce bias or inaccuracies into your analysis. The fill() function is particularly useful for data where missing values can logically be assumed to continue from the nearest previous or next observation.

The fill() function accepts three inputs, the dataset, columns to fill, and the direction for filling missing values. There are four options for the fill direction:

.direction Description
"down" Fills NA with the last non-missing value
"up" Fills NA with the next non-missing value
"downup" Fills “down” if possible, otherwise uses “up”
"updown" Fills “up” if possible, otherwise uses “down”

With fill(), you can specify the direction of the fill with .direction, choosing either "down" to propagate the last valid observation forward, "up" to carry the next valid observation backward, or both "downup" or "updown" to fill missing values in more complex patterns. This ensures that imputation respects the data’s natural progression or sequence and maintains its structural integrity.

replace_na()

The replace_na() function allows you to replace NA values with a specific value. It works very similarly to the fill option of the complete() function, where a dataset and list of replacement values for each column is provided.

data |> 
  replace_na(
    list(
      col1 = 0,
      col2 = FALSE
    )
  )