3. Data manipulation

How to modify a dataset with dplyr to compute new columns or summaries, change rows or columns, and join datasets together.

Data manipulation is used in analysis for improving data quality, exploring data, and calculating summary tables. It involves modifying observations with filters, sorting data, and selecting, renaming, and adding variables. Computing summaries helps derive insights by aggregating data, while joining tables allows for combining multiple datasets into a cohesive singular dataset. Tools from dplyr enable such data manipulation, preparing informative summary tables and making data ready for modelling and visualisation.

Manipulating Variables

Manipulating variables involves adding new columns, removing unnecessary ones, and renaming existing columns to improve data clarity and relevance. It also includes transforming data by creating derived variables, thereby enhancing the dataset’s usability for analysis.

Select variables (select())

The select() function allows you to choose specific columns from a dataset.

library(dplyr)
library(palmerpenguins)

# Select species, island, and bill_length_mm columns
penguins |>
  select(species, island, bill_length_mm)
# A tibble: 344 Ă— 3
   species island    bill_length_mm
   <fct>   <fct>              <dbl>
 1 Adelie  Torgersen           39.1
 2 Adelie  Torgersen           39.5
 3 Adelie  Torgersen           40.3
 4 Adelie  Torgersen           NA  
 5 Adelie  Torgersen           36.7
 6 Adelie  Torgersen           39.3
 7 Adelie  Torgersen           38.9
 8 Adelie  Torgersen           39.2
 9 Adelie  Torgersen           34.1
10 Adelie  Torgersen           42  
# ℹ 334 more rows

Try selecting the bill length and depth from the penguins dataset.

Hint

Use select() to extract the bill_length_mm, and bill_depth_mm columns.

penguins |> 
  select(bill_length_mm, bill_depth_mm)
Tidy selection

The select() function supports tidy selection, much like pivot_longer() and many other functions from tidyr.

This allowing you to use helpers like starts_with() and contains() to efficiently select columns.

Rename variables (rename())

The rename() function changes the names of existing columns.

# Rename bill_length_mm to bill_length
penguins |>
  rename(bill_length = bill_length_mm)
# A tibble: 344 Ă— 8
   species island  bill_length bill_depth_mm flipper_length_mm body_mass_g sex  
   <fct>   <fct>         <dbl>         <dbl>             <int>       <int> <fct>
 1 Adelie  Torger…        39.1          18.7               181        3750 male 
 2 Adelie  Torger…        39.5          17.4               186        3800 fema…
 3 Adelie  Torger…        40.3          18                 195        3250 fema…
 4 Adelie  Torger…        NA            NA                  NA          NA <NA> 
 5 Adelie  Torger…        36.7          19.3               193        3450 fema…
 6 Adelie  Torger…        39.3          20.6               190        3650 male 
 7 Adelie  Torger…        38.9          17.8               181        3625 fema…
 8 Adelie  Torger…        39.2          19.6               195        4675 male 
 9 Adelie  Torger…        34.1          18.1               193        3475 <NA> 
10 Adelie  Torger…        42            20.2               190        4250 <NA> 
# ℹ 334 more rows
# ℹ 1 more variable: year <int>

Try renaming the columns to remove the units from the variable names.

Hint

Use rename() to modify column names to remove units, such as changing bill_length_mm to bill_length, bill_depth_mm to bill_depth, flipper_length_mm to flipper_length, and body_mass_g to body_mass.

penguins |> 
  rename(
    bill_length = bill_length_mm, 
    bill_depth = bill_depth_mm, 
    flipper_length = flipper_length_mm,
    body_mass = body_mass_g
  )

Compute variables (mutate())

The mutate() function creates new columns or modifies existing ones.

# Add a new column for body mass in kilograms
penguins |>
  mutate(body_mass_kg = body_mass_g / 1000)
# A tibble: 344 Ă— 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, body_mass_kg <dbl>

Try creating new columns which convert the bill length, bill depth, and flipper length into centimetres.

Hint

Use mutate() to convert the millimetre columns to centimetres, for instance, change bill_length_mm to bill_length_cm by dividing by 10.

penguins |> 
  mutate(
    bill_length_cm = bill_length_mm / 10,
    bill_depth_cm = bill_depth_mm / 10,
    flipper_length_cm = flipper_length_mm / 10
  )
Mutating across multiple columns

To apply the same calculation across multiple columns, use mutate() with across() and tidy selection functions. For instance, you can convert measurements from millimetres to centimetres for all columns ending with “mm” by dividing by 10.

penguins |>
  mutate(
    across(
      # Specify columns with tidy selection
      .cols = ends_with("mm"),
      # Calculations on each column
      .fns = ~ .x / 10, 
      # Name of the new colum (replace mm with cm)
      .names = "{sub('mm', 'cm', .col)}"
    )
  )
# A tibble: 344 Ă— 11
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 5 more variables: sex <fct>, year <int>, bill_length_cm <dbl>,
#   bill_depth_cm <dbl>, flipper_length_cm <dbl>

A named list of functions can be provided to compute multiple things across multiple columns.

Manipulating Observations

Extract rows by condition (filter())

The filter() function selects rows based on specific conditions.

# Filter for Adelie species only
penguins |>
  filter(species == "Adelie")
# A tibble: 152 Ă— 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Try using filter() to keep only penguins with above average flipper lengths.

Hint

Filter the penguins dataset to include only the rows where flipper_length_mm is greater than the mean of flipper_length_mm.

penguins |> 
  filter(flipper_length_mm > mean(flipper_length_mm, na.rm = TRUE))

Extract rows by position (slice())

The slice() function selects rows by their position.

# Get the first 5 rows
penguins |>
  slice(1:5)
# A tibble: 5 Ă— 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
# ℹ 2 more variables: sex <fct>, year <int>
Slicing the start and end of the data

To quickly access the top or bottom rows of your data, you can use slice_head() and slice_tail() functions.

# Slice the top 5 rows
penguins %>%
  slice_head(n = 5)
# A tibble: 5 Ă— 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
# ℹ 2 more variables: sex <fct>, year <int>
# Slice the bottom 10% of rows
penguins %>%
  slice_tail(prop = 0.1)
# A tibble: 34 Ă— 8
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
 1 Chinstrap Dream            49.7          18.6               195        3600
 2 Chinstrap Dream            47.5          16.8               199        3900
 3 Chinstrap Dream            47.6          18.3               195        3850
 4 Chinstrap Dream            52            20.7               210        4800
 5 Chinstrap Dream            46.9          16.6               192        2700
 6 Chinstrap Dream            53.5          19.9               205        4500
 7 Chinstrap Dream            49            19.5               210        3950
 8 Chinstrap Dream            46.2          17.5               187        3650
 9 Chinstrap Dream            50.9          19.1               196        3550
10 Chinstrap Dream            45.5          17                 196        3500
# ℹ 24 more rows
# ℹ 2 more variables: sex <fct>, year <int>

A negative n or prop will remove values from the start or end of the data.

slice_min() and slice_max() in dplyr help you extract rows with the smallest or largest values of a specific column. This is useful for identifying extremes within your data.

# Find penguins with the smallest bill length
penguins %>%
  slice_min(bill_length_mm, n = 3)
# A tibble: 3 Ă— 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Dream               32.1          15.5               188        3050
2 Adelie  Dream               33.1          16.1               178        2900
3 Adelie  Torgersen           33.5          19                 190        3600
# ℹ 2 more variables: sex <fct>, year <int>

To take a random sample of rows from your dataset, use the slice_sample() function. This is helpful for creating a subset of data for exploratory analysis or testing.

# Randomly slice 10 penguins
penguins %>%
  slice_sample(n = 10)
# A tibble: 10 Ă— 8
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
 1 Chinstrap Dream            52            19                 197        4150
 2 Adelie    Biscoe           41.3          21.1               195        4400
 3 Gentoo    Biscoe           47.2          15.5               215        4975
 4 Chinstrap Dream            50.3          20                 197        3300
 5 Gentoo    Biscoe           42.9          13.1               215        5000
 6 Adelie    Dream            39.6          18.1               186        4450
 7 Gentoo    Biscoe           43.5          15.2               213        4650
 8 Chinstrap Dream            52            20.7               210        4800
 9 Chinstrap Dream            51.3          19.2               193        3650
10 Adelie    Dream            40.9          18.9               184        3900
# ℹ 2 more variables: sex <fct>, year <int>

Use an appropriate slice function to find the 5 heaviest penguins.

Hint

Use the slice_max() function to find the 5 penguins with the highest body_mass_g, sorting by this variable and specifying n = 5.

penguins |> 
  slice_max(order_by = body_mass_g, n = 5)

Unique combinations (distinct())

The distinct() function returns unique rows, removing duplicates.

# Get distinct species and islands combinations
penguins |>
  distinct(species, island)
# A tibble: 5 Ă— 2
  species   island   
  <fct>     <fct>    
1 Adelie    Torgersen
2 Adelie    Biscoe   
3 Adelie    Dream    
4 Gentoo    Biscoe   
5 Chinstrap Dream    

Use distinct() to find which years exist in the dataset.

Hint

Use distinct(year) to find and extract the unique years in the dataset.

penguins |> 
  distinct(year)

Sort rows (arrange())

The arrange() function orders the rows according to one or more columns. You can wrap a variable in desc() to sort that variable in descending order.

# Arrange penguins by bill length in descending order
penguins |>
  arrange(desc(bill_length_mm))
# A tibble: 344 Ă— 8
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo    Biscoe           59.6          17                 230        6050
 2 Chinstrap Dream            58            17.8               181        3700
 3 Gentoo    Biscoe           55.9          17                 228        5600
 4 Chinstrap Dream            55.8          19.8               207        4000
 5 Gentoo    Biscoe           55.1          16                 230        5850
 6 Gentoo    Biscoe           54.3          15.7               231        5650
 7 Chinstrap Dream            54.2          20.8               201        4300
 8 Chinstrap Dream            53.5          19.9               205        4500
 9 Gentoo    Biscoe           53.4          15.8               219        5500
10 Chinstrap Dream            52.8          20                 205        4550
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Try sorting the penguins dataset by species and flipper_length_mm.

Hint

Use the arrange() function to sort the dataset first by species and then by flipper_length_mm.

penguins |> 
  arrange(species, flipper_length_mm)

Grouped Operations

The group_by() function is used to group data by one or more variables for subsequent operations.

# Group by species and island
penguins |>
  group_by(species, island)

While group_by() doesn’t change any rows or columns of the data itself, it alters how other dplyr functions operate, affecting calculations and transformations.

Grouped operations are most commonly used with summarise() to calculate aggregated statistics, like averages or totals, for each group. However, group_by() also works seamlessly with other dplyr functions like filter() and mutate(), allowing you to conduct calculations and transformations within each group.

Mutating within groups

Mutating within groups allows you to transform data based on group-specific calculations. For example, you can determine if a penguin’s body_mass_g is above average for its species.

penguins |> 
  group_by(species) |> 
  mutate(above_average_mass = body_mass_g > mean(body_mass_g, na.rm = TRUE)) |> 
  ungroup()
# A tibble: 344 Ă— 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, above_average_mass <lgl>

Summarising Observations

summarise()

The summarise() function in dplyr calculates summary statistics over a dataset, allowing you to condense information into key metrics.

# Calculate summary statistics for flipper length
penguins %>%
  summarise(
    min_flipper = min(flipper_length_mm, na.rm = TRUE),
    median_flipper = median(flipper_length_mm, na.rm = TRUE),
    max_flipper = max(flipper_length_mm, na.rm = TRUE)
  )
# A tibble: 1 Ă— 3
  min_flipper median_flipper max_flipper
        <int>          <dbl>       <int>
1         172            197         231

The summarise() function is often combined with group_by(), allowing you to compare summary statistics across different segments of your data.

# Summarise the average bill length for each species
penguins |>
  group_by(species) |>
  summarise(average_bill_length = mean(bill_length_mm, na.rm = TRUE))
# A tibble: 3 Ă— 2
  species   average_bill_length
  <fct>                   <dbl>
1 Adelie                   38.8
2 Chinstrap                48.8
3 Gentoo                   47.5

Try calculating the average body mass by sex for Adelie, Chinstrap, and Gentoo penguins.

Hint

Group the data by both sex and species before using summarise() to calculate the mean of body_mass_g.

penguins |> 
  group_by(sex, species) |> 
  summarise(average_body_mass = mean(body_mass_g, na.rm = TRUE))

count()

The count() function in dplyr is used to tally the number of occurrences of each unique combination of values in specified columns. This is similar to distinct(), which identifies unique cases, but count() provides the number of times each distinct case occurs.

# Count the penguins of each sex for each species
penguins %>%
  count(species, sex)
# A tibble: 8 Ă— 3
  species   sex        n
  <fct>     <fct>  <int>
1 Adelie    female    73
2 Adelie    male      73
3 Adelie    <NA>       6
4 Chinstrap female    34
5 Chinstrap male      34
6 Gentoo    female    58
7 Gentoo    male      61
8 Gentoo    <NA>       5

Try counting the number of penguins recorded on each island.

Hint

Use count() to tally the number of penguins recorded for each island specified in the dataset.

penguins |> 
  count(island)

Joining Tables

There are several ways to join multiple tables together in dplyr. You can combine data by stacking rows using bind_rows() or by adding columns with bind_cols(). To add new variables based on common identifying columns, you can use mutating joins like left_join(), right_join(), inner_join(), and full_join(). To extract observations based on matching values in a secondary dataset, use the filtering join function anti_join().

Combining data (bind_rows(), bind_cols())

Multiple datasets with the same variables can be combined by stacking rows using bind_rows(), while datasets with the same observations can be merged by adding columns using bind_cols().

These functions combine data without using common identifying keys, and so special care should be taken to ensure that the datasets are compatible (aligned rows and columns). If your datasets do have matching identifying columns, it is safer to use mutating joins instead of bind_cols().

It is common for datasets to be provided in multiple files, such as a separate dataset for each year of collection. Since they all have the same set of variables, we can use bind_rows() to combine these datasets together.

Code for setting up separate penguins datasets
library(dplyr)
library(palmerpenguins)
penguins_year <- penguins |> 
  group_by(year) |> 
  group_split()
penguins_2007 <- penguins_year[[1]]
penguins_2008 <- penguins_year[[3]]
penguins_2009 <- penguins_year[[2]]
bind_rows(penguins_2007, penguins_2008, penguins_2009)
# A tibble: 344 Ă— 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Mutating join (left_join(), right_join(), inner_join(), full_join())

Mutating joins add additional variables from one dataset to another based on matching key columns.

Join Type Description Rows Retained from Left Dataset Rows Retained from Right Dataset
left_join() Adds columns from the right dataset to the left dataset. Keeps all rows from the left dataset, with NA for unmatched right dataset rows. All Only those that match
right_join() Adds columns from the left dataset to the right dataset. Keeps all rows from the right dataset, with NA for unmatched left dataset rows. Only those that match All
inner_join() Returns rows that have matching keys in both datasets and adds columns of both datasets to the result. Only those that match Only those that match
full_join() Combines all rows from both datasets, treating unmatched keys with NA values in columns from the opposite dataset. All All

Suppose we have a penguin_info dataset which contains key details about different penguin species, including their scientific name, typical nest type, and conservation status.

penguin_info <- tibble(
  species = c("Adelie", "Chinstrap", "Gentoo"),
  scientific_name = c("Pygoscelis adeliae", "Pygoscelis antarcticus", "Pygoscelis papua"),
  nest_type = c("Nests made from stones", "Nests made from stones", "Nests lined with pebbles and vegetation"),
  conservation_status = c("Least Concern", "Least Concern", "Least Concern")
)
penguin_info
# A tibble: 3 Ă— 4
  species   scientific_name        nest_type                 conservation_status
  <chr>     <chr>                  <chr>                     <chr>              
1 Adelie    Pygoscelis adeliae     Nests made from stones    Least Concern      
2 Chinstrap Pygoscelis antarcticus Nests made from stones    Least Concern      
3 Gentoo    Pygoscelis papua       Nests lined with pebbles… Least Concern      

We can add this additional information about the penguin species to the data using a left_join().

penguins |>
  left_join(penguin_info, by = "species")
# A tibble: 344 Ă— 11
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 5 more variables: sex <fct>, year <int>, scientific_name <chr>,
#   nest_type <chr>, conservation_status <chr>

Try to add information about the islands to the penguins dataset. Information from the Dream island was not provided in the supplementary island_info dataset, so use an appropriate join operation that will produce a combined dataset without any missing values.

Hint

Consider using an inner_join() for combining the two datasets. This is appropriate because inner_join() will only keep rows with matching values in both datasets. Since the Dream island is not available in island_info, its entries from penguins will be excluded from the final merged dataset.

penguins |> 
  inner_join(island_info, by = ______)
Hint

The datasets need to be joined on a key column they both have in common. In this case, it is the island column, which links data about islands from both the penguins and island_info datasets.

penguins |> 
  inner_join(island_info, by = "island")

Filtering join (semi_join(), anti_join())

Filtering join functions filter the rows in the left dataset based on matching values in the right dataset. Unlike mutating joins, filtering joins do not add additional variables from the right dataset; the right dataset is only used to identify rows to keep or remove.

Join Type Description Rows Retained from Left Dataset Rows Retained from Right Dataset
semi_join() Returns rows from the left dataset that have matching keys in the right dataset. Only those that match Not included in the result
anti_join() Returns rows from the left dataset that do not have matching keys in the right dataset. Only those without a match Not included in the result

The anti_join() function filters observations from the first dataset that do not have matching keys in the second dataset. This is useful when you want to identify records in one dataset without corresponding entries in another.

Suppose we don’t want to add additional information about the islands, but we are only interested in the islands we have complete data for. We could use a filtering join to keep only the observations with known island information.

island_info <- tibble(
  island = c("Biscoe", "Torgersen"),
  coastline_km = c(10, 3),
  latitude = c(-64.8, -64.8),
  longitude = c(-63.8, -64.1)
)

penguins |>
  semi_join(island_info, by = "island")
# A tibble: 220 Ă— 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 210 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Finding incomplete data

Filtering joins can help you identify implicit missing values in your dataset if you have a second dataset with a complete set of identifying variables. Using anti_join(complete_data, analysis_data) will reveal which observations are missing from your analysis dataset.