How to modify a dataset with dplyr to compute new columns or summaries, change rows or columns, and join datasets together.
Data manipulation is used in analysis for improving data quality, exploring data, and calculating summary tables. It involves modifying observations with filters, sorting data, and selecting, renaming, and adding variables. Computing summaries helps derive insights by aggregating data, while joining tables allows for combining multiple datasets into a cohesive singular dataset. Tools from dplyr enable such data manipulation, preparing informative summary tables and making data ready for modelling and visualisation.
Manipulating Variables
Manipulating variables involves adding new columns, removing unnecessary ones, and renaming existing columns to improve data clarity and relevance. It also includes transforming data by creating derived variables, thereby enhancing the dataset’s usability for analysis.
Select variables (select())
The select() function allows you to choose specific columns from a dataset.
library(dplyr)library(palmerpenguins)# Select species, island, and bill_length_mm columnspenguins |>select(species, island, bill_length_mm)
Try selecting the bill length and depth from the penguins dataset.
Hint
Use select() to extract the bill_length_mm, and bill_depth_mm columns.
penguins |>select(bill_length_mm, bill_depth_mm)
Tidy selection
The select() function supports tidy selection, much like pivot_longer() and many other functions from tidyr.
This allowing you to use helpers like starts_with() and contains() to efficiently select columns.
Rename variables (rename())
The rename() function changes the names of existing columns.
# Rename bill_length_mm to bill_lengthpenguins |>rename(bill_length = bill_length_mm)
# A tibble: 344 Ă— 8
species island bill_length bill_depth_mm flipper_length_mm body_mass_g sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Adelie Torger… 39.1 18.7 181 3750 male
2 Adelie Torger… 39.5 17.4 186 3800 fema…
3 Adelie Torger… 40.3 18 195 3250 fema…
4 Adelie Torger… NA NA NA NA <NA>
5 Adelie Torger… 36.7 19.3 193 3450 fema…
6 Adelie Torger… 39.3 20.6 190 3650 male
7 Adelie Torger… 38.9 17.8 181 3625 fema…
8 Adelie Torger… 39.2 19.6 195 4675 male
9 Adelie Torger… 34.1 18.1 193 3475 <NA>
10 Adelie Torger… 42 20.2 190 4250 <NA>
# ℹ 334 more rows
# ℹ 1 more variable: year <int>
Try renaming the columns to remove the units from the variable names.
Hint
Use rename() to modify column names to remove units, such as changing bill_length_mm to bill_length, bill_depth_mm to bill_depth, flipper_length_mm to flipper_length, and body_mass_g to body_mass.
To apply the same calculation across multiple columns, use mutate() with across() and tidy selection functions. For instance, you can convert measurements from millimetres to centimetres for all columns ending with “mm” by dividing by 10.
penguins |>mutate(across(# Specify columns with tidy selection.cols =ends_with("mm"),# Calculations on each column.fns =~ .x /10, # Name of the new colum (replace mm with cm).names ="{sub('mm', 'cm', .col)}" ) )
# A tibble: 344 Ă— 11
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 5 more variables: sex <fct>, year <int>, bill_length_cm <dbl>,
# bill_depth_cm <dbl>, flipper_length_cm <dbl>
A named list of functions can be provided to compute multiple things across multiple columns.
Manipulating Observations
Extract rows by condition (filter())
The filter() function selects rows based on specific conditions.
# Filter for Adelie species onlypenguins |>filter(species =="Adelie")
# A tibble: 152 Ă— 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Try using filter() to keep only penguins with above average flipper lengths.
Hint
Filter the penguins dataset to include only the rows where flipper_length_mm is greater than the mean of flipper_length_mm.
A negative n or prop will remove values from the start or end of the data.
slice_min() and slice_max() in dplyr help you extract rows with the smallest or largest values of a specific column. This is useful for identifying extremes within your data.
# Find penguins with the smallest bill lengthpenguins %>%slice_min(bill_length_mm, n =3)
# A tibble: 3 Ă— 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Dream 32.1 15.5 188 3050
2 Adelie Dream 33.1 16.1 178 2900
3 Adelie Torgersen 33.5 19 190 3600
# ℹ 2 more variables: sex <fct>, year <int>
To take a random sample of rows from your dataset, use the slice_sample() function. This is helpful for creating a subset of data for exploratory analysis or testing.
Try sorting the penguins dataset by species and flipper_length_mm.
Hint
Use the arrange() function to sort the dataset first by species and then by flipper_length_mm.
penguins |>arrange(species, flipper_length_mm)
Grouped Operations
The group_by() function is used to group data by one or more variables for subsequent operations.
# Group by species and islandpenguins |>group_by(species, island)
While group_by() doesn’t change any rows or columns of the data itself, it alters how other dplyr functions operate, affecting calculations and transformations.
Grouped operations are most commonly used with summarise() to calculate aggregated statistics, like averages or totals, for each group. However, group_by() also works seamlessly with other dplyr functions like filter() and mutate(), allowing you to conduct calculations and transformations within each group.
Mutating within groups
Mutating within groups allows you to transform data based on group-specific calculations. For example, you can determine if a penguin’s body_mass_g is above average for its species.
The count() function in dplyr is used to tally the number of occurrences of each unique combination of values in specified columns. This is similar to distinct(), which identifies unique cases, but count() provides the number of times each distinct case occurs.
# Count the penguins of each sex for each speciespenguins %>%count(species, sex)
# A tibble: 8 Ă— 3
species sex n
<fct> <fct> <int>
1 Adelie female 73
2 Adelie male 73
3 Adelie <NA> 6
4 Chinstrap female 34
5 Chinstrap male 34
6 Gentoo female 58
7 Gentoo male 61
8 Gentoo <NA> 5
Try counting the number of penguins recorded on each island.
Hint
Use count() to tally the number of penguins recorded for each island specified in the dataset.
penguins |>count(island)
Joining Tables
There are several ways to join multiple tables together in dplyr. You can combine data by stacking rows using bind_rows() or by adding columns with bind_cols(). To add new variables based on common identifying columns, you can use mutating joins like left_join(), right_join(), inner_join(), and full_join(). To extract observations based on matching values in a secondary dataset, use the filtering join function anti_join().
Combining data (bind_rows(), bind_cols())
Multiple datasets with the same variables can be combined by stacking rows using bind_rows(), while datasets with the same observations can be merged by adding columns using bind_cols().
These functions combine data without using common identifying keys, and so special care should be taken to ensure that the datasets are compatible (aligned rows and columns). If your datasets do have matching identifying columns, it is safer to use mutating joins instead of bind_cols().
It is common for datasets to be provided in multiple files, such as a separate dataset for each year of collection. Since they all have the same set of variables, we can use bind_rows() to combine these datasets together.
Mutating joins add additional variables from one dataset to another based on matching key columns.
Join Type
Description
Rows Retained from Left Dataset
Rows Retained from Right Dataset
left_join()
Adds columns from the right dataset to the left dataset. Keeps all rows from the left dataset, with NA for unmatched right dataset rows.
All
Only those that match
right_join()
Adds columns from the left dataset to the right dataset. Keeps all rows from the right dataset, with NA for unmatched left dataset rows.
Only those that match
All
inner_join()
Returns rows that have matching keys in both datasets and adds columns of both datasets to the result.
Only those that match
Only those that match
full_join()
Combines all rows from both datasets, treating unmatched keys with NA values in columns from the opposite dataset.
All
All
Suppose we have a penguin_info dataset which contains key details about different penguin species, including their scientific name, typical nest type, and conservation status.
penguin_info <-tibble(species =c("Adelie", "Chinstrap", "Gentoo"),scientific_name =c("Pygoscelis adeliae", "Pygoscelis antarcticus", "Pygoscelis papua"),nest_type =c("Nests made from stones", "Nests made from stones", "Nests lined with pebbles and vegetation"),conservation_status =c("Least Concern", "Least Concern", "Least Concern"))penguin_info
# A tibble: 3 Ă— 4
species scientific_name nest_type conservation_status
<chr> <chr> <chr> <chr>
1 Adelie Pygoscelis adeliae Nests made from stones Least Concern
2 Chinstrap Pygoscelis antarcticus Nests made from stones Least Concern
3 Gentoo Pygoscelis papua Nests lined with pebbles… Least Concern
We can add this additional information about the penguin species to the data using a left_join().
penguins |>left_join(penguin_info, by ="species")
# A tibble: 344 Ă— 11
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 5 more variables: sex <fct>, year <int>, scientific_name <chr>,
# nest_type <chr>, conservation_status <chr>
Try to add information about the islands to the penguins dataset. Information from the Dream island was not provided in the supplementary island_info dataset, so use an appropriate join operation that will produce a combined dataset without any missing values.
Hint
Consider using an inner_join() for combining the two datasets. This is appropriate because inner_join() will only keep rows with matching values in both datasets. Since the Dream island is not available in island_info, its entries from penguins will be excluded from the final merged dataset.
penguins |>inner_join(island_info, by = ______)
Hint
The datasets need to be joined on a key column they both have in common. In this case, it is the island column, which links data about islands from both the penguins and island_info datasets.
penguins |>inner_join(island_info, by ="island")
Filtering join (semi_join(), anti_join())
Filtering join functions filter the rows in the left dataset based on matching values in the right dataset. Unlike mutating joins, filtering joins do not add additional variables from the right dataset; the right dataset is only used to identify rows to keep or remove.
Join Type
Description
Rows Retained from Left Dataset
Rows Retained from Right Dataset
semi_join()
Returns rows from the left dataset that have matching keys in the right dataset.
Only those that match
Not included in the result
anti_join()
Returns rows from the left dataset that do not have matching keys in the right dataset.
Only those without a match
Not included in the result
The anti_join() function filters observations from the first dataset that do not have matching keys in the second dataset. This is useful when you want to identify records in one dataset without corresponding entries in another.
Suppose we don’t want to add additional information about the islands, but we are only interested in the islands we have complete data for. We could use a filtering join to keep only the observations with known island information.
# A tibble: 220 Ă— 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 210 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Finding incomplete data
Filtering joins can help you identify implicit missing values in your dataset if you have a second dataset with a complete set of identifying variables. Using anti_join(complete_data, analysis_data) will reveal which observations are missing from your analysis dataset.