Reading and writing data from common local file formats
Published
21 February 2025
Data is the core of data analysis, and accurately reading data is the first step of any analysis. There are many data formats and platforms in use, and understanding how to read them efficiently is necessary for working with data in R. Later in your analysis, you often need to share your cleaned dataset or tabulated results with others, and knowing how to write data to appropriate file formats is necessary for effective communication and collaboration.
This lesson will teach the basics of file-based data formats, including:
Text-based data (e.g. CSV)
These datasets use structured plain text data, and are popular for their simplicity and compatibility.
Excel data
Spreadsheet formats used by Microsoft Excel, supporting complex data structures including formulas and graphs.
Serialized Data Formats (e.g. RDS and RData)
These formats are purpose built and can only be read in by specialised functions for that format.
Parquet data
Parquet is a columnar storage file format which is ideal for large and complex data structures.
Text-Based Data
Text-based data formats are widely used due to their simplicity and compatibility across various platforms, storing data as plain text. The readr package in R is ideal for importing and exporting text-based datasets.
Comma-Separated Values (CSV)
CSV files are popular for representing tabular data, where each line is a row of data and columns are separated by commas.
You can open these files in any text editor, which will show you something like this:
The read_csv() function from readr is recommended to read in CSV files. Simply use the function with the file path to the dataset.
The raw CSV for the Palmer penguins dataset can be found at system.file("extdata/penguins_raw.csv", package = "palmerpenguins"), and it can be imported into R as penguins_csv with this code:
library(readr)# Read the raw penguins csv file from palmerpenguins package. penguins_csv <-read_csv(system.file("extdata/penguins_raw.csv", package ="palmerpenguins"))
Rows: 344 Columns: 17
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (9): studyName, Species, Region, Island, Stage, Individual ID, Clutch C...
dbl (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng...
date (1): Date Egg
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notice the message about βcolumn specificationβ when this dataset is imported. Text-based data needs parsing into appropriate data types, such as integers or characters.
The readr package guesses data types for each column automatically. This guessing is usually accurate, but it is strongly recommended that you verify these column types and make any corrections if needed.
You can manually specify the column types with col_types, which is safer than relying on guessing. You can generate some starter code for column types based on the guessed types using spec().
Now the data types will always be accurate, with any incompatible values being converted to NA (with warnings).
Other useful importing options
The contents of data files can be just as varied as the file formats themselves. The {readr} package provides many useful arguments to handle these variations to accurately read in the values. Read the documentation with ?read_csv for all the details, here are some particularly useful arguments:
skip: Number of lines to skip before reading data, useful for excluding metadata.
na: Specify how missing values are represented, e.g., na = c("", "NA", "missing").
locale: Specify the language for the data (useful for if dates and text encodings differ).
Writing CSVs
The write_csv() function is used to export data frames to CSV files. Simply pass in the dataset to export, and the file path of where the data should be written.
# Exporting a data frame to a CSV filewrite_csv(penguins_csv, "output/penguins_processed.csv")
Other text-based formats
Different formats are suited to specific needs, such as varying delimiters or fixed column widths. Hereβs a summary of other file formats and the functions used to import and export them.
Format
Read Function
Write Function
Description
Comma-Separated
read_csv()
write_csv()
Commas separate values
Tab-Separated
read_tsv()
write_tsv()
Tabs separate values
Delimited
read_delim()
write_delim()
Custom delimiter specified by user
Fixed Width Format
read_fwf()
N/A
Columns of fixed widths
White Space Delimited
read_table()
N/A
Separated by white space
Reading multiple files
For datasets with similar structures, readr enables you to read them together into a single dataset (optionally with a file identifier).
Simply use the read_*() function with a vector of file paths, and you can add a column for the file name with id = <file_name_column>.
Excel files
The read_excel() function from the readxl package is designed to import data from Excel files into R. It supports both .xls and .xlsx formats. Only a single sheet can be read in at a time, and by default the first sheet is read.
library(readxl)# Importing data from an Excel filepenguins_xlsx <-read_excel("penguins_data.xlsx")
The read_excel() function behaves very similarly to the {readr} functions, but has some additional options specific to excel:
sheet: Specify which sheet to read either with the sheet name (as a string) or the sheet index (as a number).
range: This option lets you define a specific cell range to import from the selected sheet, for example "A1:D10".
The {readxl} package does not support writing excel files.
Serialised R Data Formats
In R, serialised data formats such as RDS and RDA (or RData) are used to efficiently save and load R objects, preserving their structure and metadata. These formats allow you to save and restore any R object, not just datasets, making them a useful format for storing intermediate results.
Serialised formats provide quick read and write operations because they do not require parsing and deparsing like text-based data formats. The disadvantage of this is that these formats are not designed for use outside the R ecosystem.
The readr package supports reading and writing RDS files, with a similar usage to other text-based file formats:
library(readr)# Writing the penguins data to an RDS filewrite_rds(penguins_csv, "penguins.rds")# Reading the penguins data from an RDS filepenguins_rds <-read_rds("penguins.rds")