Statistically summarise relationships between variables of a dataset with a linear regression model.
Published
21 February 2025
Regression modelling helps us understand relationships between variables and make predictions. Regression allows us to quantify these relationships between two or more variables to make informed decisions based on data. The lm() function in R allows us to estimate linear regression models using a model formula.
Simple linear regression
A simple linear regression models the relationship between one predictor (X) and one response (Y).
\(\mathbf{Y}\) is the dependent variable (what is being predicted),
\(\mathbf{1}\) is a vector containing 1 (for the intercept),
\(\mathbf{X}\) is the independent variable (how it is predicted),
\(\beta_0\) (intercept) and \(\beta_1\) (slope) are model parameters,
\(\boldsymbol{\varepsilon}\) represents random error (assumed \(\varepsilon_i \overset{iid}{\sim} N(0,1)\)).
Programmatically, the R model formula is:
lm(Y ~1+ X, data = data)
where:
Y is the name of the dependent variable from data,
1 indicates to include an intercept,
X is the name of the independent variable from data,
data is the name of the dataset being modelled.
Example: Predicting penguin weight
Letās predict the weight of adult foraging penguins nearby Antarcticaās Palmer Station. This dataset is included in the palmerpenguins package.
The weight of each penguin is recorded in the body_mass_g column (our dependent variable), and we will use the length of their bill (from the bill_length_mm column) to predict it. Our hypothesis could be that a larger bill allows a penguin to eat more fish!
library(palmerpenguins)# Fit simple linear regression modellm(body_mass_g ~ bill_length_mm, data = penguins)
The coefficient for bill_length_mm (\(\beta_1 \approx 87.42\)) indicates that for each additional millimetre in bill length, the penguinās body mass increases by roughly 87.42 grams on average.
To evaluate the significance of this relationship, we can look at the model in more detail using summary().
# Fit simple linear regression modelfit <-lm(body_mass_g ~ bill_length_mm, data = penguins)# View model summarysummary(fit)
Call:
lm(formula = body_mass_g ~ bill_length_mm, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1762.08 -446.98 32.59 462.31 1636.86
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 362.307 283.345 1.279 0.202
bill_length_mm 87.415 6.402 13.654 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 645.4 on 340 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.3542, Adjusted R-squared: 0.3523
F-statistic: 186.4 on 1 and 340 DF, p-value: < 2.2e-16
The summary() output shows that the bill_length_mm coefficient has a p-value of \(< 2e^{-16}\) (nearly 0), which is significant at a 5% level (since it is less than 0.05). This analysis and interpretation isnāt reliable, since there are many statistical considerations which have not yet been satisfied.
Interpret with caution
While the model provides useful insights, it should be interpreted with caution. Making accurate interpretations from a model requires thorough investigation of the model and data. A deeper understanding of the modelās capabilities and assumptions is beyond this lesson.
While the model suggests a positive relationship between bill length and body mass, there are several important caveats and assumptions to consider:
Assumed Linearity
The model assumes a linear relationship between bill length and body mass. If the true relationship is curved or more complex, a simple linear regression may not be appropriate.
Causation vs. Correlation
The model does not prove that bill length causes changes in body massāit only identifies an association. Other factors (e.g., diet, age, species differences) may influence both variables.
Omitted Variable Bias
The model only includes bill length as a predictor. Other important predictors like flipper length, species, or sex might influence body mass. Ignoring these could lead to misleading results.
Extrapolation Risk
Predictions should only be made within the range of observed bill lengths in the dataset. For example, predicting body mass for a penguin with an extremely small or large bill (outside the datasetās range) could lead to unreliable estimates.
Residual assumptions
The model assumes that residuals (errors) are \(\varepsilon_i \overset{iid}{\sim} N(0,1)\), and this should be assumption should be tested with residual diagnostics. The iid means:
Independent (and)
The weight of one penguin does not affect another
Identically
All residuals have a:
mean of 0 (needed for unbiased predictions)
constant variance (needed for accurate prediction intervals)
Distributed
The residuals are normally distributed (needed for valid hypothesis testing)
Try estimating a simple linear regression model with a different predictor from the Palmer penguins dataset.
Hint
Choose another variable from the penguins dataset, such as bill_depth_mm, flipper_length_mm, or species for building the model.
fit <-lm(body_mass_g ~ bill_depth_mm, data = penguins)summary(fit)
Multiple linear regression
A multiple linear regression extends simple linear regression by including multiple predictors (Xā, Xā, ..., Xā) to explain variations in a response variable (Y). This helps account for additional sources of variation and can reduce omitted variable bias, which occurs when an important predictor is left out, potentially leading to misleading estimates.
\(\mathbf{X_1}, \mathbf{X_2}, \dots, \mathbf{X_n}\) are independent variables (predictors),
\(\beta_1, \beta_2, \dots, \beta_n\) are their respective coefficients,
\(\beta_0\) is the intercept,
\(\boldsymbol{\varepsilon}\) represents random error.
Programmatically, the R model formula is:
lm(Y ~1+ X1 + X2 + X3 + ..., data = data)
where:
Y is the name of the dependent variable from data,
1 indicates to include an intercept,
X1, X2, X3, ā¦ are the names of the independent variables from data,
data is the name of the dataset being modelled.
Example: Predicting penguin weight with multiple predictors
Instead of just using bill length, we now incorporate flipper length as an additional predictor. Our hypothesis is that both bill size and flipper size contribute to the body mass of a penguin (perhaps longer flippers helps a penguin swim faster to catch more fish)!
# Fit multiple linear regression modelfit_multi <-lm(body_mass_g ~ bill_length_mm + flipper_length_mm, data = penguins)# View model summarysummary(fit_multi)
Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm,
data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1090.5 -285.7 -32.1 244.2 1287.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5736.897 307.959 -18.629 <2e-16 ***
bill_length_mm 6.047 5.180 1.168 0.244
flipper_length_mm 48.145 2.011 23.939 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 394.1 on 339 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.76, Adjusted R-squared: 0.7585
F-statistic: 536.6 on 2 and 339 DF, p-value: < 2.2e-16
Each coefficient can be interpretted similarly to simple linear regression.
The coefficient for bill_length_mm (\(\beta_1 \approx 6.05\)) indicates that for each additional millimetre in bill length, the penguinās body mass increases by roughly 6.05 grams on average, holding flipper length constant. However, the p-value (0.244) suggests this effect is not statistically significant at a 5% level.
The coefficient for flipper_length_mm (\(\beta_2 \approx 48.15\)) indicates that for each additional millimetre in flipper length, the penguinās body mass increases by roughly 48.15 grams on average, holding bill length constant. This effect is highly significant (p < 0.001).
Notice that when including the flipper length in the model, the coefficient for bill length (\(\beta_1\)) becomes much smaller and is no longer statistically significant (p = 0.244). This suggests that flipper length may account for much of the variation in body mass that was previously attributed to bill length in a simple linear regression. In other words, bill lengthās effect on body mass may have been overstated in the simpler model due to omitted variable bias, and flipper length appears to be a stronger predictor of body mass.
Now add the bill depth to the model and see how it changes the model.
Hint
Include the specified numerical predictors: bill_length_mm, bill_depth_mm, and flipper_length_mm to build the model.
fit <-lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm, data = penguins)summary(fit)
Categorical predictors (dummy variables)
A dummy variable is a numerical representation of a categorical variable. It uses 0s and 1s to encode different groups, which results in level shifts to predictions (just like the intercept). This allows categorical data to be included in regression models.
For example, in the Palmer Penguins dataset, the species variable has three categories: Adelie, Chinstrap, and Gentoo. Since regression models require numerical inputs, we create dummy variables for two of the species (the third is treated as the baseline).
Letās define the dummy variables:
\[
\text{species}_\text{Chinstrap} =
\begin{cases}
1, & \text{if the penguin is Chinstrap} \\
0, & \text{otherwise}
\end{cases}
\]
\[
\text{species}_\text{Gentoo} =
\begin{cases}
1, & \text{if the penguin is Gentoo} \\
0, & \text{otherwise}
\end{cases}
\]
Since there are three categories, we only need two dummy variables ā the third category (Adelie) is the reference category.
Mathematically, the multiple regression model with these dummy variables is:
\(\beta_0\) represents the average body mass of Adelie penguins (the reference category).
\(\beta_1\) represents the difference in average body mass between Chinstrap and Adelie penguins.
\(\beta_2\) represents the difference in average body mass between Gentoo and Adelie penguins.
Programmatically, the R model formula is:
lm(Y ~1+ X, data = data)
where:
X is a categorical variable (character or factor) which is automatically converted into dummy variable(s).
This approach allows categorical predictors to be incorporated into regression models while maintaining meaningful interpretation.
Example: Predicting penguin weight using species information
fit_species <-lm(body_mass_g ~ species, data = penguins)summary(fit_species)
Call:
lm(formula = body_mass_g ~ species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1126.02 -333.09 -33.09 316.91 1223.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3700.66 37.62 98.37 <2e-16 ***
speciesChinstrap 32.43 67.51 0.48 0.631
speciesGentoo 1375.35 56.15 24.50 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 462.3 on 339 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6697, Adjusted R-squared: 0.6677
F-statistic: 343.6 on 2 and 339 DF, p-value: < 2.2e-16
Interpretting the dummy variable coefficients for categorical predictors requires special consideration to the reference category.
Intercept (\(\beta_0 \approx 3700.66\)):
This represents the average body mass (in grams) of Adelie penguins, since Adelie is the reference category. On average, Adelie penguins weigh 3700.66 grams. This is highly statistically significant (p < 0.001), which simply means their weight is statistically not zero.
speciesChinstrap (\(\beta_1 \approx 32.43\)):
This represents the difference in average body mass between Chinstrap and Adelie penguins. Chinstrap penguins are, on average, 32.43 grams heavier than Adelie penguins, but this difference is not statistically significant (p = 0.631), meaning we cannot confidently conclude that Chinstrap penguins weigh more than Adelie penguins based on this model.
speciesGentoo (\(\beta_2 \approx 1375.35\)):
This represents the difference in average body mass between Gentoo and Adelie penguins. Gentoo penguins are, on average, 1375.35 grams heavier than Adelie penguins, and this difference is highly statistically significant (p < 0.001).
Try adding a second categorical variable for the penguinās island into the model, does it make a difference?
Hint
To model body_mass_g with categorical variables, use both island and species as predictors.
fit <-lm(body_mass_g ~ island + species, data = penguins)summary(fit)
Interaction effects
An interaction effect occurs when the effect of one predictor on the response variable depends on the value of another predictor. This means that the relationship between a predictor and the outcome is not constant across levels of another variable.
Mathematically, an interaction between two variables \(X_1\) and \(X_2\) in a regression model is represented as:
\(\beta_3\) represents the interaction effect, or how the relationship between \(X_1\) and \(Y\) changes depending on \(X_2\).
Programmatically we use * to represent interactions, so the R model formula is:
lm(Y ~1+ X1 * X2, data = data)
The main effects (X1) and (X2) are automatically included. To specify an interaction without including main effects, we instead use X1:X2.
Example: Interaction between species and flipper length
Letās test whether the relationship between flipper length and body mass differs by species. We include an interaction term between species and flipper_length_mm in the model:
fit_interaction <-lm(body_mass_g ~ species * flipper_length_mm, data = penguins)summary(fit_interaction)
Call:
lm(formula = body_mass_g ~ species * flipper_length_mm, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-911.18 -251.93 -31.77 197.82 1144.81
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2535.837 879.468 -2.883 0.00419 **
speciesChinstrap -501.359 1523.459 -0.329 0.74229
speciesGentoo -4251.444 1427.332 -2.979 0.00311 **
flipper_length_mm 32.832 4.627 7.095 7.69e-12 ***
speciesChinstrap:flipper_length_mm 1.742 7.856 0.222 0.82467
speciesGentoo:flipper_length_mm 21.791 6.941 3.139 0.00184 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 370.6 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.7896, Adjusted R-squared: 0.7864
F-statistic: 252.2 on 5 and 336 DF, p-value: < 2.2e-16
The interaction terms represent how the effect of flipper length on body mass differs by species.
speciesChinstrap:flipper_length_mm (\(\beta_4 \approx 1.742\)):
This coefficient represents the additional change in body mass for Chinstrap penguins for each additional millimetre of flipper length, compared to Adelie penguins (the reference species). The coefficient is not statistically significant (p = 0.82467), suggesting that the relationship between flipper length and body mass for Chinstrap penguins is not meaningfully different from that of Adelie penguins. The small value of 1.742 indicates that even if the effect were significant, it would be a minor difference in comparison to other species.
speciesGentoo:flipper_length_mm (\(\beta_5 \approx 21.791\)):
This coefficient represents the additional change in body mass for Gentoo penguins for each additional millimetre of flipper length, compared to Adelie penguins. The coefficient is statistically significant (p = 0.00184), indicating that the effect of flipper length on body mass for Gentoo penguins is significantly stronger than for Adelie penguins. Specifically, for each additional millimetre of flipper length, the body mass of Gentoo penguins increases by 21.791 grams more than Adelie penguins, reflecting a more pronounced relationship between flipper length and body mass for this species.
Now try to model a penguinās weight using the interaction between species and the length of a penguinās bill.
Hint
Use * symbol to include the interaction between species and bill_length_mm in your model.
fit <-lm(body_mass_g ~ species * bill_length_mm, data = penguins)summary(fit)