1. Linear regression

Statistically summarise relationships between variables of a dataset with a linear regression model.

Published

21 February 2025

Regression modelling helps us understand relationships between variables and make predictions. Regression allows us to quantify these relationships between two or more variables to make informed decisions based on data. The lm() function in R allows us to estimate linear regression models using a model formula.

Simple linear regression

A simple linear regression models the relationship between one predictor (X) and one response (Y).

Mathematically, general formula is:

\[\mathbf{Y} = \beta_0\mathbf{1} + \beta_1 \mathbf{X} + \boldsymbol{\varepsilon}\]

where:

  • \(\mathbf{Y}\) is the dependent variable (what is being predicted),
  • \(\mathbf{1}\) is a vector containing 1 (for the intercept),
  • \(\mathbf{X}\) is the independent variable (how it is predicted),
  • \(\beta_0\) (intercept) and \(\beta_1\) (slope) are model parameters,
  • \(\boldsymbol{\varepsilon}\) represents random error (assumed \(\varepsilon_i \overset{iid}{\sim} N(0,1)\)).

Programmatically, the R model formula is:

lm(Y ~ 1 + X, data = data)

where:

  • Y is the name of the dependent variable from data,
  • 1 indicates to include an intercept,
  • X is the name of the independent variable from data,
  • data is the name of the dataset being modelled.

Example: Predicting penguin weight

Letā€™s predict the weight of adult foraging penguins nearby Antarcticaā€™s Palmer Station. This dataset is included in the palmerpenguins package.

The weight of each penguin is recorded in the body_mass_g column (our dependent variable), and we will use the length of their bill (from the bill_length_mm column) to predict it. Our hypothesis could be that a larger bill allows a penguin to eat more fish!

library(palmerpenguins)
# Fit simple linear regression model
lm(body_mass_g ~ bill_length_mm, data = penguins)

Call:
lm(formula = body_mass_g ~ bill_length_mm, data = penguins)

Coefficients:
   (Intercept)  bill_length_mm  
        362.31           87.42  

The coefficient for bill_length_mm (\(\beta_1 \approx 87.42\)) indicates that for each additional millimetre in bill length, the penguinā€™s body mass increases by roughly 87.42 grams on average.

To evaluate the significance of this relationship, we can look at the model in more detail using summary().

# Fit simple linear regression model
fit <- lm(body_mass_g ~ bill_length_mm, data = penguins)

# View model summary
summary(fit)

Call:
lm(formula = body_mass_g ~ bill_length_mm, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1762.08  -446.98    32.59   462.31  1636.86 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     362.307    283.345   1.279    0.202    
bill_length_mm   87.415      6.402  13.654   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 645.4 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.3542,    Adjusted R-squared:  0.3523 
F-statistic: 186.4 on 1 and 340 DF,  p-value: < 2.2e-16

The summary() output shows that the bill_length_mm coefficient has a p-value of \(< 2e^{-16}\) (nearly 0), which is significant at a 5% level (since it is less than 0.05). This analysis and interpretation isnā€™t reliable, since there are many statistical considerations which have not yet been satisfied.

Interpret with caution

While the model provides useful insights, it should be interpreted with caution. Making accurate interpretations from a model requires thorough investigation of the model and data. A deeper understanding of the modelā€™s capabilities and assumptions is beyond this lesson.

While the model suggests a positive relationship between bill length and body mass, there are several important caveats and assumptions to consider:

  • Assumed Linearity

    The model assumes a linear relationship between bill length and body mass. If the true relationship is curved or more complex, a simple linear regression may not be appropriate.

  • Causation vs. Correlation

    The model does not prove that bill length causes changes in body massā€”it only identifies an association. Other factors (e.g., diet, age, species differences) may influence both variables.

  • Omitted Variable Bias

    The model only includes bill length as a predictor. Other important predictors like flipper length, species, or sex might influence body mass. Ignoring these could lead to misleading results.

  • Extrapolation Risk

    Predictions should only be made within the range of observed bill lengths in the dataset. For example, predicting body mass for a penguin with an extremely small or large bill (outside the datasetā€™s range) could lead to unreliable estimates.

  • Residual assumptions
    The model assumes that residuals (errors) are \(\varepsilon_i \overset{iid}{\sim} N(0,1)\), and this should be assumption should be tested with residual diagnostics. The iid means:

    • Independent (and)

      The weight of one penguin does not affect another

    • Identically

      All residuals have a:

      • mean of 0 (needed for unbiased predictions)
      • constant variance (needed for accurate prediction intervals)
    • Distributed

      The residuals are normally distributed (needed for valid hypothesis testing)

Try estimating a simple linear regression model with a different predictor from the Palmer penguins dataset.

Hint

Choose another variable from the penguins dataset, such as bill_depth_mm, flipper_length_mm, or species for building the model.

fit <- lm(body_mass_g ~ bill_depth_mm, data = penguins)
summary(fit)

Multiple linear regression

A multiple linear regression extends simple linear regression by including multiple predictors (Xā‚, Xā‚‚, ..., Xā‚™) to explain variations in a response variable (Y). This helps account for additional sources of variation and can reduce omitted variable bias, which occurs when an important predictor is left out, potentially leading to misleading estimates.

Mathematically, the general formula is:

\[\mathbf{Y} = \beta_0\mathbf{1} + \beta_1 \mathbf{X_1} + \beta_2 \mathbf{X_2} + \dots + \beta_n \mathbf{X_n} + \boldsymbol{\varepsilon}\]

where:

  • \(\mathbf{X_1}, \mathbf{X_2}, \dots, \mathbf{X_n}\) are independent variables (predictors),
  • \(\beta_1, \beta_2, \dots, \beta_n\) are their respective coefficients,
  • \(\beta_0\) is the intercept,
  • \(\boldsymbol{\varepsilon}\) represents random error.

Programmatically, the R model formula is:

lm(Y ~ 1 + X1 + X2 + X3 + ..., data = data)

where:

  • Y is the name of the dependent variable from data,
  • 1 indicates to include an intercept,
  • X1, X2, X3, ā€¦ are the names of the independent variables from data,
  • data is the name of the dataset being modelled.

Example: Predicting penguin weight with multiple predictors

Instead of just using bill length, we now incorporate flipper length as an additional predictor. Our hypothesis is that both bill size and flipper size contribute to the body mass of a penguin (perhaps longer flippers helps a penguin swim faster to catch more fish)!

# Fit multiple linear regression model
fit_multi <- lm(body_mass_g ~ bill_length_mm + flipper_length_mm, data = penguins)

# View model summary
summary(fit_multi)

Call:
lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm, 
    data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-1090.5  -285.7   -32.1   244.2  1287.5 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5736.897    307.959 -18.629   <2e-16 ***
bill_length_mm        6.047      5.180   1.168    0.244    
flipper_length_mm    48.145      2.011  23.939   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 394.1 on 339 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:   0.76, Adjusted R-squared:  0.7585 
F-statistic: 536.6 on 2 and 339 DF,  p-value: < 2.2e-16

Each coefficient can be interpretted similarly to simple linear regression.

The coefficient for bill_length_mm (\(\beta_1 \approx 6.05\)) indicates that for each additional millimetre in bill length, the penguinā€™s body mass increases by roughly 6.05 grams on average, holding flipper length constant. However, the p-value (0.244) suggests this effect is not statistically significant at a 5% level.

The coefficient for flipper_length_mm (\(\beta_2 \approx 48.15\)) indicates that for each additional millimetre in flipper length, the penguinā€™s body mass increases by roughly 48.15 grams on average, holding bill length constant. This effect is highly significant (p < 0.001).

Notice that when including the flipper length in the model, the coefficient for bill length (\(\beta_1\)) becomes much smaller and is no longer statistically significant (p = 0.244). This suggests that flipper length may account for much of the variation in body mass that was previously attributed to bill length in a simple linear regression. In other words, bill lengthā€™s effect on body mass may have been overstated in the simpler model due to omitted variable bias, and flipper length appears to be a stronger predictor of body mass.

Now add the bill depth to the model and see how it changes the model.

Hint

Include the specified numerical predictors: bill_length_mm, bill_depth_mm, and flipper_length_mm to build the model.

fit <- lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm, data = penguins)
summary(fit)

Categorical predictors (dummy variables)

A dummy variable is a numerical representation of a categorical variable. It uses 0s and 1s to encode different groups, which results in level shifts to predictions (just like the intercept). This allows categorical data to be included in regression models.

For example, in the Palmer Penguins dataset, the species variable has three categories: Adelie, Chinstrap, and Gentoo. Since regression models require numerical inputs, we create dummy variables for two of the species (the third is treated as the baseline).

Letā€™s define the dummy variables:

\[ \text{species}_\text{Chinstrap} = \begin{cases} 1, & \text{if the penguin is Chinstrap} \\ 0, & \text{otherwise} \end{cases} \]

\[ \text{species}_\text{Gentoo} = \begin{cases} 1, & \text{if the penguin is Gentoo} \\ 0, & \text{otherwise} \end{cases} \]

Since there are three categories, we only need two dummy variables ā€” the third category (Adelie) is the reference category.

Mathematically, the multiple regression model with these dummy variables is:

\[ \text{body\_mass}_i = \beta_0 + \beta_1 \cdot \text{species}_\text{Chinstrap} + \beta_2 \cdot \text{species}_\text{Gentoo} + \varepsilon_i \]

where:

  • \(\beta_0\) represents the average body mass of Adelie penguins (the reference category).
  • \(\beta_1\) represents the difference in average body mass between Chinstrap and Adelie penguins.
  • \(\beta_2\) represents the difference in average body mass between Gentoo and Adelie penguins.

Programmatically, the R model formula is:

lm(Y ~ 1 + X, data = data)

where:

  • X is a categorical variable (character or factor) which is automatically converted into dummy variable(s).

This approach allows categorical predictors to be incorporated into regression models while maintaining meaningful interpretation.

Example: Predicting penguin weight using species information

fit_species <- lm(body_mass_g ~ species, data = penguins)
summary(fit_species)

Call:
lm(formula = body_mass_g ~ species, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1126.02  -333.09   -33.09   316.91  1223.98 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3700.66      37.62   98.37   <2e-16 ***
speciesChinstrap    32.43      67.51    0.48    0.631    
speciesGentoo     1375.35      56.15   24.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 462.3 on 339 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6697,    Adjusted R-squared:  0.6677 
F-statistic: 343.6 on 2 and 339 DF,  p-value: < 2.2e-16

Interpretting the dummy variable coefficients for categorical predictors requires special consideration to the reference category.

  • Intercept (\(\beta_0 \approx 3700.66\)):

    This represents the average body mass (in grams) of Adelie penguins, since Adelie is the reference category. On average, Adelie penguins weigh 3700.66 grams. This is highly statistically significant (p < 0.001), which simply means their weight is statistically not zero.

  • speciesChinstrap (\(\beta_1 \approx 32.43\)):

    This represents the difference in average body mass between Chinstrap and Adelie penguins. Chinstrap penguins are, on average, 32.43 grams heavier than Adelie penguins, but this difference is not statistically significant (p = 0.631), meaning we cannot confidently conclude that Chinstrap penguins weigh more than Adelie penguins based on this model.

  • speciesGentoo (\(\beta_2 \approx 1375.35\)):

    This represents the difference in average body mass between Gentoo and Adelie penguins. Gentoo penguins are, on average, 1375.35 grams heavier than Adelie penguins, and this difference is highly statistically significant (p < 0.001).

Try adding a second categorical variable for the penguinā€™s island into the model, does it make a difference?

Hint

To model body_mass_g with categorical variables, use both island and species as predictors.

fit <- lm(body_mass_g ~ island + species, data = penguins)
summary(fit)

Interaction effects

An interaction effect occurs when the effect of one predictor on the response variable depends on the value of another predictor. This means that the relationship between a predictor and the outcome is not constant across levels of another variable.

Mathematically, an interaction between two variables \(X_1\) and \(X_2\) in a regression model is represented as:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (X_1 \times X_2) + \varepsilon \]

where:

  • \(\beta_3\) represents the interaction effect, or how the relationship between \(X_1\) and \(Y\) changes depending on \(X_2\).

Programmatically we use * to represent interactions, so the R model formula is:

lm(Y ~ 1 + X1 * X2, data = data)

The main effects (X1) and (X2) are automatically included. To specify an interaction without including main effects, we instead use X1:X2.

Example: Interaction between species and flipper length

Letā€™s test whether the relationship between flipper length and body mass differs by species. We include an interaction term between species and flipper_length_mm in the model:

fit_interaction <- lm(body_mass_g ~ species * flipper_length_mm, data = penguins)
summary(fit_interaction)

Call:
lm(formula = body_mass_g ~ species * flipper_length_mm, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-911.18 -251.93  -31.77  197.82 1144.81 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        -2535.837    879.468  -2.883  0.00419 ** 
speciesChinstrap                    -501.359   1523.459  -0.329  0.74229    
speciesGentoo                      -4251.444   1427.332  -2.979  0.00311 ** 
flipper_length_mm                     32.832      4.627   7.095 7.69e-12 ***
speciesChinstrap:flipper_length_mm     1.742      7.856   0.222  0.82467    
speciesGentoo:flipper_length_mm       21.791      6.941   3.139  0.00184 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 370.6 on 336 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.7896,    Adjusted R-squared:  0.7864 
F-statistic: 252.2 on 5 and 336 DF,  p-value: < 2.2e-16

The interaction terms represent how the effect of flipper length on body mass differs by species.

  • speciesChinstrap:flipper_length_mm (\(\beta_4 \approx 1.742\)):
    This coefficient represents the additional change in body mass for Chinstrap penguins for each additional millimetre of flipper length, compared to Adelie penguins (the reference species). The coefficient is not statistically significant (p = 0.82467), suggesting that the relationship between flipper length and body mass for Chinstrap penguins is not meaningfully different from that of Adelie penguins. The small value of 1.742 indicates that even if the effect were significant, it would be a minor difference in comparison to other species.

  • speciesGentoo:flipper_length_mm (\(\beta_5 \approx 21.791\)):
    This coefficient represents the additional change in body mass for Gentoo penguins for each additional millimetre of flipper length, compared to Adelie penguins. The coefficient is statistically significant (p = 0.00184), indicating that the effect of flipper length on body mass for Gentoo penguins is significantly stronger than for Adelie penguins. Specifically, for each additional millimetre of flipper length, the body mass of Gentoo penguins increases by 21.791 grams more than Adelie penguins, reflecting a more pronounced relationship between flipper length and body mass for this species.

Now try to model a penguinā€™s weight using the interaction between species and the length of a penguinā€™s bill.

Hint

Use * symbol to include the interaction between species and bill_length_mm in your model.

fit <- lm(body_mass_g ~ species * bill_length_mm, data = penguins)
summary(fit)