4  Pitfalls, Diagnostics, and Remedies

In Chapter 3 we saw that complex regression models can go wrong due to overfitting. But even simple regression analyses can be misleading if the linear model we are fitting does not adequately capture the data-generating process.

Example 4.1: Anscombe’s quartet: same regression, different data

Regression summaries can mask radically different data patterns. Consider the following four datasets, known as Anscombe’s quartet:

These data are clearly very different, howerver each variable has the same summary statistics:

Furthermore, fitting a simple linear regression model to each dataset yields identical estimates:

The key lesson from Anscombe’s quartet is that regression summaries can hide important aspects of the data. Therefore, a necessary step in any regression analysis is to check whether the data meet the assumptions of the linear model being fitted.

In this chapter we focus on the ways in which linear regression can go wrong, how to diagnose problems using residuals, and what to do when issues arise.

Key-point

A good regression analysis is not just about fitting a model and reporting results; it also involves checking that the model assumptions are reasonably satisfied and addressing any issues that arise.

4.1 Revisiting linear model assumptions

In ?sec-introduction-to-linear-models, we introduced the basic assumptions for constructing a simple linear model model of the relationship between two variables. In Chapter 2 we extended these these assumptions to the multiple predictor setting. As a reminder, a linear regression analysis assumes that the data were generated according to something of following form:

Assumption: The linear model

The response \(Y\) for observation \(i\) is related to predictor variables \(X_1, X_2, \ldots, X_p\) according to: \[ Y_i = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \cdots + \beta_p X_{i,p} + \varepsilon_i, \quad i = 1, 2, \ldots, n \] where \[ \varepsilon_i \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma^2) \] for some unknown parameters \(\beta_0, \beta_1, \ldots, \beta_p\) and \(\sigma^2 > 0\).

By assuming this form we can proceed to estimate the parameters \(b_0, b_1, \ldots, b_p\) and \(s^2\) using the method of least squares, and then use these estimates to perform inference and prediction.

However, if it is not the case that the data were generated according to the proposed linear model (e.g. as in sets 2, 3, or 4 of Anscombe’s quartet), then our estimates, inferences, and predictions may be misleading.

4.1.1 Common assumptions

It is often the case, some aspects of some aspects of the proposed linear model may be violated while others are still reasonably satisfied. For example, data may show a linear trend but have nonconstant variance (heteroskedasticity). For this reason, we often break the general assumption above into more specific parts when performing model diagnostics. A common, equivalent way to present the assumptions of a linear model is via the following four assumptions:

Assumption: Linearity

The relationship between the predictors and the response is linear (in the parameters) on average \[ E[Y_i] = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} \]

Assumption: Homoscedasticity

The variance of the errors is constant across the range of predictors: \[ \text{Var}(\epsilon_i) = \sigma^2 \]

Assumption: Normality

The errors are approximately normally distributed: \[ \epsilon_i \sim \text{Normal}(0, \sigma^2) \]

Assumption: Independence

The errors are independent: \(\epsilon_i\) is independent of \(\epsilon_j\) for \(i \neq j\).

The standard workflow for a regression analysis involves checking whether these assumptions are reasonably satisfied. If problems are detected, it means we need to revise our model in some way (e.g., by adding polynomial terms, transforming variables, or using a different modelling approach).

4.2 Residuals as diagnostic tools

How can we check whether the assumptions of a linear regression model are met? We might start by plotting the data and looking for obvious problems. This is how we identified issues in ?sec-anscombes-quartet looking at Anscombe’s quartet. However, a more systematic approach is to examine the residuals from the fitted model. Remember that the residuals are defined as the difference between the observed values and the fitted values from the model.

Key-term 4.1: Residual

The residual for observation \(i\) in a regression model is defined as \[e_i = y_i - \hat{y}_i\] where \(e_i\) is the residual for observation \(i\), \(y_i\) is the observed value, and \(\hat{y}_i\) is the predicted value from the model.

We used residuals in Section 1.3.2 to estimate the parameters of the linear model, but they can also be used to diagnose model problems. Since residuals encode the information about what the model has not captured, plotting residuals can reveal systematic patterns where the model is failing to fit the data well.

4.2.1 Standardised residuals, leverage, and Cook’s distance

Raw residuals \(e_i\) are measured on the scale of the response, so their magnitude depends on the units of \(Y\) (and on how variable \(Y\) is). In multiple regression, residuals also do not all have the same variance: observations with unusual predictor values tend to have smaller residual variance simply because the fitted model is “pulled” towards them. To compare residuals on a common scale, we use standardised residuals.

Key-term 4.2: Standardised residual

The standardised residual for observation \(i\) is \[ r_i = \frac{e_i}{s\,\sqrt{1-h_{ii}}} \] where \(s\) is the residual standard error from the fitted model and \(h_{ii}\) is the [leverage] for observation \(i\). Under the usual model assumptions, \(r_i\) is approximately \(\text{Normal}(0,1)\).

Standardised residuals are especially useful when comparing residual sizes across observations, and they are the scale used in several common diagnostic plots.

Key-term 4.3: Leverage (hat value)

Leverage measures how unusual an observation’s predictor values are relative to the rest of the data. The leverage values \(h_{ii}\) come from the hat matrix \(H = X(X^\top X)^{-1}X^\top\), satisfy \(0 \le h_{ii} \le 1\), and sum to \(p + 1\) (including the intercept), so the average leverage is \((p + 1)/n\).

A common rule of thumb is to flag points with \[ h_{ii} > \frac{2(p + 1)}{n} \quad \text{(or } \frac{3(p + 1)}{n}\text{)} \] as “high leverage”, where \(p\) is the number of predictors (excluding the intercept) and \(n\) is the sample size.

High leverage does not automatically mean a problem: it becomes concerning when it coincides with a large standardised residual.

Key-term 4.4: Cook’s distance

Cook’s distance summarises how much the fitted model would change if observation \(i\) were removed: \[ D_i = \frac{e_i^2}{(p + 1)s^2}\,\frac{h_{ii}}{(1-h_{ii})^2}. \] Large values indicate potentially influential observations.

Note

Rules of thumb (use with caution): observations with \(|r_i| > 2\) (or \(> 3\)) may be unusual; Cook’s distance values that stand out relative to the rest merit investigation.

In practice, these diagnostics are easy to extract with broom::augment() (as .std.resid, .hat, and .cooksd).

Example 4.2: Residuals in the Olympic cholesterol data

The OLYMPIC dataset records fat intake (mg) and cholesterol (mg/L) for 20 athletes.

We can see that cholesterol tends to increase with fat intake, but the relationship does not appear to be perfectly linear. Let’s fit a simple linear regression model predicting cholesterol from fat intake, and then examine the residuals to see if there are any obvious problems.

We can see that the positive relationship between fat intake and cholesterol observed in the scatter plot has been captured by the model (the diagonal trend is gone in the residual plots, and the coefficient for FAT in the model summary is positive and significant). However, there is still a clear curved pattern in the residuals, indicating that not all the structure in the data has been captured.

The curved pattern in the data is not able to be captured by a first order linear model. This suggests that we need to revise our model, perhaps by adding a polynomial term or transforming one of the variables.

4.2.2 Extracting diagnostics with broom::augment()

We saw in Section 1.2.4 how to extract residuals using resid(). A convenient alternative is broom::augment(), which returns a data frame containing the original data plus fitted values, residuals, and several diagnostic quantities.

Example 4.3: Olympic residuals with broom::augment()

The columns .std.resid, .hat, and .cooksd correspond to standardised residuals, leverage, and Cook’s distance. One simple way to spot potentially influential points is to sort by Cook’s distance:

We will use the ‘augmented’ data frame format extensively in the remainder of this chapter.

4.3 Residual diagnostics by assumption

The goal of residual diagnostics is to check whether the key assumptions behind our linear regression model are reasonable.

We’ll illustrate using a multiple linear regression model predicting miles per gallon (mpg) from weight (wt), horsepower (hp), and transmission type (am) in the mtcars dataset.

We will use broom::augment() to extract the residuals and other diagnostics.

And proceed by checking each assumption from Section 4.1.1 in turn.

4.3.1 Linearity

Again, we assume that the mean of the response is a linear function of the predictors.

Assumption

The conditional mean of the response is linear in the predictors: \[ E[Y_i] = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip}. \]

Because we make this assumption, our least squares fit finds the best straight line (in the least squares sense). If the data fit this assumption, then the residuals plotted against fitted values should look like random noise around 0. Systematic curvature suggests the mean structure is misspecified (e.g., a missing nonlinear term or interaction).

4.3.2 Constant variance

The vertical spread of residuals should be roughly constant across the range of fitted values. A “funnel” pattern suggests heteroskedasticity. The scale-location plot makes changes in spread easier to see.

4.3.3 Normality

A Normal Q-Q plot compares the standardised residuals to what we’d expect under normal errors. Strong departures from the line suggest non-normality (often due to outliers or skewness).

4.3.4 Unusual and influential observations

Influence combines (i) leverage (unusual predictor values) and (ii) residual size. Points with high leverage and large residuals can have an outsized effect on the fitted model.

Key-point

If residuals show a clear pattern, that pattern is telling you what your model is missing.

4.3.5 Shortcut: plot() for lm objects

Once you know what to look for, base R will produce the same set of standard diagnostic plots automatically (and you can request a specific panel with which = 1, 2, 3, or 5).

4.4 Diagnosing issues and choosing remedies

We can use the diagnostic plots above to identify common problems with linear regression models. Below we outline some typical issues, how to recognise them in residual plots, and common strategies for addressing them.

4.4.1 Nonlinearity (missing curvature)

Symptoms

  • A curved trend in residuals vs fitted (or residuals vs a predictor).

Common remedies

  • Add polynomial terms (Chapter 03), transformations, or interactions.
  • Consider omitted predictors that explain structure in the residuals.
Example 4.4: A residual curve suggests missing nonlinearity

4.4.2 Nonconstant variance (heteroskedasticity)

Symptoms

  • A “funnel” shape in residuals vs fitted (spread increases or decreases with the fitted value).
  • A trend in the scale–location plot.

Common remedies

  • Transform the response (e.g., log or square-root) to stabilise variance.
  • Use weighted least squares if you can model how variance changes.
  • Use robust standard errors when inference is the main goal.
Example 4.5: A funnel shape suggests heteroskedasticity

4.4.3 Normality (and when it matters)

Normality of errors matters most for small-sample \(t\)-tests and confidence intervals. In large samples, linear regression often remains useful even when residuals are not perfectly normal, but it is still worth checking for strong departures caused by outliers or skewness.

Symptoms

  • Strong curvature or heavy tails in the Q–Q plot.

Common remedies

  • Check for outliers/data errors first.
  • Consider a transformation of the response.
  • Consider robust or resampling-based inference.

4.4.4 Independence (autocorrelation)

The independence assumption is most commonly violated when observations are collected over time (or space). If residuals are correlated, standard errors can be too small, and \(p\)-values can look “better” than they should.

Symptoms

  • Residuals show runs/clustering when plotted against time/order.
  • Residuals have clear autocorrelation (e.g., via an ACF plot).

Common remedies

  • Add time/seasonality terms if they explain structure.
  • Use models designed for correlated errors (time series, mixed models, GLS).
Example 4.6: Residuals over time can reveal dependence

4.4.5 Extrapolation and high-leverage prediction

Extrapolation means predicting for predictor values outside the range you observed. It is risky because the relationship itself may change outside the observed region, and because uncertainty grows rapidly when you move away from the data cloud (often showing up as high leverage).

Key-point

Before you trust a prediction, check whether it is an extrapolation. A simple first step is to compare each new predictor value to the observed range.

4.4.6 Transformations including Box–Cox

Transformations can help when residual plots show curvature or nonconstant variance. A systematic way to explore power transformations of the response is the Box–Cox transformation.

Note

Box–Cox requires the response to be strictly positive.

4.4.7 Multicollinearity

Multicollinearity occurs when predictors are strongly correlated. It does not necessarily make predictions bad, but it can make coefficient estimates unstable and inflate standard errors—so interpretation and “which predictor matters?” questions become fragile.

Symptoms

  • Large standard errors and coefficient sign changes when you add/remove a correlated predictor.
  • Large variance inflation factors (VIFs).

Common remedies

  • Drop or combine redundant predictors (guided by scientific context).
  • Collect more data (especially with more diverse predictor combinations).
  • Use dimension reduction or penalised regression when prediction is the goal.

4.4.8 Blind model selection

Automated selection procedures (stepwise, “try everything”, repeatedly adding/removing predictors until something is significant) can be useful for exploration, but they are easy to misuse.

Pitfalls

  • \(p\)-values and confidence intervals become hard to interpret after extensive searching.
  • Selected models can overfit and can be unstable to small data changes.

Remedies

  • Use scientific context to limit the candidate set.

  • Use out-of-sample validation (Chapter 03) and always re-check residual diagnostics on the final model.

4.5 Exercises

The goal of these exercises is to practise diagnosing patterns and choosing a sensible next step.

4.5.1 Exercise 1: Identify heteroskedasticity

Run the simulation below and look at the residual plot. Then answer: what issue is the plot mainly suggesting?

Exercise 4.1

Write your answer as a short phrase (e.g., "heteroskedasticity" or "nonconstant variance").

issue_4_1 <- "heteroskedasticity"
issue_4_1 <- "heteroskedasticity"

4.5.2 Exercise 2: Use VIF to spot multicollinearity

Fit the model and compute VIFs. Which term has the largest VIF?

Exercise 4.2

Set worst_term equal to the term with the largest VIF.

worst_term <- vif_from_lm(ex2_mod)$term[1]
worst_term <- vif_from_lm(ex2_mod)$term[1]

4.5.3 Exercise 3: Interpret a Box–Cox suggestion

In the Box–Cox plot earlier, we computed lambda_hat. Based on its value, which option is the best rough interpretation?

Exercise 4.3

Choose one: "no transform", "log transform", or "square-root transform".

boxcox_choice <- if (abs(lambda_hat) < 0.25) "log transform" else if (abs(lambda_hat - 0.5) < 0.25) "square-root transform" else "no transform"
boxcox_choice <- if (abs(lambda_hat) < 0.25) "log transform" else if (abs(lambda_hat - 0.5) < 0.25) "square-root transform" else "no transform"

4.6 Handling outliers and influential observations

When diagnostics flag influential observations, slow down and investigate rather than deleting points automatically.

  • Check data quality first (entry errors, unusual units, miscoding).
  • If influential points are real, fit with and without them to assess robustness; report how conclusions change.
  • Prefer model improvements (functional form, additional predictors, variance stabilisation) over deletion.