4 Pitfalls, Diagnostics, and Remedies
In Chapter 3 we saw that complex regression models can go wrong due to overfitting. But even simple regression analyses can be misleading if the linear model we are fitting does not adequately capture the data-generating process.
In this chapter we focus on the ways in which linear regression can go wrong, how to diagnose problems using residuals, and what to do when issues arise.
Key-point
A good regression analysis is not just about fitting a model and reporting results; it also involves checking that the model assumptions are reasonably satisfied and addressing any issues that arise.
4.1 Revisiting linear model assumptions
In ?sec-introduction-to-linear-models, we introduced the basic assumptions for constructing a simple linear model model of the relationship between two variables. In Chapter 2 we extended these these assumptions to the multiple predictor setting. As a reminder, a linear regression analysis assumes that the data were generated according to something of following form:
The response \(Y\) for observation \(i\) is related to predictor variables \(X_1, X_2, \ldots, X_p\) according to: \[ Y_i = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \cdots + \beta_p X_{i,p} + \varepsilon_i, \quad i = 1, 2, \ldots, n \] where \[ \varepsilon_i \stackrel{\text{iid}}{\sim} \text{Normal}(0, \sigma^2) \] for some unknown parameters \(\beta_0, \beta_1, \ldots, \beta_p\) and \(\sigma^2 > 0\).
By assuming this form we can proceed to estimate the parameters \(b_0, b_1, \ldots, b_p\) and \(s^2\) using the method of least squares, and then use these estimates to perform inference and prediction.
However, if it is not the case that the data were generated according to the proposed linear model (e.g. as in sets 2, 3, or 4 of Anscombe’s quartet), then our estimates, inferences, and predictions may be misleading.
4.1.1 Common assumptions
It is often the case, some aspects of some aspects of the proposed linear model may be violated while others are still reasonably satisfied. For example, data may show a linear trend but have nonconstant variance (heteroskedasticity). For this reason, we often break the general assumption above into more specific parts when performing model diagnostics. A common, equivalent way to present the assumptions of a linear model is via the following four assumptions:
The relationship between the predictors and the response is linear (in the parameters) on average \[ E[Y_i] = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} \]
The variance of the errors is constant across the range of predictors: \[ \text{Var}(\epsilon_i) = \sigma^2 \]
The errors are approximately normally distributed: \[ \epsilon_i \sim \text{Normal}(0, \sigma^2) \]
The errors are independent: \(\epsilon_i\) is independent of \(\epsilon_j\) for \(i \neq j\).
The standard workflow for a regression analysis involves checking whether these assumptions are reasonably satisfied. If problems are detected, it means we need to revise our model in some way (e.g., by adding polynomial terms, transforming variables, or using a different modelling approach).
4.2 Residuals as diagnostic tools
How can we check whether the assumptions of a linear regression model are met? We might start by plotting the data and looking for obvious problems. This is how we identified issues in ?sec-anscombes-quartet looking at Anscombe’s quartet. However, a more systematic approach is to examine the residuals from the fitted model. Remember that the residuals are defined as the difference between the observed values and the fitted values from the model.
The residual for observation \(i\) in a regression model is defined as \[e_i = y_i - \hat{y}_i\] where \(e_i\) is the residual for observation \(i\), \(y_i\) is the observed value, and \(\hat{y}_i\) is the predicted value from the model.
We used residuals in Section 1.3.2 to estimate the parameters of the linear model, but they can also be used to diagnose model problems. Since residuals encode the information about what the model has not captured, plotting residuals can reveal systematic patterns where the model is failing to fit the data well.
4.2.1 Standardised residuals, leverage, and Cook’s distance
Raw residuals \(e_i\) are measured on the scale of the response, so their magnitude depends on the units of \(Y\) (and on how variable \(Y\) is). In multiple regression, residuals also do not all have the same variance: observations with unusual predictor values tend to have smaller residual variance simply because the fitted model is “pulled” towards them. To compare residuals on a common scale, we use standardised residuals.
The standardised residual for observation \(i\) is \[ r_i = \frac{e_i}{s\,\sqrt{1-h_{ii}}} \] where \(s\) is the residual standard error from the fitted model and \(h_{ii}\) is the [leverage] for observation \(i\). Under the usual model assumptions, \(r_i\) is approximately \(\text{Normal}(0,1)\).
Standardised residuals are especially useful when comparing residual sizes across observations, and they are the scale used in several common diagnostic plots.
Leverage measures how unusual an observation’s predictor values are relative to the rest of the data. The leverage values \(h_{ii}\) come from the hat matrix \(H = X(X^\top X)^{-1}X^\top\), satisfy \(0 \le h_{ii} \le 1\), and sum to \(p + 1\) (including the intercept), so the average leverage is \((p + 1)/n\).
A common rule of thumb is to flag points with \[ h_{ii} > \frac{2(p + 1)}{n} \quad \text{(or } \frac{3(p + 1)}{n}\text{)} \] as “high leverage”, where \(p\) is the number of predictors (excluding the intercept) and \(n\) is the sample size.
High leverage does not automatically mean a problem: it becomes concerning when it coincides with a large standardised residual.
Cook’s distance summarises how much the fitted model would change if observation \(i\) were removed: \[ D_i = \frac{e_i^2}{(p + 1)s^2}\,\frac{h_{ii}}{(1-h_{ii})^2}. \] Large values indicate potentially influential observations.
Rules of thumb (use with caution): observations with \(|r_i| > 2\) (or \(> 3\)) may be unusual; Cook’s distance values that stand out relative to the rest merit investigation.
In practice, these diagnostics are easy to extract with broom::augment() (as .std.resid, .hat, and .cooksd).
4.2.2 Extracting diagnostics with broom::augment()
We saw in Section 1.2.4 how to extract residuals using resid(). A convenient alternative is broom::augment(), which returns a data frame containing the original data plus fitted values, residuals, and several diagnostic quantities.
We will use the ‘augmented’ data frame format extensively in the remainder of this chapter.
4.3 Residual diagnostics by assumption
The goal of residual diagnostics is to check whether the key assumptions behind our linear regression model are reasonable.
We’ll illustrate using a multiple linear regression model predicting miles per gallon (mpg) from weight (wt), horsepower (hp), and transmission type (am) in the mtcars dataset.
We will use broom::augment() to extract the residuals and other diagnostics.
And proceed by checking each assumption from Section 4.1.1 in turn.
4.3.1 Linearity
Again, we assume that the mean of the response is a linear function of the predictors.
The conditional mean of the response is linear in the predictors: \[ E[Y_i] = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip}. \]
Because we make this assumption, our least squares fit finds the best straight line (in the least squares sense). If the data fit this assumption, then the residuals plotted against fitted values should look like random noise around 0. Systematic curvature suggests the mean structure is misspecified (e.g., a missing nonlinear term or interaction).
4.3.2 Constant variance
The vertical spread of residuals should be roughly constant across the range of fitted values. A “funnel” pattern suggests heteroskedasticity. The scale-location plot makes changes in spread easier to see.
4.3.3 Normality
A Normal Q-Q plot compares the standardised residuals to what we’d expect under normal errors. Strong departures from the line suggest non-normality (often due to outliers or skewness).
4.3.4 Unusual and influential observations
Influence combines (i) leverage (unusual predictor values) and (ii) residual size. Points with high leverage and large residuals can have an outsized effect on the fitted model.
Key-point
If residuals show a clear pattern, that pattern is telling you what your model is missing.
4.3.5 Shortcut: plot() for lm objects
Once you know what to look for, base R will produce the same set of standard diagnostic plots automatically (and you can request a specific panel with which = 1, 2, 3, or 5).
4.4 Diagnosing issues and choosing remedies
We can use the diagnostic plots above to identify common problems with linear regression models. Below we outline some typical issues, how to recognise them in residual plots, and common strategies for addressing them.
4.4.1 Nonlinearity (missing curvature)
Symptoms
- A curved trend in residuals vs fitted (or residuals vs a predictor).
Common remedies
- Add polynomial terms (Chapter 03), transformations, or interactions.
- Consider omitted predictors that explain structure in the residuals.
4.4.2 Nonconstant variance (heteroskedasticity)
Symptoms
- A “funnel” shape in residuals vs fitted (spread increases or decreases with the fitted value).
- A trend in the scale–location plot.
Common remedies
- Transform the response (e.g., log or square-root) to stabilise variance.
- Use weighted least squares if you can model how variance changes.
- Use robust standard errors when inference is the main goal.
4.4.3 Normality (and when it matters)
Normality of errors matters most for small-sample \(t\)-tests and confidence intervals. In large samples, linear regression often remains useful even when residuals are not perfectly normal, but it is still worth checking for strong departures caused by outliers or skewness.
Symptoms
- Strong curvature or heavy tails in the Q–Q plot.
Common remedies
- Check for outliers/data errors first.
- Consider a transformation of the response.
- Consider robust or resampling-based inference.
4.4.4 Independence (autocorrelation)
The independence assumption is most commonly violated when observations are collected over time (or space). If residuals are correlated, standard errors can be too small, and \(p\)-values can look “better” than they should.
Symptoms
- Residuals show runs/clustering when plotted against time/order.
- Residuals have clear autocorrelation (e.g., via an ACF plot).
Common remedies
- Add time/seasonality terms if they explain structure.
- Use models designed for correlated errors (time series, mixed models, GLS).
4.4.5 Extrapolation and high-leverage prediction
Extrapolation means predicting for predictor values outside the range you observed. It is risky because the relationship itself may change outside the observed region, and because uncertainty grows rapidly when you move away from the data cloud (often showing up as high leverage).
Key-point
Before you trust a prediction, check whether it is an extrapolation. A simple first step is to compare each new predictor value to the observed range.
4.4.6 Transformations including Box–Cox
Transformations can help when residual plots show curvature or nonconstant variance. A systematic way to explore power transformations of the response is the Box–Cox transformation.
Box–Cox requires the response to be strictly positive.
4.4.7 Multicollinearity
Multicollinearity occurs when predictors are strongly correlated. It does not necessarily make predictions bad, but it can make coefficient estimates unstable and inflate standard errors—so interpretation and “which predictor matters?” questions become fragile.
Symptoms
- Large standard errors and coefficient sign changes when you add/remove a correlated predictor.
- Large variance inflation factors (VIFs).
Common remedies
- Drop or combine redundant predictors (guided by scientific context).
- Collect more data (especially with more diverse predictor combinations).
- Use dimension reduction or penalised regression when prediction is the goal.
4.4.8 Blind model selection
Automated selection procedures (stepwise, “try everything”, repeatedly adding/removing predictors until something is significant) can be useful for exploration, but they are easy to misuse.
Pitfalls
- \(p\)-values and confidence intervals become hard to interpret after extensive searching.
- Selected models can overfit and can be unstable to small data changes.
Remedies
Use scientific context to limit the candidate set.
Use out-of-sample validation (Chapter 03) and always re-check residual diagnostics on the final model.
4.5 Exercises
The goal of these exercises is to practise diagnosing patterns and choosing a sensible next step.
4.5.1 Exercise 1: Identify heteroskedasticity
Run the simulation below and look at the residual plot. Then answer: what issue is the plot mainly suggesting?
Exercise 4.1
Write your answer as a short phrase (e.g., "heteroskedasticity" or "nonconstant variance").
issue_4_1 <- "heteroskedasticity"
issue_4_1 <- "heteroskedasticity"4.5.2 Exercise 2: Use VIF to spot multicollinearity
Fit the model and compute VIFs. Which term has the largest VIF?
Exercise 4.2
Set worst_term equal to the term with the largest VIF.
worst_term <- vif_from_lm(ex2_mod)$term[1]
worst_term <- vif_from_lm(ex2_mod)$term[1]4.5.3 Exercise 3: Interpret a Box–Cox suggestion
In the Box–Cox plot earlier, we computed lambda_hat. Based on its value, which option is the best rough interpretation?
Exercise 4.3
Choose one: "no transform", "log transform", or "square-root transform".
boxcox_choice <- if (abs(lambda_hat) < 0.25) "log transform" else if (abs(lambda_hat - 0.5) < 0.25) "square-root transform" else "no transform"
boxcox_choice <- if (abs(lambda_hat) < 0.25) "log transform" else if (abs(lambda_hat - 0.5) < 0.25) "square-root transform" else "no transform"4.6 Handling outliers and influential observations
When diagnostics flag influential observations, slow down and investigate rather than deleting points automatically.
- Check data quality first (entry errors, unusual units, miscoding).
- If influential points are real, fit with and without them to assess robustness; report how conclusions change.
- Prefer model improvements (functional form, additional predictors, variance stabilisation) over deletion.