1  Simple Linear Regression (SLR)

Linear regression is a method that uses a [linear model] to describe the relationship between a response variable \(Y\) and one or more predictor variables \(x\). In this chapter, we focus on the simplest case of linear regression - simple linear regression (SLR) - where there is a single predictor variable \(x\).

1.1 When to use SLR

SLR is appropriate when you want to model the relationship between a continuous response variable \(Y\) and a single continuous predictor variable \(x\). There are some additional conditions that should be met for SLR to be valid, but we will cover these in later chapters.

Key-point

SLR is used to model the relationship between a continuous response variable (\(Y\)) and a single continuous predictor variable (\(x\)).

1.2 Fitting models to data

In ?sec-introduction-to-linear-models we constructed a simple linear model based on known parameters: \[ Y=\alpha + \beta x + \varepsilon, \quad{\varepsilon \sim N(0,\sigma^2)} \] where \(\alpha\) is the intercept, \(\beta\) is the slope, and \(\sigma\) is the standard deviation of the normally distributed error term \(\varepsilon\). In that section, we knew the values of \(\alpha\), \(\beta\) and even \(\sigma\).

However, in most contexts we don’t know the true relationship between \(x\) and \(Y\) in advance. Instead, we are given data and try to infer the values of \(\alpha\), \(\beta\), and \(\sigma\) by analysing the data. This analysis is called simple linear regression.

Indeed, at the end of Section 4 we generated data from a linear model. In preparation for this chapter, we’ve generated some data from a similar such linear model. Here it is:

            x           y
1   1.1484953 -3.65698241
2   0.4788178 -0.45242799
3   8.4269137  2.90913303
4   2.4807076  1.58147837
5  -4.6346739 -5.25054041
6   4.3392622  1.73427324
7   2.2131474 -6.08696492
8   3.1465968  1.80038987
9   1.1555571 -3.19105387
10  0.8370133 -2.31677924
11  3.9198902 -1.42710242
12 -7.5732148 -5.67057910
13 -9.9588450 -9.54491062
14 -1.2662523 -2.07733060
15 -2.4495283 -7.29716359
16 -2.6289377 -4.58458364
17 -3.9140791 -7.62168629
18 -3.2422497 -4.61594575
19 -1.3129867 -0.06028876
20 -1.5541255 -3.33258640

our task for the remainder of the chapter (and in regression generally) is to try an make a (educated) guess about what linear model gave rise to this data.

1.2.1 Parameter estimates

We call the process of trying to guess the parameters of the data generating model (i.e. \(\alpha\), \(\beta\) and \(\sigma\)), estimation, and our guesses are estimates.

To avoid confusion, we will give our estimates different names to the true parameters. In particular, for simple linear regression we will denote an estimate of the intercept,\(\alpha\), by \(a\), and estimate of the slope, \(\beta\), by \(b\), and an estimate of the error standard deviation,\(\sigma\) by \(s\). So, we have

Key-term: Parameter Estimates

The estimates of the parameters of a simple linear regression model are denoted as: \[ a: \text{Estimated intercept} \] \[ b: \text{Estimated slope} \] \[ s: \text{Estimated standard deviation}. \]

Key-point: Estimated vs true parameters

It is important to distinguish between estimated (sometimes called ‘fitted’, or ‘sample’) parameters and true (sometimes called ‘population’) parameters. The true parameters are assumed to govern the data generating process (and are unknown), while the estimated parameters are our best guesses for the true parameters based on the observed data. In this course we use Greek letters (e.g. \(\alpha\), \(\beta\), \(\sigma\)) to denote true parameters, while Latin letters (e.g. \(a\), \(b\), \(s\)) denote estimates of those parameters.

1.2.2 Predicted values

Just as the linear model with true parameters gives us the expected value of \(Y\): \[ E[Y]=\alpha + \beta x, \] our estimated linear model gives us an estimate of the value of \(Y\). We denote this as \(\hat{Y}\) (pronounced y-hat), the predicted value of \(Y\) given our estimated model:

Key-term: Predicted Value

The predicted value of the outcome variable \(Y\) given an esimated simple linear predictor is defined as: \[ \hat{Y}=a+bx \]


1.2.3 Fitting a line to data

Looking at data (i.e. pairs of \((x, Y)\) values), we try to find the estimates of our parameters that best ‘fit’ its distribution. Adjust the parameter estimates with the sliders below to find a line that you think fits the data:

Exercise 1.1

What is your best estimate for the values of \(\alpha\) and \(\beta\) that best fit the given data?

Ok great, so we have an estimate for our line-of-best-fit:

But how good is our estimate really? What do we mean by ‘best’ fit here? And how can we objectively evaluate it and maybe compare it to other estimates?

1.2.4 Residuals

Residuals are the differences between observed values of \(Y\), and the value predicted by our estimated linear predictor \(\hat{Y}\). We denote residuals with the letter \(e\):

Key-term: Residual

The difference between an observed value of \(Y\) and the value predicted by our estimated model \(\hat{Y}\): \[ e=Y-\hat{Y} = Y - (a + bx). \]

Residuals, \(e\), are similar to the errors, \(\varepsilon\), that we encountered in Section 3 - but they are distinct. Since we do not know the ‘true’ values of \(\alpha\) and \(\beta\) - we cannot calculate the errors, \(\varepsilon= Y- E[Y]= Y-\alpha + \beta x\). However, we do have our estimates, \(a\) and \(b\), so we can calculate residuals.

Indeed - calculating residuals gives us a way to assess how well our estimated model fits the data. We can think of the residuals as being the ‘mismatch’ between our estimated linear predictor and each data point. By minimizing the size of residuals (minimising the mismatch of our model to the data), we can get a better fit of our line to data.

1.3 Estimating coefficients

1.3.1 Optimising fit by minimising (squared) residuals

Below, the plot now also displays squared residuals for each data point as squares. Underneath the model equation is the Sum of squared residuals (SSR), which gives a measure of the absolute difference between our linear predictor and the observed outcomes.

Key-term: Sum of squared residuals (SSR)

The sum of squared residuals (SSR) is the sum of the squares of the residuals from our estimated linear predictor: \[ SSR=\sum_{i=1}^{n} e_i^2 \]

where \(e_i = Y_i - \hat{Y}_i\).

The SSR gives us a numerical measure of how well the line fits the data - see how small you can get the SSR by adjusting slope and intercept.

Exercise 1.2

Based on this visualisation, what do you think is the minimum possible \(SSR\) achievable?

If you want you can update your \(a\) and \(b\) estimates too (otherwise just proceed):

1.3.2 The least-squares estimate

We call the estimates \(a\) and \(b\) which minimise the sum of squares the least squares estimates. Some nice mathematics tells us that the least squares esimates for a simple linear model are unique - that is, there is one set of values for \(a\) and \(b\) which satisfy this property. Moreover, we don’t have to manually adjust our parameters to keep lowering SSR - there is a convenient closed form expression which allows us to compute \(a\) and \(b\) very efficiently. As a statistical programming language, this is something R does very easily.

1.3.2.1 Fitting a model with the lm() function

In R the lm() function computes the least squares estimates \(a\) and \(b\) for our simple linear model (among other things) in a single command:

We can extract the coefficients from the lm by indexing:

or by the coef() function:

Lets compare the estimated linear fits graphically:

1.4 Estimating variance

Once the line has been fitted (i.e. \(\alpha\) and \(\beta\) have been estimated as \(a\) and \(b\)), we also have to estimate the distribution of error terms (remember, as per ?sec-varAssumption, the error distribution has a constant variance defined by \(\sigma^2\)). As you might expect, we again utilise the residuals from our least squares linear predictor to estimate \(\sigma\). The spread of the error terms will be estimated by the spread of the residuals around our line of best fit.

We can extract the individual residual values from the fitted lm object by indexing (e.g. my_lm$residuals) or with the residuals() function.

The resulting R object is a vector of the residual for each observation. i.e. residuals(lm_1)[[3]] \(= e_3\)

Exercise 1.3: Extracting residuals from an lm object

Extract the residuals from your least squares fitted model lm_1by indexing and assign them to the variable e

e <- lm_1$residuals
e <- lm_1$residuals

1.4.1 Residual Standard Error

The residuals from our least squares fit can be used to estimate the variance of the error terms in our linear model. The residual sum of squares (RSS) that we calculated in Section 1.3.2 during ‘least squares’ fitting, as a measure of variation of observations around the fitted line can also be used to estimate the variance of the error terms.

Key-term: Residual Sum of Squares (RSS)

The residual sum of squares (RSS) is defined as \[ RSS = \sum_{i=1}^{n} e_i^2 \] where \(e_i = Y_i - \hat{Y}_i\) are the residuals from our estimated linear predictor.

Exercise 1.4: Calculate the Residual Sum of Squares

using your e object defined in the previous exercise, calculate the residual sum of squares of the least squares model.

We simply square each residual using r^2 and then sum them together using the sum() function:

rss <- sum(e^2)
rss <- sum(e^2)

However, the RSS itself is not a very useful measure of the spread of the residuals, since it depends on the number of observations - more observations will tend to lead to a larger RSS, even if the spread of the residuals is the same. We need a measure of the average size of the residual (per observation). To standardise the RSS, we might consider the average squared residual, which we can obtain by dividing the RSS by the number of observations, \(n\). However, this would tend to underestimate the true variance of the error terms, \(\sigma^2\), because the residuals are calculated from the estimated line, which has already ‘used up’ some of the information (see [degrees of freedom]) in the data.

To adjust for the fact that we have already estimated two parameters (\(a\) and \(b\)) from the RSS,we instead divide the RSS by \(n-k\) (where \(n\) is the number of observations, and \(k\) is the number of parameters alreadyu estimated -in this case 2). This gives us the mean squared error (MSE), which is our estimator for \(\sigma^2\):

Key-term: Mean Squared Error (MSE)

{#sec-residual-standard-error-definition} We define the mean squared error (MSE) of a simple linear regression model as \[ s^2 = \frac{1}{n-2}\sum_{i=1}^{n} e_i^2= \frac{SSE}{n-2} \]

Just as we often prefer to work with standard deviations rather than variances, we also define the residual standard error (RSE) as the square root of the MSE:

Key-term: Residual Standard Error (RSE)

The residual standard error (RSE) of a simple linear regression model is defined as \[ s = \sqrt{s^2} = \sqrt{\frac{1}{n-2}\sum_{i=1}^{n} e_i^2} = \sqrt{\frac{SSE}{n-2}}. \]

\(s^2\) and \(s\) are our estimators for \(\sigma^2\) and \(\sigma\), respectively.

Exercise 1.5: Calculate the Residual Standard Error

Using your rss object defined in the previous exercise, calculate the residual standard error of the least squares model.

The RSE is \(s = \sqrt{\frac{RSS}{n-2}}\). You can get \(n\) using nrow(linearData).

rse <- sqrt(sum(e^2)/(nrow(linearData)-2))
rse <- sqrt(sum(e^2)/(nrow(linearData)-2))

1.4.2 Correlation and \(R^2\)

The residual standard error \(s\) tells us the absolute spread of observations around the fitted line — it is measured in the same units as \(Y\). But it does not tell us how much of the original variation in \(Y\) our model has explained. For that, we need a relative measure.

The total sum of squares (TSS) captures the total variation in \(Y\) around its mean:

::: Key Term #### Total Sum of Squares (TSS) {#sec-total-sum-of-squares-definition}

\[ \text{TSS} = \sum_{i=1}^{n} (Y_i - \bar{Y})^2. \] :::

Exercise 1.6: Calculate TSS

Using linearData, calculate the total sum of squares (TSS).

TSS is sum((linearData$y - mean(linearData$y))^2).

TSS <- sum((linearData$y - mean(linearData$y))^2) TSS
TSS <- sum((linearData$y - mean(linearData$y))^2)
TSS

By comparing the residual sum of squares (RSS) to the total sum of squares (TSS), we can quantify the proportion of variance explained by our model. This ratio is called R-squared:

Key-term: R-squared (\(R^2\))

The \(R^2\) statistic is defined as \[ R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}. \]

i.e. the proportion of the total variance in \(Y\) (the total sum of squares, TSS) explained by the model.

  • When \(R^2 = 1\), all variation is explained (RSS = 0).
  • When \(R^2 = 0\), the model explains none of the variation (RSS = TSS).

As a visual demonstration of how \(R^2\) changes with increasing error variance, consider the following interactive plot:

We can see that as \(\sigma\) increases, \(R^2\) decreases, indicating that the model explains less of the variance in \(Y\). Fitting a line to data with high noise results in a poor fit, as reflected by a low \(R^2\) value.

Exercise 1.7: Correlation and \(R^2\) in our example

compute \(R^2\) using the residuals from lm_1.

r <- cor(linearData$x, linearData$y) r_squared <- 1 - sum(e^2) / sum((linearData$y - mean(linearData$y))^2)
r <- cor(linearData$x, linearData$y)
r_squared <- 1 - sum(e^2) / sum((linearData$y - mean(linearData$y))^2)

1.5 Inference for SLR

Up to this point, we have fitted a straight-line model to a sample of data, obtaining estimates \(a\) and \(b\) for the intercept and slope. These estimates describe the pattern visible in the sample, but they do not tell us how closely they reflect the true population parameters \(\alpha\) and \(\beta\). Because sampling introduces randomness, different samples would produce different fitted lines.

This raises the central question:

How much confidence can we place in the slope and intercept we estimated, and what do they tell us about the true relationship in the population?

To answer this, we rely on statistical inference, which quantifies the uncertainty in the estimates and evaluates whether the predictor genuinely influences the response.

Importantly, the inferential procedures we use rest on the assumptions of the linear regression model:

  • The errors \(\varepsilon\) have mean \(0\).
  • They have constant variance \(\sigma^2\) (homoscedasticity).
  • They are independent.
  • They are Normally distributed.

The Normality assumption is what allows us to derive the sampling distributions of \(a\) and \(b\), leading directly to the t-tests and confidence intervals used for inference. Without these assumptions—especially Normality—the exact forms of these inferential tools would not hold.

Note

For a broader introduction to statistical inference, see the Inferential Statistics with R short course. If concepts such as the Normal distribution or t-tests feel unfamiliar, please review that material before continuing.


1.5.1 Inference About the Slope, \(\beta\)

In simple linear regression, the population relationship is modelled as

\[ Y = \alpha + \beta x + \varepsilon. \]

To determine whether \(x\) is genuinely associated with \(Y\), we test:

\[ H_0: \beta = 0 \qquad \text{vs.} \qquad H_a: \beta \ne 0. \]

  • Under \(H_0\), changes in \(x\) do not affect the mean of \(Y\) (a change in \(x\) will lead to \(\beta\cdot x = 0\cdot x = 0\) change in \(Y\)).
  • Under \(H_a\), there is evidence of a real linear effect (i.e. a change in \(x\) will lead to a non-zero change in \(Y\)).

Because the Normality assumption implies that the estimator \(b\) has a Normal sampling distribution (and hence a \(t\) distribution once \(\sigma\) is estimated), we are able to quantify how unusual our observed slope would be if \(H_0\) were correct.

1.5.1.1 The t-Test for the Slope

The hypothesis test is carried out using the statistic

\[ t = \frac{b}{\text{SE}(b)}, \]

which follows a \(t\)-distribution with \(n - 2\) degrees of freedom.

Interpretation:

  • A large value of \(|t|\) (small p-value) indicates evidence that \(\beta \ne 0\).
  • A small value of \(|t|\) suggests the data are consistent with no linear effect.

The validity of this test relies on the Normality of the errors, which guarantees that this \(t\) statistic follows the appropriate reference distribution.

Note

While the assumption of Normality is what theoretically grounds the use of the t-test for parameter estimates, SLR is quite robust to violations of this assumption and inferences about our slope parameters may be of interest even with non-normal residual distributions.

1.5.1.2 Confidence Interval for the Slope

A \((1-\alpha)100%\) confidence interval for \(\beta\) is

\[ b ;\pm; t_{\alpha/2,,n-2},\text{SE}(b). \]

Interpretation:

  • An interval excluding zero indicates a likely genuine relationship.
  • An interval including zero suggests weaker evidence.

Confidence intervals complement hypothesis tests by communicating both the direction and plausible magnitude of the effect.

In R, you can compute the confidence interval for the slope directly:

1.5.2 Inference About the Response, \(Y\)

Once we have fitted a regression model, we often want to make statements about the value of the response at a given predictor value \(x_0\). There are two distinct quantities of interest:

  1. The mean (average) response at \(x_0\): \[ \mu_Y(x_0) = \alpha + \beta x_0. \]

  2. A new individual response at \(x_0\): \[ Y_{\text{new}} = \alpha + \beta x_0 + \varepsilon. \]

These involve different uncertainties, and therefore require different intervals.

Confidence intervals for the mean response reflect uncertainty in \(a\) and \(b\). Prediction intervals include that uncertainty plus the additional variability from the random error term \(\varepsilon\).

1.5.2.1 Confidence interval for the Mean Response

Let \(\hat{y}_0 = a + b x_0\) be the fitted value at \(x_0\). A confidence interval for the mean response is:

\[ \hat{y}*0 ;\pm; t*{\alpha/2,,n-2} , s\sqrt{ \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum (x_i - \bar{x})^2} }. \]

This interval quantifies uncertainty in the average value of \(Y\) for units with predictor value \(x_0\).

predict(fit, newdata = new_point, interval = "confidence")

1.5.2.2 Prediction interval for a New Observation

To predict an individual outcome at \(x_0\), we must include the additional uncertainty from the random error \(\varepsilon\):

\[ \hat{y}*0 ;\pm; t*{\alpha/2, , n-2} , s\sqrt{ 1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum (x_i - \bar{x})^2} }. \]

Because of the extra “1” term, prediction intervals are always wider than confidence intervals.

predict(fit, newdata = new_point, interval = "prediction")

1.5.2.3 Summary

  • Confidence interval → uncertainty in the expected value at \(x_0\)
  • Prediction interval → uncertainty in a new outcome at \(x_0\)

We now have the full SLR toolkit: estimation, fit summaries, and inference for slopes and responses. Next, we extend these ideas to multiple predictors in Chapter 2, and later return to residual diagnostics and assumption checking in Chapter 4.