Chapter 0: Introduction to linear models

This chapter serves as a conceptual introduction to linear models, the foundation of linear regression. By constructing a simple linear model from basic principles, we aim to understand the assumptions which play a crucial role in linear regression analysis.

Statistical models

A central aim of statistical modelling is to understand how one variable changes in relation to others. In your own work, these variables will have concrete meaning - perhaps plant growth, reaction time, exam score, or income - but for now we will simply call them \(x\) and \(Y\).

In regression, we choose one variable \(Y\) to treat as the outcome we want to explain or predict, and \(x\) as one or more predictors. Our goal is to describe how changes in \(x\) are associated with changes in \(Y\).

A simple way to express this idea is

\[ Y = f(x) \]

meaning that the value of \(Y\) can be described by some function of \(x\). If we knew this function exactly, and if the world behaved perfectly, then knowing \(x\) would tell us everything about \(Y\). Many physical laws look like this (for example, \(E = mc^2\)) but real data rarely follow a perfectly deterministic relationship.

In practice, even when \(x\) is held constant, repeated observations of \(Y\) will vary. People respond differently, instruments fluctuate, biological systems are noisy, and experimental conditions change. To recognise this, statistical models include a random error term:

\[ Y = f(x) + \varepsilon. \]

Here, \(\varepsilon\) represents natural variability: the part of \(Y\) that our model does not or cannot explain.

Linear prediction

To make our model concrete, we need to choose a form for the function \(f(x)\). A natural starting point—because it is simple, interpretable, and surprisingly powerful—is a linear function:

\[ f(x) = \alpha + \beta x . \]

This allows us to describe the expected value of \(Y\) as

\[ E[Y] = \alpha + \beta x \]

This is the familiar ‘straight-line’ relationship:
- \(\alpha\) is the intercept, the point where the line meets the vertical axis, and
- \(\beta\) is the slope, describing how we expect \(Y\) to change when \(x\) increases by one unit.

This decision to model \(E[Y]\) as a linear function of \(x\) is a key part of the simple linear model. By choosing a linear function (rather than some other form), we are making an important assumption about the relationship between \(x\) and \(Y\):

Assumption 1: Linearity

\(Y\) and \(x\) have a linear relationship.

Example 1: Salary growth over time

Suppose you have received a job offer from Company A, and you want to predict your salary after working there for 10 years. You are told that the average starting salary at this company is $50,000, and that salaries increase by $5,000 per year of employment.

We can represent this relationship using a simple linear predictor. For an employee with \(x\) years at the company, the expected salary is

\[ E[Y] = 50{,}000 + 5{,}000 \cdot x. \]

Expected salary at Company A as a function of years employed.
ggplot() +
  geom_abline(intercept = 5e4, slope = 5e3, colour = '#2196F3') +
  lims(x = c(0, 15), y = c(4e4, 13e4)) +
  labs(x = "Years Employed", y = "Expected Salary ($)")

After 10 years of employment (\(x = 10\)), our linear predictor gives

\[ E[Y] = 50{,}000 + 5{,}000 \times 10 = 100{,}000. \]

Exercise 1: A competing offer

A second company also offers you a position. Their starting salary is higher—$70,000 on average—but their yearly pay increases are smaller. Employees who have been at the company for 6 years earn, on average, $18,000 more than when they started.

We model expected salary after \(x\) years as:

\[ E[Y] = \alpha + \beta x. \]

Choosing parameters

Assign the values of \(\alpha\) and \(\beta\) to the R variables alpha and beta:

  • The starting salary gives \(\alpha = 70{,}000\).
  • The 6-year increase gives \(6\beta = 18{,}000\), so \(\beta = 3{,}000\).
alpha <- 70000 beta <- 3000
alpha <- 70000
beta <- 3000

Linear prediction

Thus the linear predictor for Company B is

\[ E[Y] = 70{,}000 + 3{,}000 \cdot x. \]

Below is a plot comparing salary trends for both companies:

Linear salary trends for two companies.
ggplot() +
  geom_abline(aes(intercept = 7e4, slope = 3e3, colour = "Company B")) +
  geom_abline(aes(intercept = 5e4, slope = 5e3, colour = "Company A")) +
  lims(x = c(0, 15), y = c(4e4, 13e4)) +
  labs(
    x = "Years Employed",
    y = "Expected Salary ($)",
    colour = "Company"
  ) +
  scale_color_manual(values = c("Company B" = "#4CAF50", "Company A" = "#2196F3"))

Now compute the expected salary after 15 years, using the alpha and beta values you just assigned:

E_Y <- alpha + (beta * 15)
E_Y <- alpha + (beta * 15)

Using R functions

Evaluating the expression in R:

This matches the calculation:

\[ E[Y] = 70{,}000 + 3{,}000 \times 15 = 115{,}000. \]

Now we can turn this into a reusable function:

Predict salaries for these employment durations:

simple_linear_prediction(x)
simple_linear_prediction(x)

Good work!

Next we will extend our linear predictor to form a full linear statistical model.

Random Errors

In practice, data rarely fall perfectly on a straight line. Even if the underlying relationship between \(x\) and \(Y\) is approximately linear, individual observations tend to vary around that line. As mentioned in Section Statistical models, to account for this variation we add an error term, \(\varepsilon\), to our model:

\[ Y=\alpha + \beta x + \varepsilon \]

The error term is the difference between the observed value and the linear predictor: \[ \varepsilon = Y - E[Y] = Y - (\alpha + \beta x). \]

\(\varepsilon\) is a random variable, meaning that it can take different values each time we observe \(Y\) for a given \(x\). This randomness captures the idea that even when \(x\) is held constant, repeated observations of \(Y\) will vary.

While we don’t know the exact value of \(\varepsilon\) for any particular observation, we can describe its overall behaviour using probability. We typically assume three properties of \(\varepsilon\):

On average, we expect the error terms to balance out.

Assumption 2: Mean of errors

The mean of the error term is zero (mean-zero errors):

\[ E[\varepsilon]=\mu=0 \]

i.e. The linear predictor gives the correct value of \(Y\) on average. This is why the ‘expected value’ of \(Y\) is given by the linear function: \[ E[Y]=\alpha + \beta x + E[\varepsilon] = \alpha + \beta x + 0 = \alpha + \beta x \]

While correct on average, we expect there to be some spread of data around the line (this is why we have the error term). The spread of a random variable like \(\varepsilon\) is captured by its variance, which we denote \(Var(\varepsilon)\). We assume that \(Var(\varepsilon)\) does not depend on \(x\). That is, the spread of errors is the same for all values of \(x\).

Assumption 3

The variance of the error term is constant for all values of x (homoscedasticity): \[Var(\varepsilon)=\sigma^2\]

This is similar to the \(\mu=0\) assumption above, but where there we specified the specific constant value (zero), here we specify that the variance is constant but leave its value (\(\sigma^2\)) unspecified.

The term homoscedasticity comes from the Greek words for ‘same’ (homo) and ‘spread’ (scedasis), meaning ‘same spread’ and is therefore another way of describing the same property. ::: Terminology ### Homoscedasticity Homoscedasticity means constant error variance across all values of \(x\). :::

While the assumptions of \(\mu=0\) and \(Var(\varepsilon)=\sigma^2\) describe the center and spread of the errors, they don’t fully specify the shape of their distribution. If we want to draw precise inferences about probability events (e.g. to test hypotheses using p-values), we can make a stronger assumption about the distribution of the errors. In particular, we choose to assume that the errors follow a Normal distribution.

Assumption 4

The error is normally distributed (normal errors) with mean \(\mu=0\) and variance \(\sigma^2\). \[\varepsilon \sim \mathcal{N}(0, \sigma^2)\]

The normal distribution is familiar and convenient to work with, and is often a reasonable approximation for real-world data. As a reminder of its shape, you can use the interactive plot below to explore normal distributions with different means and standard deviations.

Figure 1: Normal distribution with adjustable mean and standard deviation.

This bell-shaped curve tells us that most values of \(\varepsilon\) are close to zero, with larger deviations becoming increasingly rare. The standard deviation, \(\sigma\), controls how tightly the errors cluster around zero: smaller values of \(\sigma\) lead to a narrower peak, while larger values spread the errors out more widely.

Example 2: Variation in salary

Let’s return to our simple linear model of salary at Company A,

\[ E[Salary] = 50,000 + 5,000\times Years \]

This expresses how salary tends to increase with experience (i.e. on average). But in practice, not every employee with the same number of years earns the same amount — some earn a bit more, some a bit less. Suppose we know that most employees — about two-thirds of them — earn within roughly $4,000 of the average salary for their experience level. In other words, if the average salary after five years is $75,000, then about two-thirds of employees earn between $71,000 and $79,000, and almost everyone (about 95%) earns between $67,000 and $83,000.

We can capture this variability with a random error term, \(\varepsilon\), assumed to follow a Normal distribution with mean 0 and standard deviation \(\sigma = 4,000\).

\[ Salary = 50,000 + 5,000\times Years + \varepsilon, \quad{\varepsilon \sim \mathcal{N}(0,4000^2)} \]

This means that for a given number of years \(x\): - The expected salary is \(50,000 + 5,000\cdot x\) - Actual salaries will vary around that average, typically within about ±$4,000

For example, after 5 years (x=5): \[ E[Y]=\$50,000 + \$5,000 \times 5 = \$75,000 \]

The distribution of salaries for employees with 5 years’ experience is

\[ Y\sim \mathcal{N}(75,000, 4,000^2) \]

Since we have a probability distribution over \(Y\), we can use R to evaluate the probability of any given salary after x years at the company.

For example, we want to know if employed at company A, what is the probability that after working there for 10 years I will have a salary of at least $110,000?

First let’s calculate the average salary after 10 years

50000+(5000*10)
[1] 1e+05

$100,000.

Now

pnorm(11e4,1e5, 4e3, lower.tail = FALSE)
[1] 0.006209665

The following diagram tells us roughly the probability of observing a \(Y\) value in the given range:

Figure 2: Probability of salary falling within a chosen interval.

Here \(\varepsilon\) represents random deviations from the expected (average) salary for a given number of years x. - Same model as above - given standard deviation - Example Y values - Illustrate error with normal overay - Excersises: - which variance is larger? - (hard) probability of finding E[Y]+2sd observation

Exercise 2: Normally distributed errors

Considering our model of salary with \(\varepsilon \sim \mathcal{N}(0, 4000^2)\), what is the probability that an employee with 5 years of experience earns more than $83,000?

pnorm(83000, 75000, 4000, lower.tail = FALSE)
pnorm(83000, 75000, 4000, lower.tail = FALSE)

The Simple Linear Model

Putting these pieces together we obtain the full specification of the simple linear model:

Assumption 5: The Simple Linear Model

The relationship between \(Y\) and \(x\) is described by a linear function plus normally distributed random error: \[ Y=\alpha+\beta x+\varepsilon, \quad{\varepsilon \sim N(0,\sigma^2)} \]

This model how the outcome variable \(Y\) depends on the predictor variable \(x\), with parameters \(\alpha\) (intercept), \(\beta\) (slope), and \(\sigma\) (standard deviation of the errors). These parameters control the location, direction, and spread of the data around the linear trend. We can visualise the probability distribution of our outcome variable \(Y\) for different values of \(x\) using the interactive plot below:

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Figure 3: Cross-sections of the simple linear model normal error density.

Here, each vertical ‘slice’ of the plot shows the Normal distribution of \(Y\) for a specific value of \(x\), with the black line indicating the expected value \(E[Y] = \alpha + \beta x\). The ‘peak’ of each slice corresponds to the most probable value of \(Y\) for that \(x\), while the spread reflects the variability introduced by the error term \(\varepsilon\).

When we assume a linear model, we make an assumption about the ‘process’ that generates our data. Specifically, we assume that for each value of \(x\), the corresponding \(Y\) values are drawn from a Normal distribution whose mean is given by the linear function \(\alpha + \beta x\) and whose standard deviation is \(\sigma\). While our model is simple, we can use it to generate realistic data that captures both the linear trend and the natural variability around it.

A simple Linear model in R

To see how the model ‘generates’ data, we will simulate observations using the salary example parameters from above. 1 The \(x\) values here are generated just for illustration; in regression we condition on \(x\) rather than treat it as random.

  1. To begin, we start with a collection of \(x\) values. We treat these as fixed values, representing years of experience at the company. We can generate some example \(x\) values in R like this:
n <- 100
x <- runif(n, min = 0, max = 20)
head(x)
[1]  5.874150 17.952255 13.588758  2.003901 13.875355 19.200284
  1. Define the model

Next, we construct our simple linear model

simple_linear_model <- function(x, alpha, beta, sigma) {
  E_Y <- alpha + (beta * x) # Expected value of Y 
  epsilon <- rnorm(length(x), mean = 0, sd = sigma) # random normal errors
  Y <- E_Y + epsilon # 'Observed' Y values
  return(Y) 
}

The simple_linear_model() function takes x values and returns the specified linear function with normally distributed random noise added.

Now we can simulate observations of \(Y\) given our collection of \(x\) values and the parameters we have chosen for our salary example:

Y <- simple_linear_model(x, alpha=5e4, beta=5e3, sigma=4e3)
head(Y)
[1]  72796.48 147073.20 119517.53  55525.84 119519.02 151119.24

Let’s look at the joint distribution of x and Y:

Data generated from the simple linear model \(Y=50{,}000+5{,}000\times x + \varepsilon\), with \(\varepsilon\sim N(0,4{,}000^2)\). Dashed line shows \(E[Y|x] = \alpha + \beta x\).
library(ggplot2)

df <- data.frame(x = x, Y = Y)

ggplot(df, aes(x, Y)) +
  geom_point(alpha = 0.7) +
  theme_minimal()

The scatter plot shows a single simulated ‘sample’ of data from our simple linear model. If we repeated the simulation, we would get a different set of \(Y\) values (because of the randomness introduced by the error term), but they would still cluster around the same dashed line representing the expected value \(E[Y|x] = \alpha + \beta x\).

Here’s the same code running in webR - try adjusting some of the parameters (e.g. the number of samples), or just running again to plot a different random sample from the same model!

Summary

In this section, we have introduced the simple linear model, which describes the relationship between an outcome variable \(Y\) and a predictor variable \(x\) using a linear function plus normally distributed random error: \[ Y=\alpha + \beta x + \varepsilon, \quad{\varepsilon \sim N(0,\sigma^2)} \] This model captures both the expected trend of \(Y\) as a function of \(x\) (given by the linear function \(\alpha + \beta x\)) and the natural variability around that trend (captured by the error term \(\varepsilon\)).

We have also seen how this model can ‘generate’ realistic data that reflects both the linear relationship and the random variation. In the next section, we will explore how to use observed data to estimate the parameters \(\alpha\), \(\beta\), and \(\sigma\) of the simple linear model.


  1. 1↩︎