Chapter 0: Introduction to linear models
This chapter serves as a conceptual introduction to linear models, the foundation of linear regression. By constructing a simple linear model from basic principles, we aim to understand the assumptions which play a crucial role in linear regression analysis.
Statistical models
A central aim of statistical modelling is to understand how one variable changes in relation to others. In your own work, these variables will have concrete meaning - perhaps plant growth, reaction time, exam score, or income - but for now we will simply call them \(x\) and \(Y\).
In regression, we choose one variable \(Y\) to treat as the outcome we want to explain or predict, and \(x\) as one or more predictors. Our goal is to describe how changes in \(x\) are associated with changes in \(Y\).
A simple way to express this idea is
\[ Y = f(x) \]
meaning that the value of \(Y\) can be described by some function of \(x\). If we knew this function exactly, and if the world behaved perfectly, then knowing \(x\) would tell us everything about \(Y\). Many physical laws look like this (for example, \(E = mc^2\)) but real data rarely follow a perfectly deterministic relationship.
In practice, even when \(x\) is held constant, repeated observations of \(Y\) will vary. People respond differently, instruments fluctuate, biological systems are noisy, and experimental conditions change. To recognise this, statistical models include a random error term:
\[ Y = f(x) + \varepsilon. \]
Here, \(\varepsilon\) represents natural variability: the part of \(Y\) that our model does not or cannot explain.
Linear prediction
To make our model concrete, we need to choose a form for the function \(f(x)\). A natural starting point—because it is simple, interpretable, and surprisingly powerful—is a linear function:
\[ f(x) = \alpha + \beta x . \]
This allows us to describe the expected value of \(Y\) as
\[ E[Y] = \alpha + \beta x \]
This is the familiar ‘straight-line’ relationship:
- \(\alpha\) is the intercept, the point where the line meets the vertical axis, and
- \(\beta\) is the slope, describing how we expect \(Y\) to change when \(x\) increases by one unit.
This decision to model \(E[Y]\) as a linear function of \(x\) is a key part of the simple linear model. By choosing a linear function (rather than some other form), we are making an important assumption about the relationship between \(x\) and \(Y\):
Assumption 1: Linearity
\(Y\) and \(x\) have a linear relationship.
Example 1: Salary growth over time
Suppose you have received a job offer from Company A, and you want to predict your salary after working there for 10 years. You are told that the average starting salary at this company is $50,000, and that salaries increase by $5,000 per year of employment.
We can represent this relationship using a simple linear predictor. For an employee with \(x\) years at the company, the expected salary is
\[ E[Y] = 50{,}000 + 5{,}000 \cdot x. \]
ggplot() +
geom_abline(intercept = 5e4, slope = 5e3, colour = '#2196F3') +
lims(x = c(0, 15), y = c(4e4, 13e4)) +
labs(x = "Years Employed", y = "Expected Salary ($)")After 10 years of employment (\(x = 10\)), our linear predictor gives
\[ E[Y] = 50{,}000 + 5{,}000 \times 10 = 100{,}000. \]
Exercise 1: A competing offer
A second company also offers you a position. Their starting salary is higher—$70,000 on average—but their yearly pay increases are smaller. Employees who have been at the company for 6 years earn, on average, $18,000 more than when they started.
We model expected salary after \(x\) years as:
\[ E[Y] = \alpha + \beta x. \]
Choosing parameters
Assign the values of \(\alpha\) and \(\beta\) to the R variables alpha and beta:
- The starting salary gives \(\alpha = 70{,}000\).
- The 6-year increase gives \(6\beta = 18{,}000\), so \(\beta = 3{,}000\).
alpha <- 70000
beta <- 3000
alpha <- 70000
beta <- 3000Linear prediction
Thus the linear predictor for Company B is
\[ E[Y] = 70{,}000 + 3{,}000 \cdot x. \]
Below is a plot comparing salary trends for both companies:
ggplot() +
geom_abline(aes(intercept = 7e4, slope = 3e3, colour = "Company B")) +
geom_abline(aes(intercept = 5e4, slope = 5e3, colour = "Company A")) +
lims(x = c(0, 15), y = c(4e4, 13e4)) +
labs(
x = "Years Employed",
y = "Expected Salary ($)",
colour = "Company"
) +
scale_color_manual(values = c("Company B" = "#4CAF50", "Company A" = "#2196F3"))Now compute the expected salary after 15 years, using the alpha and beta values you just assigned:
E_Y <- alpha + (beta * 15)
E_Y <- alpha + (beta * 15)Using R functions
Evaluating the expression in R:
This matches the calculation:
\[ E[Y] = 70{,}000 + 3{,}000 \times 15 = 115{,}000. \]
Now we can turn this into a reusable function:
Predict salaries for these employment durations:
simple_linear_prediction(x)
simple_linear_prediction(x)Good work!
Next we will extend our linear predictor to form a full linear statistical model.