Chapter 0: Introduction to linear models

This chapter serves as a conceptual introduction to linear models, the foundation of linear regression. By constructing a simple linear model from basic principles, we aim to understand the assumptions which play a crucial role in linear regression analysis.

Continue

Statistical models

A central aim of statistical modelling is to understand how one variable changes in relation to others. In your own work, these variables will have concrete meaning - perhaps plant growth, reaction time, exam score, or income - but for now we will simply call them $x$ and $Y$.

In regression, we choose one variable $Y$ to treat as the outcome we want to explain or predict, and $x$ as one or more predictors. Our goal is to describe how changes in $x$ are associated with changes in $Y$.

A simple way to express this idea is

\[ Y = f(x) \]

meaning that the value of $Y$ can be described by some function of $x$. If we knew this function exactly, and if the world behaved perfectly, then knowing $x$ would tell us everything about $Y$. Many physical laws look like this (for example, $E = mc^2$) but real data rarely follow a perfectly deterministic relationship.

In practice, even when $x$ is held constant, repeated observations of $Y$ will vary. People respond differently, instruments fluctuate, biological systems are noisy, and experimental conditions change. To recognise this, statistical models include a random error term:

\[ Y = f(x) + \varepsilon. \]

Here, $\varepsilon$ represents natural variability: the part of $Y$ that our model does not or cannot explain.

Continue

Linear prediction

To make our model concrete, we need to choose a form for the function $f(x)$. A natural starting point—because it is simple, interpretable, and surprisingly powerful—is a linear function:

\[ f(x) = \alpha + \beta x . \]

This allows us to describe the expected value of $Y$ as

\[ E[Y] = \alpha + \beta x \]

This is the familiar ‘straight-line’ relationship:
- $\alpha$ is the intercept, the point where the line meets the vertical axis, and
- $\beta$ is the slope, describing how we expect $Y$ to change when $x$ increases by one unit.

This decision to model $E[Y]$ as a linear function of $x$ is a key part of the simple linear model. By choosing a linear function (rather than some other form), we are making an important assumption about the relationship between $x$ and $Y$:

Assumption 1: Linearity

$Y$ and $x$ have a linear relationship.

Continue

Example 1: Salary growth over time

Suppose you have received a job offer from Company A, and you want to predict your salary after working there for 10 years. You are told that the average starting salary at this company is $50,000, and that salaries increase by $5,000 per year of employment.

We can represent this relationship using a simple linear predictor. For an employee with $x$ years at the company, the expected salary is

\[ E[Y] = 50{,}000 + 5{,}000 \cdot x. \]

Expected salary at Company A as a function of years employed.

ggplot() +
  geom_abline(intercept = 5e4, slope = 5e3, colour = '#2196F3') +
  lims(x = c(0, 15), y = c(4e4, 13e4)) +
  labs(x = "Years Employed", y = "Expected Salary ($)")

After 10 years of employment ($x = 10$), our linear predictor gives

\[ E[Y] = 50{,}000 + 5{,}000 \times 10 = 100{,}000. \]

Continue

Exercise 1: A competing offer

A second company also offers you a position. Their starting salary is higher—$70,000 on average—but their yearly pay increases are smaller. Employees who have been at the company for 6 years earn, on average, $18,000 more than when they started.

We model expected salary after $x$ years as:

\[ E[Y] = \alpha + \beta x. \]

Choosing parameters

Assign the values of $\alpha$ and $\beta$ to the R variables alpha and beta:

Linear prediction

Thus the linear predictor for Company B is

\[ E[Y] = 70{,}000 + 3{,}000 \cdot x. \]

Below is a plot comparing salary trends for both companies:

ggplot() +
  geom_abline(aes(intercept = 7e4, slope = 3e3, colour = "Company B")) +
  geom_abline(aes(intercept = 5e4, slope = 5e3, colour = "Company A")) +
  lims(x = c(0, 15), y = c(4e4, 13e4)) +
  labs(
    x = "Years Employed",
    y = "Expected Salary ($)",
    colour = "Company"
  ) +
  scale_color_manual(values = c("Company B" = "#4CAF50", "Company A" = "#2196F3"))

Now compute the expected salary after 15 years, using the alpha and beta values you just assigned:

Using R functions

Evaluating the expression in R:

This matches the calculation:

\[ E[Y] = 70{,}000 + 3{,}000 \times 15 = 115{,}000. \]

Now we can turn this into a reusable function:

Predict salaries for these employment durations:

Good work!

Next we will extend our linear predictor to form a full linear statistical model.

Chapter 0: Introduction to linear models

Statistical models

Linear prediction

Choosing parameters

Linear prediction

Using R functions

Good work!

Random Errors

The Simple Linear Model

A simple Linear model in R

Summary