Glossary

Statistical model

A probabilistic description of how a response relates to predictors, combining a systematic component with random variation.

Common form

\[Y = f(X) + \varepsilon.\]

In R

# Linear regression (Gaussian errors)
m_lm <- lm(dist ~ speed, data = cars)

# Generalised linear model (e.g., logistic regression)
m_glm <- glm(vs ~ mpg + wt, data = mtcars, family = binomial())

# Inspect / use fitted models
summary(m_lm)
coef(m_lm)
predict(m_lm, newdata = data.frame(speed = c(10, 20)))

Random variable

A quantity whose value varies across observations, described by a probability distribution.

In R

# Simulate random variables (make results reproducible)
set.seed(1)
x_norm <- rnorm(100, mean = 0, sd = 1)
x_unif <- runif(100, min = 0, max = 1)
x_binom <- rbinom(100, size = 1, prob = 0.3)

Distribution

A description of how the values of a random variable are spread across possible outcomes.

In R

# Many distributions use the d/p/q/r pattern:
# d* = density, p* = CDF, q* = quantile, r* = simulation
dnorm(0, mean = 0, sd = 1)
pnorm(0, mean = 0, sd = 1)
qnorm(0.975, mean = 0, sd = 1)
rnorm(5, mean = 0, sd = 1)

Normal distribution

A symmetric, bell-shaped distribution often used to model random variation.

Formula

\[X \sim N(\mu, \sigma^2).\]

In R

# Normal N(mu, sigma^2): use mean = mu and sd = sigma
mu <- 10
sigma <- 2

dnorm(10, mean = mu, sd = sigma)
pnorm(12, mean = mu, sd = sigma)
qnorm(0.95, mean = mu, sd = sigma)

set.seed(1)
rnorm(5, mean = mu, sd = sigma)

Expected value

The long-run average value of a random variable; the mean of its distribution.

Formula

  • Discrete case: \[E[X] = \sum_x x\,P(X=x).\]
  • Continuous case: \[E[X] = \int_{-\infty}^{\infty} x f(x)\,dx.\]

In this course

  • In regression, \(E[Y\mid X]\) is the mean response at given predictors.

In R

set.seed(1)
x <- rnorm(100000, mean = 10, sd = 2)
mean(x)  # approximates E[X]

Variance

A measure of spread; larger variance means values tend to be further from their mean.

Formula

\[\mathrm{Var}(X) = E[(X - E[X])^2].\]

In R

# Sample variance (uses n - 1 in the denominator)
x <- c(1, 2, 3, 4, 5)
var(x)

# sd(x)^2 gives the same quantity
sd(x)^2

Assumption

A condition we adopt to justify a model or an inference procedure.

In this course

  • For linear regression, key assumptions are about the error term (mean zero, constant variance, normality, independence).

Response variable

The variable you aim to explain or predict (sometimes called the outcome or dependent variable).

In R

# In a formula like y ~ x1 + x2, the response is on the left-hand side
fit <- lm(dist ~ speed, data = cars)  # response = dist

# The response used by a fitted model can be recovered like this
y <- model.response(model.frame(fit))
head(y)

Predictor variable

A variable used to explain, adjust for, or predict changes in the response (sometimes called a covariate).

In R

# In a formula like y ~ x1 + x2, predictors are on the right-hand side
fit <- lm(dist ~ speed, data = cars)  # predictor = speed

# The design matrix contains the (coded) predictors used by the model
X <- model.matrix(fit)
head(X)

Continuous variable

A numeric variable that can (in principle) take any value on an interval.

In R

# Continuous variables are usually stored as numeric vectors
is.numeric(cars$speed)

# Quick numeric summaries
summary(cars$speed)

Linear function

A straight-line relationship between a variable and an outcome.

Formula

\[f(x) = \alpha + \beta x.\]

In R

# A straight-line function
alpha <- 2
beta <- 0.5
f <- function(x) alpha + beta * x

f(x = c(0, 1, 2))

Linear predictor

The systematic (non-random) part of a regression model that combines predictors and coefficients.

Formula

\[E[Y\mid X] = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k.\]

In R

fit <- lm(dist ~ speed, data = cars)

# Fitted values for the training data
fitted(fit)[1:5]

# Predictions for new cases
predict(fit, newdata = data.frame(speed = c(10, 20)))

Intercept

The baseline level of the response when predictors are at their reference values (often zero).

In R

# Included by default
fit_with_intercept <- lm(dist ~ speed, data = cars)

# Remove the intercept (forces line through the origin)
fit_no_intercept <- lm(dist ~ speed - 1, data = cars)

coef(fit_with_intercept)
coef(fit_no_intercept)

Slope

The expected change in the response for a one-unit increase in a predictor, holding other predictors constant.

In R

fit <- lm(dist ~ speed, data = cars)

# The slope is the coefficient on the predictor
slope_speed <- coef(fit)["speed"]
slope_speed

Random error term

The part of the response not explained by the predictors (random variation around the systematic part).

In a model

  • Often written as \(\varepsilon\) in \(Y = f(X) + \varepsilon\).

Mean-zero errors

The assumption that errors average to zero, so the model is unbiased on average.

Formula

\[E[\varepsilon \mid X] = 0.\]

Homoscedasticity

The assumption that the variability of the errors is roughly constant across the predictor range.

Formula

\[\mathrm{Var}(\varepsilon \mid X) = \sigma^2.\]

In R

fit <- lm(dist ~ speed, data = cars)

# Residuals vs fitted values (look for constant spread)
plot(fit, which = 1)

# Scale-location plot (another constant-variance check)
plot(fit, which = 3)

Remedies (when violated)

  • Transform the response (e.g., log / square-root), or model the mean-variance relationship.
  • Use heteroskedasticity-robust standard errors for inference (where appropriate).

Heteroskedasticity

Nonconstant error variance across predictor values (the opposite of homoscedasticity).

Idea

  • \(\mathrm{Var}(\varepsilon \mid X)\) changes with \(X\) (often increasing with the mean).

In R

fit <- lm(dist ~ speed, data = cars)

# Look for a funnel pattern in residuals vs fitted values
plot(fit, which = 1)

# Optional formal test (requires lmtest):
# lmtest::bptest(fit)

Remedies

  • Transform the response (log / square-root), especially for positive outcomes.
  • Consider weighted least squares if you can model the changing variance.
  • For inference, consider robust standard errors (e.g., sandwich + lmtest), but still inspect the fit.

Normal errors

The assumption that errors are approximately normally distributed (mainly important for small-sample inference).

Formula

\[\varepsilon \sim N(0, \sigma^2).\]

In R

fit <- lm(dist ~ speed, data = cars)

# Q-Q plot of residuals
plot(fit, which = 2)

# Alternative base R approach:
# qqnorm(resid(fit)); qqline(resid(fit))

Independence

The assumption that errors from different observations are not correlated.

In R

fit <- lm(dist ~ speed, data = cars)

# For time-ordered data: inspect autocorrelation
acf(resid(fit))

# Optional Durbin–Watson test (requires lmtest):
# lmtest::dwtest(fit)

Estimation

The process of using data to infer unknown model parameters (like regression coefficients and error variability).

In R

# Fit a linear regression model (estimates coefficients from data)
fit <- lm(dist ~ speed, data = cars)

# Extract the estimated coefficients
coef(fit)

Estimate

A value computed from data that approximates an unknown population quantity (a parameter).

Notation

  • Parameters are often written with Greek letters (e.g., \(\beta\)), and estimates with hats (e.g., \(\hat{\beta}\)).

In R

fit <- lm(dist ~ speed, data = cars)

# The estimated coefficients (beta-hats)
beta_hat <- coef(fit)
beta_hat

Ordinary least squares

Estimation method that chooses coefficients to minimise the sum of squared residuals.

Criterion

\[\text{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2.\]

In R

fit <- lm(dist ~ speed, data = cars)

# lm() uses OLS (it chooses coefficients that minimise RSS)
rss <- sum(resid(fit)^2)
rss

Residual sum of squares

The total squared discrepancy between observed values and fitted values.

Formula

\[\text{RSS} = \sum_{i=1}^n e_i^2.\]

In R

fit <- lm(dist ~ speed, data = cars)

# Compute RSS directly
sum(resid(fit)^2)

# For lm objects, deviance() equals RSS
deviance(fit)

Notes

  • Notation varies: many texts use SSE for this quantity; SSR is sometimes reserved for the regression sum of squares.

Total sum of squares

The total variability in the response around its mean (a baseline benchmark for model fit).

Formula

\[\text{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2.\]

In R

y <- cars$dist

# Total variability around the mean
sum((y - mean(y))^2)

Mean squared error

An estimate of error variance based on residual size (often RSS divided by residual degrees of freedom).

Formula

\[\text{MSE} = \frac{\text{RSS}}{n - p},\]

where \(p\) is the number of estimated parameters (including the intercept).

In R

fit <- lm(dist ~ speed, data = cars)

# MSE = RSS / residual_df
rss <- sum(resid(fit)^2)
mse <- rss / df.residual(fit)
mse

# For lm, sigma(fit)^2 is the same value
sigma(fit)^2

Residual standard error

An estimate of the typical size of residuals (the estimated error standard deviation).

Formula

\[\text{RSE} = \sqrt{\frac{\text{RSS}}{n - p}} = \sqrt{\text{MSE}},\]

where \(p\) is the number of fitted parameters (including the intercept).

In R

fit <- lm(dist ~ speed, data = cars)

# Residual standard error (estimated sigma)
sigma(fit)

Degrees of freedom

The amount of independent information remaining after estimating model parameters.

Examples

  • In simple linear regression (intercept + slope): residual df is \(n - 2\).
  • More generally (linear regression): residual df is \(n - p\).

In R

fit <- lm(dist ~ speed, data = cars)

# Residual degrees of freedom
df.residual(fit)

Standard error

An estimate of how much a statistic (like a coefficient) would vary across repeated samples.

Formula (linear regression)

\[\mathrm{SE}(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2\,(X^\top X)^{-1}_{jj}}.\]

In R

fit <- lm(dist ~ speed, data = cars)

# Coefficient standard errors
summary(fit)$coefficients[, "Std. Error"]

# Covariance matrix of coefficient estimates
vcov(fit)

t-test

Tests whether a coefficient differs from zero using a t statistic and an assumed reference distribution.

Test statistic

\[t = \frac{\hat{\beta}_j - 0}{\mathrm{SE}(\hat{\beta}_j)}.\]

In R

fit <- lm(dist ~ speed, data = cars)

# t values and p-values for coefficients
summary(fit)$coefficients[, c("t value", "Pr(>|t|)")]

F-test

Tests whether predictors jointly improve model fit compared with a simpler model (often intercept-only).

Common form (overall regression test)

\[F = \frac{(\text{TSS} - \text{RSS})/(p - 1)}{\text{RSS}/(n - p)},\]

where \(p\) is the number of fitted parameters (including the intercept).

In R

fit_small <- lm(dist ~ 1, data = cars)     # intercept-only
fit_big <- lm(dist ~ speed, data = cars)   # add predictor

# Overall F-test appears in summary() for lm
summary(fit_big)

# Compare nested models with anova()
anova(fit_small, fit_big)

p-value

The probability, under the null hypothesis, of observing a result at least as extreme as the one observed.

In R

fit <- lm(dist ~ speed, data = cars)

# p-values are part of the coefficient table
summary(fit)$coefficients[, "Pr(>|t|)"]

Confidence interval

A range of plausible parameter values at a chosen confidence level (e.g., 95%).

Common form

\[\hat{\theta} \pm t^* \cdot \mathrm{SE}(\hat{\theta}).\]

In R

fit <- lm(dist ~ speed, data = cars)

# Confidence intervals for coefficients (default 95%)
confint(fit)

Prediction interval

An interval for a future observation at given predictors; wider than a confidence interval.

In R

fit <- lm(dist ~ speed, data = cars)

new <- data.frame(speed = c(10, 20))
predict(fit, newdata = new, interval = "prediction")

Fitted value

The model’s predicted mean response for an observation (given its predictor values).

Notation

  • Often written as \(\hat{y}_i\).

In R

fit <- lm(dist ~ speed, data = cars)

# Fitted values for the training data
head(fitted(fit))

# Fitted value (mean) for new cases
predict(fit, newdata = data.frame(speed = c(10, 20)))

Residual

The difference between an observed value and the value predicted by the fitted model.

Formula

\[e_i = y_i - \hat{y}_i.\]

In R

fit <- lm(dist ~ speed, data = cars)

# Residuals for the training data
head(resid(fit))

Simple linear model

A model that relates a response to a single predictor using a straight-line mean function plus random error.

Formula

\[Y = \alpha + \beta x + \varepsilon.\]

In R

# Fit a simple linear model (one predictor)
fit <- lm(dist ~ speed, data = cars)
summary(fit)

Simple linear regression

Linear regression with exactly one predictor (a straight-line relationship on average).

Formula

\[E[Y\mid X] = \beta_0 + \beta_1 X.\]

In R

fit <- lm(dist ~ speed, data = cars)

# Visualise the fitted line
plot(dist ~ speed, data = cars)
abline(fit, col = "blue")

Multiple linear regression

Linear regression with two or more predictors, allowing adjustment for multiple variables at once.

Formula

\[E[Y\mid X] = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k.\]

In R

# Two or more predictors on the right-hand side
fit_mlr <- lm(mpg ~ wt + hp + disp, data = mtcars)
summary(fit_mlr)

Partial regression coefficient

The effect of a predictor on the response after holding the other predictors constant.

Idea

  • Interprets the change in \(E[Y\mid X]\) for a one-unit increase in \(X_j\), with the other predictors held fixed.

In R

fit_mlr <- lm(mpg ~ wt + hp, data = mtcars)

# Each coefficient is a partial effect (holding the other predictors fixed)
coef(fit_mlr)

Pearson correlation

Measures linear association between two variables, ranging from -1 to 1.

Formula

\[r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2\;\sum_{i=1}^n (y_i - \bar{y})^2}}.\]

In R

# Pearson correlation (default)
cor(mtcars$mpg, mtcars$wt)

# Test for correlation (with CI)
cor.test(mtcars$mpg, mtcars$wt)

R-squared

The proportion of response variability explained by the fitted model (on the training data).

Formula

\[R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}.\]

In R

fit <- lm(dist ~ speed, data = cars)

# R-squared
summary(fit)$r.squared

Adjusted R-squared

R-squared penalised for the number of predictors to discourage unnecessary terms.

Formula

\[\bar{R}^2 = 1 - (1 - R^2)\frac{n - 1}{n - p},\]

where \(p\) is the number of fitted parameters (including the intercept).

In R

fit_mlr <- lm(mpg ~ wt + hp + disp, data = mtcars)

# Adjusted R-squared
summary(fit_mlr)$adj.r.squared

Polynomial term

A transformed predictor (like x²) used to capture curvature while keeping the model linear in coefficients.

In R

# Add a squared term (still linear in coefficients)
fit_poly <- lm(mpg ~ wt + I(wt^2), data = mtcars)

# Or use orthogonal polynomials
fit_poly2 <- lm(mpg ~ poly(wt, degree = 2), data = mtcars)

Interaction

A model term that allows the effect of one predictor to depend on another predictor.

In R

# Interaction only:
fit_int_only <- lm(mpg ~ wt:hp, data = mtcars)

# Main effects + interaction:
fit_int <- lm(mpg ~ wt * hp, data = mtcars)

Overfitting

When a model captures noise in the training data and generalises poorly to new data.

In practice

  • Symptoms include overly optimistic fit on training data and poor performance on new data.
  • Prefer validation (train/test split or cross-validation) over “fit on everything and hope”.

Stepwise regression

An automated selection approach that adds/removes predictors based on a criterion (often AIC).

Pitfalls

  • Can be unstable: small data changes may lead to different selected models.
  • Post-selection \(p\)-values and confidence intervals can be misleading.

In R

fit_full <- lm(mpg ~ wt + hp + disp + drat + qsec, data = mtcars)

# Stepwise selection by AIC (automated; use with caution)
fit_step <- step(fit_full, trace = 0)
formula(fit_step)

Akaike Information Criterion

Model comparison metric that balances fit and complexity; lower values indicate a preferred model among those compared.

Formula

\[\mathrm{AIC} = -2\log(L) + 2k,\]

where \(k\) is the number of fitted parameters and \(L\) is the maximised likelihood.

In R

m1 <- lm(mpg ~ wt, data = mtcars)
m2 <- lm(mpg ~ wt + hp, data = mtcars)

# Lower AIC is preferred (among the models compared)
AIC(m1, m2)

Bayesian Information Criterion

Model comparison metric that penalises complexity more than AIC; lower values indicate a preferred model among those compared.

Formula

\[\mathrm{BIC} = -2\log(L) + k\log(n).\]

In R

m1 <- lm(mpg ~ wt, data = mtcars)
m2 <- lm(mpg ~ wt + hp, data = mtcars)

# Lower BIC is preferred (stronger penalty for complexity)
BIC(m1, m2)

Parsimony

Choosing the simplest adequate model that answers the scientific question.

In practice

  • Prefer simpler models unless extra complexity meaningfully improves interpretation or prediction.

Outlier

An observation with an unusually large residual relative to the fitted model.

Diagnostics

  • Large standardised/studentised residuals suggest an outlying response value given the predictors.
  • Rules of thumb (context dependent): \(|r_i| > 2\) (flag), \(|r_i| > 3\) (strong flag).

In R

fit <- lm(dist ~ speed, data = cars)

# Standardised residuals and studentised residuals
head(rstandard(fit))
head(rstudent(fit))

Standardised residual

A residual scaled by its estimated standard deviation to highlight unusually large deviations.

Formula

\[r_i = \frac{e_i}{\hat{\sigma}\sqrt{1 - h_{ii}}}.\]

In R

fit <- lm(dist ~ speed, data = cars)

# Standardised residuals
rstandard(fit)[1:5]

# Studentised residuals (often used for outlier detection)
rstudent(fit)[1:5]

Influential point

An observation that substantially changes fitted coefficients or predictions when removed.

In R

fit <- lm(dist ~ speed, data = cars)

# Influence diagnostics
cooks.distance(fit)[1:5]
influence.measures(fit)

Leverage

A measure of how unusual a case’s predictor values are; high leverage can increase influence.

Formula

\[h_{ii} = x_i^\top (X^\top X)^{-1} x_i.\]

In R

fit <- lm(dist ~ speed, data = cars)

# Hat-values are leverages (h_ii)
hatvalues(fit)[1:5]

Rule of thumb

  • High leverage often flagged by \(h_{ii} > 2p/n\) (or \(3p/n\)), where \(p\) is the number of parameters.

Cook’s distance

An influence measure combining residual size and leverage to flag points that strongly affect the fit.

Formula

\[D_i = \frac{e_i^2}{p\,\hat{\sigma}^2}\frac{h_{ii}}{(1 - h_{ii})^2}.\]

In R

fit <- lm(dist ~ speed, data = cars)

# Cook's distance (larger values indicate more influence)
cd <- cooks.distance(fit)
head(cd)

# Rule of thumb often used: 4/n
4 / nrow(cars)

Rule of thumb

  • Values above about \(4/n\) are often flagged for inspection (not an automatic deletion rule).

Extrapolation

Making predictions outside the observed predictor range, where the fitted relationship may not hold.

In R

# Compare new predictor values to the observed range
range(cars$speed)

new_speed <- c(5, 50)
new_speed

Practical note

  • Uncertainty grows quickly as you move away from the data cloud (often reflected in higher leverage).

Multicollinearity

Strong correlation among predictors that inflates standard errors and destabilises coefficient estimates.

Diagnostics

  • Large standard errors and unstable coefficient signs/magnitudes.
  • High correlations among predictors; large VIFs.
  • Exact multicollinearity shows up as non-estimable coefficients.

In R

fit_mlr <- lm(mpg ~ wt + hp + disp, data = mtcars)

# Correlations among predictors
cor(mtcars[, c("wt", "hp", "disp")])

# Exact aliasing (perfect collinearity)
alias(fit_mlr)

# Condition number (rough collinearity diagnostic)
kappa(model.matrix(fit_mlr))

Remedies

  • Remove or combine redundant predictors (guided by your scientific question).
  • Centering can help when you include interactions/polynomials (reduces induced collinearity), but does not “fix” collinearity in general.
  • If prediction is the goal, consider regularisation (ridge/lasso) or dimension reduction (e.g., principal components).

Variance inflation factor

A diagnostic that summarises how collinearity inflates the uncertainty of a coefficient estimate.

Formula

\[\text{VIF}_j = \frac{1}{1 - R_j^2},\]

where \(R_j^2\) comes from regressing predictor \(X_j\) on the other predictors.

In R

fit_mlr <- lm(mpg ~ wt + hp + disp, data = mtcars)

# Optional (if you have the car package):
# car::vif(fit_mlr)

# Base R: regress each predictor on the others, then VIF = 1 / (1 - R^2)
predictors <- c("wt", "hp", "disp")
vif <- sapply(predictors, function(p) {
  others <- setdiff(predictors, p)
  r2 <- summary(lm(reformulate(others, response = p), data = mtcars))$r.squared
  1 / (1 - r2)
})
vif

Box-Cox transformation

A family of power transformations used to stabilise variance or improve linearity for positive responses.

Formula

For \(\lambda \neq 0\):

\[g_\lambda(y) = \frac{y^\lambda - 1}{\lambda}.\]

For \(\lambda = 0\):

\[g_0(y) = \log(y).\]

In R

fit <- lm(dist ~ speed, data = cars)

# Box-Cox search (requires MASS; often installed with R)
# MASS::boxcox(fit)

# Common special case (lambda = 0): log-transform a positive response
fit_log <- lm(log(dist) ~ speed, data = cars)
summary(fit_log)

Factor

A categorical predictor with discrete levels; R stores these as factors.

In R

x <- c("low", "medium", "low", "high")
f <- factor(x)

levels(f)
table(f)

Dummy variable

An indicator (0/1) used to code categories in regression relative to a baseline.

In R

df <- data.frame(
  y = c(1, 2, 3, 4),
  group = factor(c("A", "A", "B", "B"))
)

# Factors expand to dummy/indicator columns in the design matrix
model.matrix(y ~ group, data = df)

Reference level

The baseline category used to interpret coefficients for a factor.

In R

g <- factor(c("control", "treatment", "control"))
levels(g)

# Set the reference (baseline) level
g2 <- relevel(g, ref = "control")
levels(g2)

Contrast

A coding scheme that maps factor levels to numeric columns (e.g., treatment coding).

In R

g <- factor(c("control", "treatment", "treatment"))

# View or set contrasts (coding)
contrasts(g) <- contr.treatment(nlevels(g))
contrasts(g)