1 Introduction to Logistic Regression

1.1 Motivation: modelling binary outcomes

In the Linear Regression Short Course (LRSC), we used a linear model to predict a variable Y from one or more predictors X. Something like this:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \epsilon \]

In the cases we considered, Y was a continuous variable (e.g., salary, Naplan score, etc.), and the model predicted values of Y that could take on any real number.

In logistic regression, our outcome is not continuous. Instead, it is typically a discrete categorical variable. In particular, logistic regression is used to model a binary variable i.e. one which takes on only two possible values (e.g., success/failure, yes/no, 0/1)¹.

Example 1.1: A binary outcome: pass or fail

Johno is a student studying for an important exam. He wants to understand how his study time affects his chances of passing the exam. To this end, he collects data on previous students’ study times and whether they passed or failed the exam. Here is a sample of the data Johno collected:

# A tibble: 395 × 2
   studytime pass_fail
       <dbl> <chr>    
 1         2 fail     
 2         2 fail     
 3         2 pass     
 4         3 pass     
 5         2 pass     
 6         2 pass     
 7         2 pass     
 8         2 fail     
 9         2 pass     
10         2 pass     
# ℹ 385 more rows

Johno wants to use this data to model the relationship between study time and the probability of passing the exam. He identifies his outcome variable, \(Y\), as the binary variable indicating whether a student passed or failed the exam, and his predictor variable, \(X\), as the amount of time spent studying. Therefore, Johno’s goal is to produce a statistical model that predicts Y (pass/fail) based on X (study time): \[ Y = f(X) + \epsilon \]

1.1.1 From categories to discrete numbers

As with dummy variables for categorical predictors in linear regression, we often recode a binary outcome into numbers so we can work with it mathematically. Conventionally we use \(Y=1\) for the event of interest (e.g., “pass”) and \(Y=0\) otherwise—this coding doesn’t make the outcome “really numeric”, it’s just a convenient label that lets us model probabilities.

1.1.2 From modelling categories to modelling probabilities

When the outcome is binary (pass/fail, yes/no, 0/1), it can be tempting to treat the problem as “predict the category”. But in most real settings the category is not deterministic given the predictors: even for the same study time, two students might differ in prior knowledge, sleep, assessment difficulty, and so on. So rather than trying to claim certainty (“this student will pass”), a statistical model aims to quantify how likely the event is.

For a binary outcome, it is useful to code the outcome as \[ Y = \\begin{cases} 1 & \\text{event occurs (e.g., pass)} \\\\ 0 & \\text{event does not occur (e.g., fail)} \\end{cases} \] and model the probability of the event given the predictors: \[ p(x) = P(Y = 1 \\mid X = x). \] This is a shift from predicting a label (two discrete options) to modelling a probability (a continuum between 0 and 1). Once we have a probability model, “predicting a category” is just a decision rule on top of it (e.g., predict “pass” if \(p(x) \\ge 0.5\)), and that cutoff can change depending on the costs of false positives/negatives.

One reason linear regression seems to “almost work” here is that when \(Y\) is coded as 0/1, \[ E(Y \\mid X = x) = P(Y = 1 \\mid X = x) = p(x), \] so modelling the mean looks like modelling a probability. The issue is that an ordinary linear model does not respect the special nature of probabilities (they must stay between 0 and 1), which motivates logistic regression.

Example 1.2: When linear regression is not appropriate

Suppose we want to predict whether a student passes or fails a course based on their study hours and have obtained a dataset like this:

# A tibble: 395 × 2
   studytime pass_fail
       <dbl> <chr>    
 1         2 fail     
 2         2 fail     
 3         2 pass     
 4         3 pass     
 5         2 pass     
 6         2 pass     
 7         2 pass     
 8         2 fail     
 9         2 pass     
10         2 pass     
# ℹ 385 more rows

Let’s follow the steps we might take if we tried to use linear regression here. First, we convert the categorical outcome into a numeric 0/1 variable (remember: the mean of a 0/1 variable is a probability). As in previous courses, we can use 1 for “pass” and 0 for “fail”:

Let’s plot the data in a scatter plot with a fitted linear regression line:

(note the use of geom_jitter() to add some random noise to the points so they don’t overlap exactly).

How do we interpret the fitted line? Notice that for any of the study times in our data, the predicted value from the linear regression line is between 0 (fail) and 1(pass).

        1         2         3         4 
0.6694035 0.7112321 0.6275749 0.7530607

This makes sense since the ‘highest score’ of any student in our dataset is 1 (i.e. a pass), and the ‘lowest score’ is 0 (a fail). No student has a predicted score of exactly 0 or 1, but the predicted values are between these two extremes. We might reasonably interpret the predicted value as the probability of passing the course given the study time (i.e. the coefficient for study time indicates how much the probability of passing increases for each additional hour of study time).

For example, a student who studies for 2 hours has a predicted probability of passing of about 0.6694035, while a student who studies for 4 hours has a predicted probability of passing of about 0.7530607.

However, the linear regression line can produce predicted values outside this range (less than 0 or greater than 1), which does not make sense for probabilities.

For example, a student who studies for 0 hours has a predicted probability of passing of about 0.5857463, which is less than 0, and a student who studies for 10 hours has a predicted probability of passing of about 1.0040323, which is greater than 1. With a binary outcome, the relationship between predictors and probability is typically nonlinear: the “same” change in a predictor cannot produce the same change in probability at every starting probability (e.g., you can’t increase a 0.95 probability by +0.2).

1.1.3 Examples of binary outcomes

Binary outcomes show up in many applied settings:

Disease status (yes/no)
Admission (accepted/rejected)
Survival (alive/dead)
Purchase (bought/did not buy)

1.1.4 Why linear regression fails for binary outcomes

With a binary outcome, the relationship between predictors and probability is typically nonlinear: the “same” change in a predictor cannot produce the same change in probability at every starting probability (e.g., you can’t increase a 0.95 probability by +0.2).

Logistic regression fixes this by modeling a linear relationship on the log-odds (logit) scale, then converting back to the probability scale.

1.1.5 What logistic regression provides

Logistic regression models the probability of an event while ensuring predictions stay between 0 and 1. It also supports effect interpretation via odds ratios and provides likelihood-based tools for inference and model assessment.

1.2 Generalised Linear Models

types of random variables are covered in the Introduction to Statistical Inference Short Course, so look there if you need a refresher. Moreover, categorical predictors (factors) were used in the Linear Regression Short Course.↩︎