5 Regression Case Study: NAPLAN Reading Scores

In Chapters 00–04 we built up the pieces of a regression analysis: what a linear model is (Chapter 00), how to fit and interpret regression models (Chapters 01–02), how to build and compare candidate models (Chapter 03), and how to diagnose problems (Chapter 04). In this final chapter, we put those pieces together in an end-to-end analysis.

Our workflow follows the model-building approach in Section 3.3:

Define the research question and candidate variables.
Explore the data (plots + summaries).
Fit a reasonable starting model.
Refine using model comparisons (without “blind selection”).
Check diagnostics and interpret results.
Write up the findings in plain language.

This chapter is designed as a hands-on walkthrough. The webR chunks are the analysis: run each one as you read. Only use “Skip exercise” if you’ve already completed that step (e.g. in your IDE) and just want to keep moving.

5.1 Defining the research question

This case study uses naplan_reading.csv (in this project: Data/naplan_reading.csv), which contains NAPLAN reading achievement data for 3,000 Australian students across Years 3, 5, 7, and 9, drawn from 60 schools (school_id). Each row is one student.

The outcome variable is naplan_reading_score. Because scores are on different ranges across grades (roughly 100–600 in Year 3 vs 400–900 in Year 9), we include grade as a covariate throughout.

Alongside the outcome, the dataset includes student and family predictors such as weekly reading time at home (in minutes), parent education, school type, gender, a socioeconomic status index, birth month, and number of siblings.

Research question: How are students’ socioeconomic circumstances, reading practice at home, and demographic characteristics associated with NAPLAN Reading scores?

In this dataset:

The outcome is naplan_reading_score.
Candidate predictors include:
- reading practice: reading_time_home (minutes per week),
- socioeconomic measures: ses_index and parent_education,
- demographics: gender, birth_months, n_siblings,
- schooling context: school_type,
- schooling level: grade (Year 3/5/7/9, included as a covariate).

Note

This is an observational dataset. Regression estimates here describe associations, not “effects”. For reminders about modelling assumptions, see Section 4.1.1.

Note

Students are clustered within schools (school_id). If there are unmeasured school-level factors (e.g. teaching practices, resources, peer effects), then students from the same school can end up with correlated errors, which violates the “independent errors” assumption used for standard t/F tests in a plain lm().

In this chapter we stay within ordinary least squares (no random effects). We partly address school-to-school differences by including observed context variables such as school_type, along with strong student-level covariates like grade and socioeconomic proxies. These predictors can explain some between-school variation and reduce residual clustering, but they do not guarantee independence.

A more complete approach would use either:

lm(... + factor(school_id)) (school fixed effects): controls for all stable between-school differences, but uses many parameters and you generally can’t estimate purely school-level predictors (like school_type) at the same time, or
a multilevel model with a random effect for school_id, or
cluster-robust standard errors.

These options are beyond the scope of this short course, but are worth knowing about if you analyse clustered education data.

5.2 Importing and preparing the data

We start by importing the dataset and setting factor levels so coefficient interpretations are clear (see Section 2.5 and Section 2.5.4).

Exercise 5.1

Run the chunk below to import and prepare the data. The glimpse() output shows all available variables, and the small dictionary table summarises what each variable represents.

When you’re done, click Check to continue. If you’ve already done this step (e.g. in your IDE), click Skip exercise instead.

naplan <- read_csv("Data/naplan_reading.csv", show_col_types = FALSE) |>
  mutate(
    grade = factor(grade, levels = c("Year 3", "Year 5", "Year 7", "Year 9")),
    parent_education = factor(
      parent_education,
      levels = c(
        "Year 10 or below",
        "Year 12",
        "Certificate/Diploma",
        "Bachelor degree",
        "Postgraduate"
      )
    ),
    school_type = factor(school_type, levels = c("Government", "Catholic", "Independent")),
    gender = factor(gender, levels = c("Female", "Male")),
    birth_months = factor(
      birth_months,
      levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
    )
  )

glimpse(naplan)

naplan_dictionary <- tribble(
  ~variable, ~description,
  "student_id", "Student identifier",
  "school_id", "School identifier (60 schools)",
  "grade", "Year level (Year 3/5/7/9)",
  "reading_time_home", "Minutes of reading at home per week",
  "parent_education", "Parents' highest education level",
  "school_type", "School sector (Government/Catholic/Independent)",
  "gender", "Student gender",
  "birth_months", "Birth month (Jan–Dec)",
  "n_siblings", "Number of siblings",
  "ses_index", "Socioeconomic status index (continuous)",
  "naplan_reading_score", "NAPLAN reading score (outcome)"
)

naplan_dictionary

naplan |>
  select(
    student_id, school_id, grade, naplan_reading_score, ses_index, reading_time_home,
    parent_education, school_type, gender, birth_months, n_siblings
  ) |>
  slice_head(n = 8)

naplan |>
  summarise(
    n = n(),
    min_score = min(naplan_reading_score),
    max_score = max(naplan_reading_score),
    mean_score = mean(naplan_reading_score),
    sd_score = sd(naplan_reading_score),
    n_schools = n_distinct(school_id)
  )

naplan |>
  count(grade) |>
  arrange(grade)

Exercise 5.2

As a first quick EDA summary, create a table called score_by_grade with the mean NAPLAN reading score for each grade, sorted from highest to lowest.

When it looks right, click Check to continue (or Skip exercise if you did this in your IDE).

score_by_grade <- naplan |>
  group_by(grade) |>
  summarise(mean_score = mean(naplan_reading_score)) |>
  arrange(desc(mean_score))

score_by_grade

5.3 Exploratory data analysis

Before we fit any model, we look for:

obvious trends (linear vs curved),
differences across groups (categorical predictors),
potential multicollinearity (predictors that tell us almost the same thing),
outliers or unusual values that might affect the model.

This is exactly the “EDA first” mindset in Section 3.3.1.

5.3.1 Clustering by school (students within schools)

Because students are sampled within schools, a quick EDA step is to check the cluster sizes (how many students per school) and whether school-average scores differ. Visible variation across schools is one sign that school context might matter, which in turn can lead to correlated errors within schools if that context is not fully captured by the predictors. Later, when we include school_type (and other predictors) in the regression, we are adjusting for observed school context, which can reduce—but not eliminate—between-school clustering in the residuals.

5.3.2 Outcome distribution (by grade)

NAPLAN scores are on different ranges across grades (Year 3 scores are typically lower than Year 9 scores), so it helps to look grade-by-grade.

5.3.3 Key bivariate relationships

5.3.3.1 Reading score vs SES index

5.3.3.2 Reading score vs time spent reading at home

Because reading_time_home is reported in minutes per week and has a lot of small values (including zeros), we should be open to nonlinearity here (Chapter 03: Section 3.1.2).

5.3.4 Categorical predictors

Boxplots make it easy to compare distributions across categories (see Section 2.5 and Section 2.5.2).

5.3.5 Correlations among continuous variables

We don’t have GGally::ggpairs() available in this project, but we can still compute a correlation matrix and look at a basic pairs plot (see Section 3.3.1.1).

Exercise 5.3

Recreate one of the key EDA plots: reading time at home vs NAPLAN reading score, with a quadratic smoother and separate panels for each grade.

After it runs, try experimenting (e.g., set se = FALSE, remove facet_wrap(~ grade), or change the smoother to linear).

p_readtime <- ggplot(naplan, aes(x = reading_time_home, y = naplan_reading_score)) +
  geom_point(alpha = 0.25, size = 1) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2, raw = TRUE), se = TRUE) +
  facet_wrap(~ grade) +
  labs(x = "Reading time at home (minutes/week)", y = "NAPLAN reading score")

p_readtime

5.4 Fitting the least-squares model

We now fit regression models using ordinary least squares (Chapter 01: Section 5). We start with a first-order additive model and then consider targeted increases in complexity when the EDA suggests they are useful (Chapter 03: Section 3.1).

5.4.1 A first-order screening model

This model includes all candidate predictors, including birth_months. Think of it as a starting point for refinement (not the final model).

5.4.2 Model comparison: does birth month help?

We compare the model with and without birth_months using AIC/BIC (Chapter 03: Section 3.2.2.2) and an ANOVA comparison (Chapter 03: Section 3.2.2.4).

The comparison suggests birth_months doesn’t meaningfully improve the model here, so we drop it.

Exercise 5.4

Fit the first-order “screening” model yourself, and store it as my_full_mod.

my_full_mod <- lm(
  naplan_reading_score ~ grade + ses_index + reading_time_home + n_siblings +
    parent_education + school_type + gender + birth_months,
  data = naplan
)

glance(my_full_mod)

5.5 Model refinement and selection

Now we refine the model in a controlled way, guided by what we saw in the EDA and by the tools in Chapter 03.

Note

Automated methods like stepwise regression can be useful as screening tools, but can be misleading if used blindly (see Section 3.3.3 and Section 4.4.8). In this case study we prefer small, interpretable comparisons.

5.5.1 (Optional) Stepwise AIC screening

If you want to see what a purely automated AIC search does, we can run a stepwise procedure on a first-order model (no interactions or polynomial terms). Treat this as a starting point for thought, not a final answer.

5.5.2 Add a quadratic term for reading time

The scatterplot suggested possible curvature in the relationship between reading_time_home and naplan_reading_score. A natural next step is a quadratic term (Chapter 03: Section 3.1.2).

5.5.3 Allow the reading-time curve to differ by grade (interaction)

Because grade is strongly related to score, it’s plausible the reading-time pattern differs across grades. This is a continuous × categorical interaction (Chapter 03: Section 3.1.4.1).

5.5.4 Two competing “SES” choices: SES index vs parent education

The dataset contains two SES-related variables:

ses_index (a numeric socioeconomic index),
parent_education (a categorical proxy related to SES).

Including both can be redundant: they are different ways of measuring similar underlying circumstances. This is a good moment to practise the “partial effects” idea from Section 2.3, and to remember the warning about multicollinearity in Section 4.4.7.

We fit two versions:

mod_ses_index: includes ses_index (but not parent_education),
final_mod: includes parent_education (but not ses_index).

In this dataset, the parent-education version fits better by AIC/BIC, so we use it as our final model for reporting. But we still report what the SES index suggests later, as a sensitivity check.

Does the SES index add anything once parent education is already in the model?

Exercise 5.5

Use an ANOVA comparison to test whether adding birth_months improves the first-order model. Store the p-value in p_birth.

5.6 Diagnostics (does the model look reasonable?)

As emphasized in Chapter 04, we always re-check residual diagnostics on the final model (see Section 4.3 and Section 4.4).

5.6.1 A small set of diagnostic plots

The residual-vs-fitted plot is the main place to look for nonlinearity and changing variance (Chapter 04: Section 4.4.1 and Section 4.4.2). In this case the pattern looks broadly centred around 0, with at most mild changes in spread.

5.6.2 Influence and unusual points

These points are worth checking, but the Cook’s distances are not extreme. This is the sort of “small number of influential points” situation discussed in Section 4.2.1.

5.6.3 Multicollinearity check (VIF) on a simpler model

Because VIFs can be inflated by raw polynomials and interactions, we compute VIFs on the simpler first-order model (no quadratic or interaction terms) to get a clearer sense of redundancy among predictors (Chapter 04: Section 4.4.7).

Exercise 5.6

Recreate the core diagnostic plot from Chapter 04: residuals vs fitted values.

p_resid <- ggplot(final_aug, aes(x = .fitted, y = .resid)) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "gray40") +
  geom_point(alpha = 0.25, size = 1) +
  labs(x = "Fitted value", y = "Residual")

p_resid

5.7 Reporting the results

We now summarise the final model in a way that a reader can understand (Chapter 02: interpreting coefficients, Section 2.3; and Chapter 01: confidence/prediction intervals, Section 1.5.2).

5.7.1 Model summary

We fitted a multiple linear regression model to 3000 students. The overall model fit was:

Adjusted $R^2 \approx$ 0.638
Residual standard error $\approx$ 56.48
Overall $F$-test: $F($18$, $2981$) = 294.7, p <1e-16

5.7.2 Interpreting key coefficients

For categorical predictors, coefficients are interpreted relative to their reference levels (Chapter 02: Section 2.5). In this model the references are:

grade: Year 3
parent_education: Year 10 or below
school_type: Government
gender: Female

From the table:

Higher parent education is associated with higher reading scores (e.g., Bachelor vs Year 10 or below: about 54.8 points higher, holding other variables constant).
Independent schools score higher than Government schools by about 44.5 points on average (Catholic vs Government is much smaller).
Males score about -7.9 points lower than females, on average.

5.7.3 Interpreting the grade × reading-time pattern

Because the model includes an interaction between grade and a quadratic reading-time term, it’s often easiest to interpret with predicted curves rather than trying to “read” raw polynomial coefficients (Chapter 03: Section 3.1.4.1).

5.7.4 A concrete prediction (with a prediction interval)

Consider a Year 9 female student in a Government school, with parent education = Bachelor degree, who reports 30 minutes of reading at home per week. The fitted mean is:

5.7.5 Sensitivity check: what if we use the SES index instead?

If we swap parent_education for the numeric SES index, the SES index is strongly associated with score.

This illustrates an important modelling lesson: how you operationalise a construct like “SES” can change both interpretation and model fit. In this dataset, parent education seems to capture SES-related variation particularly well.

Exercise 5.7

Create a prediction interval for a specific student profile.

Use a Year 9 student with:

reading_time_home = 30
parent_education = "Bachelor degree"
school_type = "Government"
gender = "Female"

Store the result of predict(..., interval = "prediction") in pred_pi.

my_profile <- tibble(
  grade = factor("Year 9", levels = levels(naplan$grade)),
  reading_time_home = 30,
  parent_education = factor("Bachelor degree", levels = levels(naplan$parent_education)),
  school_type = factor("Government", levels = levels(naplan$school_type)),
  gender = factor("Female", levels = levels(naplan$gender))
)

pred_pi <- predict(final_mod, newdata = my_profile, interval = "prediction")
pred_pi

5.8 Conclusion

Using 3000 students across Years 3, 5, 7, and 9, we fitted a multiple linear regression model relating NAPLAN reading scores to reading time at home, grade, parent education, school type, and gender. The model explained about 63.8% of the variation in scores (adjusted $R^2 \approx$ 0.638), with a residual standard error of about 56.5 points.

After adjusting for other variables, higher parent education was associated with higher reading scores (e.g., Bachelor vs Year 10 or below: roughly 54.8 points higher). Independent schools scored higher than Government schools by about 44.5 points on average, and males scored about 7.9 points lower than females.

The relationship between reading time at home and score was not purely linear and differed by grade (a grade × quadratic reading-time interaction). Predicted curves suggested that increases in reading time are associated with higher predicted scores, but the shape of that association varies across schooling levels.

Limitations: These are associations from observational data. Potential clustering within schools and omitted variables (e.g., prior achievement, classroom factors) may affect estimates. Always interpret results in context and re-check diagnostics, as in Chapter 4.