Generalized additive models

Welcome

This is the course website for the 2 day generalized additive modelling course put on by BioSS for UKCEH, in part as a preamble to this workshop.

The instructors for the course will be Dave Miller and Thomas Cornulier.

Pre-requisites

Computing

We recommend updating to the latest version of R from r-project.org, then running:

update.packages("mgcv")

to ensure that you have the latest version of mgcv.

Exercises use a variety of R packages, the following should install the necessary packages:

install.packages(c("tidyverse", "patchwork", "gratia"))

We built the course using Quarto. An introductory article on using Quarto with RStudio is available here. If you know RMarkdown, then you know Quarto!

Course materials and schedule

The data used in the course are adapted from COSMOS-UK site, with more information on the data available on the catalogue page. Our processing can be seen at this GitLab repo.

The processed data can be downloaded here.

Slides

Slides and a summary of each session is available below. Slides are provided in HTML format. For PDF, simply print to PDF in your browser (works best with Chrome it seems).

Sources for the slides are available on the GitLab site.

Day 1

0930-1300 What are GAMs? HTML slides
- Why do we need GAMs?
  - Revision of generalized linear models, why they aren’t enough
  - Using quadratic etc terms in a GLM, why that’s a bad idea
- Smoothing
  - Basis functions (Shiny app)
  - Penalties (effective degrees of freedom, basis size)
  - gam() basic operation
  - summary()
  - simple use of plot()
[LUNCH]
1400-1700 Predictions and variance: HTML slides
- Making predictions and
  - What is a prediction?
  - Using predict()
- Calculating variance
  - Where does variance come from?
  - Using predict(..., se=TRUE)

Day 2

0930-1200 Model checking and validation: HTML slides
- residuals from GAMs
- using gam.check()
- AIC()
- term selection using bs="ts"
1300-1700 Multidimensional smoothing in space and time: HTML slides
- cyclic smooths for temporal effects
- multi-dimensional tprs, what is isotropy?
- building space-time interactions using tensors
- spatial smoothers that respect boundaries, soap film smoothing

Exercises

To do the exercises 01 - 03, we recommend downloading the qmd file and working through the questions and code (recent versions of RStudio know how to open these). You can also follow through with the HTML version and copy/paste material, if you prefer. For exercise 04, we recommend you use the HTML version, if you want to take advantage of the option to hide and reveal solutions on demand.

Exercise 1: Working with gam() HTML qmd
Exercise 2: Predictions and variance HTML qmd
Exercise 3: Model checking and selection HTML qmd
Exercise 4: building a spatio-temporal model, with tensors HTML qmd

Things that came up during the course

Residual checking when using count distributions (or binomial data)
- This can be tricky, since we tend to have a few unique values (0, 1, 2, etc) so our check plots can look a bit wonky and it’s hard to tell if there was equal variance in the residuals, for example. Randomized quantile residuals (sometimes called probability inverse transform, or PIT residuals) are a way to get around this.
- Dunn and Smyth (1996) explain how these work
- Warton, Thibaut and Wang (2017) also cover this topic
- Here is some basic plotting code to do this from the dsm R package (which Dave previously worked on.
Other residual checks by subsets of the data can be useful. For example, plotting the residuals as boxplots by year to confirm that there is no change in their distribution over time (similarly over space, land cover type etc).
- Chandler (2005) has some examples of this.
- Marra, Miller and Zanin (2011) also has an example of using this for spatio-temporal data.
Looking at the differences in using method="REML" vs. other options (this is a very technical topic!)
- Reiss and Ogden (2009) talk about problems with using GCV-based fitting
- Wood (2011) shows that REML does better. Section 1.1 and figure 1 give the important take-aways.
Prediction intervals give us a better gasp of uncertainty for predictions of particular values. The prediction uncertainty we get when we set se.fit=TRUE in predict() only tells us about uncertainty in the model coefficients (the $\boldsymbol{\beta}$) and maybe the smoothing parameters (the $). So our prediction uncertainty is for the mean response, given the covariates. We might want to incorporate measurement error into this too. We can do that via simulation.
- This script shows how to do this for normal data. For other distributions, the basic recipe is the same but we need to be careful with the link function and the distribution we generate our replicate predictions from.
Building cyclic splines can be a bit tricky if you don’t have observations at the ends of the cycle (e.g., day 1 and day 365 of the year) we need to be a bit careful.
- Gavin Simpson has a nice post about this on his blog (which also contains a lot of useful stuff on GAMs!)
If you have some spatial smooth where you need to carefully take into account the boundary (e.g., a coastline or an lake/loch with islands), you might want to use the soap film smoother.
- Wood, Bravington and Hedley (2008) is the original reference
- Again, Gavin Simpson’s blog has some useful information too.
- Dave and Benjamin Hlina have developed an R package to help setup these models. There is a vignette is available here to help.