update.packages("mgcv")
Generalized additive models
Welcome
This is the course website for the 2 day generalized additive modelling course put on by BioSS for UKCEH, in part as a preamble to this workshop.
The instructors for the course will be Dave Miller and Thomas Cornulier.
Pre-requisites
Computing
We recommend updating to the latest version of R from r-project.org, then running:
to ensure that you have the latest version of mgcv
.
Exercises use a variety of R packages, the following should install the necessary packages:
install.packages(c("tidyverse", "patchwork", "gratia"))
We built the course using Quarto. An introductory article on using Quarto with RStudio is available here. If you know RMarkdown, then you know Quarto!
Course materials and schedule
The data used in the course are adapted from COSMOS-UK site, with more information on the data available on the catalogue page. Our processing can be seen at this GitLab repo.
The processed data can be downloaded here.
Slides
Slides and a summary of each session is available below. Slides are provided in HTML format. For PDF, simply print to PDF in your browser (works best with Chrome it seems).
Sources for the slides are available on the GitLab site.
Day 1
- 0930-1300 What are GAMs? HTML slides
- Why do we need GAMs?
- Revision of generalized linear models, why they aren’t enough
- Using quadratic etc terms in a GLM, why that’s a bad idea
- Smoothing
- Basis functions (Shiny app)
- Penalties (effective degrees of freedom, basis size)
gam()
basic operationsummary()
- simple use of
plot()
- Why do we need GAMs?
- [LUNCH]
- 1400-1700 Predictions and variance: HTML slides
- Making predictions and
- What is a prediction?
- Using
predict()
- Calculating variance
- Where does variance come from?
- Using
predict(..., se=TRUE)
- Making predictions and
Day 2
- 0930-1200 Model checking and validation: HTML slides
- residuals from GAMs
- using
gam.check()
AIC()
- term selection using
bs="ts"
- 1300-1700 Multidimensional smoothing in space and time: HTML slides
- cyclic smooths for temporal effects
- multi-dimensional tprs, what is isotropy?
- building space-time interactions using tensors
- spatial smoothers that respect boundaries, soap film smoothing
Exercises
To do the exercises 01 - 03, we recommend downloading the qmd
file and working through the questions and code (recent versions of RStudio know how to open these). You can also follow through with the HTML version and copy/paste material, if you prefer. For exercise 04, we recommend you use the HTML version, if you want to take advantage of the option to hide and reveal solutions on demand.
Things that came up during the course
- Residual checking when using count distributions (or binomial data)
- This can be tricky, since we tend to have a few unique values (0, 1, 2, etc) so our check plots can look a bit wonky and it’s hard to tell if there was equal variance in the residuals, for example. Randomized quantile residuals (sometimes called probability inverse transform, or PIT residuals) are a way to get around this.
- Dunn and Smyth (1996) explain how these work
- Warton, Thibaut and Wang (2017) also cover this topic
- Here is some basic plotting code to do this from the
dsm
R package (which Dave previously worked on.
- Other residual checks by subsets of the data can be useful. For example, plotting the residuals as boxplots by year to confirm that there is no change in their distribution over time (similarly over space, land cover type etc).
- Chandler (2005) has some examples of this.
- Marra, Miller and Zanin (2011) also has an example of using this for spatio-temporal data.
- Looking at the differences in using
method="REML"
vs. other options (this is a very technical topic!)- Reiss and Ogden (2009) talk about problems with using GCV-based fitting
- Wood (2011) shows that REML does better. Section 1.1 and figure 1 give the important take-aways.
- Prediction intervals give us a better gasp of uncertainty for predictions of particular values. The prediction uncertainty we get when we set
se.fit=TRUE
inpredict()
only tells us about uncertainty in the model coefficients (the \(\boldsymbol{\beta}\)) and maybe the smoothing parameters (the $). So our prediction uncertainty is for the mean response, given the covariates. We might want to incorporate measurement error into this too. We can do that via simulation.- This script shows how to do this for normal data. For other distributions, the basic recipe is the same but we need to be careful with the link function and the distribution we generate our replicate predictions from.
- Building cyclic splines can be a bit tricky if you don’t have observations at the ends of the cycle (e.g., day 1 and day 365 of the year) we need to be a bit careful.
- Gavin Simpson has a nice post about this on his blog (which also contains a lot of useful stuff on GAMs!)
- If you have some spatial smooth where you need to carefully take into account the boundary (e.g., a coastline or an lake/loch with islands), you might want to use the soap film smoother.
- Wood, Bravington and Hedley (2008) is the original reference
- Again, Gavin Simpson’s blog has some useful information too.
- Dave and Benjamin Hlina have developed an R package to help setup these models. There is a vignette is available here to help.
Further reading
- Simon Wood’s book “Generalized Additive Models: An Introduction with R” has more information than you require on GAMs.
- Gavin Simpson’s paper “Modelling palaeoecological time series using generalised additive models”.
- Dave’s preprint on Bayesian views of GAMs.
- Phil Bouchet’s package
dsmextra
which can be used to assess extrapolations with GAMs (focussed on count data, but the methods can be used in any setting). - Dave and co’s paper on modelling hierarchical effects using GAMs also has some potentially useful background information on GAMs.
- Further resources on generalized additve models compiled by Noam Ross.
Further watching
This work is licensed under Attribution-NonCommercial-ShareAlike 4.0 International