The Full Game

library(ascot) # Our package for simulating space-time patterns
library(dplyr) # Handy package for general data manipulation
library(ggplot2) # Visualisation
library(lubridate) # Nice way to work with date data
library(mgcv) # The GAMs package we will use for modelling

Building Worlds

We are going to start by building a simulated version of our study system. While it is often tempting to jump straight into working with real data, the models we are using here are quite complex. It can be challenging, therefore, to intuitively grasp whether something has gone wrong in our modelling, or whether there is something missing in our understanding of the data. By building simulated datasets where we have full control over how the data were generated, we give the models nowhere to hide!

There is an art to constructing good simulations for model testing. We want to make our simulated world’s realistic enough to give a flavour of the real data, but not so complex that we can’t track down problems if they arise. When working with spatio-temporal questions, things become very complicated very quickly!

Using Our Package

We have designed the ascot package to give you a lot of flexibility for simulating and sampling from realistic spatio-temporal scenarios. In other vignettes, we explore this flexibility but here we will make some quick decisions that give a flavour of the available options.

The Temporal Pattern

We start by thinking about the pattern through time. In many realistic ecological and environmental datasets we expect there to be periodic patterns in the data corresponding to cycles within days, weeks, seasons or years. The define_pattern_t function allows you to generate these cycles at multiple scales (weeks, seasons and years) and to tune the amplitude and noise associated with these patterns. We discuss in more detail in the temporal_simulations vignette. Here, we generate one month’s worth of data which have oscillations of amplitude 1 at the weekly scale.

# Determine duration of time series
time_df <- set_timeline(1, "month")
# Set time series pattern
time_series <- define_pattern_t(time_df,
                                weekly_amplitude = 1,
                                weekly_noise = 0)
#> Warning: One quarter or less of data, ignoring arguments for seasonal and annual
#> patterns

# Visualise
ggplot(time_series, aes(x = date, y = pattern)) +
  geom_point() + 
  geom_line() +
  theme_bw()

Spatial Pattern

Similarly, in most real-world processes we expect there to be spatial autocorrelation, that is to say, places closer together in space are likely to be more similar to each other than places further apart. This can happen for numerous reasons, from simple diffusion to complex aggregation behaviours. Our next job, therefore, is to define how spatial locations will relate to each other. Here we create a 20 by 20 grid with moderate autocorrelation. More information on generating spatial fields can be found in the spatial_simulations vignette.

# Generate Field
spatial_field <- define_pattern_s(20, 20, autocorr = TRUE, range = 5)

# Visualise
ggplot(spatial_field, aes(x = x, y = y, fill = value)) +
  geom_raster() +
  theme_bw()

Combining Patterns

We now want to combine the two patterns we have made so that our locations in space change through time. Conceptually, we treat our spatial pattern as a starting point and then allow each location to move through time according to the temporal pattern. We can make the relationship between space and time as outlined in the GAMs workshop. An easy way of doing this is through the pattern_adherence argument in the function simulate_spacetime. Here, we set this to 0, the simplest form of combining the two. We refer to this as the truth as it will represent the true process we are interested in inferring when we start modelling.

# Simulate from spatial and temporal patterns
truth <- simulate_spacetime(temporal_pattern = time_series,
                                 field = spatial_field,
                                 pattern_adherence = 0)

Visualising Space and Time

When generating patterns across time OR across space, they are fairly easy to visualise, as demonstrated above. When we have space and time interacting, visualisations are more challenging as there is a time series at every spatial location and a different map for each point in time. There are several options for approaching this problem.

Plot a random sample of time points

# Generate vector of random dates
random_dates <- truth$date %>% unique() %>% sample(9)
# Sample those dates
truth_rand_times <- truth %>% filter(date %in% random_dates)
# Visualise
ggplot(truth_rand_times, aes(x = x, y = y, fill = value)) +
  geom_raster() +
  facet_wrap(vars(date))

Plot a window of time points

# Generate vector of first 16 dates
first_n_dates <- truth$date %>% unique() %>% sort() %>% head(16)
# Sample those dates
truth_first_n_times <- truth %>% filter(date %in% first_n_dates)
# Visualise
ggplot(truth_first_n_times, aes(x = x, y = y, fill = value)) +
  geom_raster() +
  facet_wrap(vars(date))

Plot Time Series But Not Space

# Combine each x and y coordinate to give each location a unique ID
truth$loc_id <- factor(paste(truth$x, truth$y, sep = "_"))

ggplot(truth, aes(x = date, y = value, colour = loc_id)) +
  geom_line() +
  theme_bw() +
  theme(legend.position = "none")

Sampling

We have now simulated the underlying Truth - the process we are interested in reconstructing. However, for real data, we will also need to tackle the observation process. The observation process is how data are collected and represented. Observation processes can be as complex as the physical or biological processes we are interested in.

To maintain generality, we focus on some relatively simple but important observation processes:

  1. We have multiple observation methods which capture the same process;
  2. The observations are differently distributed in space;
  3. The observations have different degrees of error

In the modelling section below, we will illustrate how to tackle problems 1-3.

Change in Support

There is an additional process motivated by our case studies: the change in support problem. Change in support refers to tackling areal and point data in the same analysis. As we cover further in in the discussion, we believe that for sparse data it is hard to get good results with conventional change of support models as they can be quite data hungry.

Instead, we recommend treating the areal data as point data. This is an approximation that will hold well when the areal data are relatively fine but will become less appropriate for coarser areal data. From informal testing on realistic data simulations and discussions with experienced modellers, the approximation is more robust than attempting to fit the more elegant change of support models.

Two Sampling Stategies

We will compare two sampling strategies here, one where we evenly space our observations (sample_area_pointily(...regular=TRUE)) and one where we space our observations randomly (sample_area_pointily(...regular=FALSE)). We will give each sampling regime a different error term, mimicking having data with different levels of reliability. For more details on sampling strategies, consult the sampling_strategies vignette.

sample_reg <- sample_area_pointily(truth, n_samples = 25, sampling_error = 0.1,
                               regular = TRUE)
sample_ran <- sample_area_pointily(truth, n_samples = 25, sampling_error = 1,
                               regular = FALSE)

Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
head(sample_reg)

Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)

Remember, we might be interested in deconstructing the different components of the date to pick out weekly, quarterly and annual patterns.

Building A Basic Space Time Model

fit_sample_reg <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_reg, 
                        family = "gaussian",
                        method = "REML")

gam.check(fit_sample_reg)

Does this look familiar?

Building A Basic Space Time Model

fit_sample_reg <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_reg, 
                        family = "gaussian",
                        method = "REML")

gam.check(fit_sample_reg)

Building A Basic Space Time Model

fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_ran, 
                        family = "gaussian",
                        method = "REML")

gam.check(fit_sample_ran)

Building A Basic Space Time Model

fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_ran, 
                        family = "gaussian",
                        method = "REML")

gam.check(fit_sample_ran)

Building A Basic Space Time Model

gam.check(fit_sample_reg)
gam.check(fit_sample_ran)

A Joint Model

Rather than calling mgcv twice, can we specify these two models as a single joint model (like Model I)

More Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples

# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam) 

More Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples

# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam) 

More Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples

# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam) 

More Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples

# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam) 

Model I - I For Independence

fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_ran, 
                        family = "gaussian",
                        method = "REML")

fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")

Model I - I For Independence

fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_ran, 
                        family = "gaussian",
                        method = "REML")

fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")

Model I - I For Independence

fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")
gam.check(fit_fam_I)

Hierarchical GAMs

Model GI - Same Shape, Different Wiggles

To use gfam(), we need to create this rather unsightly resp object.

# Need to create combined response variable
sample_com$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Some New mgcv Tricks: Do Not Be Afraid

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Some New mgcv Tricks: Basis Specification

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Some New mgcv Tricks: Grouped Families

# Need to create combined response variable
sample_com$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Some New mgcv Tricks: Grouped Families

# Need to create combined response variable
sample_com$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Now we can have observation processes with different response distribution s!

Model GI - Same Shape, Different Wiggles

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Model GS - Same Shape, Same Wiggles

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")
# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec, fam_fac,
                             d=c(2, 1, 1),
                             bs=c("ts", "ts", "re")), 
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Model G - Same Model

# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec, fam_fac,
                             d=c(2, 1, 1),
                             bs=c("ts", "ts", "re")), 
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec), 
                       data = sample_com, 
                       family = "gaussian",
                       method = "REML")

Model G* - Group Trend Models

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec, fam_fac,
                             d=c(2, 1, 1),
                             bs=c("ts", "ts", "re")), 
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec), 
                       data = sample_com, 
                       family = "gaussian",
                       method = "REML")

Hierarchical GAM

# Independent trend and smooths
fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec, fam_fac,
                             d=c(2, 1, 1),
                             bs=c("ts", "ts", "re")), 
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec), 
                       data = sample_com, 
                       family = "gaussian",
                       method = "REML")

How Do The Models Compare?

# Need to create combined response variable
sample_com$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec, fam_fac,
                             d=c(2, 1, 1),
                             bs=c("ts", "ts", "re")), 
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec), 
                       data = sample_com, 
                       family = "gaussian",
                       method = "REML")

Model I

gam.check(fit_fam_I)

Model GI

gam.check(fit_fam_GI)

Model GS

gam.check(fit_fam_GS)

Model G

gam.check(fit_fam_G)

How Do The Models Compare?

AIC(fit_fam_I)
AIC(fit_fam_GI)
AIC(fit_fam_GS)
AIC(fit_fam_G)

Thoughts To Finish With

References

On Hierarchical GAMs

  1. Pedersen EJ, Miller DL, Simpson GL, Ross N. 2019. Hierarchical generalized additive models in ecology: an introduction with mgcv. PeerJ 7:e6876 https://doi.org/10.7717/peerj.6876

On Observation Processes

  1. Chadwick, FJ, Haydon, DT, Husmeier, D, Ovaskainen, O, & Matthiopoulos, J (2023). LIES of omission: complex observation processes in ecology. Trends in Ecology & Evolution https://doi.org/10.1016/j.tree.2023.10.009

On Methods Underpinning gfam()

  1. Wood, SN, Pya, N and Saefken, B (2016), Smoothing parameter and model selection for general smooth models. Journal of the American Statistical Association 111, 1548-1575 doi:10.1080/01621459.2016.1180986

On The by= Argument

  1. Watch out for forthcoming paper from Dave Miller, Thomas Cornulier and Ken Newman.