library(ascot) # Our package for simulating space-time patterns
library(dplyr) # Handy package for general data manipulation
library(ggplot2) # Visualisation
library(lubridate) # Nice way to work with date data
library(mgcv) # The GAMs package we will use for modellingWe are going to start by building a simulated version of our study system. While it is often tempting to jump straight into working with real data, the models we are using here are quite complex. It can be challenging, therefore, to intuitively grasp whether something has gone wrong in our modelling, or whether there is something missing in our understanding of the data. By building simulated datasets where we have full control over how the data were generated, we give the models nowhere to hide!
There is an art to constructing good simulations for model testing. We want to make our simulated world’s realistic enough to give a flavour of the real data, but not so complex that we can’t track down problems if they arise. When working with spatio-temporal questions, things become very complicated very quickly!
We have designed the ascot package to give you a lot of flexibility for simulating and sampling from realistic spatio-temporal scenarios. In other vignettes, we explore this flexibility but here we will make some quick decisions that give a flavour of the available options.
We start by thinking about the pattern through time. In many
realistic ecological and environmental datasets we expect there to be
periodic patterns in the data corresponding to cycles within days,
weeks, seasons or years. The define_pattern_t function
allows you to generate these cycles at multiple scales (weeks, seasons
and years) and to tune the amplitude and noise associated with these
patterns. We discuss in more detail in the
temporal_simulations vignette. Here, we generate one
month’s worth of data which have oscillations of amplitude 1 at the
weekly scale.
# Determine duration of time series
time_df <- set_timeline(1, "month")
# Set time series pattern
time_series <- define_pattern_t(time_df,
weekly_amplitude = 1,
weekly_noise = 0)
#> Warning: One quarter or less of data, ignoring arguments for seasonal and annual
#> patterns
# Visualise
ggplot(time_series, aes(x = date, y = pattern)) +
geom_point() +
geom_line() +
theme_bw()Similarly, in most real-world processes we expect there to be spatial
autocorrelation, that is to say, places closer together in space are
likely to be more similar to each other than places further apart. This
can happen for numerous reasons, from simple diffusion to complex
aggregation behaviours. Our next job, therefore, is to define how
spatial locations will relate to each other. Here we create a 20 by 20
grid with moderate autocorrelation. More information on generating
spatial fields can be found in the spatial_simulations
vignette.
# Generate Field
spatial_field <- define_pattern_s(20, 20, autocorr = TRUE, range = 5)
# Visualise
ggplot(spatial_field, aes(x = x, y = y, fill = value)) +
geom_raster() +
theme_bw()We now want to combine the two patterns we have made so that our
locations in space change through time. Conceptually, we treat our
spatial pattern as a starting point and then allow each location to move
through time according to the temporal pattern. We can make the
relationship between space and time as outlined in the GAMs workshop. An
easy way of doing this is through the pattern_adherence
argument in the function simulate_spacetime. Here, we set
this to 0, the simplest form of combining the two. We refer to this as
the truth as it will represent the true process we are
interested in inferring when we start modelling.
# Simulate from spatial and temporal patterns
truth <- simulate_spacetime(temporal_pattern = time_series,
field = spatial_field,
pattern_adherence = 0)When generating patterns across time OR across space, they are fairly easy to visualise, as demonstrated above. When we have space and time interacting, visualisations are more challenging as there is a time series at every spatial location and a different map for each point in time. There are several options for approaching this problem.
# Generate vector of random dates
random_dates <- truth$date %>% unique() %>% sample(9)
# Sample those dates
truth_rand_times <- truth %>% filter(date %in% random_dates)
# Visualise
ggplot(truth_rand_times, aes(x = x, y = y, fill = value)) +
geom_raster() +
facet_wrap(vars(date))# Generate vector of first 16 dates
first_n_dates <- truth$date %>% unique() %>% sort() %>% head(16)
# Sample those dates
truth_first_n_times <- truth %>% filter(date %in% first_n_dates)
# Visualise
ggplot(truth_first_n_times, aes(x = x, y = y, fill = value)) +
geom_raster() +
facet_wrap(vars(date))# Combine each x and y coordinate to give each location a unique ID
truth$loc_id <- factor(paste(truth$x, truth$y, sep = "_"))
ggplot(truth, aes(x = date, y = value, colour = loc_id)) +
geom_line() +
theme_bw() +
theme(legend.position = "none")We have now simulated the underlying Truth - the process we are interested in reconstructing. However, for real data, we will also need to tackle the observation process. The observation process is how data are collected and represented. Observation processes can be as complex as the physical or biological processes we are interested in.
To maintain generality, we focus on some relatively simple but important observation processes:
In the modelling section below, we will illustrate how to tackle problems 1-3.
There is an additional process motivated by our case studies: the change in support problem. Change in support refers to tackling areal and point data in the same analysis. As we cover further in in the discussion, we believe that for sparse data it is hard to get good results with conventional change of support models as they can be quite data hungry.
Instead, we recommend treating the areal data as point data. This is an approximation that will hold well when the areal data are relatively fine but will become less appropriate for coarser areal data. From informal testing on realistic data simulations and discussions with experienced modellers, the approximation is more robust than attempting to fit the more elegant change of support models.
We will compare two sampling strategies here, one where we evenly
space our observations
(sample_area_pointily(...regular=TRUE)) and one where we
space our observations randomly
(sample_area_pointily(...regular=FALSE)). We will give each
sampling regime a different error term, mimicking having data with
different levels of reliability. For more details on sampling
strategies, consult the sampling_strategies vignette.
sample_reg <- sample_area_pointily(truth, n_samples = 25, sampling_error = 0.1,
regular = TRUE)
sample_ran <- sample_area_pointily(truth, n_samples = 25, sampling_error = 1,
regular = FALSE)set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25,
sampling_error = 0.1, regular = TRUE)
head(sample_reg)value is the Truth from our simulation.point_sample is what we observed.loc_id is the combination of the x and
y coordinates to help with plotting.xset.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25,
sampling_error = 0.1, regular = TRUE)
# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)Remember, we might be interested in deconstructing the different components of the date to pick out weekly, quarterly and annual patterns.
fit_sample_reg <- mgcv::gam(point_sample ~ t2(x, y, date_dec),
data = sample_reg,
family = "gaussian",
method = "REML")
gam.check(fit_sample_reg)Does this look familiar?
fit_sample_reg <- mgcv::gam(point_sample ~ t2(x, y, date_dec),
data = sample_reg,
family = "gaussian",
method = "REML")
gam.check(fit_sample_reg)fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec),
data = sample_ran,
family = "gaussian",
method = "REML")
gam.check(fit_sample_ran)fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec),
data = sample_ran,
family = "gaussian",
method = "REML")
gam.check(fit_sample_ran)gam.check(fit_sample_reg)gam.check(fit_sample_ran)Rather than calling mgcv twice, can we specify these two
models as a single joint model (like Model I)
set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25,
sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25,
sampling_error = 1, regular = FALSE)
# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples
# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam) set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25,
sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25,
sampling_error = 1, regular = FALSE)
# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples
# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam) set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25,
sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25,
sampling_error = 1, regular = FALSE)
# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples
# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam) set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25,
sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25,
sampling_error = 1, regular = FALSE)
# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples
# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam) fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec),
data = sample_ran,
family = "gaussian",
method = "REML")
fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec,
by = fam_fac),
data = sample_com,
family = "gaussian",
method = "REML")fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec),
data = sample_ran,
family = "gaussian",
method = "REML")
fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec,
by = fam_fac),
data = sample_com,
family = "gaussian",
method = "REML")fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec,
by = fam_fac),
data = sample_com,
family = "gaussian",
method = "REML")
gam.check(fit_fam_I)To use gfam(), we need to create this rather unsightly
resp object.
# Need to create combined response variable
sample_com$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec,
by = fam_fac),
data = sample_com,
family = "gaussian",
method = "REML")
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")mgcv Tricks: Do Not Be Afraid# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")mgcv Tricks: Basis Specification# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")mgcv Tricks: Grouped Families# Need to create combined response variable
sample_com$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")mgcv Tricks: Grouped Families# Need to create combined response variable
sample_com$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")Now we can have observation processes with different response distribution s!
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec, fam_fac,
d=c(2, 1, 1),
bs=c("ts", "ts", "re")),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec, fam_fac,
d=c(2, 1, 1),
bs=c("ts", "ts", "re")),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec),
data = sample_com,
family = "gaussian",
method = "REML")# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec, fam_fac,
d=c(2, 1, 1),
bs=c("ts", "ts", "re")),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec),
data = sample_com,
family = "gaussian",
method = "REML")# Independent trend and smooths
fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec,
by = fam_fac),
data = sample_com,
family = "gaussian",
method = "REML")
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec, fam_fac,
d=c(2, 1, 1),
bs=c("ts", "ts", "re")),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec),
data = sample_com,
family = "gaussian",
method = "REML")# Need to create combined response variable
sample_com$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
t2(x, y, date_dec, fam_fac,
d=c(2, 1, 1),
bs=c("ts", "ts", "re")),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec),
data = sample_com,
family = "gaussian",
method = "REML")Model I
gam.check(fit_fam_I)Model GI
gam.check(fit_fam_GI)Model GS
gam.check(fit_fam_GS)Model G
gam.check(fit_fam_G)AIC(fit_fam_I)
AIC(fit_fam_GI)
AIC(fit_fam_GS)
AIC(fit_fam_G)mgcv makes fitting really rather complex models
relatively simple.On Hierarchical GAMs
On Observation Processes
On Methods Underpinning gfam()
On The by= Argument