library(ascot) # Our package for simulating space-time patterns
library(dplyr) # Handy package for general data manipulation
library(ggplot2) # Visualisation
library(lubridate) # Nice way to work with date data
library(mgcv) # The GAMs package we will use for modelling
We are going to start by building a simulated version of our study system. While it is often tempting to jump straight into working with real data, the models we are using here are quite complex. It can be challenging, therefore, to intuitively grasp whether something has gone wrong in our modelling, or whether there is something missing in our understanding of the data. By building simulated datasets where we have full control over how the data were generated, we give the models nowhere to hide!
There is an art to constructing good simulations for model testing. We want to make our simulated world’s realistic enough to give a flavour of the real data, but not so complex that we can’t track down problems if they arise. When working with spatio-temporal questions, things become very complicated very quickly!
We have designed the ascot package to give you a lot of flexibility for simulating and sampling from realistic spatio-temporal scenarios. In other vignettes, we explore this flexibility but here we will make some quick decisions that give a flavour of the available options.
We start by thinking about the pattern through time. In many
realistic ecological and environmental datasets we expect there to be
periodic patterns in the data corresponding to cycles within days,
weeks, seasons or years. The define_pattern_t
function
allows you to generate these cycles at multiple scales (weeks, seasons
and years) and to tune the amplitude and noise associated with these
patterns. We discuss in more detail in the
temporal_simulations
vignette. Here, we generate one
month’s worth of data which have oscillations of amplitude 1 at the
weekly scale.
# Determine duration of time series
<- set_timeline(1, "month")
time_df # Set time series pattern
<- define_pattern_t(time_df,
time_series weekly_amplitude = 1,
weekly_noise = 0)
#> Warning: One quarter or less of data, ignoring arguments for seasonal and annual
#> patterns
# Visualise
ggplot(time_series, aes(x = date, y = pattern)) +
geom_point() +
geom_line() +
theme_bw()
Similarly, in most real-world processes we expect there to be spatial
autocorrelation, that is to say, places closer together in space are
likely to be more similar to each other than places further apart. This
can happen for numerous reasons, from simple diffusion to complex
aggregation behaviours. Our next job, therefore, is to define how
spatial locations will relate to each other. Here we create a 20 by 20
grid with moderate autocorrelation. More information on generating
spatial fields can be found in the spatial_simulations
vignette.
# Generate Field
<- define_pattern_s(20, 20, autocorr = TRUE, range = 5)
spatial_field
# Visualise
ggplot(spatial_field, aes(x = x, y = y, fill = value)) +
geom_raster() +
theme_bw()
We now want to combine the two patterns we have made so that our
locations in space change through time. Conceptually, we treat our
spatial pattern as a starting point and then allow each location to move
through time according to the temporal pattern. We can make the
relationship between space and time as outlined in the GAMs workshop. An
easy way of doing this is through the pattern_adherence
argument in the function simulate_spacetime
. Here, we set
this to 0, the simplest form of combining the two. We refer to this as
the truth
as it will represent the true process we are
interested in inferring when we start modelling.
# Simulate from spatial and temporal patterns
<- simulate_spacetime(temporal_pattern = time_series,
truth field = spatial_field,
pattern_adherence = 0)
When generating patterns across time OR across space, they are fairly easy to visualise, as demonstrated above. When we have space and time interacting, visualisations are more challenging as there is a time series at every spatial location and a different map for each point in time. There are several options for approaching this problem.
# Generate vector of random dates
<- truth$date %>% unique() %>% sample(9)
random_dates # Sample those dates
<- truth %>% filter(date %in% random_dates)
truth_rand_times # Visualise
ggplot(truth_rand_times, aes(x = x, y = y, fill = value)) +
geom_raster() +
facet_wrap(vars(date))
# Generate vector of first 16 dates
<- truth$date %>% unique() %>% sort() %>% head(16)
first_n_dates # Sample those dates
<- truth %>% filter(date %in% first_n_dates)
truth_first_n_times # Visualise
ggplot(truth_first_n_times, aes(x = x, y = y, fill = value)) +
geom_raster() +
facet_wrap(vars(date))
# Combine each x and y coordinate to give each location a unique ID
$loc_id <- factor(paste(truth$x, truth$y, sep = "_"))
truth
ggplot(truth, aes(x = date, y = value, colour = loc_id)) +
geom_line() +
theme_bw() +
theme(legend.position = "none")
We have now simulated the underlying Truth - the process we are interested in reconstructing. However, for real data, we will also need to tackle the observation process. The observation process is how data are collected and represented. Observation processes can be as complex as the physical or biological processes we are interested in.
To maintain generality, we focus on some relatively simple but important observation processes:
In the modelling section below, we will illustrate how to tackle problems 1-3.
There is an additional process motivated by our case studies: the change in support problem. Change in support refers to tackling areal and point data in the same analysis. As we cover further in in the discussion, we believe that for sparse data it is hard to get good results with conventional change of support models as they can be quite data hungry.
Instead, we recommend treating the areal data as point data. This is an approximation that will hold well when the areal data are relatively fine but will become less appropriate for coarser areal data. From informal testing on realistic data simulations and discussions with experienced modellers, the approximation is more robust than attempting to fit the more elegant change of support models.
We will compare two sampling strategies here, one where we evenly
space our observations
(sample_area_pointily(...regular=TRUE)
) and one where we
space our observations randomly
(sample_area_pointily(...regular=FALSE)
). We will give each
sampling regime a different error term, mimicking having data with
different levels of reliability. For more details on sampling
strategies, consult the sampling_strategies
vignette.
<- sample_area_pointily(truth, n_samples = 25, sampling_error = 0.1,
sample_reg regular = TRUE)
<- sample_area_pointily(truth, n_samples = 25, sampling_error = 1,
sample_ran regular = FALSE)
set.seed(1868)
<- sample_area_pointily(truth, n_samples = 25,
sample_reg sampling_error = 0.1, regular = TRUE)
head(sample_reg)
value
is the Truth from our simulation.point_sample
is what we observed.loc_id
is the combination of the x
and
y
coordinates to help with plotting.xset.seed(1868)
<- sample_area_pointily(truth, n_samples = 25,
sample_reg sampling_error = 0.1, regular = TRUE)
# Make mgcv see date as a date
$date_dec <- lubridate::decimal_date(sample_reg$date) sample_reg
Remember, we might be interested in deconstructing the different components of the date to pick out weekly, quarterly and annual patterns.
<- mgcv::gam(point_sample ~ t2(x, y, date_dec),
fit_sample_reg data = sample_reg,
family = "gaussian",
method = "REML")
gam.check(fit_sample_reg)
Does this look familiar?
<- mgcv::gam(point_sample ~ t2(x, y, date_dec),
fit_sample_reg data = sample_reg,
family = "gaussian",
method = "REML")
gam.check(fit_sample_reg)
<- mgcv::gam(point_sample ~ t2(x, y, date_dec),
fit_sample_ran data = sample_ran,
family = "gaussian",
method = "REML")
gam.check(fit_sample_ran)
<- mgcv::gam(point_sample ~ t2(x, y, date_dec),
fit_sample_ran data = sample_ran,
family = "gaussian",
method = "REML")
gam.check(fit_sample_ran)
gam.check(fit_sample_reg)
gam.check(fit_sample_ran)
Rather than calling mgcv
twice, can we specify these two
models as a single joint model (like Model I)
set.seed(1868)
<- sample_area_pointily(truth, n_samples = 25,
sample_reg sampling_error = 0.1, regular = TRUE)
set.seed(1871)
<- sample_area_pointily(truth, n_samples = 25,
sample_ran sampling_error = 1, regular = FALSE)
# Make mgcv see date as a date
$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_reg$date_dec <- lubridate::decimal_date(sample_ran$date)
sample_ran# Add family column
$fam <- 1 # Observation source 1, the regular samples
sample_reg$fam <- 2 # Observation source 2, the random samples
sample_ran
# Combine into one dataframe
<- rbind(sample_reg, sample_ran)
sample_com # Note, we need *both* fam and fam_fac for the factor smoothing later
$fam_fac <- factor(sample_com$fam) sample_com
set.seed(1868)
<- sample_area_pointily(truth, n_samples = 25,
sample_reg sampling_error = 0.1, regular = TRUE)
set.seed(1871)
<- sample_area_pointily(truth, n_samples = 25,
sample_ran sampling_error = 1, regular = FALSE)
# Make mgcv see date as a date
$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_reg$date_dec <- lubridate::decimal_date(sample_ran$date)
sample_ran# Add family column
$fam <- 1 # Observation source 1, the regular samples
sample_reg$fam <- 2 # Observation source 2, the random samples
sample_ran
# Combine into one dataframe
<- rbind(sample_reg, sample_ran)
sample_com # Note, we need *both* fam and fam_fac for the factor smoothing later
$fam_fac <- factor(sample_com$fam) sample_com
set.seed(1868)
<- sample_area_pointily(truth, n_samples = 25,
sample_reg sampling_error = 0.1, regular = TRUE)
set.seed(1871)
<- sample_area_pointily(truth, n_samples = 25,
sample_ran sampling_error = 1, regular = FALSE)
# Make mgcv see date as a date
$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_reg$date_dec <- lubridate::decimal_date(sample_ran$date)
sample_ran# Add family column
$fam <- 1 # Observation source 1, the regular samples
sample_reg$fam <- 2 # Observation source 2, the random samples
sample_ran
# Combine into one dataframe
<- rbind(sample_reg, sample_ran)
sample_com # Note, we need *both* fam and fam_fac for the factor smoothing later
$fam_fac <- factor(sample_com$fam) sample_com
set.seed(1868)
<- sample_area_pointily(truth, n_samples = 25,
sample_reg sampling_error = 0.1, regular = TRUE)
set.seed(1871)
<- sample_area_pointily(truth, n_samples = 25,
sample_ran sampling_error = 1, regular = FALSE)
# Make mgcv see date as a date
$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_reg$date_dec <- lubridate::decimal_date(sample_ran$date)
sample_ran# Add family column
$fam <- 1 # Observation source 1, the regular samples
sample_reg$fam <- 2 # Observation source 2, the random samples
sample_ran
# Combine into one dataframe
<- rbind(sample_reg, sample_ran)
sample_com # Note, we need *both* fam and fam_fac for the factor smoothing later
$fam_fac <- factor(sample_com$fam) sample_com
<- mgcv::gam(point_sample ~ t2(x, y, date_dec),
fit_sample_ran data = sample_ran,
family = "gaussian",
method = "REML")
<- mgcv::gam(point_sample ~ t2(x, y, date_dec,
fit_fam_I by = fam_fac),
data = sample_com,
family = "gaussian",
method = "REML")
<- mgcv::gam(point_sample ~ t2(x, y, date_dec),
fit_sample_ran data = sample_ran,
family = "gaussian",
method = "REML")
<- mgcv::gam(point_sample ~ t2(x, y, date_dec,
fit_fam_I by = fam_fac),
data = sample_com,
family = "gaussian",
method = "REML")
<- mgcv::gam(point_sample ~ t2(x, y, date_dec,
fit_fam_I by = fam_fac),
data = sample_com,
family = "gaussian",
method = "REML")
gam.check(fit_fam_I)
To use gfam()
, we need to create this rather unsightly
resp
object.
# Need to create combined response variable
$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam)) sample_com
<- mgcv::gam(point_sample ~ t2(x, y, date_dec,
fit_fam_I by = fam_fac),
data = sample_com,
family = "gaussian",
method = "REML")
# Shared trend and different smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GI t2(x, y, date_dec
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
mgcv
Tricks: Do Not Be Afraid# Shared trend and different smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GI t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
mgcv
Tricks: Basis Specification# Shared trend and different smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GI t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
mgcv
Tricks: Grouped Families# Need to create combined response variable
$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
sample_com# Shared trend and different smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GI t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
mgcv
Tricks: Grouped Families# Need to create combined response variable
$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
sample_com# Shared trend and different smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GI t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
Now we can have observation processes with different response distribution s!
# Shared trend and different smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GI t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Shared trend and different smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GI t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Shared trend and shared smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GS t2(x, y, date_dec, fam_fac,
d=c(2, 1, 1),
bs=c("ts", "ts", "re")),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Shared trend and shared smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GS t2(x, y, date_dec, fam_fac,
d=c(2, 1, 1),
bs=c("ts", "ts", "re")),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Identical trend and smoothness
<- mgcv::gam(point_sample ~ t2(x,y, date_dec),
fit_fam_G data = sample_com,
family = "gaussian",
method = "REML")
# Shared trend and different smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GI t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Shared trend and shared smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GS t2(x, y, date_dec, fam_fac,
d=c(2, 1, 1),
bs=c("ts", "ts", "re")),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Identical trend and smoothness
<- mgcv::gam(point_sample ~ t2(x,y, date_dec),
fit_fam_G data = sample_com,
family = "gaussian",
method = "REML")
# Independent trend and smooths
<- mgcv::gam(point_sample ~ t2(x, y, date_dec,
fit_fam_I by = fam_fac),
data = sample_com,
family = "gaussian",
method = "REML")
# Shared trend and different smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GI t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Shared trend and shared smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GS t2(x, y, date_dec, fam_fac,
d=c(2, 1, 1),
bs=c("ts", "ts", "re")),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Identical trend and smoothness
<- mgcv::gam(point_sample ~ t2(x,y, date_dec),
fit_fam_G data = sample_com,
family = "gaussian",
method = "REML")
# Need to create combined response variable
$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
sample_com# Shared trend and different smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GI t2(x, y, date_dec,
d=c(2, 1),
bs=c("ts", "ts"),
by = fam_fac),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Shared trend and shared smoothness
<- mgcv::gam(resp ~ t2(x, y, date_dec) +
fit_fam_GS t2(x, y, date_dec, fam_fac,
d=c(2, 1, 1),
bs=c("ts", "ts", "re")),
data = sample_com,
family = gfam(list(gaussian, gaussian)),
method = "REML")
# Identical trend and smoothness
<- mgcv::gam(point_sample ~ t2(x,y, date_dec),
fit_fam_G data = sample_com,
family = "gaussian",
method = "REML")
Model I
gam.check(fit_fam_I)
Model GI
gam.check(fit_fam_GI)
Model GS
gam.check(fit_fam_GS)
Model G
gam.check(fit_fam_G)
AIC(fit_fam_I)
AIC(fit_fam_GI)
AIC(fit_fam_GS)
AIC(fit_fam_G)
mgcv
makes fitting really rather complex models
relatively simple.On Hierarchical GAMs
On Observation Processes
On Methods Underpinning gfam()
On The by=
Argument