This vignette covers models where the response is a function of the interaction between two predictors, each predictor being measured at a collection (vector) of regular distances increments, which may represent distance in space (spatial lags), or in the past (time lags), for example.
The model assumes that the coefficient of the predictors (both the main effects and their interaction) vary smoothly with the distance at which they are measured.
This model may be relevant in scenarios where
the relative relevance of different lags/distances/indices for predicting the response is unknown and needs to be inferred from the data.
the effect of a predictor at a given distance depends on the value of the other predictor at the same distance.
Mathematical description
For each observation \(i\), the linear predictor includes the additive effects of two vector predictors \(X_{i}\) and \(Z_{i}\) consisting of values \(x_{ik}\) and \(z_{ik}\) measured at a range of regular distances increments \(d_{ik}\) forming the distance vector \(D_i\). In typical applications, \(D_i\) is invariant between observations, so index \(i\) can be omitted.
Building on the conventional interaction model with predictors \(A\) and \(B\): \[
\mathbb{E}(y_i) = g^{-1}\left( \beta_0 + \beta_1 A_i + \beta_2 B_i + \beta_3 A_i B_i\right),
\]
Functions \(f_{.}(d_k)\) acts as a coefficients for the \(x_{ik}\) and \(z_{ik}\) main effects and their interaction, that vary smoothly with distance \(d_{k}\). The shape of \(f_{.}(d_k)\) is typically unknown and needs to be estimated from the data, so the function should have sufficient flexibility to ensure a good fit to the data, while being sufficiently constrained to avoid overfitting. In this tutorial we will represent \(f_{.}\) with penalized regression splines, since they meet these requirements and are easily implemented using standard software.
R implementation
The R implementation of the model, using package mgcv, is of the form:
Where y is a dataframe column with \(N\) observations, and D, X, Z and XZ are all \(N \times K\) matrices, with the number of columns \(K\) corresponding to the number of distance classes over which the predictors were measured, and XZ = X * Z.
Matrix D encodes the actual distances values, with equal intervals and constant across rows (if this is not the case, appropriate integration weights should be applied). The units chosen for expressing \(D\) are arbitrary and do not affect the predictions of the model (only its interpretation).
The summation convention applied in mgcv means that when the data fed to the smooth terms are multi-column matrices, the sum of the function evaluation over all columns (after multiplication by the corresponding entry of any by= argument) is returned.
The input data should have the following structure (illustrative example with 8 distances and 100 observations):
y (response) is typically a single-column vector (may have two columns for binomial family or an arbitrary number for likelihoods that make use of this, such as multivariate normal).
D has one column per distance (index) value and is normally row-invariant.
X, Z and XZ have same dimension as D and are NOT row-invariant. If there is correlation in the \(x_{ik}\) or in the \(z_{ik}\) values between adjacent distance classes, rows of these matrices will tend to display serial correlation patterns, as in the illustrative example above.
Illustration: analysis of kittiwake breeding success
The motivation for the model, its construction and interpretation are best shown by example. Background information about the kittiwake case study can be found in the introductory vignette (“1_Introduction”).
Specific research question
“At a given distance lag, is kittiwake yearly breeding success predicted by lag-specific:
presence of sandeels?
SST value?
an interaction between the effects of sandeel presence and SST?”
A signal regression model with just temporal lags
A direct answer to the “Specific research question” can be given by a signal regression using SST and sandeel values at suitable spatial lags, as predictors.
Here, we will use SST values averaged over the weeks 16-25 of the year, taking pixel averages over 10 km distance bands and assume that the effect of SST does not extend beyond 210 km away from the colony.
Data preparation
# loading the data# depending on the configuration of# your system, one of these should worktry(load("kit_SST_sandeel.RData"), silent= T)try(load("../kit_SST_sandeel.RData"), silent= T)# we'll work with the smaller subset of 7 colonieskit1<- kit2[kit2$Site %in% kit_sub1, ]# corresponding sandeel data matrix (time-constant)sandeel1<- sandeel_mean_ring[kit1$Site, ]
Modelling parameters
# distance lags (in km from colony)lags_sp<-seq(0, 210, by=10)N<-nrow(kit2)D<-length(lags_sp[-1])
Preparing the predictor matrix (21 columns = mean SST values up to 210 km, by 10 km increments).
For each distance band, average SST is calculated firstly by differencing between nested buffers, and then by taking the ratio of the sum of SST values for all pixels within distance band, divided by number of non-NA pixels over the same band, for the relevant colony and year (see code below for more details).
# SST spatial lag matrix, assuming mean SST over weeks 16 to 25:# compute mean per periodSST_sum_buffer_16_25weeks<-apply(SST_sum_buffer[16:25, , kit1$SiteYear], c(2, 3), mean)SST_noNA_buffer_16_25weeks<-apply(SST_noNA_buffer[16:25, , kit1$SiteYear], c(2, 3), mean) # expecting little/no variation between weeks# compute rings from difference between successive buffersSST_sum_ring_16_25weeks<-apply( SST_sum_buffer_16_25weeks, c(2), FUN=function(x){diff(c(0, x))})SST_noNA_ring_16_25weeks<-apply( SST_noNA_buffer_16_25weeks, c(2), FUN=function(x){diff(c(0, x))})# compute rings mean from SSTsum/NpixelsSST_mean_ring_16_25weeks<-t(SST_sum_ring_16_25weeks / SST_noNA_ring_16_25weeks)# corresponding spatial lag index matrixsplag_mat<-t(matrix(lags_sp[-1], nrow=length(lags_sp[-1]), ncol=nrow(kit1)))head(round(SST_mean_ring_16_25weeks, 2), 3)
Visual inputs check (yellow is lowest value, blue highest):
Warning in cbind(kit1$Fledg, matrix(NA, N, D)): number of rows of result is not
a multiple of vector length (arg 1)
Lots of missing values in both sandeel and SST data, leading to loss of sample size.
Fitting the model
An offset offset(log(AON + 1)) is used to standardise the number of chicks by the number of monitored nests (see Introduction vignette)
a site random effect is included to account for unknown variation in the mean breeding success between colonies: s(Site, bs= "re")
Data are assumed to follow a Tweedie distribution with a log link, which helps modelling overdispersion in the data.
Average SST values for weeks 16-25 of the year, at each of 21 distance lags are contained in matrix SST_mean_ring_16_25weeks. Sandeel modelled probability of presence (logit scale) are contained in matrix sandeel1. Their interactive effect on kittywake breeding success is assumed to vary smoothly with the distance lag itself (encoded in matrix splag_mat) and captured by the terms s(splag_mat, by= sandeel1), s(splag_mat, by= SST_mean_ring_16_25weeks), s(splag_mat, by= sandeel_SST_16_25_prod1.
library(mgcv)
Loading required package: nlme
This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
Should the effects be significant, the plots above would suggest sandeel presence and SST main effects going from near null close to the colony, to negative as distance increases, whereas the interaction becomes larger and positive as distance increases.
Model validation
We should also look at traditional GAM diagnostics, using deviance residuals.
par(mfrow=c(2, 2))gam.check(msp_xp1)
Method: REML Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [-0.0001666128,2.821341e-07]
(score 473.9829 & scale 0.5857614).
eigenvalue range [-2.278913e-09,66.73415].
Model rank = 42 / 42
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(Site) 4.000 0.734 NA NA
s(splag_mat):sandeel1 12.000 2.006 NA NA
s(splag_mat):SST_mean_ring_16_25weeks 12.000 2.000 NA NA
s(splag_mat):sandeel_SST_16_25_prod1 12.000 2.000 NA NA
There is a long left tail to the residuals distribution, indicating that some observations are badly underpredicted. This model isn’t a great model for the data, despite its high \(R^2\).
When assuming linear effects of a predictor (here, SST), it’s often worth checking for unwanted trends in the residuals against predictor values. With SST broken down into so many distinct predictors (here, 30 lags) panel plots using a sliding window of lags can an effective way of visualizing the patterns.
The moving averages (red) suggest that there could be some non-linearity in the relationship with SST, which was not considered in this model. High sandeel1 values are only present at the shorter range of distances from the colonies, so great care required for interpreting the model, and particularly the interaction, in this case.
Simulations
Loading a function to generate random 1D smooth functions, which will allow us to simulate the “true”/known effect (coefficient) of predictor vector \(X\) measured over a series of \(K\) lags \(D\).
lambda: a vector of length 2 specifying smoothing parameters ‘lambda’ for the 1D thin plate spline
inter: intercept value
x_l: number of x values a which to evaluate 1D smooth
require(mgcv)require(mvtnorm)
Loading required package: mvtnorm
sim_spline_1D<-function(k=7, lambda=c(0.05, 0.005), inter=0, x_l=100) {# Define the spatial domain x<-seq(0, 1, length.out= x_l) resp<-rnorm(x_l)# Open a null file connection to discard the 'jagam()' text output# Check the operating systemif (.Platform$OS.type =="unix") { null_device<-"/dev/null" } else { # Windows null_device<-"NUL" } con<-file(null_device, open="wt")# generate a template gam model object for JAGS to extract the relevant components jd<-jagam(resp ~s(x, bs="tp", k= k), file= con, diagonalize= F)close(con) S1<- jd$jags.data$S1 zero<- jd$jags.data$zero# penalty matrix K1<- S1[1:(k-1),1:(k-1)] * lambda[1] + S1[1:(k-1),k:(2*k-2)] * lambda[2]# simulate multivariate normal basis coefficients beta_sim<-c(inter, mvtnorm::rmvnorm(n=1, mean= zero[2:k], sigma= K1, method="chol"))# evaluate simulated function at x values fct_sim<- jd$jags.data$X %*% beta_simreturn(fct_sim)}# example# sim_1D<- sim_spline_1D(k= 7, lambda= c(0.05, 0.005), inter= 0, x_l= 100)# plot(sim_1D, type= "l")
Now, a function to simulate data from the model, given the known parameter values.
Output values:
X: lagged predictor matrix
Z: lagged predictor matrix
XZ: product of X and Z
Lag: lag-encoding matrix
tf1, tf2, tf3: true lagged-effect function for main effect of X, main effect of Z and XZ interaction, respectively
Y: response vector
# function to simulate 1D-lagged process with cross-predictor interactionsim_Xpred_lag_of_1D<-function(N=200, nlags=20, residSD=1){# simulate time-varying predictor, where each column represents a time lag w.r.t. the observation X<-matrix(runif(N*nlags), N, nlags) Z<-matrix(runif(N*nlags), N, nlags) XZ<- X * Z# true coefficient functions: main X effect tf1<-sim_spline_1D(x_l= nlags)# true coefficient functions: main Z effect tf2<-sim_spline_1D(x_l= nlags)# true coefficient functions: XZ interaction tf3<-sim_spline_1D(x_l= nlags)# Linear predictor LP<- X %*% tf1 + Z %*% tf2 + XZ %*% tf3# observations Y<-rnorm(n= N, mean= LP, sd= residSD)# index matrix Lag<-matrix(0:(nlags-1), N, nlags, byrow= T)list(X= X, Z= Z, XZ= XZ, Y= Y, tf1= tf1, tf2= tf2, tf3= tf3, Lag= Lag)}
Let’s simulate 4 random data sets from 4 different sets of random model parameters, and plot the estimates of \(f_.(d)\) (labelled “f.(Lag)”) with their 95% confidence intervals (black), together with the true values (red). The x-axis shows \(d\) (labelled “Lag”).
par(mfcol=c(3, 4))set.seed(53) # for reproducibilityfor(i in1:4){ dat<-sim_Xpred_lag_of_1D(N=200, nlags=20, residSD=1) mod<-gam(Y ~s(Lag, by= X, k=ncol(dat$Lag)-1) +s(Lag, by= Z, k=ncol(dat$Lag)-1) +s(Lag, by= XZ, k=ncol(dat$Lag)-1), data= dat)plot(mod, se=1.96, select=1, main=paste("Simulation", i, "\nX main effect"), ylab="f1(Lag)")lines(0:(ncol(dat$Lag)-1), dat$tf1, col=2)abline(h=0, lty=3, col=grey(0.6))plot(mod, se=1.96, select=2, main=paste("Simulation", i, "\nZ main effect"), ylab="f2(Lag)")lines(0:(ncol(dat$Lag)-1), dat$tf2, col=2)abline(h=0, lty=3, col=grey(0.6))plot(mod, se=1.96, select=3, main=paste("Simulation", i, "\nXZ interaction"), ylab="f3(Lag)")lines(0:(ncol(dat$Lag)-1), dat$tf3, col=2)abline(h=0, lty=3, col=grey(0.6))}
In all cases, \(f(d)\), the effect of \(X\) across lags \(D\) is well recovered by the model, with the true value most of the time contained in the 95% CI of the estimates.
Proposed exercises
Level 1: repeat the analysis above (spatial lags of SST and sandeel).
Level 2: fit a model with the effects of two temporal layers of SST (of your choice) interacting, over a range of spatial lags.
In all cases, pay attention to the structure of the model inputs (vectors and matrices) and take some time to reflect on the interpretation of the model outputs.
Source Code
---title: "[2-4] Cross-predictor interaction with lags"author:- name: Thomas Cornulier thomas.cornulier@bioss.ac.uk affiliation: "Biomathematics & Statistics Scotland (https://www.bioss.ac.uk/)"- name: Dave Miller dave.miller@bioss.ac.uk affiliation: "BioSS & UKCEH"date: "Last run on `r format(Sys.time(), '%d/%m/%Y')`"output: html_document: number_sections: true self_contained: true---# The modelThis vignette covers models where the response is a function of the interaction between two predictors, each predictor being measured at a collection (vector) of regular distances increments, which may represent distance in space (spatial lags), *or* in the past (time lags), for example.The model assumes that the coefficient of the predictors (both the main effects and their interaction) vary smoothly with the distance at which they are measured.This model may be relevant in scenarios where - the relative relevance of different lags/distances/indices for predicting the response is unknown and needs to be inferred from the data.- the effect of a predictor at a given distance depends on the value of the other predictor at the same distance.## Mathematical descriptionFor each observation $i$, the linear predictor includes the additive effects of two vector predictors $X_{i}$ and $Z_{i}$ consisting of values $x_{ik}$ and $z_{ik}$ measured at a range of regular distances increments $d_{ik}$ forming the distance vector $D_i$. In typical applications, $D_i$ is invariant between observations, so index $i$ can be omitted.Building on the conventional interaction model with predictors $A$ and $B$:$$\mathbb{E}(y_i) = g^{-1}\left( \beta_0 + \beta_1 A_i + \beta_2 B_i + \beta_3 A_i B_i\right),$$The lagged interaction model is of the form:$$\mathbb{E}(y_i) = g^{-1}\left( \beta_0 + \sum_k f_1(d_{k}) x_{ik} + \sum_k f_2(d_{k}) z_{ik} + \sum_k f_3(d_{k}) x_{ik} z_{ik}\right)$$or in compact vector form,$$\mathbb{E}(y_i) = g^{-1}\left( \beta_0 + X_{i} f_1(D) + Z_{i} f_2(D) + X_{i} Z_{i} f_3(D)\right)$$Functions $f_{.}(d_k)$ acts as a coefficients for the $x_{ik}$ and $z_{ik}$ main effects and their interaction, that vary smoothly with distance $d_{k}$. The shape of $f_{.}(d_k)$ is typically unknown and needs to be estimated from the data, so the function should have sufficient flexibility to ensure a good fit to the data, while being sufficiently constrained to avoid overfitting. In this tutorial we will represent $f_{.}$ with penalized regression splines, since they meet these requirements and are easily implemented using standard software.## R implementationThe R implementation of the model, using package `mgcv`, is of the form:`thisModel<- gam(y ~ s(D, by= X) + s(D, by= Z) + s(D, by= XZ), data= exampleData)`Where `y` is a dataframe column with $N$ observations, and `D`, `X`, `Z` and `XZ` are all $N \times K$ matrices, with the number of columns $K$ corresponding to the number of distance classes over which the predictors were measured, and `XZ = X * Z`.Matrix `D` encodes the actual distances values, with equal intervals and constant across rows (if this is not the case, appropriate integration weights should be applied). The units chosen for expressing $D$ are arbitrary and do not affect the predictions of the model (only its interpretation).The summation convention applied in `mgcv` means that when the data fed to the smooth terms are multi-column matrices, the sum of the function evaluation over all columns (after multiplication by the corresponding entry of any `by=` argument) is returned.The input data should have the following structure (illustrative example with 8 distances and 100 observations):(here showing only the first 4 observations)```{r, echo= F}nr<-100nc<-8exampleData<-list(y=rnorm(nr, 2.5, 1),D=matrix(0:(nc-1), nr, nc, byrow= T) *5,X=matrix(arima.sim(n= nr*nc, model=list(ar=0.85)), nr, nc, byrow= T),Z=matrix(arima.sim(n= nr*nc, model=list(ar=0.5)), nr, nc, byrow= T))exampleData$XZ<- exampleData$X * exampleData$Zcolnames(exampleData$D)<-paste0("D", 1:nc)colnames(exampleData$X)<-paste0("X_D", 1:nc)colnames(exampleData$Z)<-paste0("Z_D", 1:nc)colnames(exampleData$XZ)<-paste0("XZ_D", 1:nc)exampleData<-lapply(exampleData, FUN=function(x) round(x, 2))``````{r}head(exampleData$y, n=4)head(exampleData$D, n=4)head(exampleData$X, n=4)head(exampleData$Z, n=4)head(exampleData$XZ, n=4)```Visually (yellow is lowest value, blue highest):```{r, echo= F, fig.width= 25, fig.height= 12}par(mfrow=c(1, 5), bty="n", xaxt ="n", yaxt ="n")image(t(cbind(exampleData$y, matrix(NA, nr, nc-1))), col=rev(hcl.colors(64)), main="exampleData$y")image(t(exampleData$D), col=rev(hcl.colors(64)), main="exampleData$D")image(t(exampleData$X), col=rev(hcl.colors(64)), main="exampleData$X")image(t(exampleData$Z), col=rev(hcl.colors(64)), main="exampleData$Z")image(t(exampleData$XZ), col=rev(hcl.colors(64)), main="exampleData$XZ")```Key features:* `y` (response) is typically a single-column vector (may have two columns for binomial family or an arbitrary number for likelihoods that make use of this, such as multivariate normal).* `D` has one column per distance (index) value and is normally row-invariant.* `X`, `Z` and `XZ` have same dimension as `D` and are *NOT* row-invariant. If there is correlation in the $x_{ik}$ or in the $z_{ik}$ values between adjacent distance classes, rows of these matrices will tend to display serial correlation patterns, as in the illustrative example above.# Illustration: analysis of kittiwake breeding successThe motivation for the model, its construction and interpretation are best shown by example. Background information about the kittiwake case study can be found in the introductory vignette ("1_Introduction").## Specific research question"At a given distance lag, is kittiwake yearly breeding success predicted by lag-specific:- presence of sandeels?- SST value?- an interaction between the effects of sandeel presence and SST?"## A signal regression model with just temporal lagsA direct answer to the "Specific research question" can be given by a signal regression using SST and sandeel values at suitable spatial lags, as predictors.Here, we will use SST values averaged over the weeks 16-25 of the year, taking pixel averages over 10 km distance bands and assume that the effect of SST does not extend beyond 210 km away from the colony.## Data preparation```{r, warning= F}# loading the data# depending on the configuration of# your system, one of these should worktry(load("kit_SST_sandeel.RData"), silent= T)try(load("../kit_SST_sandeel.RData"), silent= T)# we'll work with the smaller subset of 7 colonieskit1<- kit2[kit2$Site %in% kit_sub1, ]# corresponding sandeel data matrix (time-constant)sandeel1<- sandeel_mean_ring[kit1$Site, ]```Modelling parameters```{r}# distance lags (in km from colony)lags_sp<-seq(0, 210, by=10)N<-nrow(kit2)D<-length(lags_sp[-1])```Preparing the predictor matrix (21 columns = mean SST values up to 210 km, by 10 km increments).For each distance band, average SST is calculated firstly by differencing between nested buffers, and then by taking the ratio of the sum of SST values for all pixels within distance band, divided by number of non-NA pixels over the same band, for the relevant colony and year (see code below for more details).```{r}# SST spatial lag matrix, assuming mean SST over weeks 16 to 25:# compute mean per periodSST_sum_buffer_16_25weeks<-apply(SST_sum_buffer[16:25, , kit1$SiteYear], c(2, 3), mean)SST_noNA_buffer_16_25weeks<-apply(SST_noNA_buffer[16:25, , kit1$SiteYear], c(2, 3), mean) # expecting little/no variation between weeks# compute rings from difference between successive buffersSST_sum_ring_16_25weeks<-apply( SST_sum_buffer_16_25weeks, c(2), FUN=function(x){diff(c(0, x))})SST_noNA_ring_16_25weeks<-apply( SST_noNA_buffer_16_25weeks, c(2), FUN=function(x){diff(c(0, x))})# compute rings mean from SSTsum/NpixelsSST_mean_ring_16_25weeks<-t(SST_sum_ring_16_25weeks / SST_noNA_ring_16_25weeks)# corresponding spatial lag index matrixsplag_mat<-t(matrix(lags_sp[-1], nrow=length(lags_sp[-1]), ncol=nrow(kit1)))head(round(SST_mean_ring_16_25weeks, 2), 3)head(splag_mat, 3)```Now, pre-compute interaction term `XZ= X * Z````{r}sandeel_SST_16_25_prod1<- sandeel1 * SST_mean_ring_16_25weeks```Visual inputs check (yellow is lowest value, blue highest):```{r, echo= F, fig.width= 25, fig.height= 12}par(mfrow=c(1, 5), bty="n", xaxt ="n", yaxt ="n")image(t(cbind(kit1$Fledg, matrix(NA, N, D))), col=rev(hcl.colors(64)), main="y")image(t(splag_mat), col=rev(hcl.colors(64)), main="splag_mat")image(t(SST_mean_ring_16_25weeks), col=rev(hcl.colors(64)), main="SST_mean_ring_16_25weeks")image(t(sandeel1), col=rev(hcl.colors(64)), main="sandeel1")image(t(sandeel_SST_16_25_prod1), col=rev(hcl.colors(64)), main="sandeel_SST_16_25_prod1")```Lots of missing values in both sandeel and SST data, leading to loss of sample size.## Fitting the model* An offset `offset(log(AON + 1))` is used to standardise the number of chicks by the number of monitored nests (see Introduction vignette)* a site random effect is included to account for unknown variation in the mean breeding success between colonies: `s(Site, bs= "re")`* Data are assumed to follow a Tweedie distribution with a log link, which helps modelling overdispersion in the data.Average SST values for weeks 16-25 of the year, at each of 21 distance lags are contained in matrix `SST_mean_ring_16_25weeks`. Sandeel modelled probability of presence (logit scale) are contained in matrix `sandeel1`. Their interactive effect on kittywake breeding success is assumed to vary smoothly with the distance lag itself (encoded in matrix `splag_mat`) and captured by the terms `s(splag_mat, by= sandeel1)`, `s(splag_mat, by= SST_mean_ring_16_25weeks)`, `s(splag_mat, by= sandeel_SST_16_25_prod1`.```{r}library(mgcv)msp_xp1<-gam(Fledg ~offset(log(AON +1)) + Year +s(Site, bs="re") +s(splag_mat, by= sandeel1, k=12) +s(splag_mat, by= SST_mean_ring_16_25weeks, k=12) +s(splag_mat, by= sandeel_SST_16_25_prod1, k=12), data= kit1, family=tw(),method="REML")```## Model output & interpretation```{r}summary(msp_xp1)```> The model summary does not support an effect of the interaction between SST and sandeel presence.Plotting the estimated effects:```{r}par(mfrow=c(1, 3))plot(msp_xp1, scale= F, select=2)abline(h=0, lty=2, col=2)plot(msp_xp1, scale= F, select=3)abline(h=0, lty=2, col=2)plot(msp_xp1, scale= F, select=4)abline(h=0, lty=2, col=2)```> Should the effects be significant, the plots above would suggest sandeel presence and SST main effects going from near null close to the colony, to negative as distance increases, whereas the interaction becomes larger and positive as distance increases.## Model validationWe should also look at traditional GAM diagnostics, using deviance residuals.```{r}par(mfrow=c(2, 2))gam.check(msp_xp1)```> There is a long left tail to the residuals distribution, indicating that some observations are badly underpredicted. This model isn't a great model for the data, despite its high $R^2$.When assuming linear effects of a predictor (here, SST), it's often worth checking for unwanted trends in the residuals against predictor values. With SST broken down into so many distinct predictors (here, 30 lags) panel plots using a sliding window of lags can an effective way of visualizing the patterns.```{r}msp_xp1_res<-data.frame(resid=residuals(msp_xp1), SST=as.vector(SST_mean_ring_16_25weeks[-msp_xp1$na.action, ]),sandeel=as.vector(sandeel1[-msp_xp1$na.action, ]),lag=rep(lags_sp[-1], each=nrow(kit1[-msp_xp1$na.action, ])))coplot(resid ~ SST | lag, data= msp_xp1_res, panel= panel.smooth, number=12, lwd=3)coplot(resid ~ sandeel | lag, data= msp_xp1_res, panel= panel.smooth, number=12, lwd=3)```> The moving averages (red) suggest that there could be some non-linearity in the relationship with SST, which was not considered in this model. High `sandeel1` values are only present at the shorter range of distances from the colonies, so great care required for interpreting the model, and particularly the interaction, in this case.# SimulationsLoading a function to generate random 1D smooth functions, which will allow us to simulate the "true"/known effect (coefficient) of predictor vector $X$ measured over a series of $K$ lags $D$.Function name: `sim_spline_1D(k= 7, lambda= c(0.05, 0.005), inter= 0, x_l= 100)`Arguments:* `k`: spline basis size* `lambda`: a vector of length 2 specifying smoothing parameters 'lambda' for the 1D thin plate spline* `inter`: intercept value* `x_l`: number of x values a which to evaluate 1D smooth```{r, warning= F}require(mgcv)require(mvtnorm)sim_spline_1D<-function(k=7, lambda=c(0.05, 0.005), inter=0, x_l=100) {# Define the spatial domain x<-seq(0, 1, length.out= x_l) resp<-rnorm(x_l)# Open a null file connection to discard the 'jagam()' text output# Check the operating systemif (.Platform$OS.type =="unix") { null_device<-"/dev/null" } else { # Windows null_device<-"NUL" } con<-file(null_device, open="wt")# generate a template gam model object for JAGS to extract the relevant components jd<-jagam(resp ~s(x, bs="tp", k= k), file= con, diagonalize= F)close(con) S1<- jd$jags.data$S1 zero<- jd$jags.data$zero# penalty matrix K1<- S1[1:(k-1),1:(k-1)] * lambda[1] + S1[1:(k-1),k:(2*k-2)] * lambda[2]# simulate multivariate normal basis coefficients beta_sim<-c(inter, mvtnorm::rmvnorm(n=1, mean= zero[2:k], sigma= K1, method="chol"))# evaluate simulated function at x values fct_sim<- jd$jags.data$X %*% beta_simreturn(fct_sim)}# example# sim_1D<- sim_spline_1D(k= 7, lambda= c(0.05, 0.005), inter= 0, x_l= 100)# plot(sim_1D, type= "l")```Now, a function to simulate data from the model, given the known parameter values.Output values:* `X`: lagged predictor matrix* `Z`: lagged predictor matrix* `XZ`: product of `X` and `Z`* `Lag`: lag-encoding matrix* `tf1`, `tf2`, `tf3`: true lagged-effect function for main effect of `X`, main effect of `Z` and `XZ` interaction, respectively* `Y`: response vector```{r, fig.height= 20, fig.width= 18, fig.fullwidth= T}# function to simulate 1D-lagged process with cross-predictor interactionsim_Xpred_lag_of_1D<-function(N=200, nlags=20, residSD=1){# simulate time-varying predictor, where each column represents a time lag w.r.t. the observation X<-matrix(runif(N*nlags), N, nlags) Z<-matrix(runif(N*nlags), N, nlags) XZ<- X * Z# true coefficient functions: main X effect tf1<-sim_spline_1D(x_l= nlags)# true coefficient functions: main Z effect tf2<-sim_spline_1D(x_l= nlags)# true coefficient functions: XZ interaction tf3<-sim_spline_1D(x_l= nlags)# Linear predictor LP<- X %*% tf1 + Z %*% tf2 + XZ %*% tf3# observations Y<-rnorm(n= N, mean= LP, sd= residSD)# index matrix Lag<-matrix(0:(nlags-1), N, nlags, byrow= T)list(X= X, Z= Z, XZ= XZ, Y= Y, tf1= tf1, tf2= tf2, tf3= tf3, Lag= Lag)}```Let's simulate 4 random data sets from 4 different sets of random model parameters, and plot the estimates of $f_.(d)$ (labelled "f.(Lag)") with their 95% confidence intervals (black), together with the true values (red). The x-axis shows $d$ (labelled "Lag").```{r}par(mfcol=c(3, 4))set.seed(53) # for reproducibilityfor(i in1:4){ dat<-sim_Xpred_lag_of_1D(N=200, nlags=20, residSD=1) mod<-gam(Y ~s(Lag, by= X, k=ncol(dat$Lag)-1) +s(Lag, by= Z, k=ncol(dat$Lag)-1) +s(Lag, by= XZ, k=ncol(dat$Lag)-1), data= dat)plot(mod, se=1.96, select=1, main=paste("Simulation", i, "\nX main effect"), ylab="f1(Lag)")lines(0:(ncol(dat$Lag)-1), dat$tf1, col=2)abline(h=0, lty=3, col=grey(0.6))plot(mod, se=1.96, select=2, main=paste("Simulation", i, "\nZ main effect"), ylab="f2(Lag)")lines(0:(ncol(dat$Lag)-1), dat$tf2, col=2)abline(h=0, lty=3, col=grey(0.6))plot(mod, se=1.96, select=3, main=paste("Simulation", i, "\nXZ interaction"), ylab="f3(Lag)")lines(0:(ncol(dat$Lag)-1), dat$tf3, col=2)abline(h=0, lty=3, col=grey(0.6))}```> In all cases, $f(d)$, the effect of $X$ across lags $D$ is well recovered by the model, with the true value most of the time contained in the 95% CI of the estimates.# Proposed exercises* Level 1: repeat the analysis above (spatial lags of SST and sandeel). * Level 2: fit a model with the effects of two temporal layers of SST (of your choice) interacting, over a range of spatial lags.In all cases, pay attention to the structure of the model inputs (vectors and matrices) and take some time to reflect on the interpretation of the model outputs.