Exercise 4: Space-time smoothing

Author

Thomas Cornulier - thomas.cornulier@bioss.ac.uk

Introduction

This practical sheet will enable you to:

  1. gain more familiarity with the lecture contents, by repeating the analyses from the lecture yourself;
  2. test your understanding of the model specification and interpretation, by doing similar analyses with less guidance;
  3. explore further ideas not covered in the lecture.

There is (intentionally) more exercise material than you are likely to be able to go through during the online workshop.

All sections are independent and optional. You may pick and choose according to you interests and stop where you wish.

NOTE 1: as these exercises use large models (many parameters) fitted on a large dataset, model fitting with gam (as is done in the lecture slides) would take too long given the time we have for the practical. I therefore recommend you use the function bam instead, to fit your models. The syntax and interpretation of outputs is the same as with gam. It will just get you there faster, thanks to clever approximations that are efficient for large data sets.

NOTE 2: if one piece of code isn’t working for you, it is quite possibly because it is using an object that was created in a code that you didn’t run, further up in the practical.

Setup

We’ll first load the dataset, subset it and load the packages mgcvand maps (convenient for plotting UK contour, but not essential). And then do a bit of house-keeping, to have the variables and variable names we want.

(If you don’t have the data, you can download them from here)

try(load("cosmos-processed.RData"), silent= T)
try(load("../data/cosmos-processed.RData"), silent= T)
try(load("data/cosmos-processed.RData"), silent= T)

# shorten response variable name (for convenience)
# (We'll just make a copy, in case you prefer to stick with the long names)
all_sites$VWC <- all_sites$COSMOS_VWC
all_sites$LAT <- all_sites$LATITUDE
all_sites$LON <- all_sites$LONGITUDE

# compute "DAY OF THE YEAR" variable
all_sites$doy<- as.numeric(format(all_sites$DATE_TIME, "%j"))

# make a subset of Scottish data
scot_sites<- subset(all_sites, SITE_NAME %in% 
  c("Crichton", "Easter Bush", "Hartwood Home",
  "Cochno", "Balruddery", "Glensaugh"))

library(mgcv)
Loading required package: nlme
This is mgcv 1.9-0. For overview type 'help("mgcv-package")'.
library(maps)

Where did we leave the analysis?

Let’s re-fit the model from the previous exercise:

scot_sites<- subset(all_sites, SITE_NAME %in% 
  c("Crichton", "Easter Bush", "Hartwood Home",
  "Cochno", "Balruddery", "Glensaugh"))

b_all_ts<- bam(VWC ~ SOIL_TYPE +
                    s(ndate) +
                    s(SITE_ID, bs= "re") +
                    s(ALTITUDE, k= 5, bs= "ts"),
                data= scot_sites, method= "REML")

par(mfrow= c(1, 3), mar= c(5.1, 6.1, 4.1, 2.1),
    cex.axis= 1.5, cex.lab= 2, cex.main= 2)
plot(b_all_ts)

summary(b_all_ts)

Family: gaussian 
Link function: identity 

Formula:
VWC ~ SOIL_TYPE + s(ndate) + s(SITE_ID, bs = "re") + s(ALTITUDE, 
    k = 5, bs = "ts")

Parametric coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             39.843      3.572  11.155   <2e-16 ***
SOIL_TYPEOrganic soil   10.070      8.806   1.144    0.253    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                 edf Ref.df        F p-value    
s(ndate)    8.838698  8.992  165.562  <2e-16 ***
s(SITE_ID)  3.991128  4.000 3472.320  <2e-16 ***
s(ALTITUDE) 0.007435  4.000    8.254   0.639    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.579   Deviance explained = 57.9%
-REML =  54728  Scale est. = 49.915    n = 16208

Let’s pause and reflect:

  • Did we allow enough flexibility in our smooth terms?
  • How many unique ALTITUDE values are there in these data?
  • What would happen if we increased the k value for the s(ALTITUDE) term? Why?

1. Is there a seasonal pattern?

We’re looking for a repeatable pattern year-on-year, so:

  • We need a predictor that codes for the day of the year (or week, or month of the year)
  • The fitted values on 31st December and on 1st Jan should be similar (ends should match)
  • Enters the cyclic type of spline.
b_scot_season<- bam(VWC ~ SOIL_TYPE +
                    s(ndate) +
                    s(SITE_ID, bs= "re") +
                    s(doy, bs= "cc", k= 20),
                data= scot_sites, method= "REML")

summary(b_scot_season)

Family: gaussian 
Link function: identity 

Formula:
VWC ~ SOIL_TYPE + s(ndate) + s(SITE_ID, bs = "re") + s(doy, bs = "cc", 
    k = 20)

Parametric coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             39.817      3.570  11.154   <2e-16 ***
SOIL_TYPEOrganic soil   10.180      8.744   1.164    0.244    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df      F p-value    
s(ndate)    8.849  8.993  223.7  <2e-16 ***
s(SITE_ID)  3.999  4.000 5614.5  <2e-16 ***
s(doy)     16.007 18.000  433.4  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.706   Deviance explained = 70.7%
-REML =  51838  Scale est. = 34.786    n = 16208

What does the model summary tell us? What value of k did mgcv automatically choose for us, for s(ndate)?

Each of these terms in the “linear predictor” of the model contribute their own, additive part to the overall model prediction.

To visualise the contribution for a given value of each predictor, we can use the default plot function:

par(mfrow= c(1, 3), mar= c(5.1, 6.1, 4.1, 2.1),
    cex.axis= 1.5, cex.lab= 2, cex.main= 2)
plot(b_scot_season)

To visualise the part played by each term for predicting the data time series, we need to manually use the predict.gam function, asking for a term-specific prediction for each observation (type= "terms"), which produces a matrix of predictions with as many columns as there are terms in the model.

Note that all smooth terms in this model (and in the models we cover in this course) are centered around zero, so we need to add both the general and the site-specific intercepts to each of them, if we want meaningful scaling when plotting the predictions over the data.

The general intercept value (“constant”) is stored separately (rather inconveniently!), and can be recovered using attr(my_model_predictions, "constant") or coef(my_model)[1].

# term-specific predictions matrix (marginals)
pred_b_seas<- predict(b_scot_season, type= "terms")

# visualise the object
head(pred_b_seas, n= 4)
     SOIL_TYPE  s(ndate) s(SITE_ID)    s(doy)
2783         0 0.9933032  -8.489696 -4.599857
2784         0 0.9970423  -8.489696 -4.620554
2785         0 1.0007814  -8.489696 -4.649049
2786         0 1.0045204  -8.489696 -4.684778
# full model predictions (sum of all columns)
pred_full<- apply(pred_b_seas, 1, sum) +
        attr(pred_b_seas, "constant")
# model predictions excluding `ndate` effect (column 2)
pred_notrend<- apply(pred_b_seas[, -2], 1, sum) +
        attr(pred_b_seas, "constant")
# model predictions excluding `doy` effect (column 4)
pred_noseason<- apply(pred_b_seas[, -4], 1, sum) +
        attr(pred_b_seas, "constant")

In case it helps understanding this model better, let’s now take it apart and plot the data (using a single site, for clarity) and the marginal predictions for that site, from different subsets of terms. Note that each set of predictions includes the random intercept for the corresponding site, again for the purpose of scaling with the data.

tip: jan1<- sort(unique(all_sites[all_sites$doy == 1, "ndate"])) gives us all the ndates that are a 1st of Jan. abline(v= jan1, col= "grey3", lty= 3) will add a vertical line at the start of each year, to facilitate the interpretation.

# Hartwood Home indicator
HARTW<- scot_sites$SITE_ID == "HARTW"

jan1<- sort(unique(all_sites[all_sites$doy == 1, "ndate"]))

par(mfrow= c(1, 1), mar= c(5.1, 6.1, 4.1, 2.1),
    cex.axis= 1.5, cex.lab= 2, cex.main= 2)

plot(VWC ~ ndate, data= scot_sites[HARTW, ], type= "l", 
        col= grey(0.4), main= "Hartwood Home")
abline(v= jan1, col= "grey3", lty= 3)
lines(x= scot_sites$ndate[HARTW], 
        y= pred_notrend[HARTW], 
        col= 4, lwd= 4)
lines(x= scot_sites$ndate[HARTW], 
        y= pred_noseason[HARTW], 
        col= 2, lwd= 4)

legend(x= 16100, y= 85, 
    legend= c(
      expression(paste("marginal seasonal effect: s(doy, bs= \"cc\")")),
      expression(paste("marginal temporal trend: s(ndate, bs= \"tp\")")),
      "full model"),
    lty= 1, lwd= c(4, 4, 6), col= c(4, 2, 1), cex= 2, bty= "n")
              
# Now, add the full model predictions (black)

lines(x= scot_sites$ndate[HARTW], 
        y= pred_full[HARTW], 
        col= 1, lwd= 6)

2. Is there any spatial trend?

We’re now going to work with the full spatial extent of the data, 51 sites across the UK (data= all_sites).

Just for visual exploration, we can look at mean soil moiture per site:

site_mean<- aggregate(cbind(LAT, LON, VWC) ~ SITE_ID,
                      data= all_sites, FUN= mean)

head(site_mean, 3)
  SITE_ID      LAT       LON      VWC
1   ALIC1 51.15355 -0.858232 33.89351
2   BALRD 56.48230 -3.111488 31.45031
3   BICKL 53.02635 -2.700530 30.40025
par(mfrow= c(1, 1), mar= c(5.1, 6.1, 4.1, 2.1),
    cex.axis= 1.5, cex.lab= 2, cex.main= 2)
symbols(x= site_mean[, c("LON", "LAT")],
        circles= site_mean$VWC,
        inches= 0.15, bg= "green3", fg= "green4", ylim= c(50, 60), asp= 1)
# add country borders
map('world', 'UK', add= T)

A 2D spline using s(LON, LAT)

If we want to estimate the average spatial trend while accounting for (“conditional on”) average temporal trends in the data (seasonal and non-seasonal), we just need to add the spatial smooth to the model. We can set a high k value for the spatial effect, in case the true trend is complex (but has to be smaller than the number of unique sites).

Note that for the sake of demonstration, in this model I will drop the SITE_ID random effect (which we can view as a very flexible, implicitly spatial term, without an explicit spatial structure), to ensure that there is no possible confounding with the smooth, spatially-explicit (LAT,LON) trend.

b_all_s<- bam(VWC ~ SOIL_TYPE +
                    s(ndate, k= 20) +
                    s(doy, bs= "cc", k= 20) +
                    s(LON, LAT, k= 45), 
                    data= all_sites, method= "REML")

Plotting all (smooth) model terms:

par(mfrow= c(1, 3), mar= c(5.1, 6.1, 4.1, 2.1),
    cex.axis= 1.5, cex.lab= 2, cex.main= 2)

plot(b_all_s, select= 1, scheme= 2)
plot(b_all_s, select= 2, scheme= 2)
plot(b_all_s, select= 3, scheme= 2, asp= 1)
# add country borders
map('world', 'UK', add= T)

Checking model summary:

summary(b_all_s)

Family: gaussian 
Link function: identity 

Formula:
VWC ~ SOIL_TYPE + s(ndate, k = 20) + s(doy, bs = "cc", k = 20) + 
    s(LON, LAT, k = 45)

Parametric coefficients:
                                        Estimate Std. Error t value Pr(>|t|)
(Intercept)                             30.97737    0.08416  368.07   <2e-16
SOIL_TYPEMineral soil                    4.93474    0.10732   45.98   <2e-16
SOIL_TYPEOrganic soil                   37.28593    0.21405  174.19   <2e-16
SOIL_TYPEOrganic soil over mineral soil  3.32671    0.29072   11.44   <2e-16
                                           
(Intercept)                             ***
SOIL_TYPEMineral soil                   ***
SOIL_TYPEOrganic soil                   ***
SOIL_TYPEOrganic soil over mineral soil ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
             edf Ref.df      F p-value    
s(ndate)   18.93     19 1913.0  <2e-16 ***
s(doy)     17.71     18  544.5  <2e-16 ***
s(LON,LAT) 43.99     44 7638.3  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.848   Deviance explained = 84.8%
-REML = 4.1039e+05  Scale est. = 36.887    n = 127228

Was dropping the site random effect a good idea?

Think:

  • how many sites do we have in total?
  • how many edf does the smooth spatial effect account for?
  • how much spatial smoothing has effectively taken place, in this model?
  • what was the max df we allowed, for that term?
# Indeed, the complexity of the spatial term (its *edf*) 
# is bumping against the maximum we have set. If we increase this, 
# we'll soon reach the number of sites, so doing no better 
# than a spatially unstructured random intercept.

It’s worth noting that models can (sometimes advantageously) include both spatially structured (“spatially explicit”) and standard (i.e. spatially unstructured) random effects. The latter can be called “spatially implicit”, when the levels represent different spatial locations.

3. Does the seasonal pattern change with latitude? (space-time smoothing)

An (incorrect) approach to space-time smoothing

It is tempting to simply extend the bivariate smooths we’ve just used to model the interaction between the seasonal variation term and latitude: s(doy, LAT). k needs to be fairly high, because we want the 2D spline bases to give a good coverage of all combinations of doy and LAT. Note that k= 40 in 2D is only roughly on the order of k= sqrt(40) in 1D, which is about 6, so still not very high.

(In this model, I’ve removed the not very useful spatial trend and replaced it with the standard random SITE_ID effect)

b_seaslatno<- bam(VWC ~ SOIL_TYPE +
                    s(ndate) +
                    s(doy, LAT, k= 40) +
                    s(SITE_ID, bs= "re"),
                    data= all_sites, method= "REML")

par(mfrow= c(1, 3), mar= c(5.1, 6.1, 4.1, 2.1),
    cex.axis= 1.5, cex.lab= 2, cex.main= 2)
plot(b_seaslatno, scheme= 1)

The very odd pattern for the middle term, s(doy, LAT), is due to the fact that the s(X1, X2) function assumes isotropy, i.e. the same amount of smoothing in all directions.

Because the spatial and temporal distance units are arbitrary and not comparable in this case, this can create distorsions. Indeed, we have

  • LAT: 50 - 56° North (6 units range)
  • doy: 1 - 366 days (365 units range),

Even if the ranges looked similar, there is no expectation in principle that the same amount of smoothing should be suitable for both dimensions.

Also, multivariate s() terms force us to use the same type of smoother for all the dimensions, when what we would actually need is

  • a cyclic effect for doy
  • a non-cyclic effect for LAT.

A better approach: multidimensional smoothing with anisotropic tensor product smooths

A tensor-product smooth (product of 1D splines per margin) like te(doy, LAT, bs= c("cc", "tp")), resolves the issues above, by providing:

  • Margin-specific spline types
  • Margin-specific df, and penalties.
b_seaslat<- bam(VWC ~ SOIL_TYPE +
                    s(ndate) +
                    te(doy, LAT, bs= c("cc", "tp"), k= c(20, 15)) +
                    s(SITE_ID, bs= "re"),
                    data= all_sites, method= "REML")

Model summary:

summary(b_seaslat)

Family: gaussian 
Link function: identity 

Formula:
VWC ~ SOIL_TYPE + s(ndate) + te(doy, LAT, bs = c("cc", "tp"), 
    k = c(20, 15)) + s(SITE_ID, bs = "re")

Parametric coefficients:
                                        Estimate Std. Error t value Pr(>|t|)
(Intercept)                               32.229      2.938  10.969  < 2e-16
SOIL_TYPEMineral soil                      5.322      3.312   1.607   0.1081
SOIL_TYPEOrganic soil                     27.315      4.523   6.039 1.55e-09
SOIL_TYPEOrganic soil over mineral soil   20.820      9.036   2.304   0.0212
                                           
(Intercept)                             ***
SOIL_TYPEMineral soil                      
SOIL_TYPEOrganic soil                   ***
SOIL_TYPEOrganic soil over mineral soil *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                edf Ref.df      F p-value    
s(ndate)      8.992      9 1783.1  <2e-16 ***
te(doy,LAT) 230.640    258  458.8  <2e-16 ***
s(SITE_ID)   43.566     46 3688.1  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.843   Deviance explained = 84.4%
-REML = 4.1274e+05  Scale est. = 38.107    n = 127228

Always worth spending some time inspecting some key features:

  • number of edf vs number of reference df
  • are all terms supported by the data (significant p-values)?
  • do edf and p-values look plausible?
  • anything potentially problematic?

Plotting all terms helps with interpretation a lot:

par(mfrow= c(1, 3), mar= c(5.1, 6.1, 4.1, 2.1),
    cex.axis= 1.5, cex.lab= 2, cex.main= 2)
plot(b_seaslat, scheme= 1)

But in multiple dimensions, interpretation can remain tricky.

Interpreting the 3D space-time smooth

If the multidimensional object is a bit complex, an effective approach can be to slice it in meaningful ways. Here for example, predicting the seasonal trend (doy effect) for a range of set latitudes, would allow us to focus on how the seasonal trend changes up and down the country. The converse, predicting the latitudinal profile for a given time of the year, seems less useful (but this may be a useful thing to do, in some contexts).

  1. Create a regular sequence of 9 values over the range of observed latitudes
LAT_vec<- seq(min(all_sites$LAT), 
                max(all_sites$LAT), 
                l= 9)
  1. For each latitude (using a loop),
    1. predict the corresponding doy effect, fixing other variables c("SOIL_TYPE", "ndate", "SITE_ID") at an arbitrary value (I used values of the first observation in the data, purely for convenience)
    2. plot the predicted values
par(mfrow= c(3, 3), mar= c(5.1, 6.1, 4.1, 2.1),
    cex.axis= 1.5, cex.lab= 2, cex.main= 2)

for(i in 1:9){
  predicted<- predict(b_seaslat, 
        newdata= data.frame(
                all_sites[1, c("SOIL_TYPE", "ndate", "SITE_ID")],
                doy= 1:365,
                LAT= LAT_vec[i]))

  plot(1:365, predicted, type= "l", lwd= 5, 
        xlab= "Day of the year", 
        main= paste("LAT=", round(LAT_vec[i], 1)),
        ylim= c(20, 60), col= 3, cex.main= 2)
}

There is a clear change in seasonality with latitude, the South being more seasonal (having much larger amplitude) than the North. The difference is most obvious at the two extreme latitudes. Sites in-between are somewhat more similar to each other, yet with idiosyncratic local variation (including in the average wetness across the whole year), likely driven by other factors.

[optional trick] Variance decomposition for nested effects

To facilitate interpreting more complex non-linear interaction terms between two or more predictors, like te(doy, LAT, bs= c("cc", "tp")), it is possible to decompose this smooth into orthogonal main effects and interaction terms, using a specially designed tensor product function, called ti.

  • The main effect for doy becomes ti(doy, bs= c("cc")
  • The main effect for LAT becomes ti(LAT, bs= c("tp")
  • The pure interaction part becomes ti(doy, LAT, bs= c("cc", "tp")
b_decomp<- bam(VWC ~ SOIL_TYPE +
                    s(ndate) +
                    ti(doy, bs= c("cc"), k= c(20)) +
                    ti(LAT, bs= c("tp"), k= c(15)) +
                    ti(doy, LAT, bs= c("cc", "tp"), k= c(10, 10)) +
                    s(SITE_ID, bs= "re"),
                    data= all_sites, method= "REML")

summary(b_decomp)

Family: gaussian 
Link function: identity 

Formula:
VWC ~ SOIL_TYPE + s(ndate) + ti(doy, bs = c("cc"), k = c(20)) + 
    ti(LAT, bs = c("tp"), k = c(15)) + ti(doy, LAT, bs = c("cc", 
    "tp"), k = c(10, 10)) + s(SITE_ID, bs = "re")

Parametric coefficients:
                                        Estimate Std. Error t value Pr(>|t|)
(Intercept)                               32.207      2.914  11.054  < 2e-16
SOIL_TYPEMineral soil                      5.242      3.291   1.593   0.1112
SOIL_TYPEOrganic soil                     27.778      4.506   6.165 7.08e-10
SOIL_TYPEOrganic soil over mineral soil   20.763      8.929   2.325   0.0201
                                           
(Intercept)                             ***
SOIL_TYPEMineral soil                      
SOIL_TYPEOrganic soil                   ***
SOIL_TYPEOrganic soil over mineral soil *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df         F  p-value    
s(ndate)     8.992  9.000  1764.974  < 2e-16 ***
ti(doy)     17.706 18.000 12611.922  < 2e-16 ***
ti(LAT)      4.322  4.322     5.093 0.000342 ***
ti(doy,LAT) 65.709 72.000   205.026  < 2e-16 ***
s(SITE_ID)  42.667 46.000  3641.087  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.842   Deviance explained = 84.2%
-REML = 4.1324e+05  Scale est. = 38.542    n = 127228
par(mfrow= c(2, 2))
plot(b_decomp, select= 2, scheme= 1)
plot(b_decomp, select= 3, scheme= 1)
plot(b_decomp, select= 4, scheme= 1)

Plot 1 shows the marginal average effect of doy.

Plot 2 shows the marginal average effect of LAT.

Plot 3 shows the interaction between doy and LAT, to be interpreted as a deviation from the marginal main effects. This leads to a similar interpretation as the “9 slices” plots just earlier:

  • at the lower latitudes, we see a doy effect of similar shape to the main doy effect which amplifies the seasonal trend, at low latitudes.
  • at the highest latitudes, we see a mirror shape which tends to cancel the seasonal trend, at high latitudes.

Beyond helping with the interpretation of the patterns in the data, the ti smooths can be particularly useful for indentifying which of these interaction components are non-zero. In contrast, the standard tensor product te only tests for the significance of the whole term (combined main effects and interaction), i.e., testing if at least one of these 3 components is non-zero.

Exercises

Testing your familiarity

  • Try working through the above using a different set of predictors of your choice.
  • Remember to reflect on model outputs:
    • what does the model summary, including p-values, edf and ref.df tell us? Does everything look plausible?
    • what do the model plots mean and what inference can we draw from them?

Multivariate smooths

  • Which predictors might make sense to include as part of multivariate smooths?
  • For those, could/would you use s() or te()?

[Optional - extra experimenting]

  • You may want to play with different options for arguments bs and k, to experiment and see what works or doesn’t, and reflect on what may or may not make sense based on your own domain expertise.
    • for k, see examples from the lecture. More info in ?s and ?te
    • for bs, best to play with either “cc” or “tp”. If curious about alternative options, see ?smooth.terms. But beware: it’s a rabbit hole and we won’t have time in this course to explore or discuss much of the possible options!

[Optional - extra thinking] Spatial trend + random effects

We could fit a model with both the spatially explicit trend and the random SITE_ID intercept (spatially implicit, unstructured)

b_all_sre<- bam(VWC ~ SOIL_TYPE +
                    s(ndate) +
                    s(doy, bs= "cc", k= 20) +
                    s(SITE_ID, bs= "re") +
                    s(LON, LAT, k= 45), 
                    data= all_sites, method= "REML")

par(mfrow= c(1, 2), mar= c(5.1, 6.1, 4.1, 2.1),
    cex.axis= 1.5, cex.lab= 2, cex.main= 2)

plot(b_all_s, select= 3, scheme= 2, asp= 1)
map('world', 'UK', add= T)
symbols(x= site_mean[, c("LON", "LAT")],
        circles= site_mean$VWC,
        bg= "green3", fg= "green4",
        inches= 0.1, add= T)

plot(b_all_sre, select= 4, scheme= 2, asp= 1)
map('world', 'UK', add= T)
symbols(x= site_mean[, c("LON", "LAT")],
        circles= site_mean$VWC,
        bg= "green3", fg= "green4",
        inches= 0.1, add= T)

What difference does this make? Does it make sense?

(Unfold code below, for some thoughts on the problem)

Code
# Average wetness at a location is a combination of spatially 
# structured variation (caused by predictors varying at 
# large-scale) and spatially unstructured variation (caused 
# by site-specific variation or by predictors varying at 
# scales smaller than the distance between our sites). 

# The model with random effects is able to capture both spatially 
# structured and unstructured variation, and suggests both types 
# are supported by the data, making this model likely more 
# appropriate and useful.

Contrast the estimated spatial trends above with the average annual rainfall over the UK

Cyclic smoothers

  • Try fitting a seasonal model with a different time variable, like the week or month of the year.
  • Does that make any difference compared to using doy?

[Optional - extra thinking] Interplay between constrained vs. more flexible smooths #1

I left the k value for the s(ndate) term a bit low, in the model b_scot_season. Almost certainly too low, as the estimated edf is up against the maximum possible (10 by default, but ref.df also gives us an approximate idea of what that is).

  • How about repeating the analysis and pushing k up, for example at 20?
    • how does this change the results, including the term-specific fits?
    • can you explain what happened?

tip: jan1<- sort(unique(all_sites[all_sites$doy == 1, "ndate"])) gives us all the ndates that are a 1st of Jan. abline(v= jan1, col= "grey3", lty= 3) will add a vertical line at the start of each year, to facilitate the interpretation.

b_scot_season_20<- bam(VWC ~ SOIL_TYPE +
                    s(ndate, k= 20) +
                    s(SITE_ID, bs= "re") +
                    s(doy, bs= "cc", k= 20),
                data= scot_sites, method= "REML")

summary(b_scot_season_20)

Family: gaussian 
Link function: identity 

Formula:
VWC ~ SOIL_TYPE + s(ndate, k = 20) + s(SITE_ID, bs = "re") + 
    s(doy, bs = "cc", k = 20)

Parametric coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             39.815      3.547  11.224   <2e-16 ***
SOIL_TYPEOrganic soil   10.192      8.689   1.173    0.241    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df       F p-value    
s(ndate)   18.784  18.96  320.99  <2e-16 ***
s(SITE_ID)  3.999   4.00 6801.93  <2e-16 ***
s(doy)     16.440  18.00   79.03  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.761   Deviance explained = 76.1%
-REML =  50245  Scale est. = 28.345    n = 16208

Making predictions term-by-term for the new model, b_scot_season_20.

Code
# term-specific predictions matrix (marginals)
pred_b_seas_20<- predict(b_scot_season_20, type= "terms")

# visualise the object
head(pred_b_seas_20, n= 4)
     SOIL_TYPE s(ndate) s(SITE_ID)      s(doy)
2783         0 6.120825  -8.574078  0.05417416
2784         0 6.159237  -8.574078 -0.10724161
2785         0 6.197649  -8.574078 -0.27812150
2786         0 6.236059  -8.574078 -0.45768379
Code
# full model predictions (sum of all columns)
pred_full_20<- apply(pred_b_seas_20, 1, sum) +
        attr(pred_b_seas_20, "constant")
# model predictions excluding `ndate` effect (column 2)
pred_notrend_20<- apply(pred_b_seas_20[, -2], 1, sum) +
        attr(pred_b_seas_20, "constant")
# model predictions excluding `doy` effect (column 4)
pred_noseason_20<- apply(pred_b_seas_20[, -4], 1, sum) +
        attr(pred_b_seas_20, "constant")
Code
# Hartwood Home indicator
HARTW<- scot_sites$SITE_ID == "HARTW"

jan1<- sort(unique(all_sites[all_sites$doy == 1, "ndate"]))

par(mfrow= c(2, 1), mar= c(5.1, 6.1, 4.1, 2.1),
    cex.axis= 1.5, cex.lab= 2, cex.main= 2)

# plot the original model terms        
plot(VWC ~ ndate, data= scot_sites[HARTW, ], type= "l", 
        col= grey(0.4), main= "s(ndate, k= 10) + s(doy, bs= \"cc\", k= 20)")

abline(v= jan1, col= "grey3", lty= 3)

lines(x= scot_sites$ndate[HARTW], 
        y= pred_notrend[HARTW], 
        col= 4, lwd= 4)
lines(x= scot_sites$ndate[HARTW], 
        y= pred_noseason[HARTW], 
        col= 2, lwd= 4)

legend(x= 16100, y= 85, 
    legend= c(
      expression(paste("marginal seasonal effect: s(doy, bs= \"cc\")")),
      expression(paste("marginal temporal trend: s(ndate, bs= \"tp\")"))),
    lty= 1, lwd= c(4, 4), col= c(4, 2), cex= 2, bty= "n")

# plot the new model terms        
plot(VWC ~ ndate, data= scot_sites[HARTW, ], type= "l", 
        col= grey(0.4), main= "s(ndate, k= 20) + s(doy, bs= \"cc\", k= 20)")

abline(v= jan1, col= "grey3", lty= 3)
 
lines(x= scot_sites$ndate[HARTW], 
        y= pred_notrend_20[HARTW], 
        col= 4, lwd= 4)
lines(x= scot_sites$ndate[HARTW], 
        y= pred_noseason_20[HARTW], 
        col= 2, lwd= 4)

(Unfold code below, for some thoughts on the problem)

Code
# In the original model, the s(doy, bs= "cc") term picks 
# what looks like repeatable patterns year-on-year in the 
# data, while the s(ndate, bs= "tp", k= 10) term picks any 
# remaining temporal term. 
# The rest goes into the residual variation.

# In the second model, the term s(ndate, bs= "tp", k= 20) 
# has much more flexibility to pick-up temporal trends that 
# are not repeatable year-on-year. Looking at the bottom 
# panel suggests that VWC in autumn (red peaks) is much 
# more variable year-on-year than late winter-spring 
# VWC (red throughs). The `ndate` term now has the ability 
# to pick some of this variation, which as a result gets 
# taken away from the  periodic `doy` term.

And a plot of the overall predictions of the two models.

par(mfrow= c(1, 1), mar= c(5.1, 6.1, 4.1, 2.1),
    cex.axis= 1.5, cex.lab= 2, cex.main= 2)

# plot the original model terms        
plot(VWC ~ ndate, data= scot_sites[HARTW, ], type= "l", 
        col= grey(0.4), main= "Hartwood Home")

abline(v= jan1, col= "grey3", lty= 3)

lines(x= scot_sites$ndate[HARTW], 
        y= pred_full[HARTW], 
        col= 1, lwd= 3)

lines(x= scot_sites$ndate[HARTW], 
        y= pred_full_20[HARTW], 
        col= 3, lwd= 3)

legend(x= 16100, y= 85, 
    legend= c("\"Model pred_b_seas\"",
      "Model \"pred_b_seas_20\""),
    lty= 1, lwd= c(4, 4), col= c(1, 3), cex= 2, bty= "n")

(Unfold code below, for some thoughts on the problem)

Code
# Predictions are still relatively close, as some of the 
# variation has just been transfered from one term (and from 
# the residuals) to another. 
# But they are somewhat different, as we would expect from 
# the latter model having more flexibility overall.
  • What would happen if you push k very high, e.g. 100?
    • which of these models would you keep?
    • why?

[optional follow-on - overthinking] Interplay between constrained vs. more flexible smooths #2

Sticking with the scot_sites data and function bam,

  • Compare the following models:
  • s(ndate, k= 100, bs= "tp") + s(doy, k= 20, bs= "cc")
  • vs.
  • s(ndate, k= 100, bs= "tp")
  • compare:
    • the adjusted R^2 of these models
    • the fitted s(ndate) terms of the two models (clearer if using just one site)
    • the predictions from the whole models
  • can you make sense of what you see, here?

Fitting & plotting the two models:

# Model 1 (seasonal component)
b_hart_season_100<- bam(VWC ~ SOIL_TYPE + 
                    s(SITE_ID, bs= "re") + 
                    s(ndate, k= 100) +
                    s(doy, bs= "cc", k= 20),
                data= scot_sites,
                method= "REML")

# Model 2 (no seasonal component)
b_hart_100<- bam(VWC ~ SOIL_TYPE + 
                s(SITE_ID, bs= "re") + 
                s(ndate, k= 100),
                data= scot_sites,
                method= "REML")

# Model summaries (for the R^2 values)
summary(b_hart_season_100)

Family: gaussian 
Link function: identity 

Formula:
VWC ~ SOIL_TYPE + s(SITE_ID, bs = "re") + s(ndate, k = 100) + 
    s(doy, bs = "cc", k = 20)

Parametric coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             39.803      3.538  11.249   <2e-16 ***
SOIL_TYPEOrganic soil   10.238      8.667   1.181    0.238    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df        F p-value    
s(SITE_ID)  3.999   4.00 8451.993  <2e-16 ***
s(ndate)   87.454  90.73  128.152  <2e-16 ***
s(doy)     14.844  18.00    4.348  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.809   Deviance explained =   81%
-REML =  48582  Scale est. = 22.64     n = 16208
summary(b_hart_100)

Family: gaussian 
Link function: identity 

Formula:
VWC ~ SOIL_TYPE + s(SITE_ID, bs = "re") + s(ndate, k = 100)

Parametric coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             39.802      3.527  11.286   <2e-16 ***
SOIL_TYPEOrganic soil   10.240      8.638   1.185    0.236    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df      F p-value    
s(SITE_ID)  3.999   4.00 8422.9  <2e-16 ***
s(ndate)   95.764  98.72  229.5  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.808   Deviance explained = 80.9%
-REML =  48604  Scale est. = 22.719    n = 16208
# Plots of fitted terms (1 row per model)
par(mfrow= c(2, 3), mar= c(5.1, 6.1, 4.1, 2.1),
    cex.axis= 1.5, cex.lab= 2, cex.main= 2)

plot(b_hart_season_100)
plot(b_hart_100)

(Unfold code below, for some thoughts on the problem)

Code
# The seasonal model has a highly significant seasonal term still.
# The `s(ndate)` fit in the seasonal and the non-seasonal model 
# bear some resemblance but the credible intervals are narrower 
# in the non-seasonal model.

Plotting the fitted s(ndate) functions (left) and the full model predictions (right), for the “Hartwood Home” site:

Code
par(mfrow= c(1, 2), mar= c(5.1, 6.1, 4.1, 2.1),
    cex.axis= 1.5, cex.lab= 2, cex.main= 2)

# plot the partial predictions        
matplot(x= scot_sites$ndate[HARTW],
        y= cbind(predict(b_hart_season_100, type= "terms")[HARTW, "s(ndate)"],
              predict(b_hart_100, type= "terms")[HARTW, "s(ndate)"]),
      lwd= 3, col= c(2, 3), lty= 1, type= "l",
      ylab= "s(ndate)", xlab= "ndate",
      main= "fitted `s(ndate)` functions")

legend(x= 16100, y= -10, 
    legend= c("Model b_hart_season_100",
      "Model b_hart_100"),
    lty= 1, lwd= 3, col= c(2, 3), cex= 1.5, bty= "n")


# plot the full predictions        
matplot(x= scot_sites$ndate[HARTW],
        y= cbind(predict(b_hart_season_100)[HARTW],
              predict(b_hart_100)[HARTW]),
      lwd= 3, col= c(2, 3), lty= 1, type= "l",
      ylab= "s(ndate)", xlab= "ndate",
      main= "full model predictions")

(Unfold code below, for some thoughts on the problem)

Code
# In the seasonal model, the s(doy, bs= "cc") term picks the 
# repeatable patterns year-on-year in the data, while the 
# less constrained `s(ndate, bs= "tp", k= 100)` term picks 
# any "left-over" temporal variation. 

# In the non-seasonal version, because the s(ndate, bs= "tp", k= 100) 
# has so much flexibility, it is able to capture the two 
# types of temporal variation at once.

# The overall model predictions are almost identical, and as 
# a result, so are the R^2 values. In effect, the seasonal 
# variation has just transferred to the s(ndate) term.

# Interestingly,

# 1. By manipulating the structure of the model and the 
# constraints on the terms, such as smoothness or cyclicity, 
# the analyst can effectively decompose the variation in 
# the data, to focus on structures of interest (for example, 
# the seasonal component of variation, here)

# 2. This example illustrate how measures of fit such as 
# R^2 (or even AIC) may of very limited unsefulness for 
# comparing models, in cases where the model is very 
# flexible, descriptive, and marginal changes in model 
# structure may simply be bouncing a roughly constant 
# amount of explained variation across terms.

# 3. When the less constrained `s(ndate)` term is given 
# sufficient flexibility, it becomes effectively 
# "interchangeable" with `s(doy)`. 
# This confounding / collinearity (also called "concurvity" 
# in GAM) explains the inflation of the credible intervals 
# in the seasonal model.