Model checking and model validation

David L Miller

Biomathematics and Statistics Scotland

UK Centre for Ecology and Hydrology

We’ve fitted a model,
but is it good?

How good is the fit?

Can we just look at fitted vs. observed values?

This can be useful but there are lots of things to think about.

Fit
Out-of-sample performance
Distributional assumptions
… more!

plot(b$y, fitted(b), asp=1)
abline(a=0, b=1, lwd=2, col="red")

How good is the fit?

We have:
- \(R^2\)
- deviance explained
But they don’t tell us about out-of-sample performance
What to do?
- cross-validation!
- 2 papers: Wenger & Olden and Roberts et al
- R package blockCV

Cross-validation overview

Leave out some data when fitting
How good is the fit to the left-out data?

Figure from Mannocci et al, (2017).

What about assessing model assumptions?

Residuals: what are they?

Residuals

Lots of definitions
“Raw” residuals: observed value - fitted value
Doesn’t work well with non-normal data

Deviance residuals

(Skipping Pearson residuals)
Deviance in a GLM \(\approx\) sum of squares in LM
\(\Rightarrow\) Scale by the deviance (roughly: difference from the “perfect” likelihood)
Definition:

\[ \epsilon_i^D = \text{sign}(y_i - \mu_i)\sqrt{d_i} \]

where \(y_i\) is data, \(\mu_i\) is prediction and \(d_i\) is the deviance for observation \(i\)

Deviance residuals (II)

We expect \(\epsilon_i^D \sim N(0, \phi)\)
- where \(\phi\) is the scale parameter (\(\sigma\) for normal, 1 for Poisson etc)
We can check distribution of \(\epsilon_i^D\) to see if it is normal

What does that tell us?

Want \(\epsilon_i^D \sim N(0, \phi)\), why?
Want: more small residuals, fewer larger ones
\(\Rightarrow\) better fit
We can dig into this more looking at a quantile-quantile plot

Quantile-quantile plot of residuals

Histogram can look okay, but still be problems
Q-Q plot tells us more
Close to line == good
Band tells us what are reasonable values for the model

qq.gam(b, asp=1, rep=200)

What about a \(t\) distribution?

Some tail action?
Would a scaled \(t\) distribution do a better job?
(No)
family=scat()

b_t <- gam(COSMOS_VWC ~ s(ndate),
           data=wwoods, method="REML",
           family=scat())
qq.gam(b_t, asp=1, rep=200)

What about other distributional assumptions?

Variance relationship

Plotting the residuals vs. predictions
- (on link scale)
Expect a “blob”, no increase/decrease in spread

plot(fitted(b), residuals(b))

Bad variance relationship

Here variance increases with prediction
\(\Rightarrow\) incorrect mean-variance relationship

These are all useful plots, what if they were in the same place?

👌 `gam.check()` 👌

Selecting between/within models

Let’s add a term to our model

Let’s model all the sites in Scotland
Add a site-specific random effect
2 models:
- b0: COSMOS_VWC ~ s(ndate)
- b1: COSMOS_VWC ~ s(ndate) + s(SITE_ID, bs="re")

Let’s add a term to our model

Comparing models by AIC

Akaike’s information criterion
Can just use AIC() in R:

         df      AIC
b0 10.68555 122738.4
b1 15.89472 109391.2

b1 is the better model!

AIC problems

Comparing between discrete models
Do we think this is a good idea?
- (hint: probably not)
What about “shrinking” to the right model?

Term selection during model fitting

Adding some more variables

b_all <- gam(COSMOS_VWC ~ s(ndate) +
                          s(SITE_ID, bs="re") +
                          s(ALTITUDE, k=5) +
                          SOIL_TYPE,
             data=scot_sites, method="REML")

What does that look like?

summary(b_all)


Family: gaussian 
Link function: identity 

Formula:
COSMOS_VWC ~ s(ndate) + s(SITE_ID, bs = "re") + s(ALTITUDE, k = 5) + 
    SOIL_TYPE

Parametric coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             40.991      4.860   8.435   <2e-16 ***
SOIL_TYPEOrganic soil    3.909     17.766   0.220    0.826    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df        F p-value    
s(ndate)    8.841  8.992  165.575  <2e-16 ***
s(SITE_ID)  2.999  3.000 4753.831  <2e-16 ***
s(ALTITUDE) 1.000  1.000    0.174   0.676    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.579   Deviance explained = 57.9%
-REML =  54725  Scale est. = 49.913    n = 16208

What does that look like?

plot(b_all, pages=1)

Term selection

Looks like s(ALTITUDE) is not needed?
But still a slope? (EDF=1)
Can we remove? (Yes)
Can we remove during fitting? (Also yes)

Why is that slope still there?

Going back to penalties…

\[ \lambda \int_\mathbb{R} \left( \frac{\partial^2 s(x)}{\partial x^2}\right)^2 \text{d}x\\ \]

bits of \(s(x)\) without 2nd derivatives have no penalty

Term selection via shrinkage basis

as we make \(\lambda\) bigger, the smooth gets flatter and flatter
EDF gets smaller
up to a point…

Term selection via shrinkage basis

bs="ts" as we make \(\lambda\) bigger, then penalty works for nullspace terms

Shrinkage basis

b_all_ts <- gam(COSMOS_VWC ~ s(ndate) +
                             s(SITE_ID, bs="re") +
                             s(ALTITUDE, k=5, bs="ts") +
                             SOIL_TYPE,
                data=scot_sites, method="REML")

What does that look like?

summary(b_all_ts)


Family: gaussian 
Link function: identity 

Formula:
COSMOS_VWC ~ s(ndate) + s(SITE_ID, bs = "re") + s(ALTITUDE, k = 5, 
    bs = "ts") + SOIL_TYPE

Parametric coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             39.821      3.550  11.216   <2e-16 ***
SOIL_TYPEOrganic soil   10.113      8.698   1.163    0.245    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                  edf Ref.df        F p-value    
s(ndate)    8.8413793  8.992  165.574  <2e-16 ***
s(SITE_ID)  3.9983209  4.000 3777.345  <2e-16 ***
s(ALTITUDE) 0.0002275  4.000    0.008   0.637    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.579   Deviance explained = 57.9%
-REML =  54728  Scale est. = 49.913    n = 16208

What does that look like?

plot(b_all_ts, pages=1)

Shrinkage recap

We don’t need s(ALTITUDE) anyway
- s(SITE_ID, bs="re") handles all the variation
We can apply to all terms
- Slightly slower fitting

Recap

\(R^2\)/deviance explained don’t tell you that much
Cross-validation is the gold standard
gam.check to see residual diagnostics
AIC() works like you expect
Can remove terms while fitting using bs="ts"

Model checking and model validation

We’ve fitted a model, but is it good?

How good is the fit?

How good is the fit?

Cross-validation overview

What about assessing model assumptions?

Residuals: what are they?

Residuals

Deviance residuals

Deviance residuals (II)

What does that tell us?

Quantile-quantile plot of residuals

What about a \(t\) distribution?

What about other distributional assumptions?

Variance relationship

Bad variance relationship

These are all useful plots, what if they were in the same place?

👌 gam.check() 👌

Selecting between/within models

Let’s add a term to our model

Let’s add a term to our model

Comparing models by AIC

AIC problems

Term selection during model fitting

Adding some more variables

What does that look like?

What does that look like?

Term selection

Why is that slope still there?

Term selection via shrinkage basis

Term selection via shrinkage basis

Shrinkage basis

What does that look like?

What does that look like?

Shrinkage recap

Recap

We’ve fitted a model,
but is it good?

👌 `gam.check()` 👌