lm(y ~ x1, data=dat)
Biomathematics and Statistics Scotland
UK Centre for Ecology and Hydrology
\[ y_i \sim \text{Normal}(\mu_i, \sigma_i) \]
\[ \mu_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots \]
\[ y_i \sim \text{EF}(\mu_i, \sigma_i) \]
\[ g(\mu_i) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots \]
\(y_i \sim \text{EF}(\mu_i, \sigma_i); \qquad g(\mu_i) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots\)
What does that mean?
The the response, \(\mu_i\), is assumed to be a linear combination (weighted by the \(\beta\)s) of the covariates, \(x_{ji}\), with an intercept (\(\beta_0\)).
(for a few reasons)
Photos from https://www.core77.com/posts/55368/When-Splines-Were-Physical-Objects
\[ s(x) = \sum_k \beta_k b_k(x) \]
\[ \lambda \int_\mathbb{R} \left( \frac{\partial^2 s(x)}{\partial x^2}\right)^2 \text{d}x\\ \]
where \(\lambda\) is a smoothing parameter
\[ \mathbb{E}(Y_i) = g^{-1}\left[\beta_0 + \sum_j s_j(x_{ij})\right] \]
gam
, mgcv
, R-INLA
, brms
, gamlss
mgcv
glm()
-type syntaxRecommended
(ships with R)UKCEH data: COSMOS-UK https://cosmos.ceh.ac.uk/
Dataset from: https://catalogue.ceh.ac.uk/documents/5060cc27-0b5b-471b-86eb-71f96da0c80f
Using ✨cosmic rays✨ to measure moisture
# A tibble: 127,228 × 16
SITE_ID SITE_NAME LONGITUDE LATITUDE NORTHING EASTING ALTITUDE LAND_COVER
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 ALIC1 Alice Holt -0.858 51.2 139985 479950 80 Broadleaf woo…
2 ALIC1 Alice Holt -0.858 51.2 139985 479950 80 Broadleaf woo…
3 ALIC1 Alice Holt -0.858 51.2 139985 479950 80 Broadleaf woo…
4 ALIC1 Alice Holt -0.858 51.2 139985 479950 80 Broadleaf woo…
5 ALIC1 Alice Holt -0.858 51.2 139985 479950 80 Broadleaf woo…
6 ALIC1 Alice Holt -0.858 51.2 139985 479950 80 Broadleaf woo…
7 ALIC1 Alice Holt -0.858 51.2 139985 479950 80 Broadleaf woo…
8 ALIC1 Alice Holt -0.858 51.2 139985 479950 80 Broadleaf woo…
# ℹ 127,220 more rows
# ℹ 8 more variables: SOIL_TYPE <fct>, DATE_TIME <date>, PRECIP <dbl>,
# SNOW_DEPTH <dbl>, COSMOS_VWC <dbl>, ndate <dbl>, month <dbl>, year <dbl>
Our model is then: \[ \mathbb{E}(\mu_t) = \beta_0 + s(t) \]
gam
glm
, we know how to use gam
formula
and some data
s()
s(x)
Family: gaussian
Link function: identity
Formula:
COSMOS_VWC ~ s(ndate)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.3525 0.1524 310.7 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(ndate) 8.167 8.812 95.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.656 Deviance explained = 66.3%
-REML = 1150.5 Scale est. = 10.218 n = 440
Method: REML Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [-3.367285e-07,8.300716e-08]
(score 1150.533 & scale 10.2183).
Hessian positive definite, eigenvalue range [3.13693,219.0595].
Model rank = 10 / 10
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(ndate) 9.00 8.17 0.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
k
Family: gaussian
Link function: identity
Formula:
COSMOS_VWC ~ s(ndate, k = 20)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.3525 0.1289 367.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(ndate) 17.43 18.68 71.43 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.754 Deviance explained = 76.4%
-REML = 1099.9 Scale est. = 7.3089 n = 440
(you’ll learn how to do this later!)
gam()
fits a GAMsummary()
gives us information about the fitplot()
makes basic plots