Data Fusion

Fergus J Chadwick

BioSS

Ana S Couto

BioSS

Dave L Miller

BioSS

UKCEH

Jackie Potts

BioSS

Thomas Cornulier

BioSS

Before The Break

flowchart LR
  A[Process<br>of Interest]:::FixFont --> B(Observation<br>Process):::FixFont
  B --> C{Data}:::FixFont;
    classDef FixFont font-size:15px
    style A color:#cc0000
    style B color:#3c78d8
    style C color:#f1c232

Before The Break

flowchart LR
  A[Process<br>of Interest]:::FixFont --> B(Observation<br>Process):::FixFont
  B --> C{Data}:::FixFont;
    classDef FixFont font-size:15px
    style A color:#cc0000
    style B color:#3c78d8
    style C color:#f1c232
flowchart LR
  A[Time<br>Pattern]:::FixFont --> C(Simulate):::FixFont
  B[Spatial<br>Pattern]:::FixFont --> C
  C --> D(Sample):::FixFont
  D --> E{Realistic<br>Dataset}:::FixFont;
    classDef FixFont font-size:15px
    style A color:#cc0000
    style B color:#cc0000
    style C color:#674ea7
    style D color:#3c78d8
    style E color:#f1c232

Before The Break

flowchart LR
  A[Time<br>Pattern]:::FixFont --> C(Simulate):::FixFont
  B[Spatial<br>Pattern]:::FixFont --> C
  C --> D(Sample):::FixFont
  D --> E{Realistic<br>Dataset}:::FixFont;
    classDef FixFont font-size:15px
    style A color:#6aa84f
# Temporal Pattern
timeline <-set_timeline(1, "month")
time_series <- define_pattern_t(timeline, weekly_amplitude = 1, ...)
# Spatial Pattern
spatial_field <- define_pattern_s(n_x = 10, n_y = 10, ...)
# Simulate
truth <- simulate_spacetime(temporal_pattern = time_series, field = spatial_field, ...)
# Sample
point_samples <- sample_area_pointily(field = truth, n_samples = 10,
                                      sampling_error = 0.1, ...)

Before The Break

flowchart LR
  A[Time<br>Pattern]:::FixFont --> C(Simulate):::FixFont
  B[Spatial<br>Pattern]:::FixFont --> C
  C --> D(Sample):::FixFont
  D --> E{Realistic<br>Dataset}:::FixFont;
    classDef FixFont font-size:15px
    style B color:#6aa84f
# Temporal Pattern
timeline <-set_timeline(1, "month")
time_series <- define_pattern_t(timeline, weekly_amplitude = 1, ...)
# Spatial Pattern
spatial_field <- define_pattern_s(n_x = 10, n_y = 10, ...)
# Simulate
truth <- simulate_spacetime(temporal_pattern = time_series, field = spatial_field, ...)
# Sample
point_samples <- sample_area_pointily(field = truth, n_samples = 10,
                                      sampling_error = 0.1, ...)

Before The Break

flowchart LR
  A[Time<br>Pattern]:::FixFont --> C(Simulate):::FixFont
  B[Spatial<br>Pattern]:::FixFont --> C
  C --> D(Sample):::FixFont
  D --> E{Realistic<br>Dataset}:::FixFont;
    classDef FixFont font-size:15px
    style C color:#6aa84f
# Temporal Pattern
timeline <-set_timeline(1, "month")
time_series <- define_pattern_t(timeline, weekly_amplitude = 1, ...)
# Spatial Pattern
spatial_field <- define_pattern_s(n_x = 10, n_y = 10, ...)
# Simulate
truth <- simulate_spacetime(temporal_pattern = time_series, field = spatial_field, ...)
# Sample
point_samples <- sample_area_pointily(field = truth, n_samples = 10,
                                      sampling_error = 0.1, ...)

Before The Break

flowchart LR
  A[Time<br>Pattern]:::FixFont --> C(Simulate):::FixFont
  B[Spatial<br>Pattern]:::FixFont --> C
  C --> D(Sample):::FixFont
  D --> E{Realistic<br>Dataset}:::FixFont;
    classDef FixFont font-size:15px
    style D color:#6aa84f
# Temporal Pattern
timeline <-set_timeline(1, "month")
time_series <- define_pattern_t(timeline, weekly_amplitude = 1, ...)
# Spatial Pattern
spatial_field <- define_pattern_s(n_x = 10, n_y = 10, ...)
# Simulate
truth <- simulate_spacetime(temporal_pattern = time_series, field = spatial_field, ...)
# Sample
point_samples <- sample_area_pointily(field = truth, n_samples = 10,
                                      sampling_error = 0.1, ...)

Before The Break

flowchart LR
  A[Time<br>Pattern]:::FixFont --> C(Simulate):::FixFont
  B[Spatial<br>Pattern]:::FixFont --> C
  C --> D(Sample):::FixFont
  D --> E{Realistic<br>Dataset}:::FixFont;
    classDef FixFont font-size:15px
    style E color:#6aa84f
# Temporal Pattern
timeline <-set_timeline(1, "month")
time_series <- define_pattern_t(timeline, weekly_amplitude = 1, ...)
# Spatial Pattern
spatial_field <- define_pattern_s(n_x = 10, n_y = 10, ...)
# Simulate
truth <- simulate_spacetime(temporal_pattern = time_series, field = spatial_field, ...)
# Sample
point_samples <- sample_area_pointily(field = truth, n_samples = 10,
                                      sampling_error = 0.1, ...)

In The GAMs Workshop*

flowchart LR
  A[Process<br>of Interest]:::FixFont --> B(Observation<br>Process):::FixFont
  B --> C{Data}:::FixFont;
    classDef FixFont font-size:15px
    style A color:#cc0000
    style B color:#3c78d8
    style C color:#f1c232

*From Dave and Thomas

In The GAMs Workshop*

flowchart LR
C{Data}:::FixFont --> B(Observation<br>Process):::FixFont
B --> A[Process<br>of Interest]:::FixFont;
    classDef FixFont font-size:15px
    style A color:#cc0000
    style B color:#3c78d8
    style C color:#f1c232

*From Dave and Thomas

In This Session

flowchart LR
C1{Dataset<br>1}:::FixFont --> B1(Observation<br>Process<br>1):::FixFont
B1 --> A[Process<br>of Interest]:::FixFont

C2{Dataset<br>2}:::FixFont --> B2(Observation<br>Process<br>2):::FixFont
B2 --> A;
    classDef FixFont font-size:15px
    style A color:#cc0000
    style B1 color:#3c78d8
    style B2 color:#3c78d8
    style C1 color:#f1c232
    style C2 color:#f1c232

Our Simulated Data

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

Spot The Difference

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

Our Simulated Data

Our Simulated Data

Our Simulated Data

Finding Agreement,
Allowing For Disagreement

  • Data complementarity is the heart of data fusion.

  • We want to allow data to be mutually informative…

  • But we also want to acknowledge their differences.

  • The answer: hierarchical GAMS

Hierarchical GAMs

  • We can share the global trend (left column of figs).
  • Or we can allow each model to have a different shape.

Hierarchical GAMs

  • We can make the smooths for each group exactly the same.


  • Or we can make them similar.


  • Or we can allow them to be completely different.

Hierarchical GAM Hypotheses

What our best model tells us if it has a shared trend…

  • The observation processes agree where the peaks and troughs are
  • If it also shares smoothness, the two datasets are capturing the process of interest in similar ways.
  • If it also doesn’t share smoothness, is there a reason?

Hierarchical GAM Hypotheses

What our best model tells us if it does not have a shared trend…

  • The observation processes may disagree where the peaks and troughs are.
    • If they share smoothness, maybe there is a lag? Come to tomorrow’s talk!
    • If they don’t share smoothness either, they may be capturing different processes altogether.

Hierarchical GAM Hypotheses

Let’s build things up slowly with a quick refresher on building a non-hierarchical GAM.

Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
head(sample_reg)
# A tibble: 6 × 6
# Groups:   date [6]
      x     y date        value loc_id point_sample
  <dbl> <dbl> <date>      <dbl> <fct>         <dbl>
1     1     1 2001-01-01  0.812 1_1          0.575 
2     1     1 2001-01-02  0.271 1_1          0.518 
3     1     1 2001-01-03 -0.597 1_1         -0.396 
4     1     1 2001-01-04 -1.14  1_1         -1.27  
5     1     1 2001-01-05 -0.945 1_1         -0.925 
6     1     1 2001-01-06 -0.163 1_1         -0.0446
  • Remember value is the Truth from our simulation.
  • point_sample is what we observed.
  • loc_id is the combination of the x and y coordinates to help with plotting.

Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)

Remember, we might be interested in deconstructing the different components of the date to pick out weekly, quarterly and annual patterns.

Building A Basic Space Time Model

fit_sample_reg <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_reg, 
                        family = "gaussian",
                        method = "REML")

gam.check(fit_sample_reg)

Does this look familiar?

Building A Basic Space Time Model

Building A Basic Space Time Model

fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_ran, 
                        family = "gaussian",
                        method = "REML")

gam.check(fit_sample_ran)

Building A Basic Space Time Model

Building A Basic Space Time Model

A Joint Model

Rather than calling mgcv twice, can we specify these two models as a single joint model (like Model I)

More Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples

# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam) 

More Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples

# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam) 

More Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples

# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam) 

More Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples

# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam) 

Model I - I For Independence

fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_ran, 
                        family = "gaussian",
                        method = "REML")

fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")

Model I - I For Independence

fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_ran, 
                        family = "gaussian",
                        method = "REML")

fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")

Model I - I For Independence

fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")
gam.check(fit_fam_I)

Hierarchical GAMs

  • Model I has inherited the same problems as the independent models (reassuring for our coding, bad for our inference!).
  • We should probably think about adding a group trend.
  • Model GI - Same Shape, Different Wiggles

Model GI - Same Shape, Different Wiggles

fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Some New mgcv Tricks: Do Not Be Afraid

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Some New mgcv Tricks: Basis Specification

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Some New mgcv Tricks: Grouped Families

# Need to create combined response variable
sample_com$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Some New mgcv Tricks: Grouped Families

# Need to create combined response variable
sample_com$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Now we can have observation processes with different response distribution s!

Model GI - Same Shape, Different Wiggles

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Model GS - Same Shape, Same Wiggles

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")
# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec, fam_fac,
                             d=c(2, 1, 1),
                             bs=c("ts", "ts", "re")), 
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Model G - Same Model

# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec, fam_fac,
                             d=c(2, 1, 1),
                             bs=c("ts", "ts", "re")), 
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec), 
                       data = sample_com, 
                       family = "gaussian",
                       method = "REML")

Model G* - Group Trend Models

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec, fam_fac,
                             d=c(2, 1, 1),
                             bs=c("ts", "ts", "re")), 
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec), 
                       data = sample_com, 
                       family = "gaussian",
                       method = "REML")

Hierarchical GAM

# Independent trend and smooths
fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec, fam_fac,
                             d=c(2, 1, 1),
                             bs=c("ts", "ts", "re")), 
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec), 
                       data = sample_com, 
                       family = "gaussian",
                       method = "REML")

How Do The Models Compare?

Model I

Model GI

Model GS

Model G

How Do The Models Compare?

AIC(fit_fam_I)
[1] 4571.037
AIC(fit_fam_GI)
[1] 4480.373
AIC(fit_fam_GS)
[1] 4480.303
AIC(fit_fam_G)
[1] 4688.921
  • The grouped models with and without similar smoothness perfom best!
  • Is this surprising given our simulation?

Thoughts To Finish With

  • mgcv makes fitting really rather complex models relatively simple.
  • Data observed in different ways can, and should, be modelled simultaneously if possible.
  • Observation processes should be treated with as much respect as the process of interest.
  • Thank you for listening!

References

On Hierarchical GAMs

  1. Pedersen EJ, Miller DL, Simpson GL, Ross N. 2019. Hierarchical generalized additive models in ecology: an introduction with mgcv. PeerJ 7:e6876 https://doi.org/10.7717/peerj.6876

On Observation Processes

  1. Chadwick, FJ, Haydon, DT, Husmeier, D, Ovaskainen, O, & Matthiopoulos, J (2023). LIES of omission: complex observation processes in ecology. Trends in Ecology & Evolution https://doi.org/10.1016/j.tree.2023.10.009

On Methods Underpinning gfam()

  1. Wood, SN, Pya, N and Saefken, B (2016), Smoothing parameter and model selection for general smooth models. Journal of the American Statistical Association 111, 1548-1575 doi:10.1080/01621459.2016.1180986

On The by= Argument

  1. Watch out for forthcoming paper from Dave Miller, Thomas Cornulier and Ken Newman.