Data Fusion

Fergus J Chadwick

BioSS

Ana S Couto

BioSS

Dave L Miller

BioSS

UKCEH

Jackie Potts

BioSS

Thomas Cornulier

BioSS

Before The Break

flowchart LR
  A[Process<br>of Interest]:::FixFont --> B(Observation<br>Process):::FixFont
  B --> C{Data}:::FixFont;
    classDef FixFont font-size:15px
    style A color:#cc0000
    style B color:#3c78d8
    style C color:#f1c232

Before The Break

flowchart LR
  A[Process<br>of Interest]:::FixFont --> B(Observation<br>Process):::FixFont
  B --> C{Data}:::FixFont;
    classDef FixFont font-size:15px
    style A color:#cc0000
    style B color:#3c78d8
    style C color:#f1c232

flowchart LR
  A[Time<br>Pattern]:::FixFont --> C(Simulate):::FixFont
  B[Spatial<br>Pattern]:::FixFont --> C
  C --> D(Sample):::FixFont
  D --> E{Realistic<br>Dataset}:::FixFont;
    classDef FixFont font-size:15px
    style A color:#cc0000
    style B color:#cc0000
    style C color:#674ea7
    style D color:#3c78d8
    style E color:#f1c232

Before The Break

flowchart LR
  A[Time<br>Pattern]:::FixFont --> C(Simulate):::FixFont
  B[Spatial<br>Pattern]:::FixFont --> C
  C --> D(Sample):::FixFont
  D --> E{Realistic<br>Dataset}:::FixFont;
    classDef FixFont font-size:15px
    style A color:#6aa84f

# Temporal Pattern
timeline <-set_timeline(1, "month")
time_series <- define_pattern_t(timeline, weekly_amplitude = 1, ...)
# Spatial Pattern
spatial_field <- define_pattern_s(n_x = 10, n_y = 10, ...)
# Simulate
truth <- simulate_spacetime(temporal_pattern = time_series, field = spatial_field, ...)
# Sample
point_samples <- sample_area_pointily(field = truth, n_samples = 10,
                                      sampling_error = 0.1, ...)

Before The Break

flowchart LR
  A[Time<br>Pattern]:::FixFont --> C(Simulate):::FixFont
  B[Spatial<br>Pattern]:::FixFont --> C
  C --> D(Sample):::FixFont
  D --> E{Realistic<br>Dataset}:::FixFont;
    classDef FixFont font-size:15px
    style B color:#6aa84f

# Temporal Pattern
timeline <-set_timeline(1, "month")
time_series <- define_pattern_t(timeline, weekly_amplitude = 1, ...)
# Spatial Pattern
spatial_field <- define_pattern_s(n_x = 10, n_y = 10, ...)
# Simulate
truth <- simulate_spacetime(temporal_pattern = time_series, field = spatial_field, ...)
# Sample
point_samples <- sample_area_pointily(field = truth, n_samples = 10,
                                      sampling_error = 0.1, ...)

Before The Break

flowchart LR
  A[Time<br>Pattern]:::FixFont --> C(Simulate):::FixFont
  B[Spatial<br>Pattern]:::FixFont --> C
  C --> D(Sample):::FixFont
  D --> E{Realistic<br>Dataset}:::FixFont;
    classDef FixFont font-size:15px
    style C color:#6aa84f

# Temporal Pattern
timeline <-set_timeline(1, "month")
time_series <- define_pattern_t(timeline, weekly_amplitude = 1, ...)
# Spatial Pattern
spatial_field <- define_pattern_s(n_x = 10, n_y = 10, ...)
# Simulate
truth <- simulate_spacetime(temporal_pattern = time_series, field = spatial_field, ...)
# Sample
point_samples <- sample_area_pointily(field = truth, n_samples = 10,
                                      sampling_error = 0.1, ...)

Before The Break

flowchart LR
  A[Time<br>Pattern]:::FixFont --> C(Simulate):::FixFont
  B[Spatial<br>Pattern]:::FixFont --> C
  C --> D(Sample):::FixFont
  D --> E{Realistic<br>Dataset}:::FixFont;
    classDef FixFont font-size:15px
    style D color:#6aa84f

# Temporal Pattern
timeline <-set_timeline(1, "month")
time_series <- define_pattern_t(timeline, weekly_amplitude = 1, ...)
# Spatial Pattern
spatial_field <- define_pattern_s(n_x = 10, n_y = 10, ...)
# Simulate
truth <- simulate_spacetime(temporal_pattern = time_series, field = spatial_field, ...)
# Sample
point_samples <- sample_area_pointily(field = truth, n_samples = 10,
                                      sampling_error = 0.1, ...)

Before The Break

flowchart LR
  A[Time<br>Pattern]:::FixFont --> C(Simulate):::FixFont
  B[Spatial<br>Pattern]:::FixFont --> C
  C --> D(Sample):::FixFont
  D --> E{Realistic<br>Dataset}:::FixFont;
    classDef FixFont font-size:15px
    style E color:#6aa84f

# Temporal Pattern
timeline <-set_timeline(1, "month")
time_series <- define_pattern_t(timeline, weekly_amplitude = 1, ...)
# Spatial Pattern
spatial_field <- define_pattern_s(n_x = 10, n_y = 10, ...)
# Simulate
truth <- simulate_spacetime(temporal_pattern = time_series, field = spatial_field, ...)
# Sample
point_samples <- sample_area_pointily(field = truth, n_samples = 10,
                                      sampling_error = 0.1, ...)

In The GAMs Workshop*

flowchart LR
  A[Process<br>of Interest]:::FixFont --> B(Observation<br>Process):::FixFont
  B --> C{Data}:::FixFont;
    classDef FixFont font-size:15px
    style A color:#cc0000
    style B color:#3c78d8
    style C color:#f1c232

*From Dave and Thomas

In The GAMs Workshop*

flowchart LR
C{Data}:::FixFont --> B(Observation<br>Process):::FixFont
B --> A[Process<br>of Interest]:::FixFont;
    classDef FixFont font-size:15px
    style A color:#cc0000
    style B color:#3c78d8
    style C color:#f1c232

*From Dave and Thomas

In This Session

flowchart LR
C1{Dataset<br>1}:::FixFont --> B1(Observation<br>Process<br>1):::FixFont
B1 --> A[Process<br>of Interest]:::FixFont

C2{Dataset<br>2}:::FixFont --> B2(Observation<br>Process<br>2):::FixFont
B2 --> A;
    classDef FixFont font-size:15px
    style A color:#cc0000
    style B1 color:#3c78d8
    style B2 color:#3c78d8
    style C1 color:#f1c232
    style C2 color:#f1c232

Our Simulated Data

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

Spot The Difference

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

Our Simulated Data

Finding Agreement,
Allowing For Disagreement

Data complementarity is the heart of data fusion.
We want to allow data to be mutually informative…
But we also want to acknowledge their differences.
The answer: hierarchical GAMS

Hierarchical GAMs

We can share the global trend (left column of figs).
Or we can allow each model to have a different shape.

Hierarchical GAMs

We can make the smooths for each group exactly the same.

Or we can make them similar.

Or we can allow them to be completely different.

Hierarchical GAM Hypotheses

What our best model tells us if it has a shared trend…

The observation processes agree where the peaks and troughs are
If it also shares smoothness, the two datasets are capturing the process of interest in similar ways.
If it also doesn’t share smoothness, is there a reason?

Hierarchical GAM Hypotheses

What our best model tells us if it does not have a shared trend…

The observation processes may disagree where the peaks and troughs are.
- If they share smoothness, maybe there is a lag? Come to tomorrow’s talk!
- If they don’t share smoothness either, they may be capturing different processes altogether.

Hierarchical GAM Hypotheses

Let’s build things up slowly with a quick refresher on building a non-hierarchical GAM.

Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
head(sample_reg)

# A tibble: 6 × 6
# Groups:   date [6]
      x     y date        value loc_id point_sample
  <dbl> <dbl> <date>      <dbl> <fct>         <dbl>
1     1     1 2001-01-01  0.812 1_1          0.575 
2     1     1 2001-01-02  0.271 1_1          0.518 
3     1     1 2001-01-03 -0.597 1_1         -0.396 
4     1     1 2001-01-04 -1.14  1_1         -1.27  
5     1     1 2001-01-05 -0.945 1_1         -0.925 
6     1     1 2001-01-06 -0.163 1_1         -0.0446

Remember value is the Truth from our simulation.
point_sample is what we observed.
loc_id is the combination of the x and y coordinates to help with plotting.

Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)

Remember, we might be interested in deconstructing the different components of the date to pick out weekly, quarterly and annual patterns.

Building A Basic Space Time Model

fit_sample_reg <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_reg, 
                        family = "gaussian",
                        method = "REML")

gam.check(fit_sample_reg)

Does this look familiar?

Building A Basic Space Time Model

fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_ran, 
                        family = "gaussian",
                        method = "REML")

gam.check(fit_sample_ran)

Building A Basic Space Time Model

A Joint Model

Rather than calling mgcv twice, can we specify these two models as a single joint model (like Model I)

More Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples

# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam)

More Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples

# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam)

More Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples

# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam)

More Data Prep

set.seed(1868)
sample_reg <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 0.1, regular = TRUE)
set.seed(1871)
sample_ran <- sample_area_pointily(truth, n_samples = 25, 
                                   sampling_error = 1, regular = FALSE)

# Make mgcv see date as a date
sample_reg$date_dec <- lubridate::decimal_date(sample_reg$date)
sample_ran$date_dec <- lubridate::decimal_date(sample_ran$date)
# Add family column
sample_reg$fam <- 1 # Observation source 1, the regular samples
sample_ran$fam <- 2 # Observation source 2, the random samples

# Combine into one dataframe
sample_com <- rbind(sample_reg, sample_ran)
# Note, we need *both* fam and fam_fac for the factor smoothing later
sample_com$fam_fac <- factor(sample_com$fam)

Model I - I For Independence

fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_ran, 
                        family = "gaussian",
                        method = "REML")

fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")

Model I - I For Independence

fit_sample_ran <- mgcv::gam(point_sample ~ t2(x, y, date_dec), 
                        data = sample_ran, 
                        family = "gaussian",
                        method = "REML")

fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")

Model I - I For Independence

fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")
gam.check(fit_fam_I)

Hierarchical GAMs

Model I has inherited the same problems as the independent models (reassuring for our coding, bad for our inference!).
We should probably think about adding a group trend.
Model GI - Same Shape, Different Wiggles

Model GI - Same Shape, Different Wiggles

fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Some New `mgcv` Tricks: Do Not Be Afraid

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Some New `mgcv` Tricks: Basis Specification

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Some New `mgcv` Tricks: Grouped Families

# Need to create combined response variable
sample_com$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Some New `mgcv` Tricks: Grouped Families

# Need to create combined response variable
sample_com$resp <- cbind(sample_com$point_sample, as.numeric(sample_com$fam))
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Now we can have observation processes with different response distribution s!

Model GI - Same Shape, Different Wiggles

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Model GS - Same Shape, Same Wiggles

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")
# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec, fam_fac,
                             d=c(2, 1, 1),
                             bs=c("ts", "ts", "re")), 
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

Model G - Same Model

# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec, fam_fac,
                             d=c(2, 1, 1),
                             bs=c("ts", "ts", "re")), 
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec), 
                       data = sample_com, 
                       family = "gaussian",
                       method = "REML")

Model G* - Group Trend Models

# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec, fam_fac,
                             d=c(2, 1, 1),
                             bs=c("ts", "ts", "re")), 
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec), 
                       data = sample_com, 
                       family = "gaussian",
                       method = "REML")

Hierarchical GAM

# Independent trend and smooths
fit_fam_I <- mgcv::gam(point_sample ~ t2(x, y, date_dec, 
                                         by = fam_fac), 
                        data = sample_com, 
                        family = "gaussian",
                        method = "REML")
# Shared trend and different smoothness
fit_fam_GI <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec,
                             d=c(2, 1),
                             bs=c("ts", "ts"), 
                             by = fam_fac),
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Shared trend and shared smoothness
fit_fam_GS <- mgcv::gam(resp ~ t2(x, y, date_dec) +
                          t2(x, y, date_dec, fam_fac,
                             d=c(2, 1, 1),
                             bs=c("ts", "ts", "re")), 
                        data = sample_com,
                        family = gfam(list(gaussian, gaussian)),
                        method = "REML")

# Identical trend and smoothness
fit_fam_G <- mgcv::gam(point_sample ~ t2(x,y, date_dec), 
                       data = sample_com, 
                       family = "gaussian",
                       method = "REML")

How Do The Models Compare?

Model I

Model GI

Model GS

Model G

How Do The Models Compare?

AIC(fit_fam_I)

[1] 4571.037

AIC(fit_fam_GI)

[1] 4480.373

AIC(fit_fam_GS)

[1] 4480.303

AIC(fit_fam_G)

[1] 4688.921

The grouped models with and without similar smoothness perfom best!
Is this surprising given our simulation?

Thoughts To Finish With

mgcv makes fitting really rather complex models relatively simple.
Data observed in different ways can, and should, be modelled simultaneously if possible.
Observation processes should be treated with as much respect as the process of interest.
Thank you for listening!

References

On Hierarchical GAMs

Pedersen EJ, Miller DL, Simpson GL, Ross N. 2019. Hierarchical generalized additive models in ecology: an introduction with mgcv. PeerJ 7:e6876 https://doi.org/10.7717/peerj.6876

On Observation Processes

Chadwick, FJ, Haydon, DT, Husmeier, D, Ovaskainen, O, & Matthiopoulos, J (2023). LIES of omission: complex observation processes in ecology. Trends in Ecology & Evolution https://doi.org/10.1016/j.tree.2023.10.009

On Methods Underpinning gfam()

Wood, SN, Pya, N and Saefken, B (2016), Smoothing parameter and model selection for general smooth models. Journal of the American Statistical Association 111, 1548-1575 doi:10.1080/01621459.2016.1180986

On The by= Argument

Watch out for forthcoming paper from Dave Miller, Thomas Cornulier and Ken Newman.

Data Fusion

Before The Break

Before The Break

Before The Break

Before The Break

Before The Break

Before The Break

Before The Break

In The GAMs Workshop*

In The GAMs Workshop*

In This Session

Our Simulated Data

Spot The Difference

Our Simulated Data

Our Simulated Data

Our Simulated Data

Finding Agreement,Allowing For Disagreement

Hierarchical GAMs

Hierarchical GAMs

Hierarchical GAM Hypotheses

Hierarchical GAM Hypotheses

Hierarchical GAM Hypotheses

Data Prep

Data Prep

Building A Basic Space Time Model

Building A Basic Space Time Model

Building A Basic Space Time Model

Building A Basic Space Time Model

Building A Basic Space Time Model

A Joint Model

More Data Prep

More Data Prep

More Data Prep

More Data Prep

Model I - I For Independence

Model I - I For Independence

Model I - I For Independence

Hierarchical GAMs

Model GI - Same Shape, Different Wiggles

Some New mgcv Tricks: Do Not Be Afraid

Some New mgcv Tricks: Basis Specification

Some New mgcv Tricks: Grouped Families

Some New mgcv Tricks: Grouped Families

Model GI - Same Shape, Different Wiggles

Model GS - Same Shape, Same Wiggles

Model G - Same Model

Model G* - Group Trend Models

Hierarchical GAM

How Do The Models Compare?

How Do The Models Compare?

Thoughts To Finish With

References

Finding Agreement,
Allowing For Disagreement

Some New `mgcv` Tricks: Do Not Be Afraid

Some New `mgcv` Tricks: Basis Specification

Some New `mgcv` Tricks: Grouped Families

Some New `mgcv` Tricks: Grouped Families