[1] Introduction

Authors

Affiliations

Thomas Cornulier thomas.cornulier@bioss.ac.uk

Biomathematics & Statistics Scotland (https://www.bioss.ac.uk/)

Dave Miller dave.miller@bioss.ac.uk

BioSS & UKCEH

Charlotte Regan chareg@ceh.ac.uk

Maria Bogdanova marib@ceh.ac.uk

Kate Searle katrle@ceh.ac.uk

UKCEH

Published

Invalid Date

Models

This series of vignettes covers a range of general models part of the signal regression / distributed lag models family, including a few extensions covering more specialist types of research questions (yet with potentially broad relevance across disciplines).

The case study applications are only there to facilitate and reinforce the understanding of the models. It should be noted that none of the analyses presented are intended to be fully realistic, in depth treatments of the case study data: they are meant as reasonably motivated, but simplified examples, intended to illustrate applications of each model type.

Structure of the vignettes

Each vignette covers a specific form of the model, typically building up from simpler to more complex structures and from more general to more specialist research questions.

Each one contains:

Verbal description of the model
Mathematical description
R implementation
Illustration with case study
Simulation example

How to use the vignettes

Vignette 2-1 “Distributed linear Effects Over 1D Lags” sets out the foundation framework on which all other vignettes build, so this is the place to start. In theory, the following vignettes are independent and could be consulted individually. However there is some logical progression through the series, so if you struggle to understand a particular model or its context, it may help to go back a few vignettes or start from the beginning.
Start with the vignette’s Intro section (high level verbal description of the model).
Vignettes are designed so that Maths section should be helpful, but not essential for understanding what the model does, or for implementing and interpreting it sensibly. We suggest you read it however, to get familiarity with the notation. A mathematical equation (as opposed to R code) is also the notation that should be used in a paper’s methods section to describe the model used, as it isn’t tool specific and thus not time specific (R is sure to go out of use before maths do).
The model is best understood by studying its application to a case study. In the individual vignettes, we assume familiarity with the context of the case study, presented in the “Introduction to the case study” section, below.
We provide access to both the pre processed data and the code, so you can try it yourself and learn by experimenting with the model and outputs.
The optional simulation section is there to provide a more general example of code, allowing you to
- simulate data from the model with known “true” parameter values;
- estimate the model from the simulated data;
- check that the model does what it is meant to do, i.e.recover parameter estimates that are consistent with the known “true” parameter values (in an acceptable proportion of the simulation runs, for example close to 95%, if using 95% confidence intervals for the estimates).

Introduction to the case study

Here, we seek to explain variation in the breeding success (“productivity”) of a seabird species (black-legged kittiwake) at several breeding colonies along the east coast of the UK. Kittiwakes nest on land but forage at sea for small forage fish (e.g., sandeel), with the availability and quality of these fish expected to correlate with annual breeding success as they constitute a key part of the diet of chicks. Given that data on sandeel abundance are typically limited in their spatial and temporal resolution, sea surface temperature (SST) is frequently used as a proxy for prey availability for kittiwakes and other seabirds due to its hypothesised links with sandeel abundance and nutritional quality. However, the spatial and temporal scale over which SST may predict kittiwake breeding success is unclear. For example, although we expect the foraging ranges around colonies to be a determinant of the spatial scale of effects, foraging distances vary considerably both between years and between colonies meaning that it’s hard a priori to determine the likely relevant spatial scale. Similarly, the most relevant temporal scale is unclear given that prey availability may be driven by SST both through the previous winter and spring as well as during the breeding season itself. Consequently, in this work, we sought to understand the spatial and temporal scales over which SST predicts kittiwake breeding success. To do this we used seabird breeding success data collected via the Seabird Monitoring Programme and SST data obtained from Marine Scotland’s Scottish Shelf Model. The number of nests per colony (and the proportion of nests that are monitored) varies considerably, so we model the number of chicks produced, standardized by the number of monitored nests (“AON”, for “Apparently Occupied Nests”). This is achieved by using log(AON) as an offset term in the model (i.e. with coefficient forced at 1), with a log-link on the linear predictor.

specific questions
- Is there a (linear) lagged effect of SST on kittiwake yearly breeding success and how does it change across time lags? (What is the temporal zone of influence of SST on breeding success?)
- Is there a (linear) lagged effect of SST on kittiwake yearly breeding success and how does it change across spatial lags (away from colony)? (What is the spatial zone of influence of SST on breeding success?)
- What is the spatial and temporal zone of influence of SST on kittiwake yearly breeding success?
- Does the SST zone of influence of SST on kittiwake yearly breeding success vary between sites (colonies)?
- Is there an interaction between the effect of 1D lagged SST and 1D lagged sandeel presence?
- Is there a (non-linear) lagged effect of SST on kittiwake yearly breeding success and how does it change across time lags?

Original data

Seabird Monitoring Programme overview https://www.bto.org/our-science/projects/seabird-monitoring-programme
Seabird Monitoring Programme data https://app.bto.org/seabirds/public/index.jsp
Sandeel distribution estimates data from https://doi.org/10.3354/meps13693
Scottish Shelf Waters Reanalysis Service https://sites.google.com/view/ssw-rs/home

Pre-processing of the data

The sandeel data come as a single layer raster, whereas the SST data come as multi-layer raster, with one layer per week over many years.

Sandeel data have been cut into distance rings extending from each site (= seabird colony), by increments of 10 km up to 210 km. for each ring, the average value is calculated. This forms a $N \times D$ matrix where $N$ is the number of seabird bereding success observations (site x year combinations) and $D$ is the number of distance lags.

SST data are spatio-temporal, so form a 3D array (“brick”) rather than a 2D matrix. The spatial distance bands are the same as for sandeels, and for the temporal lags we have used the first 30 weeks of each year, leading roughly up to the chick rearing period, when the censuses are undertaken. Note the non-standard indexing we have adopted for the SST array, with weeks in rows, distances in columns and observations coming last, as the 3rd dimension. SST arrays therefore come in dimensions $L \times D \times N$, where $L$ is the number of temporal lags.

to derive quantities over the distance intervals we ultimately want, we first formed nested buffers of increasing radii, and then for each buffer in sequence, subtracted the value of the previous inner buffer from the focal buffer value.
to compute mean SST per interval of space or time, we computed two quantities over each buffer and take the mean as $A / B$:
- sum of all pixel values in the range of interest ($A$)
- count of non-NA values in the range of interest ($B$)

For each distance band, average SST is calculated firstly by differencing between nested buffers, and then by taking the ratio of the sum of SST values for all pixels within distance band, divided by number of non-NA pixels over the same band, for the relevant colony and year (see code below for more details).

--- title: "[1] Introduction" author: - name: Thomas Cornulier thomas.cornulier@bioss.ac.uk affiliation: Biomathematics & Statistics Scotland (https://www.bioss.ac.uk/) - name: Dave Miller dave.miller@bioss.ac.uk affiliation: "BioSS & UKCEH" - name: Charlotte Regan chareg@ceh.ac.uk - name: Maria Bogdanova marib@ceh.ac.uk - name: Kate Searle katrle@ceh.ac.uk affiliation: UKCEH date: "Last run on `r format(Sys.time(), '%d/%m/%Y')`" output: html_document: number_sections: true --- # Models This series of vignettes covers a range of general models part of the signal regression / distributed lag models family, including a few extensions covering more specialist types of research questions (yet with potentially broad relevance across disciplines). The case study applications are only there to facilitate and reinforce the understanding of the models. It should be noted that none of the analyses presented are intended to be fully realistic, in depth treatments of the case study data: they are meant as reasonably motivated, but simplified examples, intended to illustrate applications of each model type. # Structure of the vignettes Each vignette covers a specific form of the model, typically building up from simpler to more complex structures and from more general to more specialist research questions. Each one contains: * Verbal description of the model * Mathematical description * R implementation * Illustration with case study * Simulation example # How to use the vignettes * Vignette 2-1 "Distributed linear Effects Over 1D Lags" sets out the foundation framework on which all other vignettes build, so this is the place to start. In theory, the following vignettes are independent and could be consulted individually. However there is some logical progression through the series, so if you struggle to understand a particular model or its context, it may help to go back a few vignettes or start from the beginning. * Start with the vignette's **Intro** section (high level verbal description of the model). * Vignettes are designed so that **Maths** section should be helpful, but not essential for understanding what the model does, or for implementing and interpreting it sensibly. We suggest you read it however, to get familiarity with the notation. A mathematical equation (as opposed to R code) is also the notation that should be used in a paper's methods section to describe the model used, as it isn't tool specific and thus not time specific (R is sure to go out of use before maths do). * The model is best understood by studying its application to a **case study**. In the individual vignettes, we assume familiarity with the context of the case study, presented in the "Introduction to the case study" section, below. * We provide access to both the pre processed data and the code, so you can try it yourself and learn by experimenting with the model and outputs. * The optional **simulation** section is there to provide a more general example of code, allowing you to * simulate data from the model with known "true" parameter values; * estimate the model from the simulated data; * check that the model does what it is meant to do, i.e.recover parameter estimates that are consistent with the known "true" parameter values (in an acceptable proportion of the simulation runs, for example close to 95%, if using 95% confidence intervals for the estimates). # Introduction to the case study Here, we seek to explain variation in the breeding success ("productivity") of a seabird species (black-legged kittiwake) at several breeding colonies along the east coast of the UK. Kittiwakes nest on land but forage at sea for small forage fish (e.g., sandeel), with the availability and quality of these fish expected to correlate with annual breeding success as they constitute a key part of the diet of chicks. Given that data on sandeel abundance are typically limited in their spatial and temporal resolution, sea surface temperature (SST) is frequently used as a proxy for prey availability for kittiwakes and other seabirds due to its hypothesised links with sandeel abundance and nutritional quality. However, the spatial and temporal scale over which SST may predict kittiwake breeding success is unclear. For example, although we expect the foraging ranges around colonies to be a determinant of the spatial scale of effects, foraging distances vary considerably both between years and between colonies meaning that it’s hard a priori to determine the likely relevant spatial scale. Similarly, the most relevant temporal scale is unclear given that prey availability may be driven by SST both through the previous winter and spring as well as during the breeding season itself. Consequently, in this work, we sought to understand the spatial and temporal scales over which SST predicts kittiwake breeding success. To do this we used seabird breeding success data collected via the Seabird Monitoring Programme and SST data obtained from Marine Scotland’s Scottish Shelf Model. The number of nests per colony (and the proportion of nests that are monitored) varies considerably, so we model the number of chicks produced, standardized by the number of monitored nests ("AON", for "Apparently Occupied Nests"). This is achieved by using `log(AON)` as an offset term in the model (i.e. with coefficient forced at 1), with a log-link on the linear predictor. * specific questions * Is there a (linear) lagged effect of SST on kittiwake yearly breeding success and how does it change across time lags? (What is the temporal zone of influence of SST on breeding success?) * Is there a (linear) lagged effect of SST on kittiwake yearly breeding success and how does it change across spatial lags (away from colony)? (What is the spatial zone of influence of SST on breeding success?) * What is the spatial and temporal zone of influence of SST on kittiwake yearly breeding success? * Does the SST zone of influence of SST on kittiwake yearly breeding success vary between sites (colonies)? * Is there an interaction between the effect of 1D lagged SST and 1D lagged sandeel presence? * Is there a (non-linear) lagged effect of SST on kittiwake yearly breeding success and how does it change across time lags? ## Original data * Seabird Monitoring Programme overview https://www.bto.org/our-science/projects/seabird-monitoring-programme * Seabird Monitoring Programme data https://app.bto.org/seabirds/public/index.jsp * Sandeel distribution estimates [data](https://www.data.gov.uk/dataset/9af04ca7-82da-46d0-a9ef-e7ce0ecce905/species-distribution-a-verified-distribution-model-for-the-lesser-sandeel-ammodytes-marinus) from https://doi.org/10.3354/meps13693 * Scottish Shelf Waters Reanalysis Service https://sites.google.com/view/ssw-rs/home ## Pre-processing of the data The sandeel data come as a single layer raster, whereas the SST data come as multi-layer raster, with one layer per week over many years. Sandeel data have been cut into distance rings extending from each site (= seabird colony), by increments of 10 km up to 210 km. for each ring, the average value is calculated. This forms a $N \times D$ matrix where $N$ is the number of seabird bereding success observations (site x year combinations) and $D$ is the number of distance lags. SST data are spatio-temporal, so form a 3D array ("brick") rather than a 2D matrix. The spatial distance bands are the same as for sandeels, and for the temporal lags we have used the first 30 weeks of each year, leading roughly up to the chick rearing period, when the censuses are undertaken. Note the non-standard indexing we have adopted for the SST array, with weeks in rows, distances in columns and observations coming last, as the 3rd dimension. SST arrays therefore come in dimensions $L \times D \times N$, where $L$ is the number of temporal lags. - to derive quantities over the distance intervals we ultimately want, we first formed nested buffers of increasing radii, and then for each buffer in sequence, subtracted the value of the previous inner buffer from the focal buffer value. - to compute mean SST per interval of space or time, we computed two quantities over each buffer and take the mean as $A / B$: - sum of all pixel values in the range of interest ($A$) - count of non-NA values in the range of interest ($B$) For each distance band, average SST is calculated firstly by differencing between nested buffers, and then by taking the ratio of the sum of SST values for all pixels within distance band, divided by number of non-NA pixels over the same band, for the relevant colony and year (see code below for more details).