CFtime

Climate change models and calendars

Around the world, many climate change models are being developed (100+) under the umbrella of the World Climate Research Programme to assess the rate of climate change. Published data is generally publicly available to download for research and other (non-commercial) purposes through partner organizations in the Earth Systems Grid Federation.

The data are all formatted to comply with the CF Metadata Conventions, a set of standards to support standardization among research groups and published data sets. These conventions greatly facilitate use and analysis of the climate projections because standard processing work flows (should) work across the various data sets.

On the flip side, the CF Metadata Conventions needs to cater to a wide range of modeling requirements and that means that some of the areas covered by the standards are more complex than might be assumed. One of those areas is the temporal dimension of the data sets. The CF Metadata Conventions supports no less than nine different calendar definitions, that, upon analysis, fall into five distinct calendars (from the perspective of computation of climate projections):

The three latter calendars are specific to the CF Metadata Conventions to reduce computational complexities of working with dates. These three, and the julian calendar, are not compliant with the standard POSIXt date/time facilities in R and using standard date/time procedures would quickly lead to problems. In the below code snippet, the date of 1949-12-01 is the datum from which other dates are calculated. When adding 43,289 days to this datum for a data set that uses the 360_day calendar, that should yield a date some 120 years after the datum:

# POSIXt calculations on a standard calendar - INCORRECT
as.Date("1949-12-01") + 43289
#> [1] "2068-06-08"

# CFtime calculation on a "360_day" calendar - CORRECT
# See below examples for details on the two functions
CFtimestamp(CFtime("days since 1949-12-01", "360_day", 43289))
#> [1] "2070-02-30"

Using standard POSIXt calculations gives a result that is about 21 months off from the correct date - obviously an undesirable situation. This example is far from artificial: 1949-12-01 is the datum for all CORDEX data, covering the period 1951 - 2005 for historical experiments and the period 2006 - 2100 for RCP experiments (with some deviation between data sets), and several models used in the CORDEX set use the 360_day calendar. The 365_day or noleap calendar deviates by about 1 day every 4 years (disregarding centurial years), or about 24 days in a century. The 366_day or all_leap calendar deviates by about 3 days every 4 years, or about 76 days in a century.

The CFtime package deals with the complexity of the different calendars allowed by the CF Metadata Conventions. It properly formats dates and times (even oddball dates like 2070-02-30) and it can generate calendar-aware factors for further processing of the data.

Using CFtime to deal with calendars

Data sets that are compliant with the CF Metadata Conventions always include a datum, a specific point in time in reference to a specified calendar, from which other points in time are calculated by adding a specified offset of a certain unit. This approach is encapsulated in the CFtime package by the S4 class CFtime.

# Create a CF time object from a definition strinmg, a calendar and some offsets
cf <- CFtime("days since 1949-12-01", "360_day", 19830:90029)
cf
#> CF datum of origin:
#>   Origin  : 1949-12-01 00:00:00
#>   Units   : days
#>   Calendar: 360_day
#> CF time series:
#>   Elements: [2005-01-01 .. 2199-12-30] (average of 1.000000 days between 70200 elements)

The CFtime() function takes a datum description (which is actually a unit - “days” - in reference to a datum - “1949-12-01”), a calendar description, and a vector of offsets from that datum. Once a CFtime instance is created its datum and calendar cannot be changed anymore. Offsets may be added.

In practice, these parameters will be taken from the data set of interest. CF Metadata Conventions require data sets to be in the netCDF format, with all metadata describing the data set included in a single file. Not surprisingly, all of the pieces are contained in the mandatory time dimension of the file. The process then becomes as follows, for a CMIP6 file of daily precipitation:

# Opening a data file that is included with the package and showing some attributes.
# Usually you would `list.files()` on a directory of your choice.
nc <- nc_open(list.files(path = system.file("extdata", package = "CFtime"), full.names = TRUE)[1])
attrs <- ncatt_get(nc, "")
attrs$title
#> [1] "NOAA GFDL GFDL-ESM4 model output prepared for CMIP6 update of RCP4.5 based on SSP2"
experiment <- attrs$experiment_id
experiment
#> [1] "ssp245"

# Create the CFtime instance from the metadata in the file.
cf <- CFtime(nc$dim$time$units, nc$dim$time$calendar, nc$dim$time$vals)
cf
#> CF datum of origin:
#>   Origin  : 1850-01-01 00:00:00
#>   Units   : days
#>   Calendar: noleap
#> CF time series:
#>   Elements: [2015-01-01T12:00:00 .. 2099-12-31T12:00:00] (average of 1.000000 days between 31025 elements)

nc$dim$time$units and nc$dim$time$calendar are required attributes of the time dimension in the netCDF file, and nc$dim$time$vals are the offset values, or dimnames() in R terms, for the time dimension of the data. The corresponding character representations of the time series can be easily generated:

dates <- CFtimestamp(cf, format = "date")
dates[1:10]
#>  [1] "2015-01-01" "2015-01-02" "2015-01-03" "2015-01-04" "2015-01-05"
#>  [6] "2015-01-06" "2015-01-07" "2015-01-08" "2015-01-09" "2015-01-10"

…as well as the full range of the time series:

CFrange(cf)
#> [1] "2015-01-01T12:00:00" "2099-12-31T12:00:00"

Note that in this latter case, if any of the timestamps in the time series have a time that is other than 00:00:00 then the time of the extremes of the time series is also displayed. This is a common occurrence for data sets at a monthly resolution with offsets calculated in days (the largest time unit that the CF Metadata Conventions allows). Typically the middle of the month is then recorded, which for months with 31 days would be something like 2005-01-15T12:00:00.

Processing climate projections

Individual files containing climate projections contain global, regional or local data, typically on a rectangular latitude-longitude grid, for a single parameter such as “near-surface temperature”, and for a number of time steps. An analysis workflow then consists of a number of steps:

Apart from the first step of obtaining the data, the steps lend themselves well to automation. The catch, however, is in the factor f to use with tapply(). The different models (in your ensemble) use different calendars, meaning that different factors are required. The CFtime package can help out.

The CFfactor() function produces a factor that respects the calendar of the data files. The function comes in two operating modes:

# Create a dekad factor for the whole `cf` time series that was created above
f_k <- CFfactor(cf, "dekad")
str(f_k)
#>  Factor w/ 3060 levels "2015D01","2015D02",..: 1 1 1 1 1 1 1 1 1 1 ...

# Create four monthly factors for a baseline epoch and early, mid and late 21st century epochs
f_ep <- CFfactor(cf, epoch = list(baseline = 1991:2020, early = 2021:2040,
                                  mid = 2041:2060, late = 2061:2080))
str(f_ep)
#> List of 4
#>  $ baseline: Factor w/ 12 levels "01","02","03",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ early   : Factor w/ 12 levels "01","02","03",..: NA NA NA NA NA NA NA NA NA NA ...
#>  $ mid     : Factor w/ 12 levels "01","02","03",..: NA NA NA NA NA NA NA NA NA NA ...
#>  $ late    : Factor w/ 12 levels "01","02","03",..: NA NA NA NA NA NA NA NA NA NA ...

For the “epoch” version, there are two interesting things to note here:

There are five periods defined for CFfactor():

Building on the examples above of opening a file, creating a CFtime instance and a suitable factor for one data suite, here daily rainfall, the actual processing of the data into precipitation anomalies for 3 periods relative to a baseline period could look like this:

# Read the data from the netCDF file.
# Keep degenerate dimensions so that we have a predictable data structure: 3-dimensional array.
# Converts units of kg m-2 s-2 to mm/day.
pr_d <- ncvar_get(nc, "pr", collapse_degen = FALSE) * 86400
str(pr_d)
#>  num [1, 1, 1:31025] 11.699 7.146 2.25 0.418 0.442 ...
# Note that the data file has two degenerate dimensions for longitude and latitude, to keep
# the example data shipped with this package small.

# Assign dimnames(), optional.
dimnames(pr_d) <- list(nc$dim$lon$vals, nc$dim$lat$vals, CFtimestamp(cf))

nc_close(nc)

# Calculate the daily average precipitation per month for the baseline period
# and the three future epochs.
# `aperm()` rearranges dimensions after `tapply()` mixed them up.
pr_d_ave <- lapply(f_ep, function(f) aperm(apply(pr_d, 1:2, tapply, f, mean), c(2, 3, 1)))

# Calculate the precipitation anomalies for the future epochs against the baseline.
# Working with daily averages per month so we can simply subtract and then multiply by days 
# per month for the CF calendar.
baseline <- pr_d_ave$baseline
pr_d_ave$baseline <- NULL
ano <- lapply(pr_d_ave, function(x) (x - baseline) * CFmonth_days(cf))

# Plot the results
plot(1:12, ano$early[1,1,], type = "o", col = "blue", ylim = c(-50, 40), xlim = c(1, 12), 
     main = paste0("Hamilton, New Zealand\n", experiment), 
     xlab = "month", ylab = "Precipitation anomaly (mm)")
lines(1:12, ano$mid[1,1,], type = "o", col = "green")
lines(1:12, ano$late[1,1,], type = "o", col = "red")

Looks like Hadley will be needing rubber boots in spring and autumn back home!

The interesting feature, working from opening the netCDF file down to plotting, is that the specifics of the CF calendar that the data suite uses do not have to be considered anywhere in the processing workflow: the CFtime package provides the functionality. Data suites using another CF calendar are processed exactly the same:

nc <- nc_open(list.files(path = system.file("extdata", package = "CFtime"), full.names = TRUE)[2])
cf <- CFtime(nc$dim$time$units, nc$dim$time$calendar, nc$dim$time$vals)
# Note that `cf` has a different CF calendar

f_ep <- CFfactor(cf, epoch = list(baseline = 1991:2020, early = 2021:2040,
                                  mid = 2041:2060, late = 2061:2080))

pr_d <- ncvar_get(nc, "pr", collapse_degen = FALSE) * 86400
nc_close(nc)

pr_d_ave <- lapply(f_ep, function(f) aperm(apply(pr_d, 1:2, tapply, f, mean), c(2, 3, 1)))
baseline <- pr_d_ave$baseline
pr_d_ave$baseline <- NULL
ano <- lapply(pr_d_ave, function(x) (x - baseline) * CFmonth_days(cf))

Working with multiple files in a single data suite

Due to the large size of typical climate projection data files, it is common to have a data suite that is contained in multiple files. A case in point is the CORDEX data set which breaks up the experiment period of 2006 - 2100 into 19 files of 5 years each, with each file covering a single parameter (temperature, precipitation, etc) over an entire domain (such as Europe, South Asia, Central America and the Caribbean, etc). The CFtime package can streamline processing of such multi-file data suites as well.

Assuming that you have your CORDEX files in a directory on disk, organized by domain and other properties such as the variable, GCM/RCM combination, experiment, etc, the process of preparing the files for processing could be encoded in a function as below. The argument fn is a list of file names to process, and var is the variable contained in the files. (There are no checks on argument sanity here, which should really be included. This function only makes sense for a single [domain, GCM/RCM, experiment, variable] combination. Also be aware of data size, CORDEX files are huge and stitching all domain data together will easily exhaust available memory and it may thus lead to very large swap files and very poor performance.)

library(ncdf4)
library(abind)

prepare_CORDEX <- function(fn, var) {
  cf <- NA
  offsets <- vector("list")
  data <- vector("list")
  lapply(fn, function(f) {
    nc <- nc_open(f)
    if (is.na(cf)) 
      # Create an "empty" CFtime object, without elements
      cf <- CFtime(nc$dim$time$units, nc$dim$time$calendar)
    
    # Make a list of all datum offsets and data arrays
    offsets <- append(offsets, as.vector(nc$dim$time$vals))
    data <- append(data, ncvar_get(nc, var, collapse_degen = FALSE)
                   
    nc_close(nc)
  })
    
  # Create a list for output with the CFtime instance assigned the offsets and
  # the data bound in a single 3-dimensional array
  list(CFtime = cf + unlist(offsets), data = abind(data, along = 3))
}

Calling this function like prepare_CORDEX(list.files(path = "~/CC/CORDEX/CAM", pattern = "*.nc", full.names = TRUE)) will yield a list with the CFtime instance describing the full temporal extent covered by the data files, as well as the data bound on the temporal dimension, ready for further processing.

When working like this it is imperative that the offsets and the data arrays are added to their final structures in exactly the same order. It is not necessary that the offsets (and the data) themselves are in order, but the correspondence between offsets and data needs to be maintained. (list.files() produces a list in alphabetical order by default, which for most climate projection files produces offsets in chronological order.)

Final observations

Acknowledgements

The results presented contain modified data from Copernicus Climate Change Service information, 2023. Neither the European Commission nor ECMWF is responsible for any use that may be made of the Copernicus information or data it contains.

We acknowledge the World Climate Research Programme, which, through its Working Group on Coupled Modelling, coordinated and promoted CMIP6. We thank the climate modeling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the data and providing access, and the multiple funding agencies who support CMIP6 and ESGF.

The two datasets used as examples in this vignette carry the following license statements: