Around the world, many climate change models are being developed (100+) under the umbrella of the World Climate Research Programme to assess the rate of climate change. Published data is generally publicly available to download for research and other (non-commercial) purposes through partner organizations in the Earth Systems Grid Federation.
The data are all formatted to comply with the CF Metadata Conventions, a set of standards to support standardization among research groups and published data sets. These conventions greatly facilitate use and analysis of the climate projections because standard processing work flows (should) work across the various data sets.
On the flip side, the CF Metadata Conventions needs to cater to a wide range of modeling requirements and that means that some of the areas covered by the standards are more complex than might be assumed. One of those areas is the temporal dimension of the data sets. The CF Metadata Conventions supports no less than nine different calendar definitions, that, upon analysis, fall into five distinct calendars (from the perspective of computation of climate projections):
standard
or gregorian
: The international
civil calendar that is in common use in many countries around the world,
adopted by edict of Pope Gregory XIII in 1582 and in effect from 15
October of that year. The proleptic_gregorian
calendar is
the same as the gregorian
calendar, but with validity
extended to periods prior to 1582-10-15
.julian
: Adopted in the year 45 BCE, every fourth year
is a leap year. Originally, the julian calendar did not have a
monotonically increasing year assigned to it and there are indeed
several julian calendars in use around the world today with different
years assigned to them. Common interpretation is currently that the year
is the same as that of the standard calendar. The julian calendar is
currently 13 days behind the gregorian calendar.365_day
or noleap
: No years have a leap
day.366_day
or all_leap
: All years have a leap
day.360_day
: Every year has 12 months of 30 days each.The three latter calendars are specific to the CF Metadata
Conventions to reduce computational complexities of working with dates.
These three, and the julian calendar, are not compliant with the
standard POSIXt
date/time facilities in R
and
using standard date/time procedures would quickly lead to problems. In
the below code snippet, the date of 1949-12-01
is the
datum from which other dates are calculated. When adding 43,289
days to this datum for a data set that uses the
360_day
calendar, that should yield a date some 120 years
after the datum:
# POSIXt calculations on a standard calendar - INCORRECT
as.Date("1949-12-01") + 43289
#> [1] "2068-06-08"
# CFtime calculation on a "360_day" calendar - CORRECT
# See below examples for details on the two functions
CFtimestamp(CFtime("days since 1949-12-01", "360_day", 43289))
#> [1] "2070-02-30"
Using standard POSIXt
calculations gives a result that
is about 21 months off from the correct date - obviously an undesirable
situation. This example is far from artificial: 1949-12-01
is the datum for all CORDEX data, covering the period 1951 - 2005 for
historical experiments and the period 2006 - 2100 for RCP experiments
(with some deviation between data sets), and several models used in the
CORDEX set use the 360_day
calendar. The
365_day
or noleap
calendar deviates by about 1
day every 4 years (disregarding centurial years), or about 24 days in a
century. The 366_day
or all_leap
calendar
deviates by about 3 days every 4 years, or about 76 days in a
century.
The CFtime
package deals with the complexity of the
different calendars allowed by the CF Metadata Conventions. It properly
formats dates and times (even oddball dates like
2070-02-30
) and it can generate calendar-aware factors for
further processing of the data.
Data sets that are compliant with the CF Metadata Conventions always
include a datum, a specific point in time in reference to a
specified calendar, from which other points in time are
calculated by adding a specified offset of a certain
unit. This approach is encapsulated in the CFtime
package by the S4 class CFtime
.
# Create a CF time object from a definition strinmg, a calendar and some offsets
cf <- CFtime("days since 1949-12-01", "360_day", 19830:90029)
cf
#> CF datum of origin:
#> Origin : 1949-12-01 00:00:00
#> Units : days
#> Calendar: 360_day
#> CF time series:
#> Elements: [2005-01-01 .. 2199-12-30] (average of 1.000000 days between 70200 elements)
The CFtime()
function takes a datum description
(which is actually a unit - “days” - in reference to a datum -
“1949-12-01”), a calendar description, and a vector of offsets
from that datum. Once a CFtime
instance is created its
datum and calendar cannot be changed anymore. Offsets may be added.
In practice, these parameters will be taken from the data set of
interest. CF Metadata Conventions require data sets to be in the netCDF
format, with all metadata describing the data set included in a single
file. Not surprisingly, all of the pieces are contained in the mandatory
time
dimension of the file. The process then becomes as
follows, for a CMIP6 file of daily precipitation:
# Opening a data file that is included with the package and showing some attributes.
# Usually you would `list.files()` on a directory of your choice.
nc <- nc_open(list.files(path = system.file("extdata", package = "CFtime"), full.names = TRUE)[1])
attrs <- ncatt_get(nc, "")
attrs$title
#> [1] "NOAA GFDL GFDL-ESM4 model output prepared for CMIP6 update of RCP4.5 based on SSP2"
experiment <- attrs$experiment_id
experiment
#> [1] "ssp245"
# Create the CFtime instance from the metadata in the file.
cf <- CFtime(nc$dim$time$units, nc$dim$time$calendar, nc$dim$time$vals)
cf
#> CF datum of origin:
#> Origin : 1850-01-01 00:00:00
#> Units : days
#> Calendar: noleap
#> CF time series:
#> Elements: [2015-01-01T12:00:00 .. 2099-12-31T12:00:00] (average of 1.000000 days between 31025 elements)
nc$dim$time$units
and nc$dim$time$calendar
are required attributes of the time
dimension in the netCDF
file, and nc$dim$time$vals
are the offset values, or
dimnames()
in R
terms, for the
time
dimension of the data. The corresponding character
representations of the time series can be easily generated:
dates <- CFtimestamp(cf, format = "date")
dates[1:10]
#> [1] "2015-01-01" "2015-01-02" "2015-01-03" "2015-01-04" "2015-01-05"
#> [6] "2015-01-06" "2015-01-07" "2015-01-08" "2015-01-09" "2015-01-10"
…as well as the full range of the time series:
Note that in this latter case, if any of the timestamps in the time
series have a time that is other than 00:00:00
then the
time of the extremes of the time series is also displayed. This is a
common occurrence for data sets at a monthly resolution with offsets
calculated in days (the largest time unit that the CF Metadata
Conventions allows). Typically the middle of the month is then recorded,
which for months with 31 days would be something like
2005-01-15T12:00:00
.
Individual files containing climate projections contain global, regional or local data, typically on a rectangular latitude-longitude grid, for a single parameter such as “near-surface temperature”, and for a number of time steps. An analysis workflow then consists of a number of steps:
apply(data, 1:2, tapply, f, fun)
(following the CF Metadata Conventions, dimensions 1 and 2 are
“longitude” and “latitude”, respectively; the third dimension is
“time”). Repeat for the data suite for each ensemble member.Apart from the first step of obtaining the data, the steps lend
themselves well to automation. The catch, however, is in the factor
f
to use with tapply()
. The different models
(in your ensemble) use different calendars, meaning that different
factors are required. The CFtime package can help out.
The CFfactor()
function produces a factor that respects
the calendar of the data files. The function comes in two operating
modes:
# Create a dekad factor for the whole `cf` time series that was created above
f_k <- CFfactor(cf, "dekad")
str(f_k)
#> Factor w/ 3060 levels "2015D01","2015D02",..: 1 1 1 1 1 1 1 1 1 1 ...
# Create four monthly factors for a baseline epoch and early, mid and late 21st century epochs
f_ep <- CFfactor(cf, epoch = list(baseline = 1991:2020, early = 2021:2040,
mid = 2041:2060, late = 2061:2080))
str(f_ep)
#> List of 4
#> $ baseline: Factor w/ 12 levels "01","02","03",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ early : Factor w/ 12 levels "01","02","03",..: NA NA NA NA NA NA NA NA NA NA ...
#> $ mid : Factor w/ 12 levels "01","02","03",..: NA NA NA NA NA NA NA NA NA NA ...
#> $ late : Factor w/ 12 levels "01","02","03",..: NA NA NA NA NA NA NA NA NA NA ...
For the “epoch” version, there are two interesting things to note here:
NA
values where the time series values are not falling in
the epoch. This ensures that the factor is compatible with the data set
which the time series describes, such that functions like
tapply()
will not throw an error.There are five periods defined for CFfactor()
:
year
, to summarize data to yearly timescalesseason
, the meteorological seasons. Note that the month
of December will be added to the months of January and February of the
following year, so the date “2020-12-01” yields the factor value
“2021-DJF”.month
, monthly summeries, the default period.dekad
, 10-day period. Each month is subdivided in
dekads as follows: (1) days 01 - 10; (2) days 11 - 20; (3) remainder of
the month.Building on the examples above of opening a file, creating a
CFtime
instance and a suitable factor for one data suite,
here daily rainfall, the actual processing of the data into
precipitation anomalies for 3 periods relative to a baseline period
could look like this:
# Read the data from the netCDF file.
# Keep degenerate dimensions so that we have a predictable data structure: 3-dimensional array.
# Converts units of kg m-2 s-2 to mm/day.
pr_d <- ncvar_get(nc, "pr", collapse_degen = FALSE) * 86400
str(pr_d)
#> num [1, 1, 1:31025] 11.699 7.146 2.25 0.418 0.442 ...
# Note that the data file has two degenerate dimensions for longitude and latitude, to keep
# the example data shipped with this package small.
# Assign dimnames(), optional.
dimnames(pr_d) <- list(nc$dim$lon$vals, nc$dim$lat$vals, CFtimestamp(cf))
nc_close(nc)
# Calculate the daily average precipitation per month for the baseline period
# and the three future epochs.
# `aperm()` rearranges dimensions after `tapply()` mixed them up.
pr_d_ave <- lapply(f_ep, function(f) aperm(apply(pr_d, 1:2, tapply, f, mean), c(2, 3, 1)))
# Calculate the precipitation anomalies for the future epochs against the baseline.
# Working with daily averages per month so we can simply subtract and then multiply by days
# per month for the CF calendar.
baseline <- pr_d_ave$baseline
pr_d_ave$baseline <- NULL
ano <- lapply(pr_d_ave, function(x) (x - baseline) * CFmonth_days(cf))
# Plot the results
plot(1:12, ano$early[1,1,], type = "o", col = "blue", ylim = c(-50, 40), xlim = c(1, 12),
main = paste0("Hamilton, New Zealand\n", experiment),
xlab = "month", ylab = "Precipitation anomaly (mm)")
lines(1:12, ano$mid[1,1,], type = "o", col = "green")
lines(1:12, ano$late[1,1,], type = "o", col = "red")
Looks like Hadley will be needing rubber boots in spring and autumn back home!
The interesting feature, working from opening the netCDF file down to
plotting, is that the specifics of the CF calendar that the data suite
uses do not have to be considered anywhere in the processing workflow:
the CFtime
package provides the functionality. Data suites
using another CF calendar are processed exactly the same:
nc <- nc_open(list.files(path = system.file("extdata", package = "CFtime"), full.names = TRUE)[2])
cf <- CFtime(nc$dim$time$units, nc$dim$time$calendar, nc$dim$time$vals)
# Note that `cf` has a different CF calendar
f_ep <- CFfactor(cf, epoch = list(baseline = 1991:2020, early = 2021:2040,
mid = 2041:2060, late = 2061:2080))
pr_d <- ncvar_get(nc, "pr", collapse_degen = FALSE) * 86400
nc_close(nc)
pr_d_ave <- lapply(f_ep, function(f) aperm(apply(pr_d, 1:2, tapply, f, mean), c(2, 3, 1)))
baseline <- pr_d_ave$baseline
pr_d_ave$baseline <- NULL
ano <- lapply(pr_d_ave, function(x) (x - baseline) * CFmonth_days(cf))
Due to the large size of typical climate projection data files, it is common to have a data suite that is contained in multiple files. A case in point is the CORDEX data set which breaks up the experiment period of 2006 - 2100 into 19 files of 5 years each, with each file covering a single parameter (temperature, precipitation, etc) over an entire domain (such as Europe, South Asia, Central America and the Caribbean, etc). The CFtime package can streamline processing of such multi-file data suites as well.
Assuming that you have your CORDEX files in a directory on disk,
organized by domain and other properties such as the variable, GCM/RCM
combination, experiment, etc, the process of preparing the files for
processing could be encoded in a function as below. The argument
fn
is a list of file names to process, and var
is the variable contained in the files. (There are no checks on argument
sanity here, which should really be included. This function only makes
sense for a single [domain, GCM/RCM, experiment, variable] combination.
Also be aware of data size, CORDEX files are huge and stitching all
domain data together will easily exhaust available memory and it may
thus lead to very large swap files and very poor performance.)
library(ncdf4)
library(abind)
prepare_CORDEX <- function(fn, var) {
cf <- NA
offsets <- vector("list")
data <- vector("list")
lapply(fn, function(f) {
nc <- nc_open(f)
if (is.na(cf))
# Create an "empty" CFtime object, without elements
cf <- CFtime(nc$dim$time$units, nc$dim$time$calendar)
# Make a list of all datum offsets and data arrays
offsets <- append(offsets, as.vector(nc$dim$time$vals))
data <- append(data, ncvar_get(nc, var, collapse_degen = FALSE)
nc_close(nc)
})
# Create a list for output with the CFtime instance assigned the offsets and
# the data bound in a single 3-dimensional array
list(CFtime = cf + unlist(offsets), data = abind(data, along = 3))
}
Calling this function like
prepare_CORDEX(list.files(path = "~/CC/CORDEX/CAM", pattern = "*.nc", full.names = TRUE))
will yield a list with the CFtime
instance describing the
full temporal extent covered by the data files, as well as the data
bound on the temporal dimension, ready for further processing.
When working like this it is imperative that the offsets and the data
arrays are added to their final structures in exactly the same
order. It is not necessary that the offsets (and the data)
themselves are in order, but the correspondence between offsets and data
needs to be maintained. (list.files()
produces a list in
alphabetical order by default, which for most climate projection files
produces offsets in chronological order.)
The results presented contain modified data from Copernicus Climate Change Service information, 2023. Neither the European Commission nor ECMWF is responsible for any use that may be made of the Copernicus information or data it contains.
We acknowledge the World Climate Research Programme, which, through its Working Group on Coupled Modelling, coordinated and promoted CMIP6. We thank the climate modeling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the data and providing access, and the multiple funding agencies who support CMIP6 and ESGF.
The two datasets used as examples in this vignette carry the following license statements: