REDCapDM



1 Introduction

The REDCapDM package allows users to read data exported directly from REDCap or via API connection. It also allows users to preprocess the previously downloaded data, create reports of queries such as outliers or missing values and track the identified queries.



2 Functions

Functions included in the package:



3 Built-in dataset

For the following examples, we will use a random sample of the COVICAN study which is included in the package. COVICAN is an international, multicentre cohort study of cancer patients with COVID-19 to describe the epidemiology, risk factors, and clinical outcomes of co-infections and superinfections in onco-hematological patients with COVID-19.

We can load the built-in dataset by typing:

data(covican)

The structure of this dataset is:

List of 2
 $ data      :'data.frame': 342 obs. of  56 variables:
 $ dictionary:'data.frame': 21 obs. of  8 variables:

And some of the variables in the dataset are:

Name Description Categories
record_id Identifier of each record
redcap_event_name Auto-generated name of the events
redcap_data_access_group Auto-generated name of each center
inc_1 Patients older than 18 years No ; Yes
inc_2 Cancer patients No ; Yes
inc_3 Diagnosed of COVID-19 No ; Yes
exc_1 Solid tumour remission >1 year No ; Yes
screening_fail_crit Indicator of non-compliance with inclusion and exclusion criteria Compliance ; Non-compliance
d_birth Date of birth (y-m-d)
d_admission Date of first visit (y-m-d)
age Age
dm Indicator of diabetes No ; Yes
type_dm Type of diabetes No complications ; End-organ diabetes-related disease
copd Indicator of chronic pulmonary disease No ; Yes
fio2 Fraction of inspired oxygen (%)
available_analytics Indicator of blood test available No ; Yes
potassium Potassium (mmol/L)
resp_rate Respiratory rate (bpm)
leuk_lymph Indicator of leukemia or lymphoma No ; Yes
acute_leuk Indicator of acute leukemia No ; Yes


4 Examples

The package structure can be divided into three main components: reading raw data, preprocessing data and identifying queries. Typically, after collecting data in REDCap, we will have to follow this three components in order to have a final validated dataset for analysis. We will provide a complete user guide on how to perform each one of these steps using the package’s functions. For the preprocessing of the data and query identification, we will use the built-in dataset as an example.

4.1 Read data

4.1.1 redcap_data

The redcap_data function allows users to easily import data from a REDCap project into R for analysis.

To read exported data from REDCap, use the arguments data_path and dic_path to, respectively, describe the path of the R file and the REDCap project’s dictionary:

dataset <- redcap_data(data_path="C:/Users/username/example.r",
                       dic_path="C:/Users/username/example_dictionary.csv")

Note: To avoid errors when using this function, the R and CSV files exported from REDCap must be located in the same directory.

Another way to read data exported from a REDCap project is using an API connection. To do this, we can use the arguments uri and token which respectively refer to the uniform resource identifier of the REDCap project and the user-specific string that serves as the password:

dataset_api <- redcap_data(uri ="https://redcap.idibell.cat/api/",
                           token = "55E5C3D1E83213ADA2182A4BFDEA")

This function returns a list with 2 elements (imported data and dictionary) which can then be used for further analysis or visualization.

4.2 Preprocess

4.2.1 rd_transform

The main function involved in the preprocessing of the data is rd_transform. This function is used to preprocess the REDCap data read into R using the redcap_data, as described above. Using the arguments of the function we can perform different type of transformations of our data.

As previously stated, we will use the built-in dataset covican as an example.

4.2.1.1 Data raw transformation

By default this function will perform a raw transformation of the data. The only necessary arguments that must be provided are the dataset to be transformed and the corresponding dictionary. This function will return the transformed dataset, dictionary and the output of the results of the transformation:

covican_transformed <- rd_transform(data = covican$data, 
                                    dic = covican$dictionary)

#Print the results of the transformation
covican_transformed$results
1. Recalculating calculated fields and saving them as '[field_name]_recalc'

| Total calculated fields | Non-transcribed fields | Recalculated different fields |
|:-----------------------:|:----------------------:|:-----------------------------:|
|            2            |         0 (0%)         |            1 (50%)            |


|     field_name      | Transcribed? | Is equal? |
|:-------------------:|:------------:|:---------:|
|         age         |     Yes      |   FALSE   |
| screening_fail_crit |     Yes      |   TRUE    |

2. Transforming checkboxes: changing their values to No/Yes and changing their names to the names of its options. For checkboxes that have a question door specified in the branching logic, converting some of their values to missing

Table: Checkbox variables advisable to be reviewed

| Variables without any branching logic |
|:-------------------------------------:|
|        type_underlying_disease        |

3. Replacing original variables for their factor version
4. Deleting variables that contain some patterns

As we can see, there are 4 steps in the transformation:

  1. Recalculation of REDCap calculated fields: it finds all the calculated fields and recalculates them using the REDCap logic specified in the calculation field translated into R. The recalculated variable is saved as the original name + ’_recalc’. It can happen that the logic found contains some specific smart-variables or other complex structures which the function is not able to transcribe. With the summary found in results we can see how many calculated fields have been found, if they have been transcribed and, if that is the case, if the recalculated variable is equal to the original one.
  2. In the example, we can see how there are two REDCap calculated fields, both have been transcribed successfully and the recalculation of the age does not match the original calculated variable from REDCap.

  3. Checkbox transformation: by default, it changes the names of the checkboxes to the name of its corresponding option and the name of their labels to ‘No/Yes’. If we want to specify another pair of label names we can specify them using the checkbox_labels argument as we will see. Furthermore, if the checkbox contains a brancing logic and the logic in it has been satisfied, its values will be set to missing.
  4. For example, let’s explain the transformation that undergo the variables corresponding to the checkbox field of the type of underlying disease. The variables were named originally as _type_underlying_disease__0 and _type_underlying_disease__1 although the name of the options are ‘Haematological cancer’ and ‘Solid tumour’. So, in the transformed dataset, the names are converted to type_underlying_disease_haematological_cancer and type_underlying_disease_solid_tumour. Then, since this checkbox variable does not have a branching logic, the variable is advised to be reviewed by the user in the results, as seen above. When reviewed we could use an additional function rd_insert_na to insert the necessary missing values into this variable, as we will explain later. If a branching logic was found for this variable, rd_transform will insert automatically the missing values when the logic is not satisfied and no further transformation will be needed.

  5. Replacement of the original variable by its factor version: REDCap creates two versions of the variables in the dataset for multiple-choice fields: a numerical one with the number which corresponds to each category and a factor one containing the labels of each category. In this step, we will replace the original variables with their factor versions, except for redcap_event_name and redcap_data_access_group, for which we will keep both versions. We can specify other variables that we do not want to transform to factor using the argument exclude_to_factor which we will later see.
  6. Elimination of variables containing some pattern: by default, the pattern that the function looks for is ’_complete’, but we can specify any other pattern using the argument delete_vars, as explained later.
  7. In this case, we do not have any variable with the pattern ’_complete’ since the built-in dataset only contains a sample of the variables of the project. All REDCap projects, when downloaded, contain one variable with the pattern ’_complete’ for each form indicating if the form has been marked as incomplete/unverified/completed. In general, we do not need this information so these variables are removed by default.

4.2.1.2 Data transformation and classification by event

Alternatively, we can transform the dataset and split it by each event. This can be done by specifying in the final_format argument that we want our data to be split by event. Recall that for this transformation to be performed, the file with the mapping of each event and each form has to be downloaded from REDCap since this information is necessary to split the data. The path of the file where it is located has to be specified using the event_path argument:

dataset <- rd_transform(data = covican$data, 
                        dic = covican$dictionary,
                        event_path = "files/COVICAN_instruments.csv",
                        final_format = "by_event")

#To print the results
dataset$results
1. Recalculating calculated fields and saving them as '[field_name]_recalc'

| Total calculated fields | Non-transcribed fields | Recalculated different fields |
|:-----------------------:|:----------------------:|:-----------------------------:|
|            2            |         0 (0%)         |            1 (50%)            |


|     field_name      | Transcribed? | Is equal? |
|:-------------------:|:------------:|:---------:|
|         age         |     Yes      |   FALSE   |
| screening_fail_crit |     Yes      |   TRUE    |

2. Transforming checkboxes: changing their values to No/Yes and changing their names to the names of its options. For checkboxes that have a question door specified in the branching logic, converting some of their values to missing

Table: Checkbox variables advisable to be reviewed

| Variables without any branching logic |
|:-------------------------------------:|
|        type_underlying_disease        |

3. Replacing original variables for their factor version
4. Deleting variables that contain some patterns
5. Erasing variables from forms that are not linked to any event
6. Final arrangment of the data by event

Now, a final step in the transformation has been added, which consists in splitting the data according to the events in the study. So, now the transformed dataset found in the output of the function is a tibble object with as many data frames as events there are in the REDCap project:

dataset$data
# A tibble: 2 x 3
  events                   vars       df             
  <chr>                    <list>     <list>         
1 baseline_visit_arm_1     <chr [25]> <df [190 x 25]>
2 follow_up_visit_da_arm_1 <chr [8]>  <df [152 x 8]> 

The column df of the nested dataframe is a list containing the data corresponding to each event. Also the variables of the forms that are found in each event are reported in the column vars.

4.2.1.3 Data transformation and classification by form

Another option is to split the data by the forms found in the REDCap project. We will use the same final_format argument to specify that we want to split data by forms and the event-form mapping file has to be specified with the event_path argument:

dataset <- rd_transform(data = covican$data, 
                        dic = covican$dictionary,
                        event_path = "files/COVICAN_instruments.csv",
                        final_format = "by_form")

#To print the results
dataset$results
1. Recalculating calculated fields and saving them as '[field_name]_recalc'

| Total calculated fields | Non-transcribed fields | Recalculated different fields |
|:-----------------------:|:----------------------:|:-----------------------------:|
|            2            |         0 (0%)         |            1 (50%)            |


|     field_name      | Transcribed? | Is equal? |
|:-------------------:|:------------:|:---------:|
|         age         |     Yes      |   FALSE   |
| screening_fail_crit |     Yes      |   TRUE    |

2. Transforming checkboxes: changing their values to No/Yes and changing their names to the names of its options. For checkboxes that have a question door specified in the branching logic, converting some of their values to missing

Table: Checkbox variables advisable to be reviewed

| Variables without any branching logic |
|:-------------------------------------:|
|        type_underlying_disease        |

3. Replacing original variables for their factor version
4. Deleting variables that contain some patterns
5. Erasing variables from forms that are not linked to any event
6. Final arrangment of the data by form

As before, a final step in the transformation has been added, which is to split the data according to the forms in the study. Thus, the transformed dataset will now be a tibble object with as many data frames as forms there are in the REDCap project:

dataset$data
# A tibble: 6 x 4
  form                        events    vars       df             
  <chr>                       <list>    <list>     <list>         
1 inclusionexclusion_criteria <chr [1]> <chr [11]> <df [190 x 11]>
2 demographics                <chr [1]> <chr [9]>  <df [190 x 9]> 
3 comorbidities               <chr [1]> <chr [10]> <df [190 x 10]>
4 vital_signs                 <chr [2]> <chr [7]>  <df [177 x 7]> 
5 laboratory_findings         <chr [2]> <chr [7]>  <df [177 x 7]> 
6 microbiological_studies     <chr [1]> <chr [6]>  <df [190 x 6]> 

4.2.1.4 Additional arguments

There are other arguments which can be used to customize some of the transformation steps that the function performs by default:


checkbox_labels: specifies the name of the categories for the checkbox variables. Default is ‘No/Yes’, but we can change it to ‘N/Y’:

dataset <- rd_transform(data = covican$data,
                        dic = covican$dictionary,
                        checkbox_labels = c("N", "Y"))


exclude_to_factor: specifies the name of the variables that we do not want to transform into a factor. For example, if we want the variable dm to keep its original numeric version:

dataset <- rd_transform(data = covican$data, 
                        dic = covican$dictionary,
                        exclude_to_factor = "dm")


keep_labels: logical argument, retains the labeling of the dataset columns from REDCap. By default, the function will remove the labels of the dataset and the labels can be found in the dictionary:

dataset <- rd_transform(data = covican$data, 
                        dic = covican$dictionary,
                        keep_labels = TRUE)

str(dataset$data[,1:2])
'data.frame':   342 obs. of  2 variables:
 $ record_id        : 'labelled' chr  "100-13" "100-13" "100-16" "100-16" ...
  ..- attr(*, "label")= Named chr "Record ID"
  .. ..- attr(*, "names")= chr "record_id"
 $ redcap_event_name: 'labelled' chr  "baseline_visit_arm_1" "follow_up_visit_da_arm_1" "baseline_visit_arm_1" "follow_up_visit_da_arm_1" ...
  ..- attr(*, "label")= Named chr "Event Name"
  .. ..- attr(*, "names")= chr "redcap_event_name"


delete_vars: every variable containing the strings specified in this argument will be removed from the dataset. By default, the value of delete_vars is ‘_complete’. For example, we can change the argument to remove the inclusion and exclusion criteria variables from the dataset (variables that contain ‘inc_’ and ‘exc_’ in their names):

dataset <- rd_transform(data = covican$data, 
                        dic = covican$dictionary,
                        delete_vars = c("inc_", "exc_"))


which_event: in the transformation by event explained earlier, we can specify whether we want to keep only one out of all the events in the dataset. For example, if we only want to keep the baseline visit:

dataset <- rd_transform(data = covican$data, 
                        dic = covican$dictionary,
                        event_path = "files/COVICAN_instruments.csv",
                        final_format = "by_event",
                        which_event = "baseline_visit_arm_1")


which_form: in the transformation by form explained earlier, we can specify whether we want to keep only one of the forms. For example, if we only want to keep the demographic form:

dataset <- rd_transform(data = covican$data, 
                        dic = covican$dictionary,
                        event_path = "files/COVICAN_instruments.csv",
                        final_format = "by_form",
                        which_form = "demographics")

data <- dataset$data

names(data)
[1] "record_id"                       "redcap_event_name"              
[3] "redcap_data_access_group"        "redcap_event_name.factor"       
[5] "redcap_data_access_group.factor" "d_admission"                    
[7] "d_birth"                         "age"                            
[9] "age_recalc"                     


wide: in the transformation by form, we can specify that we want each of the split datasets to be in a wide format. This is useful if the form appears in more than one event (or in a repeated event). Then, we will only have one row per patient and all the variables of the form will be in columns repeated by each event in the order that the events appear in REDCap. For example, if we want to keep only the laboratory findings in a wide format we can do:

dataset <- rd_transform(data = covican$data, 
                        dic = covican$dictionary,
                        event_path = "files/COVICAN_instruments.csv",
                        final_format = "by_form",
                        which_form = "laboratory_findings",
                        wide = TRUE)

head(dataset$data)
# A tibble: 6 x 5
  record_id available_analytics_1 available_analytics_2 potassium_1 potassium_2
  <chr>     <fct>                 <fct>                       <dbl>       <dbl>
1 100-13    Yes                   Yes                          3.66        4.1 
2 100-16    Yes                   No                           4.04       NA   
3 100-31    Yes                   <NA>                         4.58       NA   
4 100-34    Yes                   No                           3.48       NA   
5 100-36    Yes                   No                           4.09       NA   
6 100-52    Yes                   Yes                          3.7         7.15


4.2.2 rd_rlogic

This function transforms the REDCap logic into logic that can be evaluated in R. This function is used in the rd_transform to recalculate the calculate fields, but it may also be useful to use it in other circunstances. Let’s see how it transforms the logic of one of the calculated fields in the built-in dataset:

#screening failure
rd_rlogic(logic = "if([exc_1]='1' or [inc_1]='0' or [inc_2]='0' or [inc_3]='0',1,0)",
          data = covican$data)
[1] "ifelse(data$exc_1=='1' | data$inc_1=='0' | data$inc_2=='0' | data$inc_3=='0',1,0)"


4.2.3 rd_insert_na

This function sets some values of a variable to missing if a certain logic is fulfilled. It can be used as a complementary function for rd_transform, for example, to change the values of those checkboxes that do not have a branching logic, as commented earlier. For instance, we can perform a raw transformation of our data, as in the beginning of this section, and then use this function to set the values of the checkbox type_underlying_disease_haematological_cancer to missing when the age is less than 65 years old:

#Raw transformation of the data:
dataset <- rd_transform(data = covican$data, 
                        dic = covican$dictionary)

#Before inserting missings
table(dataset$data$type_underlying_disease_haematological_cancer)

 No Yes 
103  87 
#Run the function
dataset$data <- rd_insert_na(data = dataset$data,
                             filter = "age < 65",
                             vars = "type_underlying_disease_haematological_cancer")

#After inserting missings
table(dataset$data$type_underlying_disease_haematological_cancer)

 No Yes 
 65  50 


4.3 Queries

Queries are very important to ensure the accuracy and reliability of a REDCap dataset. The collected data may contain missing values, inconsistencies, or other potential errors that need to be identified in order to correct them later.

For all the following examples we will use the raw transformed data: covican_transformed.

4.3.1 rd_query

The rd_query function allows users to generate queries by using a specific expression. It can be used to identify missing values, values that fall outside the lower and upper limit of a variable and other types of inconsistencies.

4.3.1.1 Output

First, we will examine the output of this function. When the rd_query function is executed, it returns a list that includes a data frame with all the queries identified and a second element with a summary of the number of generated queries in each specified variable:

Identifier DAG Event Instrument Field Repetition Description Query Code
100-58 Hospital 11 Baseline visit Comorbidities copd
Chronic obstructive pulmonary disease The value is NA and it should not be missing 100-58-1
102-113 Hospital 24 Baseline visit Demographics age
Age The value is NA and it should not be missing 102-113-1
105-11 Hospital 5 Baseline visit Comorbidities copd
Chronic obstructive pulmonary disease The value is NA and it should not be missing 105-11-1
105-11 Hospital 5 Baseline visit Demographics age
Age The value is NA and it should not be missing 105-11-2
105-56 Hospital 5 Baseline visit Comorbidities copd
Chronic obstructive pulmonary disease The value is NA and it should not be missing 105-56-1
105-56 Hospital 5 Baseline visit Demographics age
Age The value is NA and it should not be missing 105-56-2
Report of queries
Variables Description Total
copd Chronic obstructive pulmonary disease 6
age Age 5

The data frame is designed to aid the user in locating each query in their REDCap project. It includes information such as the record identifier, the data access group, the event in which each query can be found, along with the name and description of the analyzed variable and a brief description of the query.

Let’s see some examples of the usability of the function in generating different types of queries.

4.3.1.2 Missings

For instance, to identify missing values in the variables copd and age of the raw transformed data, a list of required arguments needs to be supplied. We must use the variables argument to specify the variables from the database that will be examined and the expression argument to describe the expression that will be applied to those variables, in this case ‘%in%NA’ to detect missing values. Additionaly, we must use the data and dic arguments to indicate the R objects containing the REDCap data and dictionary, respectively. If the REDCap project presents a longitudinal design, we should also specify the event in which the described variables are present through the use of the event argument:

example <- rd_query(variables = c("copd", "age"),
                    expression = c("%in%NA", "%in%NA"),
                    event = "baseline_visit_arm_1",
                    dic = covican_transformed$dictionary,
                    data = covican_transformed$data)

# Printing results
example$results
Report of queries
Variables Description Total
copd Chronic obstructive pulmonary disease 6
age Age 5

In this case, we can observe that there are 6 missing values in the copd variable and 5 missing values in age.

4.3.1.3 Missings of variables with a branching logic

Another example is when we try to identify missing values in variables where a branching logic is employed. In this scenario, when the conditions of the branching logic are not satisfied, by definition, all of the values should be missing and thus queries for this specific missing values (conditions not met) should not be reported. To adress this, if a variable presents a branching logic, the function will issue a warning with a message to check the results element of our output:

example <- rd_query(variables = c("age", "copd", "potassium"),
                    expression = c("%in%NA", "%in%NA", "%in%NA"),
                    event = "baseline_visit_arm_1",
                    dic = covican_transformed$dictionary,
                    data=covican_transformed$data)
Warning: Some of the variables that were checked for missings present a branching logic. 
Check the results tab of output for more details (...$results).
# Printing results
example$results
Report of queries
Variables Description Total Branching logic
age Age 5 -
copd Chronic obstructive pulmonary disease 6 -
potassium Potassium 31 [available_analytics]=‘1’

As we can see, in addition to the missing values of the age and copd variables already identified, there are 31 missing values in the potassium variable. We can also observe that the variable potassium has the following branching logic [available_analytics]=‘1’, which means that we should only identify the missing values when available_analytics has the value ‘1’. To accomplish this, we can use the filter argument to ensure that the condition in this branching logic is fulfilled. Recall that, in the transformed dataset, the value ‘1’ was changed to ‘Yes’.

example <- rd_query(variables = c("potassium"),
                    expression = c("%in%NA"),
                    event = "baseline_visit_arm_1",
                    dic = covican_transformed$dictionary,
                    data = covican_transformed$data,
                    filter = c("available_analytics=='Yes'"))
Warning: Some of the variables that were checked for missings present a branching logic. 
Check the results tab of output for more details (...$results).
# Printing results
example$results
Report of queries
Variables Description Total Branching logic
potassium Potassium 21 [available_analytics]=‘1’

The total number of missing values changes when we use the filter argument, the variable potassium now presents 21 missing values instead of the previous 31 cases identified. This means that we were identifying 10 missing values in which available_analytics did not have the value 1 and, therefore, should not be counted.

4.3.1.4 Expressions

Up until this point, we have observed examples where the expression applied is for detecting missing values. But, as previously mentioned, the rd_query function is also able to identify outliers or observations that fulfill a specific condition. Hence, to identify, for example, all the observations where age is greater than 70, we should use the expression argument again but specifying ‘>70’ instead of ‘%in%NA’:

example <- rd_query(variables="age",
                    expression=">70",
                    event="baseline_visit_arm_1",
                    dic=covican_transformed$dictionary,
                    data=covican_transformed$data)

# Printing results
example$results
Report of queries
Variables Description Total
age Age 76


We can add other variables with other specific expressions in the same function because it is designed to treat the arguments variables and expression as vectors, so that the element at position ‘n’ of expression is applied to the element at position ‘n’ of variables.

For example, if we wanted to identify all the observations where age is greater than 70 and all the observations where copd is ‘Yes’ we should use:

example <- rd_query(variables=c("age", "copd"),
                    expression=c(">70", "='Yes'"),
                    event="baseline_visit_arm_1",
                    dic=covican_transformed$dictionary,
                    data=covican_transformed$data)

# Printing results
example$results
Report of queries
Variables Description Total
copd Chronic obstructive pulmonary disease 190
age Age 76


In a more complex scenario, for example, to identify all the observations where age is greater than 70, less than 80, or is a missing value we should use the following expression:

example <- rd_query(variables="age",
                    expression="(>70 & <80) | %in%NA",
                    event="baseline_visit_arm_1",
                    dic=covican_transformed$dictionary,
                    data=covican_transformed$data)

# Printing results
example$results
Report of queries
Variables Description Total
age Age 54

4.3.1.5 Special cases

Same expression for all variables

In order to evaluate the same expression for all variables, the user should supply just a single element for expression:

example <- rd_query(variables = c("copd","age","dm"),
                    expression = c("%in%NA"),
                    event = "baseline_visit_arm_1",
                    dic = covican_transformed$dictionary,
                    data = covican_transformed$data)
Warning: There are more variables than expressions, so the same expression was
applied to all variables
# Printing results
example$results
Report of queries
Variables Description Total
copd Chronic obstructive pulmonary disease 6
age Age 5
dm Diabetes (treated with insulin or antidiabetic … 5


The function issues a warning every time the same expression is applied to all variables to ensure that the user did not make a mistake when providing the information for each argument.


Not defining an event in a dataset with multiple events

Another special case is when the data analysed corresponds to a REDCap longitudinal project, but the event argument of the function is not defined:

example <- rd_query(variables = c("copd"),
                    expression = c("%in%NA"),
                    dic = covican_transformed$dictionary,
                    data = covican_transformed$data)
Warning: event = NA, but the dataset presents a variable that indicates the
presence of events, please specify the event.
# Printing results
example$results
Report of queries
Variables Description Total
copd Chronic obstructive pulmonary disease 158

As we can see, the number of missing values for the variable copd goes from 6 to 158 because the function considers all the events of the study if no event is specified. Thus, it might result in an overestimation of the number of missing values.


The function will issue a warning if it detects the presence of the variable exported by default from a REDCap longitudinal project and the event argument is not specified.

4.3.1.6 Additional arguments

variable_names, query_name, instrument

This arguments allow us to customize the data frame returned by the function. We can change the variables names using the variables_names argument, alter the description of the query using the query_name argument or even change the name of the instrument using the instrument argument:

example<- rd_query(variables = c("copd"),
                   variables_names = c("Chronic obstructive pulmonary disease (Yes/No)"),
                   expression = c("%in%NA"),
                   query_name = c("COPD is a missing value."),
                   instrument = c("Admission"),
                   event = "baseline_visit_arm_1",
                   dic = covican_transformed$dictionary,
                   data = covican_transformed$data)

Output:

Identifier DAG Event Instrument Field Repetition Description Query Code
100-58 Hospital 11 Baseline visit Admission copd
Chronic obstructive pulmonary disease (Yes/No) COPD is a missing value. 100-58-1


negate

This argument can be used to negate the expression applied to the variables. For example, if we want to identify all the non missing values of the variable copd, we can apply the expression ‘%in%NA’ which normally would report the missing values and add negate = TRUE, so the result will be the number of non missing values in copd:

example <- rd_query(variables = c("copd"),
                    expression = c("%in%NA"),
                    negate = TRUE,
                    event = "baseline_visit_arm_1",
                    dic = covican_transformed$dictionary,
                    data = covican_transformed$data)

# Printing results
example$results
Report of queries
Variables Description Total
copd Chronic obstructive pulmonary disease 184

There are 184 non missing values in the variable copd.


addTo

In order to keep all queries in the same R object, we can use the addTo argument to specify the output of another query dataset.

example2 <- rd_query(variables = c("age"),
                     expression = c("%in%NA"),
                     event = "baseline_visit_arm_1",
                     dic = covican_transformed$dictionary,
                     data=covican_transformed$data,
                     addTo = example)

# Printing results
example2$results
Report of queries
Variables Description Total
copd Chronic obstructive pulmonary disease 184
age Age 5

We have joined our former output of 184 non missing values in the variable copd with the new query dataset composed by the 5 missing values of the variable age.


report_title

To customize the title of the summary of queries, we can use the report_title argument:

example <- rd_query(variables = c("copd", "age"),
                    expression = c("%in%NA", "<20"),
                    event = "baseline_visit_arm_1",
                    dic = covican_transformed$dictionary,
                    data = covican_transformed$data,
                    report_title = "Missing COPD values in the baseline event")

# Printing results
example$results
Missing COPD values in the baseline event
Variables Description Total
copd Chronic obstructive pulmonary disease 6

The default title of the summary is “Report of queries” but we have changed it to “Missing COPD values in the baseline event”.


report_zeros

By default, the function will only report, in the summary of queries, variables with at least one query and will omit those with zero queries. To include these omitted variables in the summary, we can use the report_zeros argument:

example <- rd_query(variables = c("copd", "age"),
                    expression = c("%in%NA", "<20"),
                    event = "baseline_visit_arm_1",
                    dic = covican_transformed$dictionary,
                    data = covican_transformed$data,
                    report_zeros = TRUE)

# Printing results
example$results
Report of queries
Variables Description Total
copd Chronic obstructive pulmonary disease 6
age Age 0

The variable age is reported in the summary in spite of not having any queries identified.

4.3.2 rd_event

When working with a longitudinal REDCap project (presence of events), the exported data has a structure where each row represents one event per record. However, by default, REDCap will not export the corresponding rows of the events that have no collected data. So, if we try to identify missing values in variables that are inside a missing event for some records using the rd_query function, these missing values will not be identified because they do not exist in the exported data. The rd_event function can be used to point out in how many records an event does not exist:

example <- rd_event(event = "follow_up_visit_da_arm_1",
                    dic = covican_transformed$dictionary,
                    data = covican_transformed$data)

# Print results
example$results
Report of queries
Events Description Total
follow_up_visit_da_arm_1 Follow up visit day 14+/-5d 38

There are a total of 38 events per record without any row corresponding to the event Follow up visit day 14+/-5d. Thus, when searching for missing values of variables in the Follow up visit day 14+/-5d event, we need to consider that there will be 38 additional missing values which will not be accounted for by rd_query.


It might happen that an event is not mandatory for all records so we only want to check if the event is missing in a subgroup of records. For example, in the COVICAN study only patients satisfying the inclusion and exclusion criteria would have to perform the follow up visit. Therefore, to check if the follow up event is missing only in the records presenting compliance with the inclusion and exclusion criteria, we can use the filter argument of the rd_event function:

example <- rd_event(event = "follow_up_visit_da_arm_1",
                    filter = "screening_fail_crit==0",
                    dic = covican_transformed$dictionary,
                    data = covican_transformed$data)

# Print results
example$results
Report of queries
Events Description Total
follow_up_visit_da_arm_1 Follow up visit day 14+/-5d 34


Like the rd_query function, this function also treats the argument event as a vector allowing us to check for multiple missing events at the same time.

example <- rd_event(event = c("baseline_visit_arm_1","follow_up_visit_da_arm_1"),
                    filter = "screening_fail_crit==0",
                    dic = covican_transformed$dictionary,
                    data = covican_transformed$data,
                    report_zeros = TRUE)

# Print results
example$results
Report of queries
Events Description Total
follow_up_visit_da_arm_1 Follow up visit day 14+/-5d 34
baseline_visit_arm_1 Baseline visit 0


Note: This function also has the arguments query_name, addTo, report_title and report_zeros that work in the same way as in the examples previously mentioned.

4.3.3 check_queries

Once the process of identifying queries is complete, the typical approach would be to adress them by modifying the original dataset in REDCap and re-run the query identification process generating a new query dataset.

The check_queries function compares the previous query dataset with the new one by using the arguments old and new, respectively. The output remains a list with 2 items, but the data frame containing the information for each query will now have an additional column (“Modification”) indicating which queries are new, which have been modified, and which remain unchanged. Besides, the summary will show the number of queries of each one of these categories:

check <- check_queries(old = example$queries, 
                       new = new_example$queries)

# Print results
check$results
Report of modifications
State Total
Unmodified 7
Modified 4
New 1

There are 7 unchanged queries, 4 modified queries, and 1 new query between the previous and new query dataset.


Query control output:

Identifier DAG Event Instrument Field Repetition Description Query Code Modification
100-58 Hospital 11 Baseline visit Comorbidities copd
Chronic obstructive pulmonary disease The value is NA and it should not be missing 100-58-1 Unmodified
100-79 Hospital 11 Initial visit Comorbidities copd
Chronic pulmonary disease The value is NA and it should not be missing 100-79-1 New
102-113 Hospital 24 Baseline visit Demographics age
Age The value is NA and it should not be missing 102-113-1 Unmodified
105-11 Hospital 5 Baseline visit Comorbidities copd
Chronic obstructive pulmonary disease The value is NA and it should not be missing 105-11-1 Unmodified
105-11 Hospital 5 Baseline visit Demographics age
Age The value is NA and it should not be missing 105-11-2 Unmodified
105-56 Hospital 5 Baseline visit Comorbidities copd
Chronic obstructive pulmonary disease The value is NA and it should not be missing 105-56-1 Unmodified