The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

redquack

Transfer REDCap data to a database and use in R without exceeding available memory. Compatible with all databases but specifically optimized for DuckDB—a fast and portable SQL engine with first-class integration in R/Posit products.

Motivation

R objects live entirely in memory, causing three problems if not using a specialized framework:

You must load full datasets even if you only need a subset
Unused objects still consume memory
Large datasets can easily exceed available memory

redquack’s solution to this problem is to:

Request all of the REDCap record IDs to sequence in chunks
Process each chunk of the REDCap data in one R object at a time
Remove each object from memory after it has been transferred to the database

Features

Auto-resume from incomplete transfers
Auto-retry for API request failures
Auto-convert data types for DuckDB
Timestamped operation logs
Status messages and progress bar
Sound notifications (🔊 🦆)

Installation

From CRAN:

# install.packages("pak")
pak::pak("redquack")

Development version:

pak::pak("dylanpieper/redquack")

These packages are used in the examples and are not imported by redquack:

pak::pak(c("dplyr", "duckdb", "keyring"))

Setup API Token

Your REDCap API token allows R to interface with REDCap and should be stored securely. I recommend using the keyring package to store your API token. For example:

keyring::key_set("redcap_token")

Basic Usage

Data from REDCap is transferred to a database via a DBI connection in chunks of record IDs:

library(redquack)

duckdb <- DBI::dbConnect(duckdb::duckdb(), "redcap.duckdb")

result <- redcap_to_db(
  conn = duckdb,
  redcap_uri = "https://redcap.example.org/api/",
  token = keyring::key_get("redcap_token"),
  record_id_name = "record_id",
  chunk_size = 1000  
  # Increase chunk size for memory-efficient systems (faster)
  # Decrease chunk size for memory-constrained systems (slower)
)

The function returns a list with class redcap_transfer_result:

success: Logical if the transfer was completed with no failed processing
error_chunks: Vector of chunk numbers that failed processing
time_s: Numeric value for total seconds to transfer and optimize data

Database Structure

The database created by redcap_to_db() contains two tables:

data: Contains all exported REDCap records with optimized column types
```
data <- DBI::dbGetQuery(duckdb, "SELECT * FROM data LIMIT 1000")
```
log: Contains timestamped logs of the transfer process for troubleshooting
```
log <- DBI::dbGetQuery(duckdb, "SELECT * FROM log")
```

Data Types

Data is imported as VARCHAR/TEXT for consistent handling across chunks.

For DuckDB, data types are automatically optimized after transfer to improve query performance:

INTEGER: Columns with only whole numbers
DOUBLE: Columns with decimal numbers
DATE: Columns with valid dates
TIMESTAMP: Columns with valid timestamps
VARCHAR/TEXT: All other columns remain as strings

In DuckDB, you can query the data to inspect the data types:

DBI::dbGetQuery(duckdb, "PRAGMA table_info(data)")

You can also automatically convert data types in R using readr:

readr::type_convert(data)

To optimize query performance with other databases, you must alter the data table manually.

Data Manipulation

Query and collect the data with dplyr:

library(dplyr)

demographics <- tbl(duckdb, "data") |>
  filter(is.na(redcap_repeat_instrument)) |>
  select(record_id, age, race, sex, gender) |>
  collect()

If you collect() your data into memory in the last step, it can make a slow process nearly instantaneous. The following example data is 2,825,092 rows x 397 columns:

system.time(
  records <- duckdb |>
    tbl("data") |>
    collect() |>
    group_by(redcap_repeat_instrument) |>
    summarize(count = n()) |>
    arrange(desc(count)) 
)
#>   user  system elapsed
#>  5.048   5.006   6.077

system.time(
  records <- duckdb |>
    tbl("data") |>
    group_by(redcap_repeat_instrument) |>
    summarize(count = n()) |>
    arrange(desc(count)) |>
    collect()
)
#>    user  system elapsed
#>   0.040   0.015   0.040

You can also write a Parquet file directly from DuckDB and use arrow. A Parquet file will be about 5 times smaller than a DuckDB file:

DBI::dbExecute(duckdb, "COPY (SELECT * FROM data) TO 'redcap.parquet' (FORMAT PARQUET)")

Remember to close the connection when finished:

DBI::dbDisconnect(duckdb)

Collaboration Opportunities

While this package is only optimized for DuckDB, I invite collaborators to help optimize it for other databases. The pathway I suggest right now is to target your edits in R/optimize_data_types.R. Feel free to submit a PR and share any other ideas you may have.

Other REDCap Interfaces

redcapAPI (R package; provides a package comparison table)
PyCap (python module)

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.