The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
This vignette describes how to benchmark CDMConnector::generateCohortSet (old, CIRCE-based) against atlasCohortGenerator::generateCohortSet2 (new, DAG-optimized batch) across multiple database platforms. The benchmarking script:
Supported platforms include PostgreSQL, Redshift, Snowflake, Spark, and SQL Server. You provide a named list of CDM reference objects; the script handles timing, comparison, and CSV output.
The new approach (generateCohortSet2) uses a
DAG-based batch optimizer that:
As a result, wall-clock time typically decreases as the number of cohorts and the overlap in concept sets increase. The ratio (new time / old time) is often below 1.0, with larger batches showing greater speedups. The exact improvement depends on:
The benchmarking script records time_old_sec,
time_new_sec, and ratio_new_over_old per
database so you can measure the speedup on your own data and
platforms.
devtools::load_all() for the latter).CDMConnector::readCohortSet("path/to/cohorts")).For one CDM, use the single-database benchmark and optional equivalence check:
source("extras/benchmark_cohort_generation.R")
cohort_set <- CDMConnector::readCohortSet("path/to/cohorts")
result <- benchmark_cohort_generation(cdm, cohort_set, cohort_path = "path/to/cohorts")
# Compare old vs new cohort tables (identical rows, order ignored)
cmp <- compare_cohort_tables(result$cdm, name_old = "cohort_bench_old", name_new = "cohort_bench_new")
cmp$identical # TRUE if same set of rows
cmp$per_cohort # Per-cohort row counts and match statusPass a named list of CDM objects; names are used as
the database identifier in the output CSVs
(e.g. postgres, redshift,
snowflake, spark,
sql_server):
source("extras/benchmark_cohort_generation.R")
source("extras/benchmark_multi_database.R")
cohort_set <- CDMConnector::readCohortSet("path/to/cohorts")
cdms <- list(
postgres = cdm_postgres,
redshift = cdm_redshift,
snowflake = cdm_snowflake,
spark = cdm_spark,
sql_server = cdm_sqlserver
)
run_benchmark_multi_database(
cdms = cdms,
cohort_set = cohort_set,
cohort_path = "path/to/cohorts",
results_csv = "benchmark_results.csv",
equivalence_csv = "benchmark_equivalence.csv"
)database, time_old_sec,
time_new_sec, ratio_new_over_old,
n_cohorts, files_included,
status.n_old, n_new, rows_identical,
status.The timing CSV has one row per database. Example structure:
| database | time_old_sec | time_new_sec | ratio_new_over_old | n_cohorts | files_included | status |
|---|---|---|---|---|---|---|
| postgres | 120.5 | 45.2 | 0.38 | 4 | cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json | ok |
| redshift | 95.2 | 38.0 | 0.40 | 4 | cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json | ok |
| snowflake | 88.1 | 32.5 | 0.37 | 4 | cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json | ok |
| sql_server | 110.3 | 42.1 | 0.38 | 4 | cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json | ok |
The equivalence CSV confirms that the old and new cohort tables contain the same rows (order ignored). Each database has:
cohort_definition_id NA): total row counts and whether the
full tables match.n_old), from the new table (n_new), and
whether the set of rows for that cohort is identical
(rows_identical).Example:
| database | cohort_definition_id | n_old | n_new | rows_identical | status |
|---|---|---|---|---|---|
| postgres | NA | 15000 | 15000 | TRUE | ok |
| postgres | 1 | 5000 | 5000 | TRUE | ok |
| postgres | 2 | 6000 | 6000 | TRUE | ok |
| postgres | 3 | 4000 | 4000 | TRUE | ok |
| redshift | NA | 15000 | 15000 | TRUE | ok |
| redshift | 1 | 5000 | 5000 | TRUE | ok |
When rows_identical is TRUE for all cohorts (and the overall row), the new approach produces exactly the same cohort membership and dates as the old CIRCE-based method; only execution strategy and performance differ.
| Aspect | Description |
|---|---|
| Performance | The new batch optimizer typically reduces wall-clock time (ratio < 1) by sharing vocabulary and domain work across cohorts. |
| Correctness | The benchmarking pipeline compares old and new cohort tables row-by-row (order ignored) and writes equivalence results to CSV. |
| Platforms | Run the same cohort set on Postgres, Redshift, Snowflake, Spark, and
SQL Server by passing a named list of CDMs to
run_benchmark_multi_database(). |
Use the generated CSVs to document speedups and to confirm identical results across databases and between the two cohort generation methods.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.