The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Multi-Database Benchmarking: Old vs New Cohort Generation

Overview

This vignette describes how to benchmark CDMConnector::generateCohortSet (old, CIRCE-based) against atlasCohortGenerator::generateCohortSet2 (new, DAG-optimized batch) across multiple database platforms. The benchmarking script:

  1. Runs both methods on each database with the same cohort set.
  2. Records overall time for each method and writes results to a CSV.
  3. Confirms that the two cohort tables have identical rows (order ignored) and writes per-database and per-cohort equivalence results to a second CSV.

Supported platforms include PostgreSQL, Redshift, Snowflake, Spark, and SQL Server. You provide a named list of CDM reference objects; the script handles timing, comparison, and CSV output.

Performance improvements with the new approach

The new approach (generateCohortSet2) uses a DAG-based batch optimizer that:

As a result, wall-clock time typically decreases as the number of cohorts and the overlap in concept sets increase. The ratio (new time / old time) is often below 1.0, with larger batches showing greater speedups. The exact improvement depends on:

The benchmarking script records time_old_sec, time_new_sec, and ratio_new_over_old per database so you can measure the speedup on your own data and platforms.

How to run the benchmark

Prerequisites

Single database

For one CDM, use the single-database benchmark and optional equivalence check:

source("extras/benchmark_cohort_generation.R")

cohort_set <- CDMConnector::readCohortSet("path/to/cohorts")
result <- benchmark_cohort_generation(cdm, cohort_set, cohort_path = "path/to/cohorts")

# Compare old vs new cohort tables (identical rows, order ignored)
cmp <- compare_cohort_tables(result$cdm, name_old = "cohort_bench_old", name_new = "cohort_bench_new")
cmp$identical   # TRUE if same set of rows
cmp$per_cohort  # Per-cohort row counts and match status

Multiple databases

Pass a named list of CDM objects; names are used as the database identifier in the output CSVs (e.g. postgres, redshift, snowflake, spark, sql_server):

source("extras/benchmark_cohort_generation.R")
source("extras/benchmark_multi_database.R")

cohort_set <- CDMConnector::readCohortSet("path/to/cohorts")
cdms <- list(
  postgres   = cdm_postgres,
  redshift   = cdm_redshift,
  snowflake  = cdm_snowflake,
  spark      = cdm_spark,
  sql_server = cdm_sqlserver
)

run_benchmark_multi_database(
  cdms = cdms,
  cohort_set = cohort_set,
  cohort_path = "path/to/cohorts",
  results_csv = "benchmark_results.csv",
  equivalence_csv = "benchmark_equivalence.csv"
)

Benchmark results CSV (timing)

The timing CSV has one row per database. Example structure:

database time_old_sec time_new_sec ratio_new_over_old n_cohorts files_included status
postgres 120.5 45.2 0.38 4 cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json ok
redshift 95.2 38.0 0.40 4 cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json ok
snowflake 88.1 32.5 0.37 4 cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json ok
sql_server 110.3 42.1 0.38 4 cohort_a.json; cohort_b.json; cohort_c.json; cohort_d.json ok

Equivalence CSV (same results)

The equivalence CSV confirms that the old and new cohort tables contain the same rows (order ignored). Each database has:

  1. An overall row (with cohort_definition_id NA): total row counts and whether the full tables match.
  2. Per-cohort rows: row counts from the old table (n_old), from the new table (n_new), and whether the set of rows for that cohort is identical (rows_identical).

Example:

database cohort_definition_id n_old n_new rows_identical status
postgres NA 15000 15000 TRUE ok
postgres 1 5000 5000 TRUE ok
postgres 2 6000 6000 TRUE ok
postgres 3 4000 4000 TRUE ok
redshift NA 15000 15000 TRUE ok
redshift 1 5000 5000 TRUE ok

When rows_identical is TRUE for all cohorts (and the overall row), the new approach produces exactly the same cohort membership and dates as the old CIRCE-based method; only execution strategy and performance differ.

Summary

Aspect Description
Performance The new batch optimizer typically reduces wall-clock time (ratio < 1) by sharing vocabulary and domain work across cohorts.
Correctness The benchmarking pipeline compares old and new cohort tables row-by-row (order ignored) and writes equivalence results to CSV.
Platforms Run the same cohort set on Postgres, Redshift, Snowflake, Spark, and SQL Server by passing a named list of CDMs to run_benchmark_multi_database().

Use the generated CSVs to document speedups and to confirm identical results across databases and between the two cohort generation methods.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.