The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The ops_* functions are a set of lightweight utilities
that sit outside the main analysis pipeline. They help you verify your
environment before starting, explore data quality, and track how your
cohort changes at each processing step.
| Function | Purpose |
|---|---|
ops_setup() |
Check dx CLI, RAP authentication, and R package dependencies |
ops_toy() |
Generate synthetic UKB-like data for development and testing |
ops_na() |
Summarise missing values (NA and "") across all
columns |
ops_snapshot() |
Record pipeline checkpoints and track dataset changes |
ops_setup() may query dx CLI and RAP authentication
status as part of its health check. All other functions operate entirely
locally: ops_toy() and ops_na() are read-only;
ops_snapshot() and its companions track and optionally
clean up columns; ops_withdraw() removes withdrawn
participants in-place. None of them read from or write to RAP
storage.
ops_setup() — Environment Health CheckRun ops_setup() once after installing ukbflow to confirm
that all required components are in place before starting a real
analysis.
library(ukbflow)
ops_setup()
#> ── ukbflow environment check ──────────────────────────────────────────────
#> ℹ ukbflow 0.1.0 | R 4.4.1 | 2026-03-09
#> ── 1. dx-toolkit ──────────────────────────────────────────────────────────
#> ✔ dx: /usr/local/bin/dx (dx-toolkit v0.375.0)
#> ── 2. RAP authentication ───────────────────────────────────────────────────
#> ✔ user: evan.zhou
#> ✔ project: project-GXk9...
#> ── 3. R packages ───────────────────────────────────────────────────────────
#> ✔ cli 3.6.3 [core]
#> ✔ data.table 1.15.4 [core]
#> ✔ survival 3.7.0 [assoc_coxph]
#> ✔ forestploter 1.1.1 [plot_forest]
#> ...
#> ───────────────────────────────────────────────────────────────────────────
#> ✔ 15 passed
#> ! 2 optional / warningFor programmatic use (e.g. inside scripts or CI), set
verbose = FALSE and inspect the returned list:
result <- ops_setup(verbose = FALSE)
result$summary
#> $pass
#> [1] 15
#> $warn
#> [1] 2
#> $fail
#> [1] 0
# Gate the rest of your script on a clean environment
stopifnot(result$summary$fail == 0)Individual checks can be disabled when only a subset is needed:
# Check R package dependencies only (skip dx and RAP auth)
ops_setup(check_dx = FALSE, check_auth = FALSE)ops_toy() — Synthetic UKB Dataops_toy() generates a realistic but entirely synthetic
dataset that mimics the structure of UKB phenotype data on the RAP. Use
it to develop and test derive_*, assoc_*, and
plot_* functions without needing real UKB data access.
The default "cohort" scenario produces a wide
participant-level table that covers all major UKB data domains:
dt <- ops_toy()
#> ✔ ops_toy: 1000 participants | 75 columns | scenario = "cohort" | seed = 42
dim(dt)
#> [1] 1000 75
names(dt)
#> [1] "eid" "p31" "p34" "p53_i0"
#> [5] "p21022" "p21001_i0" "p20116_i0" "p1558_i0"
#> ...Column groups included:
| Group | Columns |
|---|---|
| Demographics | eid, p31, p34,
p53_i0, p21022 |
| Covariates | p21001_i0, p20116_i0,
p1558_i0, p21000_i0, p22189,
p54_i0 |
| Genetic PCs | p22009_a1 – p22009_a10 |
| Self-report disease | p20002_i0_a0 – a4,
p20008_i0_a0 – a4 |
| Self-report cancer | p20001_i0_a0 – a4,
p20006_i0_a0 – a4 |
| HES | p41270 (JSON array), p41280_a0 –
a8 |
| Cancer registry | p40006_i0 – i2, p40011_i0 –
i2, p40012_i0 – i2,
p40005_i0 – i2 |
| Death registry | p40001_i0, p40002_i0_a0 – a2,
p40000_i0 |
| First occurrence | p131742 |
| GRS columns | grs_bmi, grs_raw,
grs_finngen |
| Messy columns | messy_allna, messy_empty,
messy_label |
The messy columns deliberately stress-test
derive_missing() and ops_na() against common
data quality issues (all-NA columns, empty strings, non-standard missing
labels).
Feed the output directly into the derive pipeline:
The "forest" scenario returns a results table matching
the output of assoc_coxph(), useful for developing and
testing plot_forest() without running a real Cox model:
dt_forest <- ops_toy(scenario = "forest")
#> ✔ ops_toy: 24 rows | 11 columns | scenario = "forest" | seed = 42
plot_forest(
data = dt_forest[model == "Fully adjusted"],
est = dt_forest[model == "Fully adjusted", HR],
lower = dt_forest[model == "Fully adjusted", CI_lower],
upper = dt_forest[model == "Fully adjusted", CI_upper]
)ops_na() — Missing Value Diagnosticsops_na() scans every column for NA
and empty strings (""), returning counts
and percentages sorted by missingness. Counting "" as
missing is intentional — UKB exports frequently use empty strings as
placeholders for absent text values, so ops_na() reports
effective missingness rather than a plain is.na()
count. It is designed to be called before derive_missing()
to understand the data quality profile of a freshly extracted UKB
dataset.
dt <- ops_toy()
ops_na(dt)
#> ── ops_na ──────────────────────────────────────────────────────────────────
#> ℹ 1000 rows | 65 columns | threshold = 0%
#> ✖ messy_allna 1000 / 1000 (100.00%)
#> ✖ p41280_a4 1000 / 1000 (100.00%)
#> ✖ p20002_i0_a4 976 / 1000 ( 97.60%)
#> ✖ p131742 916 / 1000 ( 91.60%)
#> ...
#> ────────────────────────────────────────────────────────────────────────────
#> ✖ 41 columns ≥ 10% missing
#> ✔ 24 columns complete (0% missing)Columns with ≥ 10% missing are flagged in red (✖); those
between 0% and 10% in yellow (!). The summary block
(totals) is always printed regardless of the threshold
setting.
thresholdUse threshold to silence low-missingness columns from
the per-column listing when the dataset has many columns. The summary
block and returned data.table are always complete.
ops_na() returns a data.table invisibly,
regardless of threshold:
result <- ops_na(dt, verbose = FALSE)
result
#> column n_na pct_na
#> <char> <int> <num>
#> 1: messy_allna 1000 100.0
#> 2: p41280_a4 1000 100.0
#> ...
# Identify columns to drop before modelling
cols_to_drop <- result[pct_na > 90, column]
dt[, (cols_to_drop) := NULL]ops_snapshot() — Pipeline Checkpointsops_snapshot() records a lightweight summary of your
dataset at each processing step and stores it in the session cache. Each
subsequent call automatically computes deltas (Δ) against the previous
snapshot, making it easy to track how rows, columns, and missingness
change through the pipeline.
dt <- ops_toy()
ops_snapshot(dt, label = "raw")
#> ── snapshot: raw ───────────────────────────────────────────────────────────
#> rows 1,000
#> cols 65
#> NA cols 41
#> size 0.61 MB
#> ────────────────────────────────────────────────────────────────────────────
dt <- derive_missing(dt)
ops_snapshot(dt, label = "after_derive_missing")
#> ── snapshot: after_derive_missing ──────────────────────────────────────────
#> rows 1,000 (= 0)
#> cols 65 (= 0)
#> NA cols 43 (+2)
#> size 0.61 MB (= 0)
#> ────────────────────────────────────────────────────────────────────────────
dt <- dt[p31 == "Female"]
ops_snapshot(dt, label = "female_only")
#> ── snapshot: female_only ───────────────────────────────────────────────────
#> rows 570 (-430)
#> cols 65 (= 0)
#> NA cols 43 (= 0)
#> size 0.36 MB (-0.25 MB)
#> ────────────────────────────────────────────────────────────────────────────When label is omitted, snapshots are named
snapshot_1, snapshot_2, etc. automatically.
Labels should be unique within a session: if the same label is used
twice, the history row is appended again but the stored column list is
overwritten — which can cause ops_snapshot_cols() and
ops_snapshot_diff() to behave unexpectedly.
Call ops_snapshot() with no arguments to print and
return the complete history data.table:
ops_snapshot()
#> ── ops_snapshot history ────────────────────────────────────────────────────
#> idx label timestamp nrow ncol n_na_cols size_mb
#> 1: 1 raw 14:30:01 1000 65 41 0.61
#> 2: 2 after_derive_missing 14:30:05 1000 65 43 0.61
#> 3: 3 female_only 14:30:08 570 65 43 0.36
#> ────────────────────────────────────────────────────────────────────────────Set verbose = FALSE to record a snapshot without
printing anything — useful inside functions or automated scripts:
ops_snapshot_cols() — column names at a checkpointReturns the column names recorded at a given snapshot label, minus
protected columns (eid, sex, age,
age_at_recruitment, and any registered via
ops_set_safe_cols()). The primary use is building a drop
vector after the raw columns are no longer needed.
Pass keep to protect additional columns beyond the
defaults:
ops_snapshot_diff() — compare two checkpointsReturns lists of columns added and removed between two snapshots —
useful for auditing what derive_* functions produced.
ops_snapshot_remove() — drop raw columns after
derivingRemoves the raw columns captured at a snapshot from
data, keeping any derived columns added since. Built-in
safe columns (eid, etc.) and columns supplied in
keep are always retained.
# After deriving, drop the original raw columns
dt <- ops_snapshot_remove(dt, from = "raw")
#> ✔ ops_snapshot_remove: dropped 60 raw columns, 15 remaining.For data.table input the operation is by reference
(in-place); for data.frame input a new
data.table is returned and the original is not
modified.
ops_set_safe_cols() — register study-specific protected
columnsAdds column names to the session safe list so they are never dropped
by ops_snapshot_cols() or
ops_snapshot_remove().
ops_set_safe_cols(c("date_baseline", "age_at_recruitment"))
# Clear registered safe cols
ops_set_safe_cols(reset = TRUE)ops_withdraw() — Exclude Withdrawn ParticipantsUK Biobank periodically issues withdrawal files listing participants
who have revoked consent. ops_withdraw() reads the
headerless single-column CSV supplied by UKB and removes matching rows
from your dataset. Two snapshots (before_withdraw /
after_withdraw) are recorded automatically.
dt <- ops_withdraw(dt, file = "withdraw.csv")
#> ── snapshot: before_withdraw ───────────────────────────────────────────────
#> rows 502,492
#> ...
#> ── snapshot: after_withdraw ────────────────────────────────────────────────
#> rows 502,489 (-3)
#> ...
#> ℹ Withdrawal file: w854944_20260310.csv (312 IDs)
#> ✖ Excluded: 3 participants found in data
#> ✔ Remaining: 502,489 participantsRun this immediately after loading your extracted dataset, before any
derive_* steps, so withdrawn participants never enter the
analysis.
The four ops_* functions form a natural bookend around
the core pipeline:
library(ukbflow)
# 1. Verify environment before starting
ops_setup()
# 2. Generate test data (or extract real data from RAP)
dt <- ops_toy()
# 3. Inspect data quality before processing
ops_na(dt)
# 4. Run pipeline with checkpoints
ops_snapshot(dt, label = "raw")
dt <- derive_missing(dt)
ops_snapshot(dt, label = "after_derive_missing")
dt <- derive_covariate(dt,
as_numeric = "p21001_i0",
as_factor = c("p31", "p20116_i0")
)
ops_snapshot(dt, label = "after_derive_covariate")
# 5. Review full pipeline history
ops_snapshot()?ops_setup, ?ops_toy,
?ops_na, ?ops_snapshot?ops_snapshot_cols, ?ops_snapshot_diff,
?ops_snapshot_remove, ?ops_set_safe_cols?ops_withdrawvignette("get-started") — end-to-end pipeline
overviewvignette("derive") — disease phenotype derivationThese binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.