| Title: | Streamlined Workflow for UK Biobank Data Extraction, Analysis, and Visualization |
| Version: | 0.3.3 |
| Description: | Provides a streamlined workflow for UK Biobank cloud-based analysis on the Research Analysis Platform (RAP). Includes tools for phenotype extraction and decoding, variable derivation, survival and association analysis, genetic risk score computation, and publication-quality visualization. For details on the UK Biobank resource, see Bycroft et al. (2018) <doi:10.1038/s41586-018-0579-z>. |
| License: | MIT + file LICENSE |
| URL: | https://github.com/evanbio/ukbflow, https://evanbio.github.io/ukbflow/ |
| BugReports: | https://github.com/evanbio/ukbflow/issues |
| Depends: | R (≥ 4.1) |
| Imports: | broom, cli, curl, data.table, dplyr, forestploter, gt, gtsummary, jsonlite, processx, rlang, survival, tidyselect, tools |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.2 |
| Suggests: | knitr, pagedown, pROC, rmarkdown, testthat (≥ 3.0.0), mockery, webshot2, withr |
| VignetteBuilder: | knitr, rmarkdown |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| Packaged: | 2026-04-01 10:01:17 UTC; 84241 |
| Author: | Yibin Zhou |
| Maintainer: | Yibin Zhou <evanzhou.bio@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-04-07 07:40:02 UTC |
ukbflow: Streamlined Workflow for UK Biobank Data Extraction, Analysis, and Visualization
Description
Provides a streamlined workflow for UK Biobank cloud-based analysis on the Research Analysis Platform (RAP). Includes tools for phenotype extraction and decoding, variable derivation, survival and association analysis, genetic risk score computation, and publication-quality visualization. For details on the UK Biobank resource, see Bycroft et al. (2018) doi:10.1038/s41586-018-0579-z.
Author(s)
Maintainer: Yibin Zhou evanzhou.bio@gmail.com (ORCID)
See Also
Useful links:
Report bugs at https://github.com/evanbio/ukbflow/issues
Fine-Gray competing risks association analysis
Description
Fits a Fine-Gray subdistribution hazard model (via
survival::finegray() + weighted coxph()) for each exposure
variable and returns a tidy result table with subdistribution hazard ratios
(SHR).
Usage
assoc_competing(
data,
outcome_col,
time_col,
exposure_col,
compete_col = NULL,
event_val = 1L,
compete_val = 2L,
covariates = NULL,
base = TRUE,
conf_level = 0.95
)
assoc_fg(
data,
outcome_col,
time_col,
exposure_col,
compete_col = NULL,
event_val = 1L,
compete_val = 2L,
covariates = NULL,
base = TRUE,
conf_level = 0.95
)
Arguments
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Primary event column. In Mode A: a multi-value column (any integer or character codes). In Mode B: a 0/1 binary column. |
time_col |
(character) Follow-up time column (numeric, e.g. years). |
exposure_col |
(character) One or more exposure variable names. |
compete_col |
(character or NULL) Mode B only: name of the 0/1
competing event column. When |
event_val |
Scalar value in |
compete_val |
Scalar value in |
covariates |
(character or NULL) Covariate names for the Fully
adjusted model. When |
base |
(logical) Whether to auto-detect age and sex columns and include
an Age and sex adjusted model. Default: |
conf_level |
(numeric) Confidence level for SHR intervals.
Default: |
Details
Two input modes are supported depending on how the outcome is coded in your dataset:
- Mode A - single multi-value column
-
compete_col = NULL(default).outcome_colcontains all event codes in one column (e.g.0/1/2/3). Useevent_valandcompete_valto identify the event of interest and the competing event; all other values are treated as censored. Example: UKBcensoring_typewhere 1 = event, 2 = death (competing), 0/3 = censored. - Mode B - dual binary columns
-
compete_colis the name of a separate 0/1 column for the competing event.outcome_colis a 0/1 column for the primary event. When both are 1 for the same participant, the primary event takes priority. Example:outcome_col = "copd_status",compete_col = "death_status".
Internally both modes are converted to a three-level factor
c("censor", "event", "compete") before being passed to
finegray().
Three adjustment models are produced (where data allow):
-
Unadjusted - always included.
-
Age and sex adjusted - when
base = TRUEand age/sex columns are detected. -
Fully adjusted - when
covariatesis non-NULL.
Value
A data.table with one row per exposure \times term
\times model combination:
exposureExposure variable name.
termCoefficient name as returned by
coxph.modelOrdered factor:
Unadjusted<Age and sex adjusted<Fully adjusted.nParticipants in the model (after NA removal).
n_eventsPrimary events in the analysis set.
n_competeCompeting events in the analysis set.
SHRSubdistribution hazard ratio.
CI_lowerLower CI bound.
CI_upperUpper CI bound.
p_valueRobust z-test p-value from weighted Cox.
SHR_labelFormatted string, e.g.
"1.23 (1.05-1.44)".
Examples
dt <- ops_toy(scenario = "association", n = 500)
dt <- dt[dm_timing != 1L & htn_timing != 1L]
res <- assoc_competing(
data = dt,
outcome_col = "dm_status",
time_col = "dm_followup_years",
exposure_col = "p20116_i0",
compete_col = "htn_status",
covariates = c("bmi_cat", "tdi_cat")
)
Cox proportional hazards association analysis
Description
Fits one or more Cox models for each exposure variable and returns a tidy result table suitable for downstream forest plots. By default, two standard adjustment models are always included alongside any user-specified model:
Usage
assoc_coxph(
data,
outcome_col,
time_col,
exposure_col,
covariates = NULL,
base = TRUE,
strata = NULL,
conf_level = 0.95
)
assoc_cox(
data,
outcome_col,
time_col,
exposure_col,
covariates = NULL,
base = TRUE,
strata = NULL,
conf_level = 0.95
)
Arguments
data |
(data.frame or data.table) Analysis dataset. Must contain all
columns referenced by |
outcome_col |
(character) Name of the event indicator column.
Accepts |
time_col |
(character) Name of the follow-up time column (numeric, in consistent units, e.g. years). |
exposure_col |
(character) One or more exposure variable names. Each variable is analysed separately; results are stacked row-wise. |
covariates |
(character or NULL) Additional covariate column names for
the Fully adjusted model (e.g.
|
base |
(logical) If |
strata |
(character or NULL) Optional stratification variable.
Passed to |
conf_level |
(numeric) Confidence level for hazard ratio intervals.
Default: |
Details
-
Unadjusted - no covariates (crude).
-
Age and sex adjusted - age + sex auto-detected from the data via UKB field IDs (21022 and 31). Skipped with a warning if either column cannot be found.
-
Fully adjusted - the covariates supplied via the
covariatesargument. Only run whencovariatesis non-NULL.
Outcome coding: outcome_col may be logical
(TRUE/FALSE) or integer/numeric (0/1).
Logical values are converted to integer internally.
Exposure types supported:
-
Binary -
0/1orTRUE/FALSE; produces onetermrow per model. -
Factor - produces one
termrow per non-reference level. -
Numeric (continuous) - produces one
termrow per model.
Value
A data.table with one row per exposure \times term
\times model combination, and the following columns:
exposureExposure variable name.
termCoefficient name as returned by
coxph(e.g."bmi_categoryObese"for a factor, or the variable name itself for numeric/binary exposures).modelOrdered factor:
Unadjusted<Age and sex adjusted<Fully adjusted.nNumber of participants included in the model (after
NAremoval).n_eventsNumber of events in the model's analysis set (after
NAremoval).person_yearsTotal person-years of follow-up in the model's analysis set (rounded, after
NAremoval).HRHazard ratio (point estimate).
CI_lowerLower bound of the confidence interval.
CI_upperUpper bound of the confidence interval.
p_valueWald test p-value.
HR_labelFormatted string, e.g.
"1.23 (1.05-1.44)".
Examples
dt <- ops_toy(scenario = "association", n = 500)
dt <- dt[dm_timing != 1L]
res <- assoc_coxph(
data = dt,
outcome_col = "dm_status",
time_col = "dm_followup_years",
exposure_col = "p20116_i0",
covariates = c("bmi_cat", "tdi_cat"),
base = FALSE
)
# Fully adjusted only (skip Unadjusted + Age-sex)
res <- assoc_coxph(
data = dt,
outcome_col = "dm_status",
time_col = "dm_followup_years",
exposure_col = "p20116_i0",
covariates = c("p21022", "p31", "bmi_cat", "tdi_cat"),
base = FALSE
)
Proportional hazards assumption test for Cox regression
Description
Tests the proportional hazards (PH) assumption using Schoenfeld residuals
via cox.zph. Re-fits the same models as
assoc_coxph (same interface) and returns a tidy result table
with term-level and global test statistics.
Usage
assoc_coxph_zph(
data,
outcome_col,
time_col,
exposure_col,
covariates = NULL,
base = TRUE,
strata = NULL
)
assoc_zph(
data,
outcome_col,
time_col,
exposure_col,
covariates = NULL,
base = TRUE,
strata = NULL
)
Arguments
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Binary event indicator ( |
time_col |
(character) Follow-up time column name. |
exposure_col |
(character) One or more exposure variable names. |
covariates |
(character or NULL) Covariates for the Fully adjusted
model. Default: |
base |
(logical) Include Unadjusted and Age and sex adjusted models.
Default: |
strata |
(character or NULL) Optional stratification variable. |
Details
A non-significant p-value (p > 0.05) indicates the PH assumption is
satisfied for that term. The global test (global_p) reflects the
overall PH assumption for the whole model.
Value
A data.table with one row per exposure \times term
\times model combination, and columns:
exposureExposure variable name.
termCoefficient name.
modelOrdered factor:
Unadjusted<Age and sex adjusted<Fully adjusted.chisqSchoenfeld residual chi-squared statistic.
dfDegrees of freedom.
p_valueP-value for the PH test (term-level).
ph_satisfiedLogical;
TRUEifp_value > 0.05.global_chisqGlobal chi-squared for the whole model.
global_dfGlobal degrees of freedom.
global_pGlobal p-value for the whole model.
Examples
dt <- ops_toy(scenario = "association", n = 500)
dt <- dt[dm_timing != 1L]
zph <- assoc_coxph_zph(
data = dt,
outcome_col = "dm_status",
time_col = "dm_followup_years",
exposure_col = "p20116_i0",
covariates = c("bmi_cat", "tdi_cat"),
base = FALSE
)
zph[ph_satisfied == FALSE]
Cox regression lag sensitivity analysis
Description
Runs Cox proportional hazards models at one or more lag periods to assess
whether associations are robust to the exclusion of early events. For each
lag, participants whose follow-up time is less than lag_years are
removed from the analysis dataset; follow-up time is kept on its original
scale (not shifted). This mirrors the approach used in UK Biobank sensitivity
analyses to address reverse causation and detection bias.
Usage
assoc_lag(
data,
outcome_col,
time_col,
exposure_col,
lag_years = c(1, 2),
covariates = NULL,
base = TRUE,
strata = NULL,
conf_level = 0.95
)
Arguments
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Event indicator column (0/1 or logical). |
time_col |
(character) Follow-up time column (numeric, e.g. years). |
exposure_col |
(character) One or more exposure variable names. |
lag_years |
(numeric) One or more lag periods in the same units as
|
covariates |
(character or NULL) Covariates for the Fully adjusted
model. When |
base |
(logical) Auto-detect age and sex and include an Age and sex
adjusted model. Default: |
strata |
(character or NULL) Optional stratification variable passed to
|
conf_level |
(numeric) Confidence level for HR intervals.
Default: |
Details
Setting lag_years = 0 (or including 0 in the vector) runs the
model on the full unfiltered cohort, providing a reference against which
lagged results can be compared.
The same three adjustment models produced by assoc_coxph are
available here (Unadjusted, Age and sex adjusted,
Fully adjusted).
Value
A data.table with one row per lag \times exposure
\times term \times model combination, containing all columns
produced by assoc_coxph plus:
lag_yearsThe lag period applied (numeric).
n_excludedNumber of participants excluded because their follow-up time was less than
lag_years.
lag_years and n_excluded are placed immediately after
model in the column order.
Examples
dt <- ops_toy(scenario = "association", n = 500)
dt <- dt[dm_timing != 1L]
res <- assoc_lag(
data = dt,
outcome_col = "dm_status",
time_col = "dm_followup_years",
exposure_col = "p20116_i0",
lag_years = c(0, 1, 2),
covariates = c("bmi_cat", "tdi_cat"),
base = FALSE
)
res[, .(lag_years, n, n_excluded, HR, CI_lower, CI_upper, p_value)]
Linear regression association analysis
Description
Fits one or more linear regression models for each exposure variable and returns a tidy result table. By default, two standard adjustment models are always included:
Usage
assoc_linear(
data,
outcome_col,
exposure_col,
covariates = NULL,
base = TRUE,
conf_level = 0.95
)
assoc_lm(
data,
outcome_col,
exposure_col,
covariates = NULL,
base = TRUE,
conf_level = 0.95
)
Arguments
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Name of the continuous numeric outcome column. |
exposure_col |
(character) One or more exposure variable names. |
covariates |
(character or NULL) Covariate column names for the
Fully adjusted model. Default: |
base |
(logical) Include Unadjusted and Age and sex
adjusted models. Default: |
conf_level |
(numeric) Confidence level. Default: |
Details
-
Unadjusted - no covariates (crude).
-
Age and sex adjusted - age + sex auto-detected from the data via UKB field IDs (21022 and 31). Skipped with a warning if either column cannot be found.
-
Fully adjusted - the covariates supplied via the
covariatesargument. Only run whencovariatesis non-NULL.
Outcome: intended for continuous numeric variables. Passing a
binary (0/1) or logical column is permitted (linear probability model) but
will trigger a warning recommending assoc_logistic instead.
CI method: based on the t-distribution via confint.lm(),
which is exact under the normal linear model assumption. There is no
ci_method argument (unlike assoc_logistic) as profile
likelihood does not apply to lm.
SE column: the standard error of \beta is included to
support downstream meta-analysis and GWAS-style summary statistics.
Value
A data.table with one row per exposure \times term
\times model combination, and columns:
exposureExposure variable name.
termCoefficient name (e.g.
"bmi_categoryObese").modelOrdered factor:
Unadjusted<Age and sex adjusted<Fully adjusted.nParticipants in model (after NA removal).
betaRegression coefficient (
\beta).seStandard error of
\beta.CI_lowerLower confidence bound.
CI_upperUpper confidence bound.
p_valuet-test p-value.
beta_labelFormatted string, e.g.
"0.23 (0.05-0.41)".
Examples
dt <- ops_toy(scenario = "association", n = 500)
res <- assoc_linear(
data = dt,
outcome_col = "p21001_i0",
exposure_col = "p20116_i0",
covariates = c("bmi_cat", "tdi_cat"),
base = FALSE
)
Logistic regression association analysis
Description
Fits one or more logistic regression models for each exposure variable and returns a tidy result table suitable for downstream forest plots. By default, two standard adjustment models are always included:
Usage
assoc_logistic(
data,
outcome_col,
exposure_col,
covariates = NULL,
base = TRUE,
ci_method = c("wald", "profile"),
conf_level = 0.95
)
assoc_logit(
data,
outcome_col,
exposure_col,
covariates = NULL,
base = TRUE,
ci_method = c("wald", "profile"),
conf_level = 0.95
)
Arguments
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Binary outcome column ( |
exposure_col |
(character) One or more exposure variable names. |
covariates |
(character or NULL) Covariate column names for the
Fully adjusted model. Default: |
base |
(logical) Include Unadjusted and Age and sex
adjusted models. Default: |
ci_method |
(character) CI calculation method: |
conf_level |
(numeric) Confidence level. Default: |
Details
-
Unadjusted - no covariates (crude).
-
Age and sex adjusted - age + sex auto-detected from the data via UKB field IDs (21022 and 31). Skipped with a warning if either column cannot be found.
-
Fully adjusted - the covariates supplied via the
covariatesargument. Only run whencovariatesis non-NULL.
Outcome coding: outcome_col may be logical
(TRUE/FALSE) or integer/numeric (0/1).
Logical values are converted to integer internally.
CI methods:
-
"wald"(default) - fast, appropriate for large UKB samples. -
"profile"- profile likelihood CI viaconfint.glm(); slower but more accurate for small or sparse data.
Value
A data.table with one row per exposure \times term
\times model combination, and columns:
exposureExposure variable name.
termCoefficient name (e.g.
"bmi_categoryObese").modelOrdered factor:
Unadjusted<Age and sex adjusted<Fully adjusted.nParticipants in model (after NA removal).
n_casesNumber of cases (outcome = 1) in model.
OROdds ratio (point estimate).
CI_lowerLower confidence bound.
CI_upperUpper confidence bound.
p_valueWald test p-value.
OR_labelFormatted string, e.g.
"1.23 (1.05-1.44)".
Examples
dt <- ops_toy(scenario = "association", n = 500)
res <- assoc_logistic(
data = dt,
outcome_col = "dm_status",
exposure_col = "p20116_i0",
covariates = c("bmi_cat", "tdi_cat"),
base = FALSE
)
Subgroup association analysis with optional interaction test
Description
Stratifies the dataset by a single grouping variable (by) and runs
the specified association model (coxph, logistic, or
linear) within each subgroup. Unlike assoc_coxph and
its siblings, there is no automatic age-and-sex-adjusted model: at the
subgroup level the variable that defines the stratum would have zero
variance, making the auto-detected adjustment meaningless. Accordingly,
this function does not accept a base argument. Instead, two
models are available:
Usage
assoc_subgroup(
data,
outcome_col,
time_col = NULL,
exposure_col,
by,
method = c("coxph", "logistic", "linear"),
covariates = NULL,
interaction = TRUE,
conf_level = 0.95
)
assoc_sub(
data,
outcome_col,
time_col = NULL,
exposure_col,
by,
method = c("coxph", "logistic", "linear"),
covariates = NULL,
interaction = TRUE,
conf_level = 0.95
)
Arguments
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Outcome column name. |
time_col |
(character or NULL) Follow-up time column (required when
|
exposure_col |
(character) One or more exposure variable names. |
by |
(character) Single stratification variable name. Its unique
non-NA values (or factor levels, in order) define the subgroups.
Should be a categorical or binary variable with a small number of
levels (e.g. sex, smoking status). Continuous variables are technically
permitted and the interaction LRT will still run, but per-unique-value
subgrouping is rarely meaningful in practice — use a pre-categorised
version (e.g. via |
method |
(character) Regression method: |
covariates |
(character or NULL) Covariate column names for the Fully
adjusted model. When |
interaction |
(logical) Compute the LRT p-value for the exposure
|
conf_level |
(numeric) Confidence level. Default: |
Details
-
Unadjusted - always run (no covariates).
-
Fully adjusted - run when
covariatesis non-NULL. Users are responsible for excluding thebyvariable fromcovariates(a warning is issued if it is included).
Interaction test: when interaction = TRUE (default), a
likelihood ratio test (LRT) for the exposure \times by interaction is
computed on the full dataset for each exposure \times model
combination and appended as p_interaction. LRT is preferred over
Wald because it handles factor, binary, and continuous by variables
uniformly without requiring the user to recode the by variable.
Value
A data.table with one row per subgroup level \times
exposure \times term \times model, containing:
subgroupName of the
byvariable.subgroup_levelFactor: level of the
byvariable (ordered by original level / sort order).exposureExposure variable name.
termCoefficient name from the fitted model.
modelOrdered factor:
Unadjusted<Fully adjusted.nParticipants in model (after NA removal).
- ...
Effect estimate columns:
HR/OR/beta,CI_lower,CI_upper,p_value, and a formatted label column. Cox models additionally includen_eventsandperson_years; logistic models includen_cases; linear models includese.p_interactionLRT p-value for the exposure
\timesby interaction on the full dataset. Shared across all subgroup levels for the same exposure\timesmodel.NAwhen the interaction model fails. Only present wheninteraction = TRUE.
Examples
dt <- ops_toy(scenario = "association", n = 500)
dt <- dt[dm_timing != 1L]
res <- assoc_subgroup(
data = dt,
outcome_col = "dm_status",
time_col = "dm_followup_years",
exposure_col = "p20116_i0",
by = "p31",
method = "coxph",
covariates = c("bmi_cat", "tdi_cat")
)
Dose-response trend analysis
Description
Fits categorical and trend models simultaneously for each ordered-factor
exposure, returning per-category effect estimates alongside a p-value for
linear trend. Two models are run internally per exposure \times
adjustment combination:
Usage
assoc_trend(
data,
outcome_col,
time_col = NULL,
exposure_col,
method = c("coxph", "logistic", "linear"),
covariates = NULL,
base = TRUE,
scores = NULL,
conf_level = 0.95
)
assoc_tr(
data,
outcome_col,
time_col = NULL,
exposure_col,
method = c("coxph", "logistic", "linear"),
covariates = NULL,
base = TRUE,
scores = NULL,
conf_level = 0.95
)
Arguments
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Outcome column name. |
time_col |
(character or NULL) Follow-up time column (required when
|
exposure_col |
(character) One or more exposure column names. Each
must be a |
method |
(character) Regression method: |
covariates |
(character or NULL) Covariates for the Fully adjusted
model. Default: |
base |
(logical) Include Unadjusted and Age-and-sex-adjusted models.
Default: |
scores |
(numeric or NULL) Numeric scores assigned to factor levels in
level order. Length must equal |
conf_level |
(numeric) Confidence level. Default: |
Details
-
Categorical model - exposure treated as a factor; produces one row per non-reference level (HR / OR /
\betavs reference). -
Trend model - exposure recoded as numeric scores (default: 0, 1, 2, ...); produces the per-score-unit effect estimate (
*_per_score) andp_trend.
Both results are merged: the output contains a reference row (effect = 1 /
0, CI = NA) followed by non-reference rows, with *_per_score and
p_trend appended as shared columns (same value within each
exposure \times model combination).
Scores: By default levels are scored 0, 1, 2, ... so the
reference group = 0 and each step = 1 unit. Supply scores to use
meaningful units (e.g. median years per category) - only p_trend and
the per-score estimate change; per-category HRs are unaffected.
Adjustment models: follows the same logic as
assoc_coxph - Unadjusted and Age-and-sex-adjusted models are
included by default (base = TRUE); a Fully adjusted model is added
when covariates is non-NULL.
Value
A data.table with one row per exposure \times level
\times model, containing:
exposureExposure variable name.
levelFactor level (ordered factor preserving original level order).
termCoefficient name as returned by the model (reference row uses
paste0(exposure, ref_level)).modelOrdered factor:
Unadjusted<Age and sex adjusted<Fully adjusted.nParticipants in model (after NA removal).
- ...
Effect estimate columns from the categorical model:
HR/OR/beta,CI_lower,CI_upper,p_value, and a formatted label. Reference row hasHR = 1/beta = 0andNAfor CI and p. Cox models additionally includen_eventsandperson_years; logistic includesn_cases; linear includesse.HR_per_score/OR_per_score/beta_per_scorePer-score-unit effect estimate from the trend model. Shared across all levels within the same exposure
\timesmodel.HR_per_score_label/OR_per_score_label/beta_per_score_labelFormatted string for the per-score estimate.
p_trendP-value for linear trend from the trend model. Shared across all levels within the same exposure
\timesmodel.
Examples
dt <- ops_toy(scenario = "association", n = 500)
dt <- dt[dm_timing != 1L]
res <- assoc_trend(
data = dt,
outcome_col = "dm_status",
time_col = "dm_followup_years",
exposure_col = "bmi_cat",
method = "coxph",
covariates = c("tdi_cat", "p20116_i0"),
base = FALSE
)
List available DNAnexus projects
Description
Returns a list of all projects accessible to the current user.
Usage
auth_list_projects()
Value
A character vector of project names and IDs.
Examples
## Not run:
auth_list_projects()
## End(Not run)
Login to DNAnexus with a token
Description
Authenticates with the DNAnexus Research Analysis Platform using an API
token. Equivalent to running dx login --token on the command line.
Usage
auth_login(token = NULL)
Arguments
token |
(character) DNAnexus API token. If NULL, reads from the
environment variable |
Value
Invisible TRUE on success.
Examples
## Not run:
# Supply token directly
auth_login(token = "your_token_here")
# Or store token in ~/.Renviron (recommended):
# usethis::edit_r_environ()
# Add: DX_API_TOKEN=your_token_here
# Save and restart R, then call:
auth_login()
## End(Not run)
Logout from DNAnexus
Description
Invalidates the current DNAnexus session on the remote platform. The local
token file is not removed but becomes invalid. A new token must be generated
from the DNAnexus platform before calling auth_login() again.
Usage
auth_logout()
Value
Invisible TRUE on success.
Examples
## Not run:
auth_logout()
## End(Not run)
Select a DNAnexus project
Description
Switches the active project context on the DNAnexus platform. Only project
IDs (e.g. "project-XXXXXXXXXXXX") are accepted. Run
auth_list_projects() to find your project ID.
Usage
auth_select_project(project)
Arguments
project |
(character) Project ID in the form |
Value
Invisible TRUE on success.
Examples
## Not run:
auth_select_project("project-XXXXXXXXXXXX")
## End(Not run)
Check current DNAnexus authentication status
Description
Returns the current logged-in user and selected project.
Usage
auth_status()
Value
A named list with user and project.
Examples
## Not run:
auth_status()
## End(Not run)
Rename UKB field ID columns to human-readable snake_case names
Description
Renames columns from the raw UKB field ID format used by
extract_pheno (e.g. participant.p31) and
job_result (e.g. p53_i0) to human-readable
snake_case identifiers (e.g. sex,
date_of_attending_assessment_centre_i0).
Usage
decode_names(data, max_nchar = 60L)
Arguments
data |
(data.frame or data.table) Data extracted from UKB-RAP via
|
max_nchar |
(integer) Column names longer than this value are flagged.
Default: |
Details
Column labels are taken from the UKB field title dictionary via
extract_ls. Both the dataset name and field list are cached
after the first call, so subsequent calls to decode_names() involve
no network requests.
When an auto-generated name exceeds max_nchar characters it is
flagged with a warning so you can decide whether to shorten it manually
with names(data)[...] <- .... The function never truncates names
automatically, because the right short name depends on scientific context
that only you know.
Value
The input data with column names replaced by snake_case
labels. Returns a data.table if the input is a data.table.
Examples
## Not run:
df <- extract_pheno(c(31, 53, 21022))
df <- decode_names(df)
# participant.eid → eid
# participant.p31 → sex
# participant.p21022 → age_at_recruitment
# participant.p53_i0 → date_of_attending_assessment_centre_i0 (warned if > 60)
# Shorten a long name afterwards
names(df)[names(df) == "date_of_attending_assessment_centre_i0"] <- "date_baseline"
## End(Not run)
Decode UKB categorical column values using Showcase metadata
Description
Converts raw integer codes produced by extract_pheno into
human-readable labels for all categorical fields
(value_type 21 and 22), using the UKB Showcase encoding tables.
Continuous, text, date, and already-decoded columns are left unchanged.
Usage
decode_values(data, metadata_dir = "data/metadata/")
Arguments
data |
(data.frame or data.table) Data from |
metadata_dir |
(character) Directory containing |
Details
This function requires two metadata files downloaded from the UKB Research Analysis Platform:
-
field.tsv- maps field IDs to encoding IDs and value types. -
esimpint.tsv- maps encoding ID + integer code to label.
Download them once with:
fetch_metadata(dest_dir = "data/metadata")
Both files are cached in the session after the first read.
Call order: use decode_values() before
decode_names, so that column names still contain the numeric
field ID needed to look up the encoding.
Value
The input data with categorical columns replaced by character
labels. Returns a data.table if the input is a data.table.
Examples
## Not run:
# Download metadata once
fetch_metadata(dest_dir = "data/metadata")
# Recommended call order
df <- extract_pheno(c(31, 54, 20116, 21000))
df <- decode_values(df) # 0/1 → "Female"/"Male", etc.
df <- decode_names(df) # participant.p31 → sex
## End(Not run)
Compute age at event for one or more UKB outcomes
Description
For each name in name, adds one column age_at_{name} (numeric,
years) computed as:
age\_at\_event = age\_col + (event\_date - baseline\_date) / 365.25
Usage
derive_age(
data,
name,
baseline_col,
age_col,
date_cols = NULL,
status_cols = NULL
)
Arguments
data |
(data.frame or data.table) UKB phenotype data. |
name |
(character) One or more output prefixes, e.g.
|
baseline_col |
(character) Name of the baseline date column
(e.g. |
age_col |
(character) Name of the age-at-baseline column
(e.g. |
date_cols |
(character or NULL) Named character vector mapping each
name to its event date column, e.g.
|
status_cols |
(character or NULL) Named character vector mapping each
name to its status column. |
Details
The value is NA for participants who did not experience the event
(status is FALSE / 0) or who lack an event date.
Auto-detection per name (when date_cols / status_cols
are NULL):
-
date_col- looked up as{name}_date. -
status_col- looked up first as{name}_status, then as{name}(logical column); if neither exists all rows with a non-NAdate are treated as cases.
data.table pass-by-reference: new columns are added in-place.
Value
The input data with one new age_at_{name} column
per entry in name, added in-place.
Examples
dt <- ops_toy(scenario = "association", n = 100)
derive_age(dt, name = "dm", baseline_col = "p53_i0", age_col = "p21022")
Derive a binary disease flag from UKB cancer registry
Description
The UK Biobank cancer registry links each participant's cancer diagnoses
from national cancer registries. Each diagnosis is stored as a separate
instance with four parallel fields: ICD-10 code (p40006), histology
code (p40011), behaviour code (p40012), and diagnosis date
(p40005). Unlike HES or self-report data, each instance holds
exactly one record - there is no array (a*) dimension.
Usage
derive_cancer_registry(
data,
name,
icd10 = NULL,
histology = NULL,
behaviour = NULL,
code_cols = NULL,
hist_cols = NULL,
behv_cols = NULL,
date_cols = NULL
)
Arguments
data |
(data.frame or data.table) UKB phenotype data containing cancer registry fields. |
name |
(character) Output column prefix, e.g. |
icd10 |
(character or NULL) Regular expression matched against the
ICD-10 code column ( |
histology |
(integer vector or NULL) Histology codes to retain
( |
behaviour |
(integer vector or NULL) Behaviour codes to retain
( |
code_cols |
(character or NULL) Names of ICD-10 code columns
( |
hist_cols |
(character or NULL) Names of histology columns
( |
behv_cols |
(character or NULL) Names of behaviour columns
( |
date_cols |
(character or NULL) Names of diagnosis date columns
( |
Details
All three filter arguments (icd10, histology,
behaviour) are applied with AND logic: a record must satisfy
every non-NULL filter to be counted. For OR conditions (e.g.\
D04 or C44 with specific histology), call the function twice and
combine the resulting columns downstream.
{name}_cancerLogical flag:
TRUEif any cancer registry record satisfies all supplied filters.{name}_cancer_dateEarliest matching diagnosis date (
IDate).
Value
The input data with two new columns added in-place:
{name}_cancer (logical) and {name}_cancer_date (IDate).
Always returns a data.table.
Examples
dt <- ops_toy(n = 100)
derive_cancer_registry(dt, name = "skin", icd10 = "^C44")
derive_cancer_registry(dt, name = "invasive", icd10 = "^C", behaviour = 3L)
Combine self-report and ICD-10 sources into a unified case definition
Description
Takes the self-report flag and ICD-10 flag produced by
derive_selfreport and derive_icd10 (or any
pair of logical columns) and merges them into a single logical case status
and earliest date.
Usage
derive_case(
data,
name,
icd10_col = NULL,
selfreport_col = NULL,
icd10_date_col = NULL,
selfreport_date_col = NULL
)
Arguments
data |
(data.frame or data.table) UKB phenotype data. |
name |
(character) Column prefix used both to locate the default
input columns and to name the output columns. Defaults:
|
icd10_col |
(character or NULL) Name of the ICD-10 status column.
|
selfreport_col |
(character or NULL) Name of the self-report status
column. |
icd10_date_col |
(character or NULL) Name of the ICD-10 date column.
|
selfreport_date_col |
(character or NULL) Name of the self-report date
column. |
Details
{name}_statusLogical:
TRUEif positive in at least one source.{name}_dateEarliest diagnosis/report date across both sources (
IDate).
Value
The input data with two new columns added in-place:
{name}_status (logical) and {name}_date (IDate).
Always returns a data.table.
Examples
dt <- ops_toy(n = 100)
derive_selfreport(dt, name = "htn", regex = "hypertension",
field = "noncancer",
disease_cols = paste0("p20002_i0_a", 0:4),
date_cols = paste0("p20008_i0_a", 0:4),
visit_cols = "p53_i0")
derive_icd10(dt, name = "htn", icd10 = "I10", source = c("hes", "death"))
derive_case(dt, name = "htn")
Prepare UKB covariates for analysis
Description
Converts decoded UKB columns to analysis-ready types: character-encoded
numeric fields to numeric, and categorical fields to factor.
Prints a concise summary for each converted column - mean / median / SD /
missing rate for numeric columns, and level counts for factor columns - so
you can verify distributions without leaving the pipeline.
Usage
derive_covariate(
data,
as_numeric = NULL,
as_factor = NULL,
factor_levels = NULL,
max_levels = 5L
)
Arguments
data |
(data.frame or data.table) UKB data, typically output of
|
as_numeric |
(character or NULL) Column names to convert to
|
as_factor |
(character or NULL) Column names to convert to
|
factor_levels |
(named list or NULL) Custom level ordering for specific
factor columns. Names must match entries in |
max_levels |
(integer) Factor columns with more levels than this
threshold trigger a warning suggesting the user consider collapsing
categories. Default: |
Details
data.table pass-by-reference: when the input is a
data.table, modifications are made in-place. Pass
data.table::copy(data) to preserve the original.
Value
The input data with converted columns. Always returns a
data.table.
Examples
dt <- ops_toy(n = 100)
dt <- derive_missing(dt)
# messy_label is a character column with numeric-looking values ("-1", "999")
# mixed with non-parseable strings — demonstrates coercion and NA warnings
derive_covariate(dt,
as_numeric = "messy_label",
as_factor = c("p31", "p20116_i0"),
factor_levels = list(
p20116_i0 = c("Never", "Previous", "Current")
)
)
Cut a continuous UKB variable into quantile-based or custom groups
Description
Creates a new factor column by binning a continuous variable into n
groups. When breaks is omitted, group boundaries are derived from
quantiles of the observed data (equal-frequency binning). When
breaks is supplied, those values are used as interior cut points.
Usage
derive_cut(data, col, n, breaks = NULL, labels = NULL, name = NULL)
Arguments
data |
(data.frame or data.table) UKB data. |
col |
(character) Name of the source numeric column. |
n |
(integer) Number of groups. Any integer |
breaks |
(numeric vector or NULL) Interior cut points; length must
equal |
labels |
(character vector or NULL) Group labels of length |
name |
(character or NULL) Name for the new column. Defaults to
|
Details
Before binning, a numeric summary (mean, median, SD, Q1, Q3, missing rate) is printed for the source column. After binning, the group distribution is printed via an internal summary helper.
Only one column can be processed per call; loop over columns explicitly when binning multiple variables.
data.table pass-by-reference: when the input is a
data.table, the new column is added in-place. Pass
data.table::copy(data) to preserve the original.
Value
The input data with one new factor column appended. Always
returns a data.table.
Examples
dt <- ops_toy(n = 100)
derive_cut(dt, col = "p21001_i0", n = 3)
# Custom breaks and labels
derive_cut(dt, col = "p21001_i0", n = 3,
breaks = c(25, 30),
labels = c("<25", "25-30", "30+"),
name = "bmi_group")
Derive a binary disease flag from UKB death registry
Description
Death registry records store the underlying (primary) cause of death in
field p40001 and contributory (secondary) causes in field
p40002, both coded in ICD-10. The date of death is in field
p40000. All three fields have an instance dimension (i0,
i1) reflecting potential amendments; p40002 additionally
has an array dimension (a0, a1, ...).
Usage
derive_death_registry(
data,
name,
icd10,
match = c("prefix", "exact", "regex"),
primary_cols = NULL,
secondary_cols = NULL,
date_cols = NULL
)
Arguments
data |
(data.frame or data.table) UKB phenotype data containing death registry fields. |
name |
(character) Output column prefix, e.g. |
icd10 |
(character) ICD-10 code(s) to match. For |
match |
(character) Matching strategy: |
primary_cols |
(character or NULL) Names of primary cause columns
( |
secondary_cols |
(character or NULL) Names of secondary cause
columns ( |
date_cols |
(character or NULL) Names of death date columns
( |
Details
{name}_deathLogical flag:
TRUEif any death registry record (primary or secondary cause) contains a matching ICD-10 code.{name}_death_dateEarliest death date across matching instances (
IDate). Note: this is the date of death, not onset date.
Value
The input data with two new columns added in-place:
{name}_death (logical) and {name}_death_date (IDate).
Always returns a data.table.
Examples
dt <- ops_toy(n = 100)
derive_death_registry(dt, name = "mi", icd10 = "I21")
derive_death_registry(dt, name = "lung", icd10 = "C34")
Derive a binary disease flag from UKB First Occurrence fields
Description
UKB pre-computes the earliest recorded date for hundreds of ICD-10 chapters
and categories as First Occurrence fields (p131xxx). Each
field contains a single date per participant; no array or instance depth is
involved. This function reads that date, converts it to IDate,
and writes two analysis-ready columns:
Usage
derive_first_occurrence(data, name, field, col = NULL)
Arguments
data |
(data.frame or data.table) UKB phenotype data. |
name |
(character) Output column prefix, e.g. |
field |
(integer or character) UKB field ID of the First Occurrence
field, e.g. |
col |
(character or NULL) Name of the source column in |
Details
{name}_fo_dateEarliest First Occurrence date (
IDate). Values that cannot be coerced to a valid date (e.g. UKB error codes) are silently set toNA.{name}_foLogical flag derived from
{name}_fo_date:TRUEif and only if a valid date exists. This guarantees that every positive case has a usable date - essential for time-to-event and prevalent/incident classification.
Column detection: the function locates the source column
automatically from field, handling both the raw format used by
extract_pheno (participant.p131666) and the
snake_case format produced by decode_names
(date_e11_first_reported_type_2_diabetes). Supply col
to override auto-detection.
data.table pass-by-reference: when the input is a
data.table, new columns are added in-place via :=.
The returned object and the original variable point to the same memory.
Value
The input data (invisibly) with two new columns added
in-place: {name}_fo (logical) and {name}_fo_date (IDate).
Examples
dt <- ops_toy(n = 100)
derive_first_occurrence(dt, name = "outcome", field = 131742L, col = "p131742")
Compute follow-up end date and follow-up time for survival analysis
Description
Adds two columns to data:
-
{name}_followup_end(IDate) - the earliest of the outcome event date, death date, lost-to-follow-up date, and the administrative censoring date. -
{name}_followup_years(numeric) - time in years frombaseline_colto{name}_followup_end.
Usage
derive_followup(
data,
name,
event_col,
baseline_col,
censor_date,
death_col = NULL,
lost_col = NULL
)
Arguments
data |
(data.frame or data.table) UKB phenotype data. |
name |
(character) Output column prefix, e.g. |
event_col |
(character) Name of the outcome event date column
(e.g. |
baseline_col |
(character) Name of the baseline date column
(e.g. |
censor_date |
(Date or character) Scalar administrative censoring date,
e.g. |
death_col |
(character or NULL) Name of the death date column
(UKB field 40000). |
lost_col |
(character or NULL) Name of the lost-to-follow-up date
column (UKB field 191). |
Details
data.table pass-by-reference: when the input is a
data.table, new columns are added in-place via :=.
Value
The input data with two new columns added in-place:
{name}_followup_end (IDate) and {name}_followup_years
(numeric).
Examples
dt <- ops_toy(n = 100)
derive_hes(dt, name = "htn", icd10 = "I10")
derive_followup(dt,
name = "htn_hes",
event_col = "htn_hes_date",
baseline_col = "p53_i0",
censor_date = as.Date("2022-06-01"),
death_col = "p40000_i0",
lost_col = FALSE)
Derive a binary disease flag from UKB HES inpatient diagnoses
Description
Hospital Episode Statistics (HES) inpatient records store ICD-10 diagnosis
codes in field p41270 (single JSON-array column on UKB RAP) and
corresponding first-diagnosis dates in field p41280
(p41280_a0, p41280_a1, ...). The array index in
p41270 and p41280 are aligned: the N-th code in the
JSON array corresponds to p41280_aN (date of first in-patient
diagnosis for that code).
Usage
derive_hes(
data,
name,
icd10,
match = c("prefix", "exact", "regex"),
disease_cols = NULL,
date_cols = NULL
)
Arguments
data |
(data.frame or data.table) UKB phenotype data containing HES
fields ( |
name |
(character) Output column prefix, e.g. |
icd10 |
(character) ICD-10 code(s) to match. For |
match |
(character) Matching strategy: |
disease_cols |
(character or NULL) Name of the |
date_cols |
(character or NULL) Names of |
Details
{name}_hesLogical flag:
TRUEif any HES record contains a matching ICD-10 code.{name}_hes_dateEarliest first-diagnosis date across all matching codes (
IDate).NAif no date is available.
Value
The input data with two new columns added in-place:
{name}_hes (logical) and {name}_hes_date (IDate).
Always returns a data.table.
Examples
dt <- ops_toy(n = 100)
derive_hes(dt, name = "htn", icd10 = "I10")
derive_hes(dt, name = "diabetes", icd10 = c("E10", "E11"))
derive_hes(dt, name = "asthma", icd10 = "^J4", match = "regex")
Derive a unified ICD-10 disease flag across multiple UKB data sources
Description
A high-level wrapper that calls one or more of derive_hes,
derive_death_registry, derive_first_occurrence,
and derive_cancer_registry according to the source
argument, then combines their results into a single status flag and
earliest-date column.
Usage
derive_icd10(
data,
name,
icd10,
source = c("hes", "death", "first_occurrence", "cancer_registry"),
match = c("prefix", "exact", "regex"),
fo_field = NULL,
fo_col = NULL,
histology = NULL,
behaviour = NULL,
hes_code_col = NULL,
hes_date_cols = NULL,
primary_cols = NULL,
secondary_cols = NULL,
death_date_cols = NULL,
cr_code_cols = NULL,
cr_hist_cols = NULL,
cr_behv_cols = NULL,
cr_date_cols = NULL
)
Arguments
data |
(data.frame or data.table) UKB phenotype data. |
name |
(character) Output column prefix, e.g. |
icd10 |
(character) ICD-10 code(s) to match. For |
source |
(character) One or more of |
match |
(character) Matching strategy passed to |
fo_field |
(integer or character or NULL) UKB field ID for the
First Occurrence column (e.g. |
fo_col |
(character or NULL) Column name of the First Occurrence
field in |
histology |
(integer vector or NULL) Passed to
|
behaviour |
(integer vector or NULL) Passed to
|
hes_code_col |
(character or NULL) Passed as |
hes_date_cols |
(character or NULL) Passed as |
primary_cols |
(character or NULL) Passed to
|
secondary_cols |
(character or NULL) Passed to
|
death_date_cols |
(character or NULL) Passed as |
cr_code_cols |
(character or NULL) Passed as |
cr_hist_cols |
(character or NULL) Passed as |
cr_behv_cols |
(character or NULL) Passed as |
cr_date_cols |
(character or NULL) Passed as |
Details
All intermediate source columns ({name}_hes, {name}_death,
{name}_fo, {name}_cancer and their _date counterparts)
are retained in data so that per-source contributions remain
traceable.
{name}_icd10Logical flag:
TRUEif any selected source contains a matching record.{name}_icd10_dateEarliest matching date across all selected sources (
IDate).
Value
The input data with {name}_icd10 (logical) and
{name}_icd10_date (IDate) added in-place, plus all intermediate
source columns. Always returns a data.table.
Examples
dt <- ops_toy(n = 100)
derive_icd10(dt, name = "htn",
icd10 = "I10",
source = c("hes", "death"))
derive_icd10(dt, name = "mi",
icd10 = "I21",
source = c("hes", "death", "first_occurrence"),
fo_col = "p131742")
Handle informative missing labels in UKB decoded data
Description
After decode_values converts categorical codes to character
labels, some values represent meaningful non-response rather than true data:
"Do not know", "Prefer not to answer", and
"Prefer not to say". This function either converts them to NA
(for complete-case analysis) or retains them as "Unknown" (to
preserve the informative missingness as a model category).
Usage
derive_missing(
data,
cols = tidyselect::everything(),
action = c("na", "unknown"),
extra_labels = NULL
)
Arguments
data |
(data.frame or data.table) Decoded UKB data, typically output
of |
cols |
(tidyselect) Columns to process. Default: |
action |
(character) One of
Empty strings are always converted to |
extra_labels |
(character or NULL) Additional labels to treat as
informative missing, appended to the built-in list. Default: |
Details
Empty strings ("") are always converted to NA regardless of
action, as they carry no informational content.
Only character columns are modified; numeric, integer, Date, and logical columns are silently skipped.
data.table pass-by-reference: when the input is a
data.table, modifications are made in-place via
set. The returned object and the original variable
point to the same memory. If you need to preserve the original, pass
data.table::copy(data) instead.
Value
The input data with missing labels replaced in-place.
Always returns a data.table.
Examples
dt <- ops_toy(n = 100)
derive_missing(dt)
# Retain informative non-response as a model category
derive_missing(dt, action = "unknown")
# Add a custom label
derive_missing(dt, extra_labels = "Not applicable")
Define a self-reported phenotype from UKB touchscreen data
Description
Searches UKB self-reported illness fields across all instances and arrays,
matches records against a user-supplied regex, parses the associated
year-month date, and appends two columns to the data:
{name}_selfreport (logical) and {name}_selfreport_date
(IDate, earliest matching instance).
Usage
derive_selfreport(
data,
name,
regex,
field = c("noncancer", "cancer"),
ignore_case = TRUE,
disease_cols = NULL,
date_cols = NULL,
visit_cols = NULL
)
Arguments
data |
(data.frame or data.table) UKB data containing self-report fields. |
name |
(character) Output column name prefix, e.g. |
regex |
(character) Regular expression matched against disease text
values (after |
field |
(character) Self-report field type: |
ignore_case |
(logical) Should regex matching ignore case?
Default: |
disease_cols |
(character or NULL) Column name(s) containing disease
text (p20002 or p20001). |
date_cols |
(character or NULL) Column name(s) containing the
self-report year-month date (p20008 or p20006). |
visit_cols |
(character or NULL) Column name(s) containing the
baseline assessment date (p53). |
Details
The function auto-detects the relevant columns using the
extract_ls() field dictionary cache (populated by
extract_ls() or extract_pheno()). The three _cols
parameters let you override auto-detection when column names have been
customised (e.g. after decode_names()).
Field mapping by field:
-
"noncancer": disease text = p20002, date = p20008. -
"cancer": disease text = p20001, date = p20006.
Baseline visit date (p53) is used as a fallback when no specific diagnosis date is recorded.
data.table pass-by-reference: new columns are added in-place.
Pass data.table::copy(data) to preserve the original.
Value
The input data with two new columns appended in-place:
{name}_selfreport (logical) and
{name}_selfreport_date (IDate). Always returns a
data.table.
Examples
dt <- ops_toy(n = 100)
# disease_cols / date_cols / visit_cols can be omitted — auto-detected from
# column names matching p20002_* / p20008_* / p53_* respectively
derive_selfreport(dt, name = "htn", regex = "hypertension",
field = "noncancer")
Classify disease timing relative to UKB baseline assessment
Description
Assigns each participant an integer timing category based on whether their disease date falls before or after the baseline visit date:
Usage
derive_timing(data, name, baseline_col, status_col = NULL, date_col = NULL)
Arguments
data |
(data.frame or data.table) UKB phenotype data. |
name |
(character) Output column prefix. The new column is named
|
baseline_col |
(character) Name of the baseline date column in
|
status_col |
(character or NULL) Name of the logical disease flag.
|
date_col |
(character or NULL) Name of the disease date column
( |
Details
0No disease (
status_colisFALSE).1Prevalent - disease date on or before baseline.
2Incident - disease date strictly after baseline.
NACase with missing date; timing cannot be determined.
Call once per timing variable needed (e.g. once for the combined case, once per individual source).
Value
The input data with one new integer column
{name}_timing (0/1/2/NA) added in-place.
Always returns a data.table.
Examples
dt <- ops_toy(n = 100)
derive_hes(dt, name = "htn", icd10 = "I10")
derive_timing(dt, name = "htn_hes",
status_col = "htn_hes",
date_col = "htn_hes_date",
baseline_col = "p53_i0")
Submit a large-scale phenotype extraction job via table-exporter
Description
Submits an asynchronous table-exporter job on the DNAnexus Research
Analysis Platform for large-scale phenotype extraction. Use this instead
of extract_pheno() when extracting many fields (e.g. 50+).
Usage
extract_batch(
field_id,
dataset = NULL,
file = NULL,
instance_type = NULL,
priority = c("low", "high")
)
Arguments
field_id |
(integer) Vector of UKB Field IDs to extract.
|
dataset |
(character) Dataset file name. Default: |
file |
(character) Output file name on the cloud (without extension),
e.g. |
instance_type |
(character) DNAnexus instance type, e.g.
|
priority |
(character) Job scheduling priority. |
Details
The job runs on the cloud and typically completes in 20-40 minutes.
Monitor progress and retrieve results using the job_ series.
Value
Invisibly returns the job ID string (e.g. "job-XXXX").
Examples
## Not run:
job_id <- extract_batch(core_field_ids)
job_id <- extract_batch(core_field_ids, file = "ad_cscc_pheno")
job_id <- extract_batch(core_field_ids, priority = "high")
# Monitor: job_status(job_id)
# Download: job_result(job_id, dest = "data/pheno.csv")
## End(Not run)
List all approved fields in the UKB dataset
Description
Returns a data.frame of all fields available for extraction in the current UKB project dataset. Fields reflect what has been approved for your project — not all UKB fields are present.
Usage
extract_ls(dataset = NULL, pattern = NULL, refresh = FALSE)
Arguments
dataset |
(character) Dataset file name, e.g.
|
pattern |
(character) Optional regex to filter results by
|
refresh |
(logical) Force re-fetch from cloud, ignoring cache.
Default: |
Details
Results are cached in the session after the first call. Subsequent calls
return instantly from cache. Use refresh = TRUE to force a new
network request (e.g. after switching projects).
Value
A data.frame with columns:
- field_name
Full field name as used in extraction, e.g.
"participant.p31","participant.p53_i0".- title
Human-readable field description, e.g.
"Sex","Date of attending assessment centre | Instance 0".
Examples
## Not run:
# List all approved fields
extract_ls()
# Search by keyword
extract_ls(pattern = "cancer")
extract_ls(pattern = "p31|p53|p22009")
# Force refresh after switching projects
extract_ls(refresh = TRUE)
## End(Not run)
Extract phenotype data from a UKB dataset
Description
Extracts phenotypic fields from the UKB Research Analysis Platform dataset
and returns a data.table. All instances and arrays are returned for
each requested field. Column names are kept as-is (e.g.
participant.p53_i0); use the clean_ series for renaming.
Usage
extract_pheno(field_id, dataset = NULL, timeout = 300)
Arguments
field_id |
(integer) Vector of UKB Field IDs to extract, e.g.
|
dataset |
(character) Dataset file name. Default: |
timeout |
(integer) Extraction timeout in seconds. Default: |
Value
A data.table with one row per participant. Column names
follow the participant.p<id>_i<n>_a<m> convention.
Fields not found are skipped with a warning.
Examples
## Not run:
df <- extract_pheno(c(31, 53, 21022))
df <- extract_pheno(c(31, 53, 20002), dataset = "app12345_20260101.dataset")
## End(Not run)
Download the UKB field dictionary file
Description
Downloads field.tsv from the Showcase metadata/ folder on the
DNAnexus Research Analysis Platform. This file contains the complete UKB
data dictionary: field IDs, titles, value types, and encoding references.
Usage
fetch_field(dest_dir, overwrite = FALSE, resume = FALSE, verbose = TRUE)
Arguments
dest_dir |
(character) Destination directory. Created automatically if it does not exist. |
overwrite |
(logical) Overwrite existing local file. Default:
|
resume |
(logical) Resume an interrupted download. Default:
|
verbose |
(logical) Show download progress. Default: |
Value
Invisibly returns the local file path as a character string.
Examples
## Not run:
fetch_field(dest_dir = "metadata")
fetch_field(dest_dir = "metadata", overwrite = TRUE)
## End(Not run)
Download a file from RAP project storage
Description
Downloads one file or all files within a folder from RAP project storage to the current directory or a specified destination within the RAP environment. This function must be called from within RAP.
Usage
fetch_file(path, dest_dir, overwrite = FALSE, resume = FALSE, verbose = TRUE)
Arguments
path |
(character) Remote file or folder path. |
dest_dir |
(character) Destination directory. Created automatically if it does not exist. |
overwrite |
(logical) Overwrite existing local files. Default:
|
resume |
(logical) Resume an interrupted download. Useful for large
files (e.g. |
verbose |
(logical) Show download progress. Default: |
Value
Invisibly returns the local file path(s) as a character vector.
Examples
## Not run:
# Download a single metadata file
fetch_file("Showcase metadata/field.tsv", dest_dir = "data/")
# Download an entire folder
fetch_file("Showcase metadata/", dest_dir = "data/metadata/")
# Resume an interrupted download
fetch_file("results/summary_stats.csv", dest_dir = "data/", resume = TRUE)
## End(Not run)
List files and folders at a remote RAP path
Description
Returns a structured data.frame describing the contents of a remote
DNAnexus Research Analysis Platform (RAP) directory. Analogous to
file_info() but for remote project storage.
Usage
fetch_ls(path = ".", type = "all", pattern = NULL)
Arguments
path |
(character) Remote path to list. Default: |
type |
(character) Filter results by entry type: |
pattern |
(character) Optional regex to filter by name, e.g.
|
Value
A data.frame with columns:
- name
Entry name (no trailing slash for folders).
- type
"file"or"folder".- size
File size string (e.g.
"120.94 GB"),NAfor folders or non-file objects.- modified
Last modified time (
POSIXct),NAfor folders.
Examples
## Not run:
fetch_ls()
fetch_ls("Showcase metadata/", type = "file")
fetch_ls("results/", pattern = "\\.csv$")
## End(Not run)
Download the Showcase metadata folder
Description
Downloads the entire Showcase metadata/ folder from the DNAnexus
Research Analysis Platform to a local directory. This includes
field.tsv, encoding.tsv, and all associated encoding tables.
Usage
fetch_metadata(dest_dir, overwrite = FALSE, resume = FALSE, verbose = TRUE)
Arguments
dest_dir |
(character) Destination directory. Created automatically if it does not exist. |
overwrite |
(logical) Overwrite existing local files. Default:
|
resume |
(logical) Resume interrupted downloads. Default: |
verbose |
(logical) Show download progress. Default: |
Value
Invisibly returns the local file paths as a character vector.
Examples
## Not run:
fetch_metadata(dest_dir = "metadata")
fetch_metadata(dest_dir = "metadata", overwrite = TRUE)
## End(Not run)
Print a remote RAP directory tree
Description
Displays the remote directory structure in a tree-like format by
recursively listing sub-folders up to max_depth. Analogous to
file_tree() but for remote project storage.
Usage
fetch_tree(path = ".", max_depth = 2, verbose = TRUE)
Arguments
path |
(character) Remote root path. Default: |
max_depth |
(integer) Maximum recursion depth. Default: |
verbose |
(logical) Whether to print the tree to the console.
Default: |
Value
Invisibly returns a character vector of tree lines.
Warning
Each level of recursion triggers one HTTPS API call per folder. Deep trees
(e.g. max_depth > 3) on large UKB projects may issue 100+ network
requests, causing the console to hang for tens of seconds or time out.
Keep max_depth at 2-3 for interactive use.
Examples
## Not run:
fetch_tree()
fetch_tree("Bulk/", max_depth = 2)
fetch_tree(verbose = FALSE)
## End(Not run)
Get pre-authenticated download URL(s) for a remote RAP file or folder
Description
Generates temporary HTTPS URLs for files on the DNAnexus Research Analysis Platform. For a single file, returns one URL. For a folder, lists all files inside and returns a named character vector of URLs.
Usage
fetch_url(path, duration = "1d")
Arguments
path |
(character) Remote file path or folder path, e.g.
|
duration |
(character) How long the URLs remain valid. Accepts
suffixes: |
Value
A named character vector of pre-authenticated HTTPS URLs. Names are the file names.
Examples
## Not run:
# Single file
fetch_url("Showcase metadata/field.tsv")
# Entire folder
fetch_url("Showcase metadata/", duration = "7d")
## End(Not run)
Convert UKB imputed BGEN files to PGEN on RAP
Description
Submits one Swiss Army Knife job per chromosome to the DNAnexus Research Analysis Platform, each converting a UKB imputed BGEN file to PGEN format with a MAF > 0.01 filter applied via plink2. Jobs run in parallel across chromosomes.
Usage
grs_bgen2pgen(
chr = 1:22,
dest = NULL,
maf = 0.01,
instance = "standard",
priority = "low"
)
Arguments
chr |
Integer vector. Chromosomes to process. Default: |
dest |
Character scalar. RAP destination path for output PGEN files
(e.g. |
maf |
Numeric scalar. Minor allele frequency filter passed to plink2
|
instance |
Character scalar. Instance type preset: |
priority |
Character scalar. Job priority: |
Details
The function auto-generates the plink2 driver script, uploads it once to
the RAP project root (/) on RAP, then loops over chr submitting
one job per chromosome. A 500 ms pause between submissions prevents API
rate-limiting.
Output path is critical. The driver script writes plink2 output to
/home/dnanexus/out/out/ - the fixed path that Swiss Army Knife
auto-uploads to dest on completion. Output files per chromosome:
chr{N}_maf001.pgen/.pvar/.psam/.log.
Instance types:
"standard"mem2_ssd1_v2_x4: 4 cores, 12 GB RAM. Suitable for smaller chromosomes (roughly chr 17–22)."large"mem2_ssd2_v2_x8: 8 cores, 28 GB RAM, 640 GB SSD. Required for large chromosomes (roughly chr 1–16) where standard storage is insufficient.
Value
A character vector of job IDs (one per chromosome), returned
invisibly. Failed submissions are NA. Use job_ls,
job_status, or job_wait to monitor progress.
Examples
## Not run:
# Test with chr22 first (smallest chromosome)
ids <- grs_bgen2pgen(chr = 22, dest = "/pgen/", priority = "high")
# Small chromosomes - standard instance
ids_small <- grs_bgen2pgen(chr = 15:22, dest = "/pgen/")
# Large chromosomes - upgrade instance to handle storage
ids_large <- grs_bgen2pgen(chr = 1:16, dest = "/pgen/", instance = "large")
# Monitor
job_ls()
## End(Not run)
Check and export a GRS weights file
Description
Reads a SNP weights file, validates its content, and writes a plink2-compatible space-delimited output ready for upload to UKB RAP.
Usage
grs_check(file, dest = NULL)
Arguments
file |
Character scalar. Path to the input weights file.
Read via |
dest |
Character scalar or |
Details
The input file must contain at least the three columns below (additional columns are ignored):
snpSNP identifier, expected in
rs+ digits format.effect_alleleEffect allele; must be one of A / T / C / G.
betaEffect size (log-OR or beta coefficient); must be numeric.
Checks performed:
Required columns present.
No
NAvalues in the three required columns.No duplicate
snpidentifiers.-
snpmatchesrs[0-9]+pattern (warning if not). -
effect_allelecontains only A / T / C / G (warning if not). -
betais numeric (error if not).
Value
A data.table with columns snp, effect_allele,
and beta, returned invisibly.
Examples
tmp_in <- tempfile(fileext = ".csv")
weights <- data.frame(
snp = c("rs1234567", "rs2345678", "rs3456789"),
effect_allele = c("A", "T", "G"),
beta = c(0.12, -0.05, 0.23)
)
write.csv(weights, tmp_in, row.names = FALSE)
w <- grs_check(tmp_in)
w
# Save validated weights to a file
tmp_out <- tempfile(fileext = ".txt")
grs_check(tmp_in, dest = tmp_out)
Calculate genetic risk scores from PGEN files on RAP
Description
Uploads local SNP weight files to the RAP project root, then submits one
Swiss Army Knife job per GRS. Each job runs plink2 --score across
all 22 chromosomes and saves a single CSV to dest on completion.
Jobs run in parallel; use job_ls to monitor progress.
Usage
grs_score(
file,
pgen_dir = NULL,
dest = NULL,
maf = 0.01,
instance = "standard",
priority = "low"
)
Arguments
file |
Named character vector of local weight file paths. Names become
the GRS identifiers (output column = |
pgen_dir |
Character scalar. Path to PGEN files on RAP
(e.g. |
dest |
Character scalar. RAP destination path for output CSV files
(e.g. |
maf |
Numeric scalar. MAF filter threshold used when locating PGEN
files. Must match the value used in |
instance |
Character scalar. Instance type preset: |
priority |
Character scalar. Job priority: |
Details
Weight files should have three columns (any delimiter, header required):
- Column 1
Variant ID (e.g.
rsIDs).- Column 2
Effect allele (A1).
- Column 3
Effect weight (beta / log-OR).
This matches the output format of grs_check.
Output per job: dest/<score_name>_scores.csv with columns
IID and the GRS score (named GRS_<name>).
Instance types:
"standard"mem2_ssd1_v2_x4: 4 cores, 12 GB RAM."large"mem2_ssd2_v2_x8: 8 cores, 28 GB RAM.
Value
A named character vector of job IDs (one per GRS), returned
invisibly. Failed submissions are NA. Use job_ls to
monitor progress.
Examples
## Not run:
ids <- grs_score(
file = c(
grs_a = "weights/grs_a_weights.txt",
grs_b = "weights/grs_b_weights.txt"
),
pgen_dir = "/mnt/project/pgen",
dest = "/grs/",
priority = "high"
)
job_ls()
## End(Not run)
Standardise GRS columns by Z-score transformation
Description
Adds a _z column for every selected GRS column:
z = (x - mean(x)) / sd(x). The original columns are kept
unchanged. grs_zscore is an alias for this function.
Usage
grs_standardize(data, grs_cols = NULL)
grs_zscore(data, grs_cols = NULL)
Arguments
data |
A |
grs_cols |
Character vector of column names to standardise.
If |
Value
The input data as a data.table with one additional
_z column per GRS column appended after its source column.
Examples
dt <- data.frame(
IID = 1:5,
GRS_a = c(0.12, 0.34, 0.56, 0.23, 0.45),
GRS_b = c(1.1, 0.9, 1.3, 0.8, 1.0)
)
grs_standardize(dt)
grs_zscore(dt) # identical
Validate GRS predictive performance
Description
For each GRS column, computes four sets of validation metrics:
-
Per SD - OR (logistic) or HR (Cox) per 1-SD increase.
-
High vs Low - OR / HR comparing top 20\ (extreme tertile grouping: Low / Mid / High).
-
Trend test - P-trend across quartiles (Q1–Q4).
-
Discrimination - AUC (logistic) or C-index (Cox).
Usage
grs_validate(
data,
grs_cols = NULL,
outcome_col,
time_col = NULL,
covariates = NULL
)
Arguments
data |
A |
grs_cols |
Character vector of GRS column names to validate.
If |
outcome_col |
Character scalar. Name of the outcome column
( |
time_col |
Character scalar or |
covariates |
Character vector or |
Details
GRS grouping columns are created internally via derive_cut
and are not added to the user's data. When time_col is
NULL, logistic regression is used throughout; when supplied, Cox
proportional hazards models are used.
Models follow the same adjustment logic as assoc_logistic /
assoc_coxph: unadjusted and age-sex adjusted models are
always included; a fully adjusted model is added when covariates
is non-NULL.
Value
A named list with four data.table elements:
-
per_sd: OR / HR per 1-SD increase in GRS. -
high_vs_low: OR / HR for High vs Low extreme tertile. -
trend: P-trend across Q1–Q4 quartiles. -
discrimination: AUC (logistic) or C-index (Cox) with 95\
Examples
dt <- ops_toy(scenario = "association", n = 500)
dt <- grs_standardize(dt, grs_cols = "grs_bmi")
res <- grs_validate(
data = dt,
grs_cols = "grs_bmi_z",
outcome_col = "dm_status",
time_col = "dm_followup_years"
)
res$per_sd
if (requireNamespace("pROC", quietly = TRUE)) {
res_logit <- grs_validate(
data = dt,
grs_cols = "grs_bmi_z",
outcome_col = "dm_status"
)
res_logit$discrimination
}
List recent DNAnexus jobs in the current project
Description
Returns a summary of the most recent jobs, optionally filtered by state. Useful for quickly reviewing which jobs have completed, failed, or are still running.
Usage
job_ls(n = 20, state = NULL)
Arguments
n |
(integer) Maximum number of recent jobs to return. Must be a
single positive integer. Default: |
state |
(character) Filter by state(s). Must be |
Value
A data.frame with columns:
- job_id
Job ID string, e.g.
"job-XXXX".- name
Job name (typically
"Table exporter").- state
Current job state.
- created
Job creation time (
POSIXct).- runtime
Runtime string (e.g.
"0:04:36"),NAif still running.
Examples
## Not run:
job_ls()
job_ls(n = 5)
job_ls(state = "failed")
job_ls(state = c("done", "failed"))
## End(Not run)
Get the RAP file path of a completed DNAnexus job output
Description
Returns the absolute /mnt/project/ path of the CSV produced by
extract_batch(). Use this to read the file directly on the RAP
without downloading.
Usage
job_path(job_id)
Arguments
job_id |
(character) Job ID returned by |
Value
A character string — the absolute path to the output CSV under
/mnt/project/.
Examples
## Not run:
path <- job_path(job_id)
df <- data.table::fread(path)
## End(Not run)
Load the result of a completed DNAnexus job into R
Description
Reads the output CSV produced by extract_batch() directly from RAP
project storage and returns a data.table. Must be run inside the
RAP environment.
Usage
job_result(job_id)
Arguments
job_id |
(character) Job ID returned by |
Value
A data.table with one row per participant.
Examples
## Not run:
job_id <- extract_batch(c(31, 53, 21022))
job_wait(job_id)
df <- job_result(job_id)
## End(Not run)
Check the current state of a DNAnexus job
Description
Returns the current state of a job submitted by extract_batch().
For failed jobs, the failure message is attached as an attribute.
Usage
job_status(job_id)
Arguments
job_id |
(character) Job ID returned by |
Value
A named character string — the job state. Possible values:
"idle"Queued, waiting to be scheduled.
"runnable"Resources being allocated.
"running"Actively executing.
"done"Completed successfully.
"failed"Failed; see
attr(result, "failure_message")."terminated"Manually terminated.
Examples
## Not run:
job_id <- extract_batch(c(31, 53, 21022))
job_status(job_id)
s <- job_status(job_id)
if (s == "failed") cli::cli_inform(attr(s, "failure_message"))
## End(Not run)
Wait for a DNAnexus job to finish
Description
Polls job_status() at regular intervals until the job reaches a
terminal state ("done", "failed", or "terminated").
Stops with an error if the job fails, is terminated, or times out.
Usage
job_wait(job_id, interval = 30, timeout = Inf, verbose = TRUE)
Arguments
job_id |
(character) Job ID returned by |
interval |
(integer) Polling interval in seconds. Default: |
timeout |
(numeric) Maximum wait time in seconds. Default: |
verbose |
(logical) Print state and elapsed time at each poll.
Default: |
Value
Invisibly returns the final state string ("done").
Examples
## Not run:
job_id <- extract_batch(c(31, 53, 21022))
job_wait(job_id)
# Read result immediately after completion (RAP only)
job_wait(job_id)
df <- job_result(job_id)
## End(Not run)
Summarise missing values by column
Description
Scans each column of a data.frame or data.table and returns the count and
percentage of missing values. Results are sorted by missingness in
descending order. Columns above 10\
and 10\
threshold.
Usage
ops_na(data, threshold = 0, verbose = TRUE)
Arguments
data |
A data.frame or data.table to scan. |
threshold |
(numeric) Columns with |
verbose |
(logical) Print the CLI report. Default |
Details
Missing value definition: a value is counted as missing if it is NA
or an empty string (""). Empty strings are treated as missing because
UKB exports frequently use "" as a placeholder for absent text values.
This means n_na and pct_na reflect effective missingness, not just
is.na(). Numeric and logical columns are not affected (they cannot hold
"").
Value
An invisible data.table with columns column, n_na, and pct_na,
sorted by pct_na descending. n_na counts both NA and "".
Always contains all columns regardless of threshold (which only affects
CLI output).
Examples
dt <- ops_toy(n = 100)
# Show all columns with any missing value
ops_na(dt)
# Only list columns with > 10% missing in the CLI output
ops_na(dt, threshold = 10)
# Programmatic use — retrieve result silently
result <- ops_na(dt, verbose = FALSE)
result[pct_na > 50]
Register additional safe columns protected from snapshot-based drops
Description
Adds column names to the session-level safe list. Columns in this list are
automatically excluded when ops_snapshot_cols is used to
build a drop vector, in addition to the built-in protected columns
("eid", "sex", "age", "age_at_recruitment").
Usage
ops_set_safe_cols(cols = NULL, reset = FALSE)
Arguments
cols |
(character) One or more column names to protect. |
reset |
(logical) If |
Value
Invisibly returns the updated safe cols vector.
Examples
ops_set_safe_cols(c("date_baseline", "townsend_index"))
ops_set_safe_cols(reset = TRUE) # clear user-registered safe cols
Check the ukbflow operating environment
Description
Runs a four-block health check covering the dx CLI, dxpy (Python), RAP authentication, and R package dependencies. Designed to be the first function a new user runs after installation.
Usage
ops_setup(
check_dx = TRUE,
check_auth = TRUE,
check_deps = TRUE,
verbose = TRUE
)
Arguments
check_dx |
(logical) Check dx CLI installation (dxpy is implied by dx). |
check_auth |
(logical) Check RAP login status. |
check_deps |
(logical) Check R package dependencies. |
verbose |
(logical) Print the formatted report. Set to |
Details
The function is read-only: it never modifies system state, installs packages, or authenticates. Auth failures are reported as warnings, not errors, because the check itself does not require a live RAP connection.
Value
An invisible named list with elements dx, dxpy, auth, deps,
and summary. Each element reflects the result of its respective check
block and can be inspected programmatically.
Examples
ops_setup(check_dx = FALSE, check_auth = FALSE)
result <- ops_setup(check_dx = FALSE, check_auth = FALSE, verbose = FALSE)
result$summary$fail == 0
Record and review dataset pipeline snapshots
Description
Captures a lightweight summary of a data.frame at a given pipeline stage
and stores it in the session cache. Subsequent calls automatically compute
deltas against the previous snapshot, making it easy to track how data
changes through derive_*, assoc_*, and other processing steps.
Usage
ops_snapshot(
data = NULL,
label = NULL,
reset = FALSE,
verbose = TRUE,
check_na = TRUE
)
Arguments
data |
A data.frame or data.table to snapshot. Pass |
label |
(character) A short name for this pipeline stage, e.g.
|
reset |
(logical) If |
verbose |
(logical) Print the CLI report. Default |
check_na |
(logical) Whether to count columns with any |
Value
When data is supplied, returns the new snapshot row invisibly
(a one-row data.table). When called with no data, returns the full
history data.table invisibly.
Examples
dt <- ops_toy(n = 100)
ops_snapshot(dt, label = "raw")
dt <- derive_missing(dt)
ops_snapshot(dt, label = "after_derive_missing")
# View full history
ops_snapshot()
# Reset history
ops_snapshot(reset = TRUE)
Retrieve column names recorded at a snapshot
Description
Returns the column names stored by a previous ops_snapshot
call, optionally excluding columns you wish to keep.
Usage
ops_snapshot_cols(label, keep = NULL)
Arguments
label |
(character) Snapshot label passed to |
keep |
(character or NULL) Column names to exclude from the returned
vector (i.e. columns to retain in the data even if they were present at
that snapshot). Default |
Value
A character vector of column names.
Examples
dt <- ops_toy(n = 100)
ops_snapshot(dt, label = "raw")
ops_snapshot_cols("raw")
ops_snapshot_cols("raw", keep = "eid")
Compare column names between two snapshots
Description
Returns lists of columns added and removed between two recorded snapshots.
Usage
ops_snapshot_diff(label1, label2)
Arguments
label1 |
(character) Label of the earlier snapshot. |
label2 |
(character) Label of the later snapshot. |
Value
A named list with two character vectors: added (columns
present in label2 but not label1) and removed
(columns present in label1 but not label2).
Examples
dt <- ops_toy(n = 100)
ops_snapshot(dt, label = "raw")
dt <- derive_missing(dt)
ops_snapshot(dt, label = "derived")
ops_snapshot_diff("raw", "derived")
# $added — newly derived columns
# $removed — columns dropped between snapshots
Remove raw source columns recorded at a snapshot
Description
Drops columns that were present at snapshot from from data,
while automatically protecting built-in safe columns
("eid", "sex", "age", "age_at_recruitment") and
any user-registered safe columns set via ops_set_safe_cols.
Columns that no longer exist in data are silently skipped.
Usage
ops_snapshot_remove(data, from, keep = NULL, verbose = TRUE)
Arguments
data |
A data.frame or data.table. |
from |
(character) Label of the snapshot whose columns should be
dropped (typically |
keep |
(character or NULL) Additional column names to protect beyond
the built-in and user-registered safe cols. Default |
verbose |
(logical) Print a summary of dropped columns. Default
|
Value
A data.table with the specified columns removed. For
data.table input the operation is performed by reference (in-place);
for data.frame input the data is first converted to a new
data.table — the original data.frame is not modified.
Examples
dt <- ops_toy(n = 100)
ops_snapshot(dt, label = "raw")
dt <- derive_missing(dt)
ops_snapshot(dt, label = "derived")
ops_snapshot_diff("raw", "derived")
dt <- ops_snapshot_remove(dt, from = "raw")
Generate toy UKB-like data for testing and development
Description
Creates a small, synthetic dataset that mimics the structure of UK Biobank
phenotype data on the RAP. Useful for developing and testing derive_*,
assoc_*, and plot_* functions without requiring real UKB data access.
Usage
ops_toy(scenario = "cohort", n = 1000L, seed = 42L)
Arguments
scenario |
(character) Data structure to generate:
|
n |
(integer) Number of participants (or exposures for |
seed |
(integer or NULL) Random seed for reproducibility. Pass |
Details
This dataset is entirely synthetic. Column names follow RAP conventions
(e.g. p41270, p20002_i0_a0).
scenario = "cohort"
Includes the following column groups:
-
Demographics:
eid,p31,p34,p53_i0,p21022 -
Covariates:
p21001_i0,p20116_i0,p1558_i0,p21000_i0,p22189,p54_i0 -
Genetic PCs:
p22009_a1–p22009_a10 -
Self-report disease:
p20002_i0_a0–a4,p20008_i0_a0–a4 -
Self-report cancer:
p20001_i0_a0–a4,p20006_i0_a0–a4 -
HES:
p41270(JSON array),p41280_a0–a8 -
Cancer registry:
p40006_i0–i2,p40011_i0–i2,p40012_i0–i2,p40005_i0–i2 -
Death registry:
p40001_i0,p40002_i0_a0–a2,p40000_i0 -
First occurrence:
p131742 -
GRS:
grs_bmi,grs_raw,grs_finngen -
Messy columns:
messy_allna,messy_empty,messy_label
scenario = "association"
Analysis-ready table. All derive inputs (raw arrays, HES JSON, registry fields) are omitted; derive outputs are pre-computed with internally consistent relationships:
-
Demographics:
eid,p31(factor),p53_i0(IDate),p21022 -
Covariates:
p21001_i0,bmi_cat(factor, derived fromp21001_i0),p20116_i0(factor),p1558_i0(factor),p21000_i0(factor),p22189,tdi_cat(factor, derived fromp22189quartiles),p54_i0(factor) -
Genetic PCs:
p22009_a1–p22009_a10 -
GRS:
grs_bmi(continuous exposure) -
DM outcome:
dm_status,dm_date,dm_timing,dm_followup_end,dm_followup_years(type 2 diabetes, ~12% prevalence) -
HTN outcome:
htn_status,htn_date,htn_timing,htn_followup_end,htn_followup_years(hypertension, ~28% prevalence)
Internal relationships guaranteed:
-
bmi_catis cut fromp21001_i0(breaks 18.5 / 25 / 30) -
tdi_catis cut fromp22189quartiles -
dm_dateisNAiffdm_status = FALSE -
dm_timing: 0 = no disease, 1 = prevalent, 2 = incident,NA= no date -
dm_followup_yearsisNAfor prevalent cases (dm_timing == 1)
Value
A data.table with UKB-style column names. See Details for the
columns included in each scenario.
Examples
# cohort: raw UKB-style columns, feed into derive pipeline
dt <- ops_toy(n = 100)
dt <- derive_missing(dt)
# association: analysis-ready, feed directly into assoc_* functions
dt <- ops_toy(scenario = "association", n = 500)
dt <- dt[dm_timing != 1L] # exclude prevalent cases
# forest: results table for plot_forest()
dt <- ops_toy(scenario = "forest")
Exclude withdrawn participants from a dataset
Description
Reads a UK Biobank withdrawal list (a headerless single-column CSV of
anonymised participant IDs) and removes the corresponding rows from
data. A pair of ops_snapshot() calls is made automatically so the
before/after row counts are recorded in the session snapshot history.
Usage
ops_withdraw(data, file, eid_col = "eid", verbose = TRUE)
Arguments
data |
A data.frame or data.table containing a participant ID column. |
file |
(character) Path to the UKB withdrawal CSV file. The file must
be a single-column, header-free CSV as supplied by UK Biobank
(e.g. |
eid_col |
(character) Name of the participant ID column in |
verbose |
(logical) Print the CLI report. Default |
Value
A data.table with withdrawn participants removed.
Examples
dt <- ops_toy(n = 100)
withdraw_file <- tempfile(fileext = ".csv")
writeLines(as.character(dt$eid[1:5]), withdraw_file)
dt <- ops_withdraw(dt, file = withdraw_file)
Publication-ready forest plot
Description
Produces a publication-ready forest plot with UKB-standard styling.
The user supplies a data frame whose first column is the row label
(item), plus any additional display columns (e.g. Cases/N).
The gap column and the auto-formatted OR (95% CI) text column are
inserted automatically at ci_column. Numeric p-value columns
declared via p_cols are formatted in-place.
Usage
plot_forest(
data,
est,
lower,
upper,
ci_column = 2L,
ref_line = 1,
xlim = NULL,
ticks_at = NULL,
arrow_lab = c("Lower risk", "Higher risk"),
header = NULL,
indent = NULL,
bold_label = NULL,
ci_col = "black",
ci_sizes = 0.6,
ci_Theight = 0.2,
ci_digits = 2L,
ci_sep = ", ",
p_cols = NULL,
p_digits = 3L,
bold_p = TRUE,
p_threshold = 0.05,
align = NULL,
background = "zebra",
bg_col = "#F0F0F0",
border = "three_line",
border_width = 3,
row_height = NULL,
col_width = NULL,
save = FALSE,
dest = NULL,
save_width = 20,
save_height = NULL,
theme = "default"
)
Arguments
data |
data.frame. First column must be the label column ( |
est |
Numeric vector. Point estimates ( |
lower |
Numeric vector. Lower CI bounds (same length as |
upper |
Numeric vector. Upper CI bounds (same length as |
ci_column |
Integer. Column position in the final rendered table where
the gap/CI graphic is placed. Must be between |
ref_line |
Numeric. Reference line. Default: |
xlim |
Numeric vector of length 2. X-axis limits. |
ticks_at |
Numeric vector. Tick positions. |
arrow_lab |
Character vector of length 2. Directional labels.
Default: |
header |
Character vector of length |
indent |
Integer vector (length = |
bold_label |
Logical vector (length = |
ci_col |
Character scalar or vector (length = |
ci_sizes |
Numeric. Point size. Default: |
ci_Theight |
Numeric. CI cap height. Default: |
ci_digits |
Integer. Decimal places for the auto-generated
|
ci_sep |
Character. Separator between lower and upper CI in the label,
e.g. |
p_cols |
Character vector. Names of numeric p-value columns in
|
p_digits |
Integer. Decimal places for p-value formatting.
Default: |
bold_p |
|
p_threshold |
Numeric. P-value threshold for bolding when
|
align |
Integer vector of length |
background |
Character. Row background style: |
bg_col |
Character. Shading colour for backgrounds (scalar), or a
per-row vector of length |
border |
Character. Border style: |
border_width |
Numeric. Border line width(s). Scalar = all three lines
same width; length-3 vector = top-of-header, bottom-of-header,
bottom-of-body. Default: |
row_height |
|
col_width |
|
save |
Logical. Save output to files? Default: |
dest |
Character. Destination file path (extension ignored; all four
formats are saved). Required when |
save_width |
Numeric. Output width in cm. Default: |
save_height |
Numeric or |
theme |
Character preset ( |
Value
A forestploter plot object (gtable), returned invisibly.
Display with plot() or grid::grid.draw().
Examples
df <- data.frame(
item = c("Exposure vs. control", "Unadjusted", "Fully adjusted"),
`Cases/N` = c("", "89/4521", "89/4521"),
p_value = c(NA_real_, 0.001, 0.006),
check.names = FALSE
)
p <- plot_forest(
data = df,
est = c(NA, 1.52, 1.43),
lower = c(NA, 1.18, 1.11),
upper = c(NA, 1.96, 1.85),
ci_column = 2L,
indent = c(0L, 1L, 1L),
bold_label = c(TRUE, FALSE, FALSE),
p_cols = "p_value",
xlim = c(0.5, 3.0)
)
plot(p)
Publication-ready Table 1 (Baseline Characteristics)
Description
Generates a publication-quality baseline-characteristics table (Table 1) using gtsummary, with optional SMD column, Lancet-style theming, and automatic export to four formats.
Usage
plot_tableone(
data,
vars,
strata = NULL,
type = NULL,
label = NULL,
statistic = NULL,
digits = NULL,
percent = "column",
missing = "no",
add_p = TRUE,
add_smd = FALSE,
overall = FALSE,
exclude_labels = NULL,
theme = "lancet",
label_width = 200,
stat_width = 140,
pvalue_width = 100,
row_height = 8,
save = FALSE,
dest = NULL,
png_scale = 2,
pdf_width = NULL,
pdf_height = NULL
)
Arguments
data |
data.frame. Input data containing all variables. |
vars |
Character vector. Variable names to display in the table. |
strata |
Character scalar. Column name for the grouping/stratification
variable. |
type |
Named list or |
label |
Named list / formula list or |
statistic |
Named list or |
digits |
Named list or |
percent |
Character. Percentage base for categorical variables:
|
missing |
Character. Display of missing counts:
|
add_p |
Logical. Add a p-value column. Default |
add_smd |
Logical. Add a standardised mean difference (SMD) column.
Continuous variables use Cohen's d; categorical variables use
root-mean-square deviation (RMSD) of group proportions. SMD is shown
only on label rows. Default |
overall |
Logical. Add an Overall column when |
exclude_labels |
Character vector or |
theme |
Character. Visual theme preset. Only |
label_width |
Integer. Label column width in pixels. Default |
stat_width |
Integer. Statistics column(s) width in pixels.
Default |
pvalue_width |
Integer. P-value and SMD column width in pixels.
Default |
row_height |
Integer. Data row padding (top + bottom) in pixels.
Default |
save |
Logical. Export the table to files. Default |
dest |
Character or |
png_scale |
Numeric. Zoom factor for PNG export via webshot2.
Higher values produce larger, higher-resolution images. Default: |
pdf_width |
Numeric or |
pdf_height |
Numeric or |
Details
The following behaviours are fixed (not exposed as parameters):
Variable labels are bold (
bold_labels()).P-values are formatted as
<0.001or to 3 decimal places.Significant p-values (
p < 0.05) are bold.The p-value column header is rendered as P-value.
The table uses a three-line (booktabs) border.
Saving always exports word, html, pdf, and png.
Value
A gt table object, returned invisibly.
Examples
library(gtsummary)
data(trial)
# Basic stratified table
plot_tableone(
data = trial,
vars = c("age", "marker", "grade"),
strata = "trt",
save = FALSE
)
# With SMD, custom labels, exclude Unknown level
plot_tableone(
data = trial,
vars = c("age", "marker", "grade", "stage"),
strata = "trt",
label = list(age ~ "Age (years)", marker ~ "Marker level"),
add_smd = TRUE,
exclude_labels = "Unknown",
save = FALSE
)