The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The derive_* functions convert raw UKB columns into
analysis-ready variables. This vignette covers the disease phenotype
derivation pipeline:
| Step | Function(s) | Purpose |
|---|---|---|
| 1 | derive_missing() |
Handle “Do not know” / “Prefer not to answer” |
| 2 | derive_covariate() |
Convert types; summarise covariates |
| 3 | derive_cut() |
Bin continuous variables into groups |
| 4 | derive_selfreport() |
Self-reported disease status + date |
| 5 | derive_hes() |
HES inpatient ICD-10 status + date |
| 6 | derive_first_occurrence() |
First Occurrence field status + date |
| 7 | derive_cancer_registry() |
Cancer registry status + date |
| 8 | derive_death_registry() |
Death registry ICD-10 status + date |
| 9 | derive_icd10() |
Combine any subset of sources (wrapper) |
| 10 | derive_case() |
Merge self-report + ICD-10 into final case definition |
All functions accept a data.frame or
data.table and return a data.table. For
data.table input, new columns are added by
reference (no copy); data.frame input is converted
to data.table internally before modification.
In production, replace
ops_toy()withextract_batch()followed bydecode_values()anddecode_names(). Seevignette("decode"). Column names below use the RAP raw format (p{field}_{instance}_{array}) as returned byops_toy()andextract_batch()before decoding.
UKB uses special labels such as "Do not know" and
"Prefer not to answer" to distinguish refusal from true
missing data. derive_missing() converts these to
NA (default) or retains them as "Unknown" for
modelling.
Performance:
derive_missing()usesdata.table::set()for in-place replacement — no column copies are made regardless of dataset size.
To keep non-response as a model category:
To add custom labels beyond the built-in list:
derive_covariate() converts categorical columns to
factor and prints a distribution summary for each.
df <- derive_covariate(
df,
as_factor = c(
"p31", # sex
"p20116_i0", # smoking_status_i0
"p1558_i0" # alcohol_intake_frequency_i0
),
factor_levels = list(
p20116_i0 = c("Never", "Previous", "Current")
)
)derive_cut() creates a new factor column by binning a
continuous variable into quantile-based or custom groups.
df <- derive_cut(
df,
col = "p21001_i0", # body_mass_index_bmi_i0
n = 4,
breaks = c(18.5, 25, 30),
labels = c("Underweight", "Normal", "Overweight", "Obese"),
name = "bmi_cat"
)
df <- derive_cut(
df,
col = "p22189", # townsend_deprivation_index_at_recruitment
n = 4,
labels = c("Q1 (least deprived)", "Q2", "Q3", "Q4 (most deprived)"),
name = "tdi_cat"
)derive_selfreport() searches UKB self-reported
non-cancer illness (field 20002) or cancer (field 20001) columns for a
disease label matching a regex, then returns binary status and the
earliest report date. Column detection is automatic from field IDs.
# Non-cancer: type 2 diabetes (field 20002)
df <- derive_selfreport(df,
name = "dm",
regex = "type 2 diabetes"
)# Cancer: lung cancer (field 20001)
df <- derive_selfreport(df,
name = "lung_cancer",
regex = "lung cancer",
field = "cancer"
)This adds two columns per call:
| Column | Type | Description |
|---|---|---|
dm_selfreport |
logical | TRUE if any instance matched |
dm_selfreport_date |
IDate | Earliest report date |
derive_hes() scans UKB Hospital Episode Statistics
ICD-10 codes (field 41270, stored as a JSON array per participant) and
matches the earliest corresponding date from field 41280.
# Prefix match: codes starting with "I10" (hypertension)
df <- derive_hes(df, name = "htn", icd10 = "I10")
# Exact match
df <- derive_hes(df, name = "dm_hes", icd10 = "E11", match = "exact")
# Regex: E10 and E11 simultaneously
df <- derive_hes(df, name = "dm_broad", icd10 = "^E1[01]", match = "regex")The match argument controls how codes are compared:
match |
Behaviour | Example |
|---|---|---|
"prefix" (default) |
Code starts with pattern | "E11" matches "E110",
"E119" |
"exact" |
Full 3- or 4-digit match | "E11" matches only "E11" |
"regex" |
Full regular expression | "^E1[01]" |
UKB First Occurrence fields (p131xxx) record the earliest date a condition was observed across all linked sources — self-report, HES inpatient, GP records, and death registry — pre-integrated by UKB. Look up your disease in the UKB Field Finder.
# ops_toy includes p131742 as a representative First Occurrence column
df <- derive_first_occurrence(df, name = "htn", field = 131742L, col = "p131742")derive_cancer_registry() searches the cancer registry
ICD-10 field (40006) and optionally filters by histology (field 40011)
and behaviour (field 40012).
# ICD-10 only
df <- derive_cancer_registry(df,
name = "skin_cancer",
icd10 = "^C44"
)
# With histology and behaviour filters
df <- derive_cancer_registry(df,
name = "scc",
icd10 = "^C44",
histology = c(8070L, 8071L, 8072L),
behaviour = 3L # 3 = malignant
)derive_death_registry() searches primary (field 40001)
and secondary (field 40002) causes of death for ICD-10 codes.
df <- derive_death_registry(df, name = "mi", icd10 = "I21")
df <- derive_death_registry(df, name = "dm", icd10 = "E11")
df <- derive_death_registry(df, name = "lung", icd10 = "C34")derive_icd10()derive_icd10() is a high-level wrapper that calls any
combination of the source-specific functions above and merges their
outputs into a single status column and earliest date. This is the
recommended approach for multi-source ascertainment.
# Non-cancer disease: HES + death + First Occurrence
df <- derive_icd10(df,
name = "dm",
icd10 = "E11",
source = c("hes", "death", "first_occurrence"),
fo_col = "p131742"
)
# Cancer outcome: cancer registry
df <- derive_icd10(df,
name = "lung",
icd10 = "^C3[34]",
match = "regex",
source = "cancer_registry",
behaviour = 3L
)Intermediate source columns are retained alongside the combined result:
| Column | Type | Description |
|---|---|---|
dm_icd10 |
logical | TRUE if positive in any specified source |
dm_icd10_date |
IDate | Earliest date across all sources |
dm_hes |
logical | HES status |
dm_hes_date |
IDate | HES date |
dm_fo |
logical | First Occurrence status |
dm_fo_date |
IDate | First Occurrence date |
dm_death |
logical | Death registry status |
dm_death_date |
IDate | Death registry date |
derive_case() merges the self-report and ICD-10 flags
into a unified case status, with the earliest date across both sources
taken via pmin().
Output columns:
| Column | Type | Description |
|---|---|---|
dm_status |
logical | TRUE if positive in self-report OR ICD-10 |
dm_date |
IDate | Earliest date across all sources
(pmin) |
Why the earliest date matters:
dm_dateis the direct input toderive_timing(),derive_age(), andderive_followup()— it is the chronological anchor of every downstream survival analysis. Seevignette("derive-survival").
?derive_missing, ?derive_covariate,
?derive_cut?derive_selfreport, ?derive_hes,
?derive_first_occurrence?derive_cancer_registry,
?derive_death_registry?derive_icd10, ?derive_casevignette("derive-survival") — timing, age at event,
follow-upvignette("decode") — decoding column names and
valuesThese binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.