The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Deriving Disease Phenotypes from UKB Data

Overview

The derive_* functions convert raw UKB columns into analysis-ready variables. This vignette covers the disease phenotype derivation pipeline:

Step Function(s) Purpose
1 derive_missing() Handle “Do not know” / “Prefer not to answer”
2 derive_covariate() Convert types; summarise covariates
3 derive_cut() Bin continuous variables into groups
4 derive_selfreport() Self-reported disease status + date
5 derive_hes() HES inpatient ICD-10 status + date
6 derive_first_occurrence() First Occurrence field status + date
7 derive_cancer_registry() Cancer registry status + date
8 derive_death_registry() Death registry ICD-10 status + date
9 derive_icd10() Combine any subset of sources (wrapper)
10 derive_case() Merge self-report + ICD-10 into final case definition

All functions accept a data.frame or data.table and return a data.table. For data.table input, new columns are added by reference (no copy); data.frame input is converted to data.table internally before modification.

In production, replace ops_toy() with extract_batch() followed by decode_values() and decode_names(). See vignette("decode"). Column names below use the RAP raw format (p{field}_{instance}_{array}) as returned by ops_toy() and extract_batch() before decoding.


Setup

library(ukbflow)

df <- ops_toy(n = 500)

Step 1: Handle Informative Missing Labels

UKB uses special labels such as "Do not know" and "Prefer not to answer" to distinguish refusal from true missing data. derive_missing() converts these to NA (default) or retains them as "Unknown" for modelling.

df <- derive_missing(df)

Performance: derive_missing() uses data.table::set() for in-place replacement — no column copies are made regardless of dataset size.

To keep non-response as a model category:

df <- derive_missing(df, action = "unknown")

To add custom labels beyond the built-in list:

df <- derive_missing(df, extra_labels = "Not applicable")

Step 2: Prepare Covariates

derive_covariate() converts categorical columns to factor and prints a distribution summary for each.

df <- derive_covariate(
  df,
  as_factor = c(
    "p31",        # sex
    "p20116_i0",  # smoking_status_i0
    "p1558_i0"    # alcohol_intake_frequency_i0
  ),
  factor_levels = list(
    p20116_i0 = c("Never", "Previous", "Current")
  )
)

Step 3: Bin Continuous Variables

derive_cut() creates a new factor column by binning a continuous variable into quantile-based or custom groups.

df <- derive_cut(
  df,
  col    = "p21001_i0",                              # body_mass_index_bmi_i0
  n      = 4,
  breaks = c(18.5, 25, 30),
  labels = c("Underweight", "Normal", "Overweight", "Obese"),
  name   = "bmi_cat"
)

df <- derive_cut(
  df,
  col    = "p22189",                                 # townsend_deprivation_index_at_recruitment
  n      = 4,
  labels = c("Q1 (least deprived)", "Q2", "Q3", "Q4 (most deprived)"),
  name   = "tdi_cat"
)

Step 4: Self-Reported Disease

derive_selfreport() searches UKB self-reported non-cancer illness (field 20002) or cancer (field 20001) columns for a disease label matching a regex, then returns binary status and the earliest report date. Column detection is automatic from field IDs.

# Non-cancer: type 2 diabetes (field 20002)
df <- derive_selfreport(df,
  name  = "dm",
  regex = "type 2 diabetes"
)
# Cancer: lung cancer (field 20001)
df <- derive_selfreport(df,
  name  = "lung_cancer",
  regex = "lung cancer",
  field = "cancer"
)

This adds two columns per call:

Column Type Description
dm_selfreport logical TRUE if any instance matched
dm_selfreport_date IDate Earliest report date

Step 5: HES Inpatient Records

derive_hes() scans UKB Hospital Episode Statistics ICD-10 codes (field 41270, stored as a JSON array per participant) and matches the earliest corresponding date from field 41280.

# Prefix match: codes starting with "I10" (hypertension)
df <- derive_hes(df, name = "htn", icd10 = "I10")

# Exact match
df <- derive_hes(df, name = "dm_hes", icd10 = "E11", match = "exact")

# Regex: E10 and E11 simultaneously
df <- derive_hes(df, name = "dm_broad", icd10 = "^E1[01]", match = "regex")

The match argument controls how codes are compared:

match Behaviour Example
"prefix" (default) Code starts with pattern "E11" matches "E110", "E119"
"exact" Full 3- or 4-digit match "E11" matches only "E11"
"regex" Full regular expression "^E1[01]"

Step 6: First Occurrence Fields

UKB First Occurrence fields (p131xxx) record the earliest date a condition was observed across all linked sources — self-report, HES inpatient, GP records, and death registry — pre-integrated by UKB. Look up your disease in the UKB Field Finder.

# ops_toy includes p131742 as a representative First Occurrence column
df <- derive_first_occurrence(df, name = "htn", field = 131742L, col = "p131742")

Step 7: Cancer Registry

derive_cancer_registry() searches the cancer registry ICD-10 field (40006) and optionally filters by histology (field 40011) and behaviour (field 40012).

# ICD-10 only
df <- derive_cancer_registry(df,
  name  = "skin_cancer",
  icd10 = "^C44"
)

# With histology and behaviour filters
df <- derive_cancer_registry(df,
  name      = "scc",
  icd10     = "^C44",
  histology = c(8070L, 8071L, 8072L),
  behaviour = 3L                        # 3 = malignant
)

Step 8: Death Registry

derive_death_registry() searches primary (field 40001) and secondary (field 40002) causes of death for ICD-10 codes.

df <- derive_death_registry(df, name = "mi",   icd10 = "I21")
df <- derive_death_registry(df, name = "dm",   icd10 = "E11")
df <- derive_death_registry(df, name = "lung", icd10 = "C34")

Step 9: Combine Sources with derive_icd10()

derive_icd10() is a high-level wrapper that calls any combination of the source-specific functions above and merges their outputs into a single status column and earliest date. This is the recommended approach for multi-source ascertainment.

# Non-cancer disease: HES + death + First Occurrence
df <- derive_icd10(df,
  name   = "dm",
  icd10  = "E11",
  source = c("hes", "death", "first_occurrence"),
  fo_col = "p131742"
)

# Cancer outcome: cancer registry
df <- derive_icd10(df,
  name      = "lung",
  icd10     = "^C3[34]",
  match     = "regex",
  source    = "cancer_registry",
  behaviour = 3L
)

Intermediate source columns are retained alongside the combined result:

Column Type Description
dm_icd10 logical TRUE if positive in any specified source
dm_icd10_date IDate Earliest date across all sources
dm_hes logical HES status
dm_hes_date IDate HES date
dm_fo logical First Occurrence status
dm_fo_date IDate First Occurrence date
dm_death logical Death registry status
dm_death_date IDate Death registry date

Step 10: Final Case Definition

derive_case() merges the self-report and ICD-10 flags into a unified case status, with the earliest date across both sources taken via pmin().

df <- derive_case(df, name = "dm")

Output columns:

Column Type Description
dm_status logical TRUE if positive in self-report OR ICD-10
dm_date IDate Earliest date across all sources (pmin)

Why the earliest date matters: dm_date is the direct input to derive_timing(), derive_age(), and derive_followup() — it is the chronological anchor of every downstream survival analysis. See vignette("derive-survival").


Getting Help

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.