The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Survival Analysis Setup for UKB Outcomes

Overview

After disease case definitions have been derived (see vignette("derive")), three additional functions prepare the data for time-to-event analysis:

Function Output columns Purpose
derive_timing() {name}_timing Classify prevalent vs. incident disease
derive_age() age_at_{name} Age at event (years)
derive_followup() {name}_followup_end, {name}_followup_years Follow-up end date and duration

Prerequisite: {name}_status and {name}_date must already be present — produced by vignette("derive"). The examples below assume the full disease derivation pipeline has been run on an ops_toy() dataset, so the baseline date column is p53_i0 and age at recruitment is p21022.


Step 1: Classify Timing — Prevalent vs. Incident

derive_timing() compares the disease date to the UKB baseline assessment date and assigns each participant to one of four categories:

Value Meaning
0 No disease (status is FALSE)
1 Prevalent — disease date on or before baseline
2 Incident — disease date strictly after baseline
NA Case with no recorded date; timing cannot be determined
library(ukbflow)

# Build on the derive pipeline from vignette("derive")
df <- ops_toy(n = 500)
df <- derive_missing(df)
df <- derive_covariate(df, as_factor = c("p31", "p20116_i0"))
df <- derive_selfreport(df, name = "dm", regex = "type 2 diabetes")
df <- derive_icd10(df, name = "dm", icd10 = "E11", source = c("hes", "death"))
df <- derive_case(df, name = "dm")
# Uses {name}_status and {name}_date by default
df <- derive_timing(df, name = "dm", baseline_col = "p53_i0")

Supply explicit column names when the defaults do not apply:

df <- derive_timing(df,
  name         = "dm",
  status_col   = "dm_status",
  date_col     = "dm_date",
  baseline_col = "p53_i0"
)

Call once per variable needed — for example, once for the combined case and once per individual source (HES, self-report, etc.).


Step 2: Age at Event

derive_age() computes age at the time of the event for cases, and returns NA for non-cases and cases without a date.

\[\text{age\_at\_event} = \text{age\_at\_recruitment} + \frac{\text{event\_date} - \text{baseline\_date}}{365.25}\]

The divisor 365.25 accounts for leap years, ensuring sub-monthly precision in age calculation across the full UKB follow-up window.

# Auto-detects {name}_date and {name}_status; produces age_at_{name} column.
df <- derive_age(df,
  name         = "dm",
  baseline_col = "p53_i0",
  age_col      = "p21022"
)

Supply explicit column mappings when names do not follow the default {name}_date / {name}_status pattern:

df <- derive_age(df,
  name         = "dm",
  baseline_col = "p53_i0",
  age_col      = "p21022",
  date_cols    = c(dm = "dm_date"),
  status_cols  = c(dm = "dm_status")
)

Step 3: Follow-Up Time

derive_followup() computes the follow-up end date as the earliest of:

  1. The outcome event date (if the participant is a case)
  2. Date of death (field 40000; competing event)
  3. Date lost to follow-up (field 191)
  4. The administrative censoring date

Follow-up time in years is then derived from the baseline date.

df <- derive_followup(df,
  name         = "dm",
  event_col    = "dm_date",
  baseline_col = "p53_i0",
  censor_date  = as.Date("2022-10-31"),   # set to your study's cut-off date
  death_col    = "p40000_i0",
  lost_col     = FALSE                    # not available in ops_toy
)

Output columns:

Column Type Description
dm_followup_end IDate Earliest competing date
dm_followup_years numeric Years from baseline to end

Prevalent cases receive NA follow-up time

Participants whose event date falls before or on the baseline date (prevalent cases, {name}_timing == 1) will have followup_years set to NA rather than a zero or negative value, which has no meaning in time-to-event analysis. Use derive_timing() to identify and exclude prevalent cases before fitting a Cox model (see the full pipeline example below).

Auto-detection of death and lost-to-follow-up columns

When death_col and lost_col are NULL (default), derive_followup() looks them up automatically from the field cache (UKB fields 40000 and 191). Pass FALSE to explicitly disable a competing event:

df <- derive_followup(df,
  name         = "dm",
  event_col    = "dm_date",
  baseline_col = "p53_i0",
  censor_date  = as.Date("2022-10-31"),
  death_col    = FALSE,
  lost_col     = FALSE
)

Full Survival-Ready Pipeline

After completing all three steps, the data contains everything needed to fit a Cox proportional hazards model:

library(survival)

# Incident analysis: exclude prevalent cases and those with undetermined timing
df_incident <- df[dm_timing != 1L]

fit <- coxph(
  Surv(dm_followup_years, dm_status) ~
    p20116_i0 + p21022 + p31 + p1558_i0,
  data = df_incident
)
summary(fit)

Column roles in the model:

Column Role
dm_status Event indicator (logical)
dm_followup_years Time variable
dm_timing Filter: exclude prevalent (== 1)
age_at_dm Age at diagnosis (descriptive / secondary analysis)
p20116_i0 Exposure of interest (smoking status)

Getting Help

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.