The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
After disease case definitions have been derived (see
vignette("derive")), three additional functions prepare the
data for time-to-event analysis:
| Function | Output columns | Purpose |
|---|---|---|
derive_timing() |
{name}_timing |
Classify prevalent vs. incident disease |
derive_age() |
age_at_{name} |
Age at event (years) |
derive_followup() |
{name}_followup_end,
{name}_followup_years |
Follow-up end date and duration |
Prerequisite:
{name}_statusand{name}_datemust already be present — produced byvignette("derive"). The examples below assume the full disease derivation pipeline has been run on anops_toy()dataset, so the baseline date column isp53_i0and age at recruitment isp21022.
derive_timing() compares the disease date to the UKB
baseline assessment date and assigns each participant to one of four
categories:
| Value | Meaning |
|---|---|
0 |
No disease (status is FALSE) |
1 |
Prevalent — disease date on or before baseline |
2 |
Incident — disease date strictly after baseline |
NA |
Case with no recorded date; timing cannot be determined |
library(ukbflow)
# Build on the derive pipeline from vignette("derive")
df <- ops_toy(n = 500)
df <- derive_missing(df)
df <- derive_covariate(df, as_factor = c("p31", "p20116_i0"))
df <- derive_selfreport(df, name = "dm", regex = "type 2 diabetes")
df <- derive_icd10(df, name = "dm", icd10 = "E11", source = c("hes", "death"))
df <- derive_case(df, name = "dm")# Uses {name}_status and {name}_date by default
df <- derive_timing(df, name = "dm", baseline_col = "p53_i0")Supply explicit column names when the defaults do not apply:
df <- derive_timing(df,
name = "dm",
status_col = "dm_status",
date_col = "dm_date",
baseline_col = "p53_i0"
)Call once per variable needed — for example, once for the combined case and once per individual source (HES, self-report, etc.).
derive_age() computes age at the time of the event for
cases, and returns NA for non-cases and cases without a
date.
\[\text{age\_at\_event} = \text{age\_at\_recruitment} + \frac{\text{event\_date} - \text{baseline\_date}}{365.25}\]
The divisor 365.25 accounts for leap years, ensuring sub-monthly precision in age calculation across the full UKB follow-up window.
# Auto-detects {name}_date and {name}_status; produces age_at_{name} column.
df <- derive_age(df,
name = "dm",
baseline_col = "p53_i0",
age_col = "p21022"
)Supply explicit column mappings when names do not follow the default
{name}_date / {name}_status pattern:
df <- derive_age(df,
name = "dm",
baseline_col = "p53_i0",
age_col = "p21022",
date_cols = c(dm = "dm_date"),
status_cols = c(dm = "dm_status")
)derive_followup() computes the follow-up end date as the
earliest of:
Follow-up time in years is then derived from the baseline date.
df <- derive_followup(df,
name = "dm",
event_col = "dm_date",
baseline_col = "p53_i0",
censor_date = as.Date("2022-10-31"), # set to your study's cut-off date
death_col = "p40000_i0",
lost_col = FALSE # not available in ops_toy
)Output columns:
| Column | Type | Description |
|---|---|---|
dm_followup_end |
IDate | Earliest competing date |
dm_followup_years |
numeric | Years from baseline to end |
NA follow-up timeParticipants whose event date falls before or on the baseline
date (prevalent cases, {name}_timing == 1) will
have followup_years set to NA rather than a
zero or negative value, which has no meaning in time-to-event analysis.
Use derive_timing() to identify and exclude prevalent cases
before fitting a Cox model (see the full pipeline example below).
When death_col and lost_col are
NULL (default), derive_followup() looks them
up automatically from the field cache (UKB fields 40000 and 191). Pass
FALSE to explicitly disable a competing event:
df <- derive_followup(df,
name = "dm",
event_col = "dm_date",
baseline_col = "p53_i0",
censor_date = as.Date("2022-10-31"),
death_col = FALSE,
lost_col = FALSE
)After completing all three steps, the data contains everything needed to fit a Cox proportional hazards model:
library(survival)
# Incident analysis: exclude prevalent cases and those with undetermined timing
df_incident <- df[dm_timing != 1L]
fit <- coxph(
Surv(dm_followup_years, dm_status) ~
p20116_i0 + p21022 + p31 + p1558_i0,
data = df_incident
)
summary(fit)Column roles in the model:
| Column | Role |
|---|---|
dm_status |
Event indicator (logical) |
dm_followup_years |
Time variable |
dm_timing |
Filter: exclude prevalent (== 1) |
age_at_dm |
Age at diagnosis (descriptive / secondary analysis) |
p20116_i0 |
Exposure of interest (smoking status) |
?derive_timing, ?derive_age,
?derive_followupvignette("derive") — disease phenotype derivationvignette("decode") — decoding column names and
valuesThese binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.