The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Extracting Phenotype Data

Overview

UKB phenotype data is stored in a proprietary .dataset format on the RAP and cannot be read directly. The extract_* functions provide R interfaces for discovering approved fields and extracting phenotype data via the DNAnexus dx extract_dataset and table-exporter tools.

Two workflows are available:

Function Mode Scale Output
extract_batch() Async job Large / production (typically 50+ fields) job ID → CSV on RAP cloud
extract_pheno() Synchronous Small (quick checks) data.table in memory

extract_batch() is the recommended approach for any serious analysis. extract_pheno() is provided for quick interactive inspection inside the RAP environment only.


Prerequisites

Ensure you are authenticated and have selected your project:

library(ukbflow)

auth_login()
auth_select_project("project-XXXXXXXXXXXX")

Step 1: Browse Available Fields

Before extracting, use extract_ls() to explore what fields are approved for your project:

# List all approved fields (cached after first call)
extract_ls()

# Search by keyword
extract_ls(pattern = "cancer")
extract_ls(pattern = "p31|p53|p21022")

# Force refresh after switching projects or datasets
extract_ls(refresh = TRUE)

The result is a data.frame with two columns:

Column Example
field_name participant.p53_i0
title Date of attending assessment centre \| Instance 0

Fields reflect your project’s approved data only — not all UKB fields are present.


Step 2: Extract Data

Quick inspection: extract_pheno()

For small-scale interactive checks inside the RAP RStudio environment:

df <- extract_pheno(c(31, 53, 21022))

extract_pheno() is restricted to the RAP environment and returns data in memory only. For any analysis intended to be saved or reproduced, use extract_batch().

Note: extract_pheno() returns raw coded values (e.g. 1/0 for Sex, numeric codes for diseases). Use the decode_* series to convert codes to human-readable labels.


A Note on Column Names

Column naming differs between the two extraction methods:

extract_batch() — no prefix:

Column Meaning
eid Participant ID
p31 Field 31 (Sex)
p53_i0 Field 53, Instance 0
p20002_i0_a0 Field 20002, Instance 0, Array 0

extract_pheno()participant. prefix:

Column Meaning
participant.eid Participant ID
participant.p31 Field 31 (Sex)
participant.p53_i0 Field 53, Instance 0
participant.p20002_i0_a0 Field 20002, Instance 0, Array 0

Getting Help

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.