The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Tutorial: GENIE BPC Data Structure

AACR Project GENIE

AACR Project GENIE is a publicly accessible international cancer registry of real-world data assembled through data sharing between 19 of the leading cancer centers in the world.
GENIE aggregates, harmonizes, and links clinical-grade, next-generation cancer genomic sequencing data with clinical outcomes obtained during routine medical practice from cancer patients treated at these institutions, although clinical information available is limited (~<10 variables).

GENIE BPC

The goal of Project GENIE BPC (Biopharma Collaborative) is to augment the existing registry genomic data with enhanced clinical (phenomic) data to support clinical-genomics analyses.
- Phenomic data are curated using the PRISSMM curation model.
- Currently MSK, Dana Farber, Vanderbilt, UCSF, and University Health Network are the participating institutions. GENIE BPC will be expanded to more institutions in phase II.
GENIE BPC sponsors seven projects that involve augmenting the genomic data in Project GENIE to include PRISSMM phenomic data and cancer-related outcomes. Each project encompasses a single cancer site. While all projects are based on curating data under the PRISSMM phenomic data curation model, the variables available for each project may vary. The seven projects are:
1. Non-small cell lung cancer (NSCLC)
2. Colon/rectal cancer (CRC)
3. Breast cancer (BrCa)
4. Pancreas cancer (PANC)
5. Bladder cancer (BLADDER)
6. Prostate cancer (Prostate)
7. Renal cancer (RENAL)

GENIE BPC Elements

GENIE BPC is comprised of several phenomic (clinical) elements that include data from the AACR Project GENIE Tier 1 registry, institutions’ tumor registries, manual curation based on institutions’ electronic health records (PRISSMM phenomic data model), and derived variables based on these data elements. The elements are:

Patient Characteristics
Cancer Diagnosis
1. Cancer Diagnosis: BPC Project Cancer
2. Cancer Diagnosis: Non-BPC Project Cancer
Cancer-Directed Drugs
Cancer-Directed Radiation Therapy
PRISSMM Pathology
PRISSMM Imaging
PRISSMM Medical Oncologist Assessments
PRISSMM Tumor Marker
Cancer Panel Test

This vignette will describe the structure of each element (dataset).

Patient Characteristics

The Patient Characteristics dataset is structured as one record per patient. This dataset includes variables such as date of death, last known alive date, and number of pathology reports, imaging reports, medical oncologist assessments, along with many other demographic and vital status variables.

The Patient Characteristics dataset can be linked to all other datasets using cohort and record_id.

Cancer Diagnosis

Two Cancer Diagnosis datasets are provided: the BPC Project Cancer Diagnosis dataset and the Non-BPC Project Cancer Diagnosis dataset.

BPC Project Cancer	Non-BPC Project Cancer
The cancer that met the eligibility criteria for the project and was selected at random for PRISSMM phenomic data curation.	A diagnosis of the same or different cancer type as the project cancer that occurs prior to, simultaneous with, or after the BPC Project cancer.
Has genomic sequencing associated with the cancer diagnosis.	Does not have associated genomic sequencing
Recorded in the BPC Project Cancer Diagnosis dataset.	Recorded in the Non-BPC Project Cancer Diagnosis dataset.

Cancer Diagnosis: BPC Project Cancer

The BPC Project Cancer Diagnosis dataset contains one record per BPC Project cancer diagnosis, per patient.

The BPC Project Cancer Diagnosis dataset can be linked to the following datasets:

Cancer-Directed Regimen dataset using variables cohort, record_id and ca_seq
Cancer Panel Test dataset using variables cohort, record_id and ca_seq
Patient Characteristics, PRISSMM Pathology, PRISSMM Imaging, and PRISSMM Medical Oncologist Assessment datasets using cohort and record_id

Cancer Diagnosis: Non-BPC Project Cancer

The Non-BPC Project Cancer Diagnosis dataset contains one record per non-BPC Project cancer diagnosis, per patient. This dataset includes two types of cancer diagnoses: 1) non-BPC Project invasive cancer and in situ/non-invasive cancer diagnoses, and 2) other tumors.

The Non-BPC Project Cancer Diagnosis dataset can be linked to the following datasets:

Cancer-Directed Regimen Dataset using variables cohort, record_id and ca_seq
Patient Characteristics, PRISSMM Pathology, PRISSMM Imaging, and PRISSMM Medical Oncologist Assessment datasets using cohort and record_id
Cannot be linked to the Cancer Panel Test dataset because non-BPC Project cancer diagnoses were not genomically sequenced

Cancer-Directed Regimen Dataset

The Cancer-Directed Regimen dataset is structured as one record per regimen-associated cancer diagnosis, per patient.

Cancer-directed regimens were curated for all cancer diagnoses, including both BPC Project and non-BPC Project cancers. A regimen can consist of one drug or up to five drugs given together. Cancer-directed drugs include anti-neoplastic drugs, immunotherapies, targeted therapies, and hormone therapies. A break in treatment of ≥8 weeks was used to indicate the end of a regimen; even if all drugs in the regimen were re-initiated 8+ weeks later, this was considered a new regimen.

If the cancer-directed drug was part of an investigational drug trial, the drug name(s) drugs_drug_1-drugs_drug_5 will be set to “Investigational Drug” and the end date interval will match the start date interval. Identification of investigational drugs varies by institution depending on contractual obligations. For some institutions, all drugs that are part of an investigational trial are required to be masked, even if standard of care. For other institutions, only the investigational drug(s) are required to be masked.

The Cancer-Directed Regimen can be linked to the following datasets:

BPC Project and non-BPC Project Cancer Diagnosis datasets using the variables cohort, record_id and ca_seq
Cancer Panel Test dataset using the variables cohort, record_id and ca_seq
Patient Characteristics, PRISSMM Pathology, PRISSMM Imaging, and PRISSMM Medical Oncologist Assessment datasets using cohort and record_id

Cancer-Directed Radiation Therapy Dataset

The Cancer-Directed Radiation Therapy dataset is structured as one record per radiation therapy-associated cancer diagnosis, per patient. For example, if radiation therapy is associated with a single cancer diagnosis, there will be one corresponding record in this dataset. If radiation therapy is associated with two cancer diagnoses, then there will be two corresponding records in this dataset: one for the first associated cancer diagnosis and another for the second associated cancer diagnosis. If it is unknown which cancer diagnosis the radiation therapy is associated with, there will still be one record in this dataset.

Radiation therapy was curated beginning within 30 days of the first BPC Project cancer diagnosis. All subsequent radiation therapies were recorded. Additional radiation therapy corresponding to non-BPC Project Cancers may be recorded.

Radiation therapy data is available for the following cohorts: PANC, Prostate, BLADDER, RENAL.

This dataset can be linked to the following datasets:

BPC Project Cancer Diagnosis, Non-BPC Project Cancer Diagnosis, Cancer-Directed Regimen, Cancer Panel Test datasets datasets using the variables cohort, record_id and ca_seq
Patient Characteristics, PRISSMM Pathology, PRISSMM Imaging, and PRISSMM Medical Oncologist Assessment datasets using cohort and record_id.

PRISSMM Pathology Dataset

The PRISSMM Pathology dataset is structured as one record per pathology report, per patient.

All pathology reports beginning with the month and year of the first BPC Project cancer diagnosis and all subsequent pathology reports are recorded (including pathology reports corresponding to non-BPC Project cancer and subsequent BPC Project cancer diagnoses).

The PRISSMM Pathology dataset can be linked to the following datasets:

Cancer Panel Test dataset using cohort and record_id, ca_seq, path_proc_number and path_report_number
Patient Characteristics, BPC Project and Non-BPC Project Cancer Diagnosis, Cancer-Directed Regimen, PRISSMM Imaging, and PRISSMM Medical Oncologist Assessment datasets using cohort and record_id

PRISSMM Imaging Dataset

The PRISSMM Imaging dataset is structured as one record per imaging report, per patient.

All imaging reports beginning with the month and year of the first BPC Project cancer diagnosis and all subsequent imaging reports are recorded (including imaging reports corresponding to non-BPC Project cancer and subsequent BPC Project cancer diagnoses).

The PRISSMM Imaging dataset can be linked to all datasets using the variable record_id.

PRISSMM Medical Oncologist Assessment Dataset

The PRISSMM Medical Oncologist Assessment dataset is structured as one row per curated medical oncologist assessment, per patient.

Medical oncologist assessments were curated beginning with the month and year of the first BPC Project cancer diagnosis. One medical oncologist assessment per month was curated.

The PRISSMM Medical Oncologist Assessment dataset can be linked to all datasets using the variable record_id.

PRISSMM Tumor Marker Dataset

The PRISSMM Tumor Marker dataset is structured as one record per curated tumor marker result, per patient. All serum-based tumor markers that are related to the diagnosis/prognosis of cancer were curated.

Tumor marker data is available for the following cohorts: CRC, BrCa, PANC, Prostate.

Note: variables pertaining to PD-L1, MSI and MMR are recorded in the pathology dataset.

The PRISSMM Tumor Marker dataset can be linked to all datasets using the variable record_id.

Cancer Panel Test Dataset

The Cancer Panel Test dataset is structured as one record per cancer panel test and its associated cancer diagnosis, per patient.

The cancer panel test refers to the multi-gene panels that have been performed through next generation sequencing (NGS) assays. The terms “cancer panel test (CPT)” and “next generation sequencing (NGS)” are used interchangeably.

The Cancer Panel Test dataset can be linked to the following datasets:

BPC Project Cancer Diagnosis dataset using the variables cohort, record_id and ca_seq
Cancer-Directed Regimen dataset using the variables cohort, record_id and ca_seq
PRISSMM Pathology dataset using cohort, record_id, ca_seq, path_proc_number and path_report_number

Note: this dataset cannot be linked to the non-BPC Project Cancer Diagnosis dataset because non-BPC Project cancer diagnoses were not genomically sequenced.

Differences Between Synapse and cBioPortal Genomic Data

Please note that pulling genomic GENIE data from Synapse using pull_data_synapse() and pulling GENIE data from cBioPortal may result in small differences in the data due to systematic differences in the processing pipelines employed by Synapse and cBioPortal. These differences may include:

Data formatting - Some data sets (e.g. CNA files) may appear in wide format in Synapse data versus long format in cBioPortal data, or column attributes and names may appear sightly different (e.g. fusions files).
Default filtering - By default, cBioPortal filters out Silent, Intron, IGR, 3’UTR, 5’UTR, 3’Flank and 5’Flank, except for the promoter mutations of the TERT gene. See cBioPortal documentation for more details. These mutations are retained in Synapse processing pipelines.
Hugo Symbols - Some genes have more than one accepted Hugo Symbol and may be referred to differently between data sources (e.g. NSD3 is an alias for WHSC1L1). Some tools exist to help you resolve gene aliases across genomic data sources. See gnomeR::recode_alias(), cbioportal::get_alias() and vignettes from the {gnomeR} and {cbioportalR} for more information on how to use these functions and work with gene aliases.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.