The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

TCGAretriever is an R library aimed at downloading clinical and molecular data from cBioPortal (https://www.cbioportal.org/), including The Cancer Genome Atlas (TCGA) data (free-tier data). TCGA is a program aimed at improving our understanding of Cancer Biology. Several TCGA Datasets are available online. TCGAretriever helps accessing and downloading TCGA data using R via the cBioPortal API. Features of TCGAretriever are:

This tutorial describes few use cases to help getting started with TCGAretriever.

Basic Operations

In this section, we provide an overview of the basic operations and commonly used functions.


Connection to the cBioPortal Server

Before starting a TCGAretriever pipeline, it’s recommended to check whether the remote data server can be reached. Thos is achieved via the check_TCGAretriever_connection() function. This should return TRUE.

library(TCGAretriever)
library(reshape2)
library(ggplot2)

# Check connection
con_alive <- check_TCGAretriever_connection()
con_alive
## [1] TRUE

Cancer Study IDs

Here we retrieve the list of available studies and select all tcga_pub entries. Each of the values included in the studyId column is a study identifier and can be passed as csid argument to other functions such as get_genetic_profiles() or get_case_lists().

# If connection is alive, obtain a list of cancer studies from cBio
if (con_alive) {
  
  all_studies <- get_cancer_studies()

  # Find published TCGA datasets
  keep <- grepl("tcga_pub$", all_studies[,'studyId'])
  tcga_studies <- all_studies[keep, ]

  # Show results
  utils::head(tcga_studies[, c(11, 1, 4)])
  
} else {
  
  print('The server cannot be reached. All code below won\'t be executed.')

}
##     resourceCounts                                             name     pmid
## 13            NULL      Kidney Chromophobe (TCGA, Cancer Cell 2014) 25155756
## 81            NULL     Adult Soft Tissue Sarcomas (TCGA, Cell 2017) 29100075
## 83            NULL Bladder Urothelial Carcinoma (TCGA, Nature 2014) 24476821
## 89            NULL    Papillary Thyroid Carcinoma (TCGA, Cell 2014) 25417114
## 109           NULL    Colorectal Adenocarcinoma (TCGA, Nature 2012) 22810696
## 137           NULL Lung Squamous Cell Carcinoma (TCGA, Nature 2012) 22960745

Genetic Profiles (Assays) and Case Lists

It is possible to retrieve genetic profiles and case lists associated to each study of interest. Genetic profiles indicate what kind of data are available for a certain study (e.g., RNA-seq, micro-array, DNA mutation…). Typically, a data type of interest may only be available for a fraction of patients in the cohort. Therefore, it is also important to obtain the identifier of the corresponding case list of interest.

# Define the cancer study id: brca_tcga_pub
my_csid <- "brca_tcga_pub"

# Obtain genetic profiles
brca_pro <- get_genetic_profiles(csid = my_csid)
utils::head(brca_pro[, c(7, 1)])
##                   molecularProfileId molecularAlterationType
## 1               brca_tcga_pub_gistic  COPY_NUMBER_ALTERATION
## 2           brca_tcga_pub_linear_CNA  COPY_NUMBER_ALTERATION
## 3     brca_tcga_pub_methylation_hm27             METHYLATION
## 4                brca_tcga_pub_mirna         MRNA_EXPRESSION
## 5 brca_tcga_pub_mirna_median_Zscores         MRNA_EXPRESSION
## 6                 brca_tcga_pub_mrna         MRNA_EXPRESSION

# Obtain cases 
brca_cas <- get_case_lists(csid = my_csid)
utils::head(brca_cas[, c(4, 1)])
##             sampleListId                             category
## 1      brca_tcga_pub_all                   all_cases_in_study
## 2    brca_tcga_pub_basal                                other
## 3  brca_tcga_pub_claudin                                other
## 4      brca_tcga_pub_cna              all_cases_with_cna_data
## 5   brca_tcga_pub_cnaseq all_cases_with_mutation_and_cna_data
## 6 brca_tcga_pub_complete                                other

Retrieve Genomic Data

Once we identified a genetic profile of interest and a case list of interest, we can obtain clinical data via the TCGAretriever::get_clinical_data(). Additionally, molecular data can be obtained via the TCGAretriever::get_molecular_data() function (non-mutation data) or the get_mutation_data() function (mutation data).

# Define a set of genes of interest
q_csid <- 'brca_tcga_pub'
q_genes <- c("TP53", "HMGA1", "E2F1", "EZH2")
q_cases <- "brca_tcga_pub_complete"
rna_prf <- "brca_tcga_pub_mrna"
mut_prf <- "brca_tcga_pub_mutations"

# Download Clinical Data
brca_cli <- get_clinical_data(csid = q_csid, case_list_id = q_cases)

# Download RNA
brca_RNA <- get_molecular_data(case_list_id = q_cases, 
                               gprofile_id = rna_prf, 
                               glist = q_genes)
  • NOTE 1: the resulting data.frame includes ENTREZ_GENE_IDs and OFFICIAL_SMBOLs as first and second column. The third column indicates the gene/assay type.

  • NOTE 2: up to n=500 genes can be queried via the get_molecular_data() function.


# Show results
brca_RNA[, 1:5]
##   entrezGeneId hugoGeneSymbol           type TCGA-A1-A0SD-01 TCGA-A1-A0SE-01
## 1         1869           E2F1 protein-coding         -1.4535         -1.3525
## 2         2146           EZH2 protein-coding         -1.7928         -1.5652
## 3         3159          HMGA1 protein-coding         -2.5195         -1.9663
## 4         7157           TP53 protein-coding         -1.0107         -0.3385
# Set SYMBOLs as rownames
# Note that you may prefer to use the tibble package for this
rownames(brca_RNA) <- brca_RNA$hugoGeneSymbol
brca_RNA <- brca_RNA[, -c(1, 2, 3)]

# Round numeric vals to 3 decimals
for (i in 1:ncol(brca_RNA)) {
  brca_RNA[, i] <- round(brca_RNA[, i], digits = 3)
}

# Download mutations
brca_MUT <- get_mutation_data(case_list_id = q_cases, 
                              gprofile_id = mut_prf, 
                              glist = q_genes)

# Identify Samples carrying a TP53 missense mutation
tp53_mis_keep <- brca_MUT$hugoGeneSymbol == 'TP53' &
  brca_MUT$mutationType == 'Missense_Mutation' &
  !is.na(brca_MUT$sampleId)
tp53_mut_samples <- unique(brca_MUT$sampleId[tp53_mis_keep])

# Show results
keep_cols <- c('sampleId', 'hugoGeneSymbol', 'mutationType',  'proteinChange')
utils:::head(brca_MUT[, keep_cols])
##          sampleId hugoGeneSymbol      mutationType proteinChange
## 1 TCGA-A1-A0SK-01           TP53 Missense_Mutation         M133K
## 2 TCGA-A1-A0SO-01           TP53 Missense_Mutation         Y220C
## 3 TCGA-A2-A04P-01           TP53   Frame_Shift_Del    G108Vfs*15
## 4 TCGA-A2-A04R-01           E2F1 Missense_Mutation         Y168C
## 5 TCGA-A2-A04W-01           TP53 Missense_Mutation         V272M
## 6 TCGA-A2-A0CL-01           TP53 Missense_Mutation         R273L

NOTE: Mutation data are returned in a different format (molten data format) compared to data corresponding to other genetic profiles. Results are formatted as molten data.frames where each row is a DNA variant. Each sample / case may have 0 or more variants for each of the query genes.


Retrieve Data for all available genes

It is possible to download data corresponding to all available genes using the fetch_all_tcgadata() function. This function takes the same arguments as the get_profile_data() function but the glist argument. Gene lists are obtained automatically via the get_gene_identifiers() function. Each job is automatically split into multiple batches. Results are automatically aggregated before being returned. If mutation data are being queried, results are returned as a molten data.frame. The mutations argument is used to guide the query type as well as the output format.

# Download all brca_pub mutation data (complete samples)
all_brca_MUT <- fetch_all_tcgadata(case_list_id = q_cases, 
                                   gprofile_id = mut_prf, 
                                   mutations = TRUE)
  
# Download all brca_pub RNA expression data (complete samples)
all_brca_RNA <- fetch_all_tcgadata(case_list_id = q_cases, 
                                   gprofile_id = rna_prf, 
                                   mutations = FALSE)

—–

Examples and visualizations

In this section, we discuss simple use case where TCGAretriever is used to study the relationship between gene expression and mutation status of some genes of interest in breast cancer tumors.

Relationship between E2F1 and EZH2 in BRCA

Here we use the data retrieved before to analyze the relationship between E2F1 and EZH2 expression in TCGA breast cancer samples.

# Visualize the correlation between EZH2 and E2F1
df <- data.frame(sample_id = colnames(brca_RNA), 
                 EZH2 = as.numeric(brca_RNA['EZH2', ]), 
                 E2F1 = as.numeric(brca_RNA['E2F1', ]), 
                 stringsAsFactors = FALSE)

ggplot(df, aes(x = EZH2, y = E2F1)) +
  geom_point(color = 'gray60', size = 0.75) +
  theme_bw() +
  geom_smooth(method = 'lm', color = 'red2', 
              size=0.3, fill = 'gray85') +
  ggtitle('E2F1-EZH2 correlation in BRCA') + 
  theme(plot.title = element_text(hjust = 0.5))

Scatterplot - showing E2F1 RNA expression with respect to EZH2 RNA expression across TCGA breast cancer samples.


E2F1 Expression vs. Binned EZH2 Expression in BRCA

This is an alternative visualization that illustrates the same relationship between E2F1 and EZH2 expression. Here, we make use of the make_groups() function to discretize the numeric vector corresponding to EZH2 expression counts. This function takes a numeric vector as its num_vector argument (e.g., a gene expression vector), performs binning, and returns the results as a data.frame. Here we show the distribution of E2F1 expression counts with respect to binned EZH2 expression.

# Bin samples according to EZH2 expression
EZH2_bins <- make_groups(num_vector = df$EZH2, groups = 5) 
utils::head(EZH2_bins, 12)
##     value rank
## 1  -1.793    2
## 2  -1.565    2
## 3  -2.495    1
## 4  -2.136    1
## 5   0.529    5
## 6  -1.439    3
## 7  -0.426    5
## 8  -1.165    4
## 9  -0.894    4
## 10 -0.998    4
## 11 -1.132    4
## 12 -1.483    3
# attach bin to df
df$EZH2_bin <- EZH2_bins$rank

# build Boxplot
ggplot(df, aes(x = as.factor(EZH2_bin), y = E2F1)) +
  geom_boxplot(outlier.shape = NA, fill = '#fed976') +
  geom_jitter(width = 0.1, size = 1) +
  theme_bw() +
  xlab('EZH2 Bins') +
  ggtitle('E2F1 Expression vs. Binned EZH2 Expression') +
  theme(plot.title = element_text(face = 'bold', hjust = 0.5))

Boxplot - showing E2F1 RNA expression with respect to binned EZH2 RNA expression across TCGA breast cancer samples.


Relationship between HMGA1 and TP53 with respect to TP53 mutation status in BRCA

We use the data retrieved before to analyze the relationship between HMGA1 and TP53 expression with respect to the TP53 WT status in TCGA breast cancer samples.

# Coerce to data.frame with numeric features 
mol_df <- data.frame(sample_id = colnames(brca_RNA), 
                     HMGA1 = as.numeric(brca_RNA['HMGA1', ]),
                     TP53 = as.numeric(brca_RNA['TP53', ]),
                     stringsAsFactors = FALSE)

mol_df$TP53.status = factor(ifelse(mol_df$sample_id %in% tp53_mut_samples, 
                                   '01.wild_type', '02.mutated'))
  
# Visualize the correlation between EZH2 and E2F1
ggplot(mol_df, aes(x = TP53, y = HMGA1)) +
  geom_point(color = 'gray60', size = 0.75) +
  facet_grid(cols = vars(TP53.status)) +
  theme_bw() +
  geom_smooth(mapping = aes(color = TP53.status), 
              method = 'lm', size=0.3, fill = 'gray85') +
  ggtitle('HMGA1-TP53 correlation in BRCA') + 
  theme(plot.title = element_text(hjust = 0.5))

Scatterplot - showing HGMA1 RNA expression with respect to TP53 RNA expression across TCGA breast cancer samples. Samples with wild type (left) and mutant (right) TP53 status are shown in the two panels.

—–

SessionInfo

sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_4.0.0        reshape2_1.4.4       TCGAretriever_1.10.3
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-4       gtable_0.3.6       jsonlite_2.0.0     dplyr_1.1.4       
##  [5] compiler_4.5.2     tidyselect_1.2.1   Rcpp_1.1.0         stringr_1.5.2     
##  [9] jquerylib_0.1.4    splines_4.5.2      scales_1.4.0       yaml_2.3.10       
## [13] fastmap_1.2.0      lattice_0.22-5     R6_2.6.1           plyr_1.8.9        
## [17] labeling_0.4.3     generics_0.1.4     curl_7.0.0         knitr_1.50        
## [21] tibble_3.3.0       bslib_0.9.0        pillar_1.11.1      RColorBrewer_1.1-3
## [25] rlang_1.1.6        cachem_1.1.0       stringi_1.8.7      xfun_0.54         
## [29] sass_0.4.10        S7_0.2.0           cli_3.6.5          mgcv_1.9-1        
## [33] withr_3.0.2        magrittr_2.0.4     digest_0.6.37      grid_4.5.2        
## [37] rstudioapi_0.17.1  nlme_3.1-168       lifecycle_1.0.4    vctrs_0.6.5       
## [41] evaluate_1.0.5     glue_1.8.0         farver_2.1.2       rmarkdown_2.30    
## [45] httr_1.4.7         tools_4.5.2        pkgconfig_2.0.3    htmltools_0.5.8.1

Success! TCGAretriever vignette. Date: 2026-Feb-23, by D Fantini.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.