RiskyCNV: A Prostate Cancer Case Study Using TCGA-PRAD Data

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Introduction

Copy Number Variation (CNV) is a genetic phenomenon in which the number of copies of a specific DNA segment varies among individuals. CNVs can take the form of duplications, deletions, or other structural changes in DNA sequences and can extend to long genomic segments. CNVs play an essential role in driving cancer development, and their characteristics are critical for disease diagnosis, prognosis, and treatment (Freeman et al., 2006).

CNVs are broadly classified into germline and somatic types. Germline CNVs contribute to hereditary disorders and inherited predispositions to cancer. Somatic CNVs occur due to aberrant cell division and often relate directly to cancer progression. The study of somatic CNVs can uncover driver genes involved in tumorigenesis or tumour suppression and serve as biomarkers for predicting prognosis or treatment outcomes (Xie & Tammi, 2009).

Prostate cancer is one of the most common malignancies in men worldwide, with marked heterogeneity in progression and treatment outcomes. Indolent (low-Gleason) cancers show minor genomic alterations, while aggressive metastatic tumours demonstrate gross CNVs (Hieronymus et al., 2014). Risk stratification — categorising patients by their likelihood of harbouring aggressive disease — is central to clinical decision-making and personalised treatment planning.

The RiskyCNV package addresses a critical gap in existing CNV tools. While packages such as CNVizard, CINmetrics, GISTIC, and CopywriteR offer visualisation, significant CNV identification, and chromosomal instability quantification, none provide stratified class analysis, particularly risk-class analysis (Krause et al., 2024; Oza et al., 2023; Mermel et al., 2011; Kuilman et al., 2015). RiskyCNV was designed specifically for class-stratified exploration and analysis of CNV omics data, with the potential to deliver personalised insights regarding disease course and risk stratification.

This vignette demonstrates the complete RiskyCNV workflow applied to prostate adenocarcinoma (PRAD) data from The Cancer Genome Atlas (TCGA), illustrating how CNV analysis integrated with clinical grading and RNA expression data can identify biologically meaningful biomarkers.

library(RiskyCNV)

Dataset

The data used in this case study are derived from the TCGA PRAD dataset (https://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/PRAD/20160128/). The dataset comprises:

Clinical data: Sample IDs, Gleason scores, and histopathological pattern information for 929 prostate cancer samples
Somatic CNV data: Segment-level copy number profiles including chromosome, start/end coordinates, number of probes, and segment mean
Gene annotation: Reference genomic coordinates for gene symbol assignment
RNA expression data: Transcriptomic profiles for correlation analysis

The grade-wise distribution of TCGA-PRAD samples is as follows:

Grade	Gleason Score	Number of Samples
Grade 1	≤ 6	44
Grade 2	3+4=7	146
Grade 3	4+3=7	99
Grade 4	8	63
Grade 5	9-10	139
Control	—	438
Total		929

sample_file     <- system.file("extdata", "sample_data.csv",
                                package = "RiskyCNV")
cnv_file        <- system.file("extdata", "cnv_data.txt",
                                package = "RiskyCNV")
gene_file       <- system.file("extdata", "gene_annotation.csv",
                                package = "RiskyCNV")
annotated_file  <- system.file("extdata", "annotated_cnv.csv",
                                package = "RiskyCNV")
cnv_matrix_file <- system.file("extdata", "cnv_matrix.csv",
                                package = "RiskyCNV")
rna_file        <- system.file("extdata", "rna_data.csv",
                                package = "RiskyCNV")

# Preview the clinical sample data
head(read.csv(sample_file))
#>         sample gleason_score pattern1 pattern2  grade
#> 1 TCGA.HC.7748             6        3        3 grade1
#> 2 TCGA.CH.5743             7        3        4 grade2
#> 3 TCGA.CH.5762             7        4        3 grade3
#> 4 TCGA.EJ.A46D             8        3        5 grade4
#> 5 TCGA.CH.5741             9        4        5 grade5
#> 6 TCGA.V1.A9OF             6        3        3 grade1

Each sample contains its total Gleason score, primary histological pattern (pattern1), secondary histological pattern (pattern2), and assigned grade label (Balraj & Muthamilselvan, 2024).

Step 1: Classify Samples into Gleason Grade Groups

Background

Gleason grading is the cornerstone of prostate cancer pathological assessment. The 2014 ISUP Consensus Conference (Epstein et al., 2016) formalised a five-tier Grade Group system:

Grade Group 1: Gleason ≤ 6 — well differentiated, very low risk
Grade Group 2: Gleason 3+4=7 — favourable intermediate risk
Grade Group 3: Gleason 4+3=7 — unfavourable intermediate risk
Grade Group 4: Gleason 8 — poorly differentiated, high risk
Grade Group 5: Gleason 9-10 — anaplastic, very high risk

Using the Prostate Cancer Preset

The extract_metadata() function implements the ISUP Grade Group system directly when disease_type = "prostate" is specified:

grade_groups <- extract_metadata(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "prostate",
  output_dir   = tempdir()
)
#> Note: 'pattern_col' not supplied. All Gleason 7 samples are assigned to Grade Group 2. To distinguish Grade Group 2 (3+4=7) from Grade Group 3 (4+3=7), supply the primary pattern column name via pattern_col (e.g., pattern_col = 'pattern1').
#> Grade groups saved to: /tmp/RtmpX5HEhM/sample_data_with_grade_groups_20260514_140743.csv

print(names(grade_groups))
#> [1] "Grade Group 1" "Grade Group 2" "Grade Group 4" "Grade Group 5"
print(sapply(grade_groups, length))
#> Grade Group 1 Grade Group 2 Grade Group 4 Grade Group 5 
#>             3             4             2             2

Using the Auto Mode

For datasets where clinical thresholds are unknown, disease_type = "auto" automatically derives a normalised Risk Score from the data using min-max normalisation and divides samples into the requested number of groups. The function assesses distribution skewness and selects equal-width or quantile-based splitting accordingly:

grade_groups_auto <- extract_metadata(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "auto",
  n_groups     = 5,
  group_type   = "grade",
  output_dir   = tempdir()
)
#> Risk Score computed. Distribution skewness: 0.325. Using equal-width splitting.
#> Grade groups saved to: /tmp/RtmpX5HEhM/sample_data_with_grade_groups_20260514_140743.csv

print(names(grade_groups_auto))
#> [1] "Grade 1" "Grade 2" "Grade 4" "Grade 5"

The Risk Score is computed as: Risk Score = (score − min) / (max − min)

This normalises any numeric scoring system to a 0–1 scale, making the package applicable to any disease with a numeric grading or staging score.

Step 2: Stratify Samples by Clinical Risk Level

Background

Risk stratification translates Gleason grades into actionable clinical risk categories. The D’Amico classification (D’Amico et al., 1998), the most widely used system in prostate cancer, defines:

Low risk: Gleason ≤ 6 — active surveillance appropriate
Intermediate risk: Gleason 7 — selective treatment indicated
High risk: Gleason ≥ 8 — aggressive treatment required

Using the Prostate Cancer Preset

risk_groups <- classify_risk(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "prostate",
  output_dir   = tempdir()
)
#> Risk categories saved to: /tmp/RtmpX5HEhM/sample_data_with_risk_categories_20260514_140743.csv

print(names(risk_groups))
#> [1] "low_risk"          "intermediate_risk" "high_risk"
print(sapply(risk_groups, length))
#>          low_risk intermediate_risk         high_risk 
#>                 3                 4                 4

Flexible Risk Grouping with Auto Mode

The auto mode supports any number of risk groups. This is useful for diseases that use only two risk levels or four levels:

# Two risk groups
risk_2 <- classify_risk(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "auto",
  n_groups     = 2,
  output_dir   = tempdir()
)
#> Risk Score computed. Distribution skewness: 0.325. Using equal-width splitting.
#> Risk categories saved to: /tmp/RtmpX5HEhM/sample_data_with_risk_categories_20260514_140743.csv
print(names(risk_2))
#> [1] "low_risk"  "high_risk"

# Four risk groups
risk_4 <- classify_risk(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "auto",
  n_groups     = 4,
  output_dir   = tempdir()
)
#> Risk Score computed. Distribution skewness: 0.325. Using equal-width splitting.
#> Risk categories saved to: /tmp/RtmpX5HEhM/sample_data_with_risk_categories_20260514_140743.csv
print(names(risk_4))
#> [1] "very_low_risk"  "low_risk"       "high_risk"      "very_high_risk"

Step 3: Detect Copy Number Aberrations

Background

Aberration detection identifies genomic segments showing significant gains (oncogene amplifications) or losses (tumour suppressor deletions). Grade-wise analysis of TCGA-PRAD data showed progressive increases in aberration frequency:

Grade 1: Amplifications in chromosomes 8, 14, 20; losses in chromosomes 2-21
Grade 5: Amplifications in chromosomes 1, 3, 7, 9, 10, 12, 14, 16, 17, 19, 21; losses in chromosomes 1-14 and 16-22

This escalation in genomic instability with Gleason grade underscores the biological basis of clinical risk stratification.

aberrations <- aberration(
  cnv_data_file = cnv_file,
  effect_size   = 0.3
)

# Aberrant regions per chromosome
print(sapply(aberrations, nrow))
#> 2 
#> 1

Segments with segment mean > +0.3 are classified as Gains (Aberration_Code = 1); those with segment mean < −0.3 are classified as Losses (Aberration_Code = 0).

Step 4: Identify Recurrent Copy Number Variations

Background

Recurrent CNVs — those appearing consistently across multiple samples within a risk group — are more likely to represent functional driver events rather than stochastic noise. Recurrent gains in oncogenes and recurrent losses in tumour suppressor genes carry particular clinical significance.

recurrent_file <- recurrent(
  x             = risk_groups,
  risk_level    = "high_risk",
  cnv_data_file = cnv_file,
  threshold     = 2
)

recurrent_data <- read.csv(recurrent_file)
head(recurrent_data)

The output CSV contains recurrent CNV regions for the high-risk group:

Sample	Chromosome	Start	End	Num_Probes	Segment_Mean
TCGA.2A.A8VL	1	66506701	66515081	9	-1.1952
TCGA.2A.A8VL	1	74699099	74700802	5	-1.2440
TCGA.2A.A8VL	2	123900537	123920905	14	-0.7297

The threshold parameter controls stringency — higher values yield more robustly recurrent regions. Here we focus on high-risk samples where aggressive CNV patterns are most pronounced.

Step 5: Annotate CNV Regions with Gene Symbols

Background

Annotating CNV regions with gene symbols — by finding genomic overlaps with a reference gene annotation — translates chromosomal coordinates into functionally meaningful gene lists. Key genes identified in TCGA-PRAD CNV analyses include ARHGEF16, CHMP7, ELP3, ASH2L, and RNF157, whose roles span immune modulation, cell cycle regulation, and tumour suppression.

annotated <- annotate(
  genes_file = gene_file,
  risk_file  = recurrent_file,
  output_dir = tempdir()
)

head(annotated)

Each recurrent CNV region is linked to the overlapping gene(s), sample ID, and segment mean value.

Step 6: Build a CNV Expression Matrix

Background

A sample × gene CNV matrix organises annotated data into a format suitable for statistical integration with expression data. Positive values indicate copy number gains; negative values indicate losses. This matrix structure allows efficient examination of gene variation patterns across all samples simultaneously.

old_wd <- getwd()
setwd(tempdir())

cnv_matrix <- create_CNVMatrix(input_file = annotated_file)
#> CNV matrix saved to: /tmp/RtmpX5HEhM/cnv_output_matrix_20260514_140744.csv

setwd(old_wd)

print(dim(cnv_matrix))
#> [1] 8 9
print(cnv_matrix[, 1:min(5, ncol(cnv_matrix))])
#> # A tibble: 8 × 5
#>   Sample       ARHGEF16   VN1R3 CD163L1   PDE12
#>   <chr>           <dbl>   <dbl>   <dbl>   <dbl>
#> 1 TCGA.EJ.5514   0.0007 NA      NA      NA     
#> 2 TCGA.G9.7509  NA      -0.0033 NA      NA     
#> 3 TCGA.J4.A67Q  NA      NA       0.0005 NA     
#> 4 TCGA.KK.A8IK  NA      NA      NA      -0.0004
#> 5 TCGA.Y6.A8TL  NA      NA      NA      NA     
#> 6 TCGA.ZG.A8QX  NA      NA      NA      NA     
#> 7 TCGA.ZG.A8QY  NA      NA      NA      NA     
#> 8 TCGA.ZG.A9MC  NA      NA      NA      NA

Step 7: Correlate CNV with RNA Expression

Background

The final step integrates CNV data with transcriptomic profiles to identify CNV-driven gene expression changes. When a gene is amplified, its expression typically increases; when deleted, expression typically decreases. Genes showing statistically significant positive Pearson correlations between CNV segment mean and RNA expression are strong biomarker candidates.

In the full TCGA-PRAD analysis, grade-specific biomarkers included:

Grade	Key CNV-correlated Gene	Role
Grade 1	PPP1R3B	Glycogen metabolism regulation
Grade 2	ASH2L	Histone methylation, chromatin remodeling
Grade 3	IWS1	RNA processing
Grade 4	UBE2Q1	Ubiquitin-mediated proteolysis
Grade 5	GSK3A	Cell cycle, apoptosis regulation

Cross-grade consensus genes CHMP7 and ELP3 were common across all five grades, suggesting fundamental roles in prostate cancer biology.

old_wd <- getwd()
setwd(tempdir())

corr_results <- correlate_with_expr(
  cnv_file = cnv_matrix_file,
  rna_file = rna_file
)
#> Common genes found: 5
#> Correlation results saved to: all_correlations.csv, significant_corr.csv, high_correlation.csv

setwd(old_wd)

cat("All correlations:\n")
#> All correlations:
print(corr_results$all_correlations)
#>       gene     cor_val    p.value
#> 1 ARHGEF16  0.57254735 0.42745265
#> 2    RQCD1  0.05635468 0.94364532
#> 3    PDE12 -0.80909159 0.19090841
#> 4   FGFBP2 -0.98551703 0.01448297
#> 5  CD163L1  0.40120322 0.59879678

cat("\nSignificant correlations (p < 0.05):\n")
#> 
#> Significant correlations (p < 0.05):
print(corr_results$significant)
#>     gene   cor_val    p.value
#> 4 FGFBP2 -0.985517 0.01448297

cat("\nHigh-confidence CNV-driven genes (p < 0.05, r > 0.8):\n")
#> 
#> High-confidence CNV-driven genes (p < 0.05, r > 0.8):
print(corr_results$high_correlation)
#> [1] gene    cor_val p.value
#> <0 rows> (or 0-length row.names)

Three output files are produced — all correlations, significant (p < 0.05), and high-confidence (p < 0.05 AND r > 0.8). Genes in the high-confidence output are the strongest candidates for further investigation as CNV-driven expression biomarkers in prostate cancer.

Biological Summary

The RiskyCNV workflow applied to TCGA-PRAD data produced the following key findings:

Risk-grade comparison: Comparing low-risk (Grade 1-2) and high-risk (Grade 4-5) analyses revealed common genes with distinct biological roles. Low-risk/Grade 1 shared genes included HMGA1, BRCA1, TP53, MYC, and EGFR — major oncogenes and tumour suppressors correlated with baseline cancer susceptibility. High-risk/Grade 4-5 shared genes included UBE2C and UBE2Z — ubiquitin pathway members associated with tumour aggressiveness and poor prognosis.

Consensus biomarkers: Integrated analysis of recurrent somatic CNVs, germline CNVs, and RNA-correlated genes identified grade-specific consensus biomarkers, including ME1, CLU, SERPINB5, NECAB1, CTHRC1, and DPYS. RNF157 emerged as a multi-consensus candidate biomarker across multiple analytical approaches, consistent with its known role in M2 macrophage polarisation and tumour immune evasion (Guan et al., 2022).

Clinical implications: By integrating risk stratification with grade-wise CNV analysis, RiskyCNV enables a more comprehensive understanding of prostate cancer. Recognising genes common across both analyses enhances the ability to distinguish between genes that primarily drive tumour progression and those that serve as early indicators of cancer susceptibility, ultimately supporting personalised treatment planning.

Generalisability

Although demonstrated here with prostate cancer Gleason scores, RiskyCNV is fully generalised:

# Breast cancer with Nottingham scores
breast_grades <- extract_metadata(
  file_path    = "breast_samples.csv",
  column_name  = "nottingham_score",
  disease_type = "auto",
  n_groups     = 3,
  group_type   = "grade",
  output_dir   = tempdir()
)

# Lymphoma with two risk groups (limited vs. advanced)
lymphoma_risk <- classify_risk(
  file_path    = "lymphoma_samples.csv",
  column_name  = "ann_arbor_stage",
  disease_type = "auto",
  n_groups     = 2,
  output_dir   = tempdir()
)

Seven built-in disease presets are available ("prostate", "breast", "colorectal", "lung", "cervical", "lymphoma", "melanoma"), and fully custom thresholds can be supplied via disease_type = "custom".

References

Balraj AS & Muthamilselvan S. (2024). PRADclass: Hybrid Gleason grade-informed computational strategy. Cancer Informatics.
D’Amico AV, et al. (1998). Biochemical outcome after radical prostatectomy. JAMA, 280(11):969-974.
Epstein JI, et al. (2016). The 2014 ISUP Consensus Conference on Gleason Grading. Am J Surg Pathol, 40(2):244-252.
Freeman JL, et al. (2006). Copy number variation: new insights in genome diversity. Genome Research, 16:949-961.
Guan H, et al. (2022). Exosomal RNF157 mRNA from prostate cancer cells contributes to M2 macrophage polarization. Front Oncol, 12:1021270.
Hieronymus H, et al. (2014). Copy number alteration burden predicts prostate cancer relapse. PNAS, 111:11139-11144.
Krause J, et al. (2024). CNVizard. BMC Bioinformatics, 25:376.
Kuilman T, et al. (2015). CopywriteR. Genome Biology, 16:49.
Mermel CH, et al. (2011). GISTIC2.0. Genome Biology, 12:R41.
Oza VH, et al. (2023). CINmetrics. PeerJ, 11:e15244.
Xie C & Tammi MT. (2009). CNV-seq. BMC Bioinformatics, 10:80.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.