The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
Cancer cells accumulate DNA mutations as result of DNA damage and DNA repair processes. mutSignatures is a computational framework aimed at deciphering DNA mutational signatures operating in cancer.
The framework includes three modules that support 1) raw data import and pre-processing, 2) mutation counts deconvolution, and 3) data visualization. Therefore, mutSignatures is a comprehensive software suite that can guide the user throughout all steps of a mutational signature analysis.
Inputs are VCF files, MAF files, as well as other commonly used file formats storing DNA variant information. These files are imported and pre-processed to obtain a mutationCounts
-class object (which includes a numeric matrix of DNA mutation counts).
The framework de-novo extracts mutational signatures via Non-Negative Matrix Factorization (NMF). Bootstrapping is performed as part of the analysis.
The framework relies on parallelization (optional) and is optimized for use on multi-core systems.
The mutSignatures framework was developed in the Meeks Lab at Northwestern University (Chicago, IL), and was described in a Scientific Reports paper Fantini D et al, 2020
The software is based on a custom R-based implementation of the MATLAB WTSI framework by Alexandrov LB et al, 2013, and includes a wide range of tools for expanded functionality.
The mutSignatures framework has been used for analyses described in peer-reviewed publications, including Fantini D et al, 2018 and Fantini D et al, 2019.
More vignettes discussing how to extract and analyze mutational signatures from different sources using mutSignatures
can be found at the following URLs:
Getting started with mutSignatures (extended): link to PDF
Deploying mutSignatures on Computational Clusters: link to PDF
Deconvoluting Mutation Counts Against Known Mutational Signatures: link to PDF
More tutorials: mutsignatures official website
The package is available from CRAN. The official version of mutSignatures
(version 2.1.1) can be installed from any CRAN mirror.
install.packages("mutSignatures")
The latest (most recently updated, beta) version of the package is available on GitHub. It is possible to install mutSignatures
using devtools
.
devtools::install_github("dami82/mutSignatures", force = TRUE,
build_opts = NULL, build_vignettes = TRUE)
One of the pre-processing steps in the mutSignatures pipeline relies on a list of Bioconductor (https://www.bioconductor.org/) libraries. These libraries are optional, since they are only required for the attachContext() step (these libs are not automatically downloaded when installing the package). The bioconductor libraries are: i) IRanges; ii) GenomicRanges; iii) BSgenome; iv) GenomeInfoDb. You can install them as shown below.
# Get BiocManager
if (!"BiocManager" %in% rownames(utils::installed.packages()))
install.packages("BiocManager")
# Get Bioc libs
BiocManager::install(pkgs = c("IRanges", "GenomicRanges", "BSgenome", "GenomeInfoDb"))
To complete the attachContext() step, a BSgenome package including full genome sequences of the organism of interest has to be installed as well. For example, human hg19 genome sequences can be installed as shown below.
BiocManager::install(pkgs = "BSgenome.Hsapiens.UCSC.hg19")
The typical mutSignatures
analytic pipeline includes three steps:
Data Import and pre-processing: Mutation data are imported from a VCF, MAF, or similar file. Single Nucleotide Variants (SNVs) are filtered, nucleotide context around the mutation site is retrieved, and tri-nucleotide mutation types are computed and counted across samples. The result of this step is a mutationCounts
-class object.
De-novo extraction of mutational signatures: Non-negative Matrix factorization is performed. Typically, 500 to 1,000 bootstrapping iterations are performed (for preparative analytic runs). 100-500 iterations are typically enough for a preliminary analysis. The NMF step returns both: i) mutational signatures, as well as ii) mutSignature exposures.
Visualization, downstream analysis, data export: Results can be visualized (barplots). Signatures can be compared and matched to known signatures (i.e. previously extracted signatures or published signatures, such as the COSMIC signatures published by the Sanger Institute, London). You can use the msigPlot()
method to generate plots.
If the set of mutational signatures is already known, it is possible to use mutSignatures
to estimate the contribution of each mutational pattern in a collection of samples. In this case, the pipeline includes the following steps:
Data Import and pre-processing: see above.
Estimate exposures to known mutational signatures: Use a Fast Combinatorial Nonnegative Least-Square approach to compute exposures when the mutation signatures are pre-defined.
Visualization, downstream analysis, data export: see above.
For running the demos described in this vignette, the following R libraries will be used: dplyr
, reshape2
, ggplot2
, kableExtra
, BSgenome.Hsapiens.UCSC.hg19
, and mutSignatures
. Before proceeding, we assign the hg19 BSgenome object to a variable named hg19
, and we load the mutSigData
dataset, which is provided together with the mutSignatures
package.
# Required libs
library(dplyr)
library(reshape2)
library(kableExtra)
library(ggplot2)
library(gridExtra)
library(BSgenome.Hsapiens.UCSC.hg19)
# Load mutSignatures
library(mutSignatures)
# prep hg19
hg19 <- BSgenome.Hsapiens.UCSC.hg19
# load data
data("mutSigData")
The first demo shows how to extract mutational signatures from a dataset including DNA mutations from bladder cancer samples.
Here, the input has a VCF-like structure, and is decorated with an extra column (namely, SAMPLEID
) that includes a unique identifier for the biological samples.
# Import data (VCF-like format)
x <- mutSigData$inputC
# Filter non SNV
x <- filterSNV(dataSet = x, seq_colNames = c("REF", "ALT"))
# Visualize head
head(x) %>% kable() %>% kable_styling(bootstrap_options = "striped")
CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | XTR1 | SAMPLEID |
---|---|---|---|---|---|---|---|---|---|---|
chr15 | 73624589 | . | G | T | 29.084 | . | . | GT:PL | 0/1 | BLCA.001103810 |
chr22 | 35684398 | . | G | A | 21.995 | . | . | GT:PL | 0/1 | BLCA.001103810 |
chr3 | 183689484 | . | G | A | 27.296 | . | . | GT:PL | 0/1 | BLCA.001103810 |
chr16 | 89261395 | . | G | A | 24.398 | . | . | GT:PL | 0/1 | BLCA.001103810 |
chr19 | 44662014 | . | G | C | 21.626 | . | . | GT:PL | 0/1 | BLCA.001103810 |
chr16 | 30724634 | . | G | C | 38.741 | . | . | GT:PL | 0/1 | BLCA.001103810 |
# Attach context
x <- attachContext(mutData = x,
chr_colName = "CHROM",
start_colName = "POS",
end_colName = "POS",
nucl_contextN = 3,
BSGenomeDb = hg19)
# Visualize head
head(x) %>% kable() %>% kable_styling(bootstrap_options = "striped")
CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | XTR1 | SAMPLEID | context |
---|---|---|---|---|---|---|---|---|---|---|---|
chr15 | 73624589 | . | G | T | 29.084 | . | . | GT:PL | 0/1 | BLCA.001103810 | CGA |
chr22 | 35684398 | . | G | A | 21.995 | . | . | GT:PL | 0/1 | BLCA.001103810 | TGA |
chr3 | 183689484 | . | G | A | 27.296 | . | . | GT:PL | 0/1 | BLCA.001103810 | TGC |
chr16 | 89261395 | . | G | A | 24.398 | . | . | GT:PL | 0/1 | BLCA.001103810 | AGG |
chr19 | 44662014 | . | G | C | 21.626 | . | . | GT:PL | 0/1 | BLCA.001103810 | AGA |
chr16 | 30724634 | . | G | C | 38.741 | . | . | GT:PL | 0/1 | BLCA.001103810 | GGA |
# Remove mismatches
x <- removeMismatchMut(mutData = x, # input data.frame
refMut_colName = "REF", # column name for ref base
context_colName = "context", # column name for context
refMut_format = "N")
# Compute mutType
x <- attachMutType(mutData = x, # as above
ref_colName = "REF", # column name for ref base
var_colName = "ALT", # column name for mut base
context_colName = "context")
# Visualize head
head(x) %>% kable() %>% kable_styling(bootstrap_options = "striped")
CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | XTR1 | SAMPLEID | context | mutType |
---|---|---|---|---|---|---|---|---|---|---|---|---|
chr15 | 73624589 | . | G | T | 29.084 | . | . | GT:PL | 0/1 | BLCA.001103810 | CGA | T[C>A]G |
chr22 | 35684398 | . | G | A | 21.995 | . | . | GT:PL | 0/1 | BLCA.001103810 | TGA | T[C>T]A |
chr3 | 183689484 | . | G | A | 27.296 | . | . | GT:PL | 0/1 | BLCA.001103810 | TGC | G[C>T]A |
chr16 | 89261395 | . | G | A | 24.398 | . | . | GT:PL | 0/1 | BLCA.001103810 | AGG | C[C>T]T |
chr19 | 44662014 | . | G | C | 21.626 | . | . | GT:PL | 0/1 | BLCA.001103810 | AGA | T[C>G]T |
chr16 | 30724634 | . | G | C | 38.741 | . | . | GT:PL | 0/1 | BLCA.001103810 | GGA | T[C>G]C |
# Count
blca.counts <- countMutTypes(mutTable = x,
mutType_colName = "mutType",
sample_colName = "SAMPLEID")
# Mutation Counts
print(blca.counts)
## Mutation Counts object - mutSignatures
##
## Total num of MutTypes: 96
## MutTypes: A[C>A]A, A[C>A]C, A[C>A]G, A[C>A]T, A[C>G]A ...
##
## Total num of Samples: 50
## Sample Names: BLCA.001103810, BLCA.001289938, BLCA.001362782, BLCA.001427035, BLCA.001575414 ...
After multi-nucleotide mutation count data have been prepared, it is possible to proceed with the de-novo signature extraction. Settings guiding the analysis are defined as a list of parameters that can be tuned via the setMutClusterParams()
function. Next, the NMF analysis is executed by the decipherMutationalProcesses()
function. At the end of the analysis, a silhouette plot is returned. This plot can be very helpful to understand if the obtained signatures are relatively weak or robust.
# how many signatures should we extract?
num.sign <- 4
# Define parameters for the non-negative matrix factorization procedure.
# you should parallelize if possible
blca.params <-
mutSignatures::setMutClusterParams(
num_processesToExtract = num.sign, # num signatures to extract
num_totIterations = 20, # bootstrapping: usually 500-1000
num_parallelCores = 4) # total num of cores to use (parallelization)
# Extract new signatures - may take a while
blca.analysis <-
decipherMutationalProcesses(input = blca.counts,
params = blca.params)
Profiles of de-novo extracted mutational signatures can be visualized (by barplots). Also it is possible to plot the estimated exposures of each sample to the identified mutational patterns. It is also possible to compare de-novo extracted signatures with known signatures (such as signatures from previous analyses, or COSMIC signatures). To cast mutSignature
objects as data.frames
, you can use the mutSignatures::as.data.frame()
method.
Note: you can use the mutSignatures::getCosmicSignatures()
function to download a copy of the COSMIC signatures.
# Retrieve signatures (results)
blca.sig <- blca.analysis$Results$signatures
# Retrieve exposures (results)
blca.exp <- blca.analysis$Results$exposures
# Plot signature 1 (standard barplot, you can pass extra args such as ylim)
msigPlot(blca.sig, signature = 1, ylim = c(0, 0.10))
# Plot exposures (ggplot2 object, you can customize as any other ggplot2 object)
msigPlot(blca.exp, main = "BLCA samples") +
scale_fill_manual(values = c("#1f78b4", "#cab2d6", "#ff7f00", "#a6cee3"))
# Export Signatures as data.frame
xprt <- coerceObj(x = blca.sig, to = "data.frame")
head(xprt) %>% kable() %>% kable_styling(bootstrap_options = "striped")
Sign.01 | Sign.02 | Sign.03 | Sign.04 | |
---|---|---|---|---|
A[C>A]A | 0.0022120 | 0.0040737 | 0.0057090 | 0.0062098 |
A[C>A]C | 0.0007129 | 0.0088800 | 0.0016303 | 0.0145214 |
A[C>A]G | 0.0002019 | 0.0039400 | 0.0007855 | 0.0120000 |
A[C>A]T | 0.0005315 | 0.0022517 | 0.0110999 | 0.0075955 |
A[C>G]A | 0.0009370 | 0.0048720 | 0.0024274 | 0.0098213 |
A[C>G]C | 0.0013582 | 0.0001527 | 0.0054321 | 0.0046073 |
# Get signatures from data (imported as data.frame)
# and then convert it to mutSignatures object
cosmixSigs <- mutSigData$blcaSIGS %>%
dplyr::select(starts_with("COSMIC")) %>%
as.mutation.signatures()
blcaKnwnSigs <- mutSigData$blcaSIGS %>%
dplyr::select(starts_with("BLCA")) %>%
as.mutation.signatures()
# Compare de-novo signatures with selected COSMIC signatures
msig1 <- matchSignatures(mutSign = blca.sig, reference = cosmixSigs,
threshold = 0.45, plot = TRUE)
msig2 <- matchSignatures(mutSign = blca.sig, reference = blcaKnwnSigs,
threshold = 0.45, plot = TRUE)
# Visualize match
# signature 1 is similar to COSMIC ;
# signatures 2 and 3 are similar to COSMIC
# Here, we should probably extract only 2 mutational signatures
hm1 <- msig1$plot + ggtitle("Match to COSMIC signs.")
hm2 <- msig2$plot + ggtitle("Match to known BLCA signs.")
# Show
grid.arrange(hm1, hm2, ncol = 2)
The second demo illustrates a case where the signatures are known. We will use the COSMIC signatures 1, 2, 5, and 13 in the analysis, as well as a set of custom signatures previously identified in bladder cancer tumor samples. We will estimate their contributions to a collection of blca mutations that are included in the mutSigData
dataset.
Data import and pre-processing can be conducted as shown above. Note that if a data.frame
with counts of mutation types across samples is available, it is sufficient to use the as.mutation.counts()
method to convert it to the desired object.
# Retrieve a mutation.counts data.frame
x <- mutSigData$blcaMUTS
# Visualize header
x[1:10, 1:5] %>% kable() %>% kable_styling(bootstrap_options = "striped")
BLCA.001103810 | BLCA.001289938 | BLCA.001362782 | BLCA.001427035 | BLCA.001575414 | |
---|---|---|---|---|---|
A[C>A]A | 1 | 0 | 0 | 0 | 0 |
A[C>A]C | 0 | 0 | 0 | 0 | 0 |
A[C>A]G | 0 | 0 | 0 | 1 | 0 |
A[C>A]T | 1 | 0 | 0 | 1 | 0 |
A[C>G]A | 1 | 0 | 0 | 2 | 0 |
A[C>G]C | 0 | 0 | 0 | 0 | 0 |
A[C>G]G | 1 | 0 | 0 | 0 | 0 |
A[C>G]T | 0 | 0 | 2 | 0 | 0 |
A[C>T]A | 1 | 0 | 1 | 3 | 0 |
A[C>T]C | 2 | 1 | 1 | 1 | 0 |
# Convert it
xx <- as.mutation.counts(x)
# Print to screen
print(xx)
## Mutation Counts object - mutSignatures
##
## Total num of MutTypes: 96
## MutTypes: A[C>A]A, A[C>A]C, A[C>A]G, A[C>A]T, A[C>G]A ...
##
## Total num of Samples: 50
## Sample Names: BLCA.001103810, BLCA.001289938, BLCA.001362782, BLCA.001427035, BLCA.001575414 ...
After preparing the mutational signatures as a mutationSignatures
-class object, it is sufficient to run the resolveMutSignatures()
function. It is not recommended to use a large number of signatures when running this analysis (for example, refrain from using all COSMIC signatures). Likewise, you don’t want to run this step using a very low number of signatures. The algorithm is way more accurate if ONLY AND ALL the relevant signatures are used (i.e., the signatures we are reasonably expecting to be operative in the tumor samples being analyzed). This analysis is typically very fast!
# Obtain 4 COSMIC signatures
cosmx <- mutSigData$blcaSIGS %>% dplyr::select(starts_with("COSMIC"))
cosmx <- as.mutation.signatures(cosmx)
# Obtain 4 BLCA signatures
blcmx <- mutSigData$blcaSIGS %>% dplyr::select(starts_with("BLCA"))
blcmx <- as.mutation.signatures(blcmx)
# Visualize cosmx
print(cosmx)
## Mutation Signatures object - mutSignatures
##
## Total num of Signatures: 4
## Total num of MutTypes: 96
##
## Sign.1 Sign.2 Sign.3 Sign.4
## ------ ------ ------ ------
## + 0.0111 0.0007 0.0149 0.0003 + A[C>A]A
## + 0.0091 0.0006 0.0090 0.0006 + A[C>A]C
## + 0.0015 0.0001 0.0022 0.0000 + A[C>A]G
## + 0.0062 0.0003 0.0092 0.0008 + A[C>A]T
## + 0.0018 0.0003 0.0117 0.0038 + A[C>G]A
## + 0.0026 0.0003 0.0073 0.0009 + A[C>G]C
## + 0.0006 0.0002 0.0023 0.0000 + A[C>G]G
## + 0.0030 0.0006 0.0117 0.0039 + A[C>G]T
## + 0.0295 0.0074 0.0218 0.0015 + A[C>T]A
## + 0.0143 0.0027 0.0128 0.0000 + A[C>T]C
## ...... ...... ...... ......
# Visualize cosmx
print(blcmx)
## Mutation Signatures object - mutSignatures
##
## Total num of Signatures: 4
## Total num of MutTypes: 96
##
## Sign.1 Sign.2 Sign.3 Sign.4
## ------ ------ ------ ------
## + 0.0019 0.0020 0.0072 0.0059 + A[C>A]A
## + 0.0003 0.0073 0.0111 0.0105 + A[C>A]C
## + 0.0001 0.0064 0.0049 0.0069 + A[C>A]G
## + 0.0006 0.0038 0.0071 0.0061 + A[C>A]T
## + 0.0007 0.0028 0.0055 0.0118 + A[C>G]A
## + 0.0017 0.0002 0.0021 0.0060 + A[C>G]C
## + 0.0017 0.0028 0.0062 0.0029 + A[C>G]G
## + 0.0024 0.0031 0.0015 0.0089 + A[C>G]T
## + 0.0019 0.0082 0.0135 0.0155 + A[C>T]A
## + 0.0028 0.0119 0.0171 0.0100 + A[C>T]C
## ...... ...... ...... ......
# Run analysis
blca.expo1 <- resolveMutSignatures(mutCountData = xx,
signFreqData = cosmx)
blca.expo2 <- resolveMutSignatures(mutCountData = xx,
signFreqData = blcmx)
As discussed above, exposures to mutational signatures can be plotted (barplots), and results can be exported using the mutSignatures::data.frame
method.
# Retrieve exposures (results)
blca.exp.1x <- blca.expo1$results$count.result
blca.exp.2x <- blca.expo2$results$count.result
# Plot exposures
bp1 <- msigPlot(blca.exp.1x, main = "BLCA | COSMIC sigs.") +
scale_fill_manual(values = c("#fdbf6f", "#e31a1c", "#fb9a99", "#1f78b4"))
bp2 <- msigPlot(blca.exp.2x, main = "BLCA | pre-blca sigs.") +
scale_fill_manual(values = c("#fdbf6f", "#e31a1c", "#fb9a99", "#1f78b4"))
# Visualize
grid.arrange(bp1, bp2, ncol = 2)
# Compare sigs
# Export Exposures as data.frame
xprt <- as.data.frame(blca.exp.1x, transpose = TRUE)
head(xprt) %>% round() %>% kable() %>%
kable_styling(bootstrap_options = "striped")
COSMIC.1 | COSMIC.2 | COSMIC.5 | COSMIC.13 | |
---|---|---|---|---|
BLCA.001103810 | 32 | 29 | 23 | 30 |
BLCA.001289938 | 42 | 24 | 0 | 32 |
BLCA.001362782 | 50 | 119 | 24 | 228 |
BLCA.001427035 | 79 | 35 | 14 | 74 |
BLCA.001575414 | 49 | 49 | 0 | 40 |
BLCA.001698493 | 34 | 6 | 53 | 0 |
More info and other examples can be found at the following URLs:
Our Scientific Reports paper describing the latest version of the software: MutSignatures: An R Package for Extraction and Analysis of Cancer Mutational Signatures, https://www.nature.com/articles/s41598-020-75062-0/. Please, cite this paper if you use mutSignatures in your work! Thanks.
Oncogene paper: Mutational Signatures operative in bladder cancer, https://www.nature.com/articles/s41388-017-0099-6/
More tutorials and vignettes about the mutSignatures R library: https://www.data-pulse.com/dev_site/mutsignatures/
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mutSignatures_2.1.1 foreach_1.5.0 gridExtra_2.3
## [4] ggplot2_3.3.2 kableExtra_1.1.0 reshape2_1.4.4
## [7] dplyr_1.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 pracma_2.2.9 highr_0.8 pillar_1.4.6
## [5] compiler_4.0.2 plyr_1.8.6 iterators_1.0.12 tools_4.0.2
## [9] digest_0.6.25 gtable_0.3.0 evaluate_0.14 lifecycle_0.2.0
## [13] tibble_3.0.3 viridisLite_0.3.0 pkgconfig_2.0.3 rlang_0.4.7
## [17] rstudioapi_0.11 parallel_4.0.2 yaml_2.2.1 xfun_0.16
## [21] cluster_2.1.0 withr_2.2.0 stringr_1.4.0 httr_1.4.2
## [25] knitr_1.29 xml2_1.3.2 generics_0.0.2 vctrs_0.3.2
## [29] hms_0.5.3 webshot_0.5.2 grid_4.0.2 tidyselect_1.1.0
## [33] glue_1.4.1 R6_2.4.1 rmarkdown_2.3 farver_2.0.3
## [37] purrr_0.3.4 readr_1.3.1 magrittr_1.5 codetools_0.2-16
## [41] scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0 rvest_0.3.6
## [45] colorspace_1.4-1 labeling_0.3 proxy_0.4-24 stringi_1.4.6
## [49] doParallel_1.0.15 munsell_0.5.0 crayon_1.3.4
Thanks for your interest in our software. If you run into any issue, bug, or you have a question about mutSignatures
, feel free to email damiano.fantini@gmail.com. At this time, mutSignatures is mainly supported by DF in his free time (so, please, be patient upon your inquires). C-2020 (November-01, 2020).
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.