Installation

You can install the development version of SeedMatchR from GitHub or the stable build from CRAN.

# Install from GitHub
install.packages("devtools")
devtools::install_github("tacazares/SeedMatchR")

Quick start example with public siRNA data

Introduction to example dataset

This example uses the siRNA sequence, D1, targeting the Ttr gene in rat liver from the publication:

Schlegel MK, Janas MM, Jiang Y, Barry JD, Davis W, Agarwal S, Berman D, Brown CR, Castoreno A, LeBlanc S, Liebow A, Mayo T, Milstein S, Nguyen T, Shulga-Morskaya S, Hyde S, Schofield S, Szeto J, Woods LB, Yilmaz VO, Manoharan M, Egli M, Charissé K, Sepp-Lorenzino L, Haslett P, Fitzgerald K, Jadhav V, Maier MA. From bench to bedside: Improving the clinical safety of GalNAc-siRNA conjugates using seed-pairing destabilization. Nucleic Acids Res. 2022 Jul 8;50(12):6656-6670. doi: 10.1093/nar/gkac539. PMID: 35736224; PMCID: PMC9262600.

The guide sequence of interest is 23 bp long and oriented 5’ -> 3’.

# siRNA sequence of interest targeting a 23 bp region of the Ttr gene
guide.seq = "UUAUAGAGCAAGAACACUGUUUU"

Required Input Data

The required inputs to SeedMatchR are a DESeq2 results data frame in addition to species specific annoation data such as GTF, 2bit DNA, and 3’ UTR sequences.

SeedMatchR makes extensive use of AnnotationDB objects to help access genomic information in a reproducible manner. The required annotations are:

  • A character string representing the siRNA RNA sequence. This must be greater than 8 bp.
  • res: a data frame of DESEQ2 results.
  • GTF: gene transfer file containing species specific genomic information for gene bodies. This is used to derive the list of 3’ UTRs and other features used in the analysis. This is also used to map transcript IDs to gene IDs.
  • Feature-specific DNAStringSet: A DNAStringSet object of sequences for each of the features of interest. The features must be named according to the transcript they were derived from. Examples include those generated by GenomicFeatures::extractTranscriptSeqs() paired with functions like GenomicFeatures::threeUTRsByTranscript().

Prepare species-specific annotation data

The function load_species_anno_db() has built in annotation data for human, rat, and mouse annotations. We can load the species specific annotations using the following approach:

# Load the species specific annotation database object
anno.db <- load_species_anno_db("rat")

Extract features and sequences of interest from annotations

We will use the annotations to derive the features and feature sequences that we want to scan for each gene.

features = get_feature_seqs(anno.db$tx.db, anno.db$dna, feature.type = "3UTR")

Prepare DESEQ2 Results

SeedMatchR assumes that you will be performing your analysis on DESEQ2 results outputs. The first step is to load your DESEQ2 results file as a data frame.

The test data that is provided with SeedMatchR was derived from the 2022 publication by Schlegel et al. The data set represents a DESeq2 analysis performed on rat liver that had been treated with Ttr targeting siRNA. We will use this example to explore seed mediated activity. The data set name is long, so it will be renamed to res.

Download data (only need to perform once)

We start by downloading the example data set. This function will download three files from the GEO accession GSE184929. These files represent three samples with different siRNA treatments at two dosages.

get_example_data("sirna")

Load example data

We can load the example data into the environment.

sirna.data = load_example_data("sirna")
#> Example data directory being created at: /tmp/RtmpEJMKWe

The DESeq2 results are available through the names Schlegel_2022_Ttr_D1_30mkg, Schlegel_2022_Ttr_D4_30mkg and Schlegel_2022_Ttr_D1_10mkg. The data set name is long, so it will be renamed to res.

res <- sirna.data$Schlegel_2022_Ttr_D1_30mkg

The DESeq2 results file is then filtered. The function filter_deseq() can be used to filter a results file by log2FoldChange, padj, baseMean, and remove NA entries.

# Dimensions before filtering

dim(res) # [1] 32883    6
#> [1] 32883     8

# Filter DESeq2 results for SeedMatchR
res = filter_deseq(res, fdr.cutoff=1, fc.cutoff=0, rm.na.log2fc = TRUE)

# Dimensions after filtering
dim(res) # [1] 13582     8
#> [1] 13582     8

Plot possible seeds

Use the plot_seeds() function to visualize the available SeedMatchR options for your input sequence of interest. The only input to plot_seeds() is your input siRNA sequence of interest. This function assumes that you are using a RNA input.

# Plot the seed sequence options for the siRNA of interest
avail.seed.plot = plot_seeds(guide.seq)
#> use default substitution matrix
#> Registered S3 methods overwritten by 'ggalt':
#>   method                  from   
#>   grid.draw.absoluteGrob  ggplot2
#>   grobHeight.absoluteGrob ggplot2
#>   grobWidth.absoluteGrob  ggplot2
#>   grobX.absoluteGrob      ggplot2
#>   grobY.absoluteGrob      ggplot2
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.

avail.seed.plot

Get the seed sequence of interest

You can extract the seed sequence information from the siRNA input sequence using the get_seed() function. The inputs to the get_seed() function in the siRNA sequence of interest and the name of the seed.

# Get the seed sequence information for the seed of interest
seed = get_seed(guide.seq, "mer7m8")

seed
#> $Guide
#> 23-letter RNAString object
#> seq: UUAUAGAGCAAGAACACUGUUUU
#> 
#> $Seed.Name
#> [1] "mer7m8"
#> 
#> $Seed.Seq.RNA
#> 7-letter RNAString object
#> seq: UAUAGAG
#> 
#> $Seed.Seq.DNA
#> 7-letter DNAString object
#> seq: TATAGAG
#> 
#> $Target.Seq
#> 7-letter DNAString object
#> seq: CTCTATA

Counting seed matches in transcripts

You can perform a seed match for a single seed using the SeedMatchR() function.

res = SeedMatchR(res, 
                 anno.db$gtf, 
                 features$seqs, 
                 guide.seq)

head(res)
#>              gene_id  baseMean log2FoldChange     lfcSE      stat        pvalue
#> 1 ENSRNOG00000016275 2138.0945      -8.164615        NA -23.61818 2.507268e-123
#> 2 ENSRNOG00000000127  437.6342      -1.346927 0.1068629 -12.60425  2.000712e-36
#> 3 ENSRNOG00000047179 1590.1745      -1.262411 0.1031403 -12.23974  1.906387e-34
#> 4 ENSRNOG00000030187  131.9206       3.422725 0.3032352  11.28736  1.515189e-29
#> 5 ENSRNOG00000008050   38.9921      -3.442834 0.3192776 -10.78320  4.132589e-27
#> 6 ENSRNOG00000008816  400.9526       2.794453 0.2661369  10.50006  8.632549e-26
#>            padj symbol mer7m8
#> 1 3.405371e-119    Ttr      1
#> 2  1.358683e-32  Kpna6      0
#> 3  8.630849e-31  Aplp2      1
#> 4  5.144824e-26  Mmp12      0
#> 5  1.122577e-23  Stac3      0
#> 6  1.954121e-22  Gpnmb      0

Match multiple seeds

You can perform seed matching for all available seeds using a for loop. The results will be appended as a new column to the results data frame.

for (seed in c("mer8", "mer6", "mer7A1")){
res <- SeedMatchR(res, 
                  anno.db$gtf, 
                  features$seqs, 
                  guide.seq, 
                  seed.name = seed)
}

head(res)
#>              gene_id  baseMean log2FoldChange     lfcSE      stat        pvalue
#> 1 ENSRNOG00000016275 2138.0945      -8.164615        NA -23.61818 2.507268e-123
#> 2 ENSRNOG00000000127  437.6342      -1.346927 0.1068629 -12.60425  2.000712e-36
#> 3 ENSRNOG00000047179 1590.1745      -1.262411 0.1031403 -12.23974  1.906387e-34
#> 4 ENSRNOG00000030187  131.9206       3.422725 0.3032352  11.28736  1.515189e-29
#> 5 ENSRNOG00000008050   38.9921      -3.442834 0.3192776 -10.78320  4.132589e-27
#> 6 ENSRNOG00000008816  400.9526       2.794453 0.2661369  10.50006  8.632549e-26
#>            padj symbol mer7m8 mer8 mer6 mer7A1
#> 1 3.405371e-119    Ttr      1    1    1      1
#> 2  1.358683e-32  Kpna6      0    0    0      0
#> 3  8.630849e-31  Aplp2      1    0    1      0
#> 4  5.144824e-26  Mmp12      0    0    0      0
#> 5  1.122577e-23  Stac3      0    0    0      0
#> 6  1.954121e-22  Gpnmb      0    0    0      0
Match seeds with mismatches and indels allowed

You can also allow for inexact seed matches in your analysis with the mismatches and indels arguments. The names can be adjusted to reflect the arguments using the col.name argument.

for (indel.bool in c(TRUE, FALSE)){
  for (mm in c(0,1,2)){
    for (seed in c("mer7m8", "mer8", "mer6", "mer7A1")){
      res <- SeedMatchR(res, 
                        anno.db$gtf, 
                        features$seqs, 
                        guide.seq, 
                        seed.name = seed, 
                        col.name = paste0(seed, ".", "mm", mm, "_indel", indel.bool), 
                        mismatches = mm, 
                        indels = indel.bool)
    }
  }
}

head(res)
#>              gene_id  baseMean log2FoldChange     lfcSE      stat        pvalue
#> 1 ENSRNOG00000016275 2138.0945      -8.164615        NA -23.61818 2.507268e-123
#> 2 ENSRNOG00000000127  437.6342      -1.346927 0.1068629 -12.60425  2.000712e-36
#> 3 ENSRNOG00000047179 1590.1745      -1.262411 0.1031403 -12.23974  1.906387e-34
#> 4 ENSRNOG00000030187  131.9206       3.422725 0.3032352  11.28736  1.515189e-29
#> 5 ENSRNOG00000008050   38.9921      -3.442834 0.3192776 -10.78320  4.132589e-27
#> 6 ENSRNOG00000008816  400.9526       2.794453 0.2661369  10.50006  8.632549e-26
#>            padj symbol mer7m8 mer8 mer6 mer7A1 mer7m8.mm0_indelTRUE
#> 1 3.405371e-119    Ttr      1    1    1      1                    1
#> 2  1.358683e-32  Kpna6      0    0    0      0                    0
#> 3  8.630849e-31  Aplp2      1    0    1      0                    1
#> 4  5.144824e-26  Mmp12      0    0    0      0                    0
#> 5  1.122577e-23  Stac3      0    0    0      0                    0
#> 6  1.954121e-22  Gpnmb      0    0    0      0                    0
#>   mer8.mm0_indelTRUE mer6.mm0_indelTRUE mer7A1.mm0_indelTRUE
#> 1                  1                  1                    1
#> 2                  0                  0                    0
#> 3                  0                  1                    0
#> 4                  0                  0                    0
#> 5                  0                  0                    0
#> 6                  0                  0                    0
#>   mer7m8.mm1_indelTRUE mer8.mm1_indelTRUE mer6.mm1_indelTRUE
#> 1                    1                  1                  2
#> 2                   11                  3                 29
#> 3                    3                  2                 12
#> 4                    0                  0                  6
#> 5                    1                  1                  1
#> 6                    1                  0                  7
#>   mer7A1.mm1_indelTRUE mer7m8.mm2_indelTRUE mer8.mm2_indelTRUE
#> 1                    1                    4                  2
#> 2                    7                   94                 21
#> 3                    6                   38                 15
#> 4                    4                   18                  6
#> 5                    1                    4                  2
#> 6                    1                   18                  6
#>   mer6.mm2_indelTRUE mer7A1.mm2_indelTRUE mer7m8.mm0_indelFALSE
#> 1                 14                    6                     1
#> 2                204                   58                     0
#> 3                101                   38                     1
#> 4                 41                   16                     0
#> 5                 14                    8                     0
#> 6                 50                   20                     0
#>   mer8.mm0_indelFALSE mer6.mm0_indelFALSE mer7A1.mm0_indelFALSE
#> 1                   1                   1                     1
#> 2                   0                   0                     0
#> 3                   0                   1                     0
#> 4                   0                   0                     0
#> 5                   0                   0                     0
#> 6                   0                   0                     0
#>   mer7m8.mm1_indelFALSE mer8.mm1_indelFALSE mer6.mm1_indelFALSE
#> 1                     1                   1                   1
#> 2                     6                   3                  18
#> 3                     3                   2                  12
#> 4                     0                   0                   2
#> 5                     1                   1                   1
#> 6                     1                   0                   6
#>   mer7A1.mm1_indelFALSE mer7m8.mm2_indelFALSE mer8.mm2_indelFALSE
#> 1                     1                     3                   1
#> 2                     6                    39                  12
#> 3                     6                    22                   8
#> 4                     1                     7                   1
#> 5                     1                     2                   1
#> 6                     0                     9                   2
#>   mer6.mm2_indelFALSE mer7A1.mm2_indelFALSE
#> 1                   6                     1
#> 2                 111                    35
#> 3                  62                    22
#> 4                  25                    10
#> 5                   7                     3
#> 6                  28                    11

Comparing the expression profiles of seed targets to background

Many factors that perturb gene expression, like miRNA, show cumulative changes in their targets gene expression. Cumulative changes in the profile of genes expression can be visualized and tested with the emperical distribution function (ecdf) coupled with a statistical test such as the Kolmogorov-Smirnov test.

SeedMatchR provides functions for comparing the log2(Fold Change) of two gene sets. The function deseq_fc_ecdf is designed to work directly with a DESeq2 results data frame.

Required Inputs:

  • res: DESeq2 results data frame
  • gene.lists: A list of lists containing gene names
# Gene set 1 
mer7m8.list = res$gene_id[res$mer7m8.mm0_indelFALSE >= 1 & res$mer8.mm0_indelFALSE ==0]

# Gene set 2 
mer8.list = res$gene_id[res$mer8.mm0_indelFALSE >= 1]

background.list = res$gene_id[res$mer7m8.mm0_indelFALSE == 0 & res$mer8.mm0_indelFALSE == 0]

ecdf.results = deseq_fc_ecdf(res, 
                             list("Background" = background.list, "mer8" = mer8.list, "mer7m8" = mer7m8.list),
                             stats.test = "KS", 
                             factor.order = c("Background", "mer8", "mer7m8"), 
                             null.name = "Background",
                             target.name = "mer8")
#> Comparing: Background vs. mer8
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ecdf.results$plot

Using SeedMatchR to explore potential small activating RNA effects

# Group transcripts by gene
sequences <- transcriptsBy(anno.db$tx.db, by="gene")

# Extract promoter sequences from tx.db object
prom.seq = getPromoterSeq(sequences,
               anno.db$dna,
                 upstream=2000,
                 downstream=100)

# perform a seed search of the promoter sequences. Set tx.id.col to F to use gene annotations
res = SeedMatchR(res, anno.db$gtf, prom.seq@unlistData, guide.seq, tx.id.col = FALSE, col.name = "promoter.mer7m8")

# Find the genes with matches
promoterWseed = res$gene_id[res$promoter.mer7m8 >= 1]

# Generate the background list of genes
background.list = res$gene_id[!(res$gene_id %in% promoterWseed)]

# Plot ecdf results for promoter matches with stats testing
ecdf.results = deseq_fc_ecdf(res, 
                             title = "Ttr D1 30mkg",
                             list("Background" = background.list, 
                                  "Promoter w/ mer7m8" = promoterWseed),
                             stats.test = "KS", 
                             factor.order = c("Background", 
                                              "Promoter w/ mer7m8"), 
                             null.name = "Background",
                             target.name = "Promoter w/ mer7m8",
                             palette = c("black", "#d35400"))
#> Comparing: Background vs. Promoter w/ mer7m8
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ecdf.results$plot

sessionInfo() 
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Red Hat Enterprise Linux
#> 
#> Matrix products: default
#> BLAS/LAPACK: /lrlhps/apps/intel/intel-2020/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] rtracklayer_1.58.0     GenomicFeatures_1.50.4 AnnotationDbi_1.60.0  
#>  [4] Biobase_2.58.0         GenomicRanges_1.50.2   msa_1.30.1            
#>  [7] Biostrings_2.66.0      GenomeInfoDb_1.34.9    XVector_0.38.0        
#> [10] IRanges_2.32.0         S4Vectors_0.36.1       BiocGenerics_0.44.0   
#> [13] SeedMatchR_1.0.1      
#> 
#> loaded via a namespace (and not attached):
#>   [1] AnnotationHub_3.6.0           BiocFileCache_2.6.0          
#>   [3] lazyeval_0.2.2                BiocParallel_1.32.5          
#>   [5] seqmagick_0.1.5               ggplot2_3.4.2                
#>   [7] digest_0.6.31                 yulab.utils_0.0.6            
#>   [9] htmltools_0.5.5               fansi_1.0.4                  
#>  [11] magrittr_2.0.3                memoise_2.0.1                
#>  [13] BSgenome_1.62.0               extrafont_0.19               
#>  [15] matrixStats_1.0.0             extrafontdb_1.0              
#>  [17] prettyunits_1.1.1             colorspace_2.1-0             
#>  [19] blob_1.2.4                    rappdirs_0.3.3               
#>  [21] xfun_0.39                     dplyr_1.1.2                  
#>  [23] crayon_1.5.2                  RCurl_1.98-1.12              
#>  [25] jsonlite_1.8.5                ape_5.7-1                    
#>  [27] glue_1.6.2                    polyclip_1.10-4              
#>  [29] gtable_0.3.3                  zlibbioc_1.44.0              
#>  [31] DelayedArray_0.24.0           proj4_1.0-12                 
#>  [33] R4RNA_1.26.0                  Rttf2pt1_1.3.12              
#>  [35] maps_3.4.1                    scales_1.2.1                 
#>  [37] DBI_1.1.3                     Rcpp_1.0.10                  
#>  [39] xtable_1.8-4                  progress_1.2.2               
#>  [41] gridGraphics_0.5-1            tidytree_0.4.2               
#>  [43] bit_4.0.5                     httr_1.4.6                   
#>  [45] RColorBrewer_1.1-3            ellipsis_0.3.2               
#>  [47] pkgconfig_2.0.3               XML_3.99-0.14                
#>  [49] farver_2.1.1                  sass_0.4.6                   
#>  [51] dbplyr_2.3.2                  utf8_1.2.3                   
#>  [53] ggmsa_1.4.0                   labeling_0.4.2               
#>  [55] ggplotify_0.1.0               tidyselect_1.2.0             
#>  [57] rlang_1.1.1                   later_1.3.1                  
#>  [59] munsell_0.5.0                 BiocVersion_3.14.0           
#>  [61] tools_4.1.2                   cachem_1.0.8                 
#>  [63] cli_3.6.1                     twosamples_2.0.0             
#>  [65] generics_0.1.3                RSQLite_2.3.1                
#>  [67] evaluate_0.21                 stringr_1.5.0                
#>  [69] fastmap_1.1.1                 yaml_2.3.7                   
#>  [71] ggtree_3.6.2                  knitr_1.43                   
#>  [73] bit64_4.0.5                   purrr_1.0.1                  
#>  [75] KEGGREST_1.38.0               nlme_3.1-153                 
#>  [77] mime_0.12                     ash_1.0-15                   
#>  [79] testit_0.13                   aplot_0.1.10                 
#>  [81] xml2_1.3.4                    biomaRt_2.54.0               
#>  [83] compiler_4.1.2                rstudioapi_0.13              
#>  [85] filelock_1.0.2                curl_5.0.1                   
#>  [87] png_0.1-8                     interactiveDisplayBase_1.36.0
#>  [89] treeio_1.22.0                 tibble_3.2.1                 
#>  [91] tweenr_2.0.2                  bslib_0.5.0                  
#>  [93] stringi_1.7.12                highr_0.10                   
#>  [95] ggalt_0.4.0                   lattice_0.20-45              
#>  [97] Matrix_1.5-3                  vctrs_0.6.2                  
#>  [99] pillar_1.9.0                  lifecycle_1.0.3              
#> [101] BiocManager_1.30.21           jquerylib_0.1.4              
#> [103] cowplot_1.1.1                 bitops_1.0-7                 
#> [105] httpuv_1.6.11                 patchwork_1.1.2              
#> [107] R6_2.5.1                      BiocIO_1.8.0                 
#> [109] promises_1.2.0.1              KernSmooth_2.23-20           
#> [111] codetools_0.2-18              MASS_7.3-54                  
#> [113] SummarizedExperiment_1.28.0   rjson_0.2.21                 
#> [115] withr_2.5.0                   GenomicAlignments_1.34.0     
#> [117] Rsamtools_2.14.0              GenomeInfoDbData_1.2.9       
#> [119] parallel_4.1.2                hms_1.1.3                    
#> [121] grid_4.1.2                    ggfun_0.0.9                  
#> [123] tidyr_1.3.0                   rmarkdown_2.22               
#> [125] MatrixGenerics_1.10.0         ggforce_0.4.1                
#> [127] shiny_1.7.4                   restfulr_0.0.15