Reproducibility with seeker

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Jake Hughey

2024-08-26

Using the seeker package together with docker, it’s easy to make fetching and processing of sequencing and microarray data completely reproducible. First pull the latest version of the socker image, which has seeker and its dependencies already installed.

RNA-seq data

The seeker package includes an example yaml file, R script, and shell script for fetching and processing a subset of an RNA-seq dataset. Here we’ll download the files from GitHub to avoid having to install the package locally:

urlBase = 'https://raw.githubusercontent.com/hugheylab/seeker/master/inst/extdata/'
for (filename in c('PRJNA600892.yml', 'run_seeker.R', 'run_seeker.sh')) {
  download.file(paste0(urlBase, filename), filename)}

PRJNA600892.yml:

study: 'PRJNA600892' # [string]
metadata:
  run: TRUE # [logical]
  bioproject: 'PRJNA600892' # [string]
  include:
    # [named list or NULL]
    colname: 'run_accession' # [string]
    values: ['SRR10876945', 'SRR10876946'] # [vector]
  # exclude # [named list or NULL]
    # colname # [string]
    # values # [vector]
fetch:
  run: TRUE # [logical]
  # keep # [logical or NULL]
  # overwrite # [logical or NULL]
  # keepSra # [logical or NULL]
  # prefetchCmd # [string or NULL]
  # prefetchArgs # [character vector or NULL]
  # fasterqdumpCmd # [string or NULL]
  # fasterqdumpArgs # [character vector or NULL]
  # pigzCmd # [string or NULL]
  # pigzArgs # [character vector or NULL]
trimgalore:
  run: TRUE # [logical]
  # keep # [logical or NULL]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
  # pigzCmd # [string or NULL]
fastqc:
  run: TRUE # [logical]
  # keep # [logical or NULL]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
salmon:
  run: TRUE # [logical]
  indexDir: '~/refgenie_genomes/alias/mm10/salmon_partial_sa_index/default' # [string]
  # sampleColname # [string or NULL]
  # keep # [logical or NULL]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
multiqc:
  run: TRUE # [logical]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
tximport:
  run: TRUE # [logical]
  tx2gene:
    # [named list or NULL]
    organism: 'mmusculus' # [string]
    # version # [number or NULL]
    # filename # [string or NULL]
  countsFromAbundance: 'lengthScaledTPM' # [string]
  # ignoreTxVersion # [logical or NULL]

run_seeker.R:

doParallel::registerDoParallel()

cArgs = commandArgs(TRUE)
yamlPath = cArgs[1L]
parentDir = cArgs[2L]

params = yaml::read_yaml(yamlPath)
seeker::seeker(params, parentDir)

run_seeker.sh:

#!/bin/sh

docker run \
  --mount type=bind,src=`pwd`,dst=/home/rstudio/projects \
  -w /home/rstudio/projects \
  --rm \
  ghcr.io/hugheylab/socker \
  bash -c \
    "source /home/rstudio/miniconda3/etc/profile.d/conda.sh \
      && conda activate seeker \
      && refgenie pull mm10/salmon_partial_sa_index \
      && Rscript run_seeker.R PRJNA600892.yml ." \
  &> PRJNA600892_progress.log

Now simply run the shell script:

sh run_seeker.sh

The output will appear in your working directory. You can follow seeker()’s progress using the log file. To process a different dataset, modify the yaml file and shell script accordingly. Beware this example uses “salmon_partial_sa_index” from refgenie to minimize computational requirements; for actual use we recommend “salmon_sa_index”.

Microarray data

The seeker package also includes an example yaml file, R script, and shell script for fetching and processing a microarray dataset. Download the files to your working directory:

urlBase = 'https://raw.githubusercontent.com/hugheylab/seeker/master/inst/extdata/'
for (filename in c('GSE25585.yml', 'run_seeker_array.R', 'run_seeker_array.sh')) {
  download.file(paste0(urlBase, filename), filename)}

GSE25585.yml:

study: 'GSE25585'
geneIdType: 'entrez'

run_seeker_array.R:

cArgs = commandArgs(TRUE)

params = yaml::read_yaml(cArgs[1L])
parentDir = cArgs[2L]

seeker::seekerArray(
  study = params$study, geneIdType = params$geneIdType,
  platform = params$platform, parentDir)

run_seeker_array.sh:

#!/bin/sh

docker run \
  --mount type=bind,src=`pwd`,dst=/home/rstudio/projects \
  -w /home/rstudio/projects \
  --rm \
  ghcr.io/hugheylab/socker \
  bash -c "Rscript run_seeker_array.R GSE25585.yml ." \
  &> GSE25585_progress.log

Now simply run the shell script:

sh run_seeker_array.sh

The output will appear in your working directory. You can follow seekerArray()’s progress using the log file. To process a different dataset, modify the yaml file and shell script accordingly.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.