The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
scfetch - Access and Format Single-cell RNA-seq Datasets from Public Resources
scfetch
is designed to accelerate users download and
prepare single-cell datasets from public resources. It can be used
to:
GEO/SRA
,
foramt fastq files to standard style that can be
identified by 10x softwares (e.g. CellRanger).GEO/SRA
,
support downloading original 10x generated bam files (with
custom tags) and normal bam files, and convert bam
files to fastq files.GEO
,
PanglanDB
and UCSC Cell Browser
, load
the downnloaded matrix to Seurat
.Zeenodo
and
CELLxGENE
.SeuratObject
, AnnData
,
SingleCellExperiment
,
CellDataSet/cell_data_set
and loom
).scfetch
is an R package distributed as part of the CRAN. To install the package,
start R and enter:
# install via CRAN (v0.5.0) # old version, it's better to install via Github
install.packages("scfetch")
# if you install from CRAN, you should install the following packages
# install.packages("devtools") #In case you have not installed it.
devtools::install_github("alexvpickering/GEOfastq") # download fastq
devtools::install_github("cellgeni/sceasy") # format conversion
devtools::install_github("mojaveazure/seurat-disk") # format conversion
devtools::install_github("satijalab/seurat-wrappers") # format conversion
# install via Github (v0.5.0)
devtools::install_github("showteeth/scfetch")
For data structures conversion, scfetch
requires several python pcakages, you can install with:
# install python packages
conda install -c bioconda loompy anndata
# or
pip install anndata loompy
In general, it is recommended to install from Github repository (update more timely).
Once scfetch
is installed, it can be loaded by the
following command.
library("scfetch")
Since the downloading process is time-consuming, we provide the commands used to illustrate the usage.
For fastq files stored in SRA, scfetch
can extract
sample information and run number with GEO accession number or users can
also provide a dataframe contains the run number of interested
samples.
Extract all samples under GSE130636
and the platform is
GPL20301
(use platform = NULL
for all
platforms):
GSE130636.runs <- ExtractRun(acce = "GSE130636", platform = "GPL20301")
With the dataframe contains gsm and run number, scfetch
will download sra files using prefetch
. The returned result
is a dataframe contains failed runs. If not NULL
, users can
re-run DownloadSRA
by setting gsm.df
to the
returned result.
# a small test
GSE130636.runs <- GSE130636.runs[GSE130636.runs$run %in% c("SRR9004346", "SRR9004351"), ]
# download, you may need to set prefetch.path
out.folder <- tempdir()
GSE130636.down <- DownloadSRA(
gsm.df = GSE130636.runs,
out.folder = out.folder
)
# GSE130636.down is null or dataframe contains failed runs
The out.folder
structure will be:
gsm_number/run_number
.
After obtaining the sra files, scfetch
provides function
SplitSRA
to split sra files to fastq files using
parallel-fastq-dump
(parallel, fastest and gzip output),
fasterq-dump
(parallel, fast but unzipped output) and
fastq-dump
(slowest and gzip output).
For fastqs generated with 10x Genomics, SplitSRA
can
identify read1, read2 and index files and format the read1 and read2 to
10x required format (sample1_S1_L001_R1_001.fastq.gz
and
sample1_S1_L001_R2_001.fastq.gz
). In detail, the file with
read length 26 or 28 is considered as read1, the files with read length
8 or 10 are considered as index files and the remain file is considered
as read2. The read length rules is from Sequencing
Requirements for Single Cell 3’ and Sequencing
Requirements for Single Cell V(D)J.
The returned result is a vector of failed sra files. If not
NULL
, users can re-run SplitSRA
by setting
sra.path
to the returned result.
# parallel-fastq-dump requires sratools.path
# you may need to set split.cmd.path and sratools.path
sra.folder <- tempdir()
GSE130636.split <- SplitSRA(
sra.folder = sra.folder,
fastq.type = "10x", split.cmd.threads = 4
)
scfetch
can extract sample information and run number
with GEO accession number or users can also provide a dataframe contains
the run number of interested samples.
GSE138266.runs <- ExtractRun(acce = "GSE138266", platform = "GPL18573")
With the dataframe contains gsm and run number, scfetch
provides DownloadBam
to download bam files using
prefetch
. It suooorts 10x generated bam files and normal
bam files.
prefetch
, scfetch
adds
--type TenX
to make sure the downloaded bam files contain
these tags.DownloadBam
will
download sra files first and then convert sra files to bam files with
sam-dump
. After testing the efficiency of
prefetch
+ sam-dump
and sam-dump
,
the former is much faster than the latter (52G sra and 72G bam
files):# # use prefetch to download sra file
# prefetch -X 60G SRR1976036
# # real 117m26.334s
# # user 16m42.062s
# # sys 3m28.295s
# # use sam-dump to convert sra to bam
# time (sam-dump SRR1976036.sra | samtools view -bS - -o SRR1976036.bam)
# # real 536m2.721s
# # user 749m41.421s
# # sys 20m49.069s
# use sam-dump to download bam directly
# time (sam-dump SRR1976036 | samtools view -bS - -o SRR1976036.bam)
# # more than 36hrs only get ~3G bam files, too slow
The returned result is a dataframe containing failed runs (either
failed to download sra files or failed to convert to bam files for
normal bam; failed to download bam files for 10x generated bam). If not
NULL
, users can re-run DownloadBam
by setting
gsm.df
to the returned result. The following is an example
to download 10x generated bam file:
# a small test
GSE138266.runs <- GSE138266.runs[GSE138266.runs$run %in% c("SRR10211566"), ]
# download, you may need to set prefetch.path
out.folder <- tempdir()
GSE138266.down <- DownloadBam(
gsm.df = GSE138266.runs,
out.folder = out.folder
)
# GSE138266.down is null or dataframe contains failed runs
The out.folder
structure will be:
gsm_number/run_number
.
With downloaded bam files, scfetch
provides function
Bam2Fastq
to convert bam files to fastq files. For bam
files generated from 10x softwares, Bam2Fastq
utilizes
bamtofastq
tool developed by 10x Genomics.
The returned result is a vector of bam files failed to convert to
fastq files. If not NULL
, users can re-run
Bam2Fastq
by setting bam.path
to the returned
result.
bam.folder <- tempdir()
# you may need to set bamtofastq.path and bamtofastq.paras
GSE138266.convert <- Bam2Fastq(
bam.folder = bam.folder
)
scfetch
provides functions for users to download
count matrices and annotations
(e.g. cell type annotation and composition) from GEO and some
single-cell databases (e.g. PanglaoDB and UCSC Cell Browser).
scfetch
also supports loading the downloaded data to
Seurat
.
Until now, the public resources supported and the returned results:
Resources | URL | Download Type | Returned results |
---|---|---|---|
GEO | https://www.ncbi.nlm.nih.gov/geo/ | count matrix | SeuratObject or count matrix for bulk RNA-seq |
PanglaoDB | https://panglaodb.se/index.html | count matrix | SeuratObject |
UCSC Cell Browser | https://cells.ucsc.edu/ | count matrix | SeuratObject |
GEO is an international public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community. It provides a very convenient way for users to explore and select interested scRNA-seq datasets.
scfetch
provides ExtractGEOMeta
to extract
sample metadata, including sample title, source name/tissue,
description, cell type, treatment, paper title, paper abstract,
organism, protocol, data processing methods, et al.
# extract metadata of specified platform
GSE200257.meta <- ExtractGEOMeta(acce = "GSE200257", platform = "GPL24676")
# set VROOM_CONNECTION_SIZE to avoid error: Error: The size of the connection buffer (786432) was not large enough
Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 60)
# extract metadata of all platforms
GSE94820.meta <- ExtractGEOMeta(acce = "GSE94820", platform = NULL)
After manually check the extracted metadata, users can
download count matrix and load the count
matrix to Seurat with ParseGEO
.
For count matrix, ParseGEO
supports downloading the
matrix from supplementary files and extracting from
ExpressionSet
, users can control the source by specifying
down.supp
or detecting automatically (ParseGEO
will extract the count matrix from ExpressionSet
first, if
the count matrix is NULL or contains non-integer values,
ParseGEO
will download supplementary files). While the
supplementary files have two main types: single count matrix file
containing all cells and CellRanger-style outputs (barcode, matrix,
feature/gene), users are required to choose the type of supplementary
files with supp.type
.
With the count matrix, ParseGEO
will load the matrix to
Seurat automatically. If multiple samples available, users can choose to
merge the SeuratObject with merge
.
# for cellranger output
out.folder <- tempdir()
GSE200257.seu <- ParseGEO(
acce = "GSE200257", platform = NULL, supp.idx = 1, down.supp = TRUE, supp.type = "10x",
out.folder = out.folder
)
# for count matrix, no need to specify out.folder, download count matrix to tmp folder
GSE94820.seu <- ParseGEO(acce = "GSE94820", platform = NULL, supp.idx = 1, down.supp = TRUE, supp.type = "count")
For bulk RNA-seq, set
data.type = "bulk"
in ParseGEO
, this will
return count matrix.
PanglaoDB is a database
which contains scRNA-seq datasets from mouse and human. Up to now, it
contains 5,586,348 cells from 1368 datasets
(1063 from Mus musculus and 305 from Homo sapiens). It has well
organized metadata for every dataset, including tissue, protocol,
species, number of cells and cell type annotation (computationally
identified). Daniel Osorio has developed rPanglaoDB to access
PanglaoDB data, the
functions of scfetch
here are based on rPanglaoDB.
Since PanglaoDB is no
longer maintained, scfetch
has cached all metadata and cell
type composition and use these cached data by default to accelerate,
users can access the cached data with PanglaoDBMeta
(all
metadata) and PanglaoDBComposition
(all cell type
composition).
scfetch
provides StatDBAttribute
to summary
attributes of PanglaoDB:
# use cached metadata
StatDBAttribute(df = PanglaoDBMeta, filter = c("species", "protocol"), database = "PanglaoDB")
scfetch
provides ExtractPanglaoDBMeta
to
select interested datasets with specified species,
protocol, tissue and cell
number (The available values of these attributes can be
obtained with StatDBAttribute
). User can also choose to
whether to add cell type annotation to every dataset with
show.cell.type
.
scfetch
uses cached metadata and cell type composition
by default, users can change this by setting
local.data = FALSE
.
hsa.meta <- ExtractPanglaoDBMeta(
species = "Homo sapiens", protocol = c("Smart-seq2", "10x chromium"),
show.cell.type = TRUE, cell.num = c(1000, 2000)
)
scfetch
provides
ExtractPanglaoDBComposition
to extract cell type annotation
and composition (use cached data by default to accelerate, users can
change this by setting local.data = FALSE
).
hsa.composition <- ExtractPanglaoDBComposition(species = "Homo sapiens", protocol = c("Smart-seq2", "10x chromium"))
After manually check the extracted metadata, scfetch
provides ParsePanglaoDB
to download count
matrix and load the count matrix to Seurat.
With available cell type annotation, uses can filter datasets without
specified cell type with cell.type
. Users can also
include/exclude cells expressing specified genes with
include.gene
/exclude.gene
.
With the count matrix, ParsePanglaoDB
will load the
matrix to Seurat automatically. If multiple datasets available, users
can choose to merge the SeuratObject with merge
.
# small test
hsa.seu <- ParsePanglaoDB(hsa.meta[1:3, ], merge = TRUE)
The UCSC Cell Browser is a web-based tool that allows scientists to interactively visualize scRNA-seq datasets. It contains 1040 single cell datasets from 17 different species. And, it is organized with the hierarchical structure, which can help users quickly locate the datasets they are interested in.
scfetch
provides ShowCBDatasets
to show all
available datasets. Due to the large number of datasets,
ShowCBDatasets
enables users to perform lazy load
of dataset json files instead of downloading the json files online
(time-consuming!!!). This lazy load requires users to provide
json.folder
to save json files and set
lazy = TRUE
(for the first time of run,
ShowCBDatasets
will download current json files to
json.folder
, for next time of run, with
lazy = TRUE
, ShowCBDatasets
will load the
downloaded json files from json.folder
.). And,
ShowCBDatasets
supports updating the local datasets with
update = TRUE
.
json.folder <- tempdir()
# first time run, the json files are stored under json.folder
# ucsc.cb.samples = ShowCBDatasets(lazy = TRUE, json.folder = json.folder, update = TRUE)
# second time run, load the downloaded json files
ucsc.cb.samples <- ShowCBDatasets(lazy = TRUE, json.folder = json.folder, update = FALSE)
# always read online
# ucsc.cb.samples = ShowCBDatasets(lazy = FALSE)
The number of datasets and all available species:
# the number of datasets
nrow(ucsc.cb.samples)
# available species
unique(unlist(sapply(unique(gsub(pattern = "\\|parent", replacement = "", x = ucsc.cb.samples$organisms)), function(x) {
unlist(strsplit(x = x, split = ", "))
})))
scfetch
provides StatDBAttribute
to summary
attributes of UCSC Cell
Browser:
StatDBAttribute(df = ucsc.cb.samples, filter = c("organism", "organ"), database = "UCSC")
scfetch
provides ExtractCBDatasets
to
filter metadata with collection,
sub-collection, organ, disease
status, organism, project and
cell number (The available values of these attributes
can be obtained with StatDBAttribute
except cell
number). All attributes except cell number support fuzzy match
with fuzzy.match
, this is useful when selecting
datasets.
hbb.sample.df <- ExtractCBDatasets(all.samples.df = ucsc.cb.samples, organ = c("brain", "blood"), organism = "Human (H. sapiens)", cell.num = c(1000, 2000))
scfetch
provides ExtractCBComposition
to
extract cell type annotation and composition.
hbb.sample.ct <- ExtractCBComposition(json.folder = json.folder, sample.df = hbb.sample.df)
After manually check the extracted metadata, scfetch
provides ParseCBDatasets
to load the online count
matrix to Seurat. All the attributes available in
ExtractCBDatasets
are also same here. Please note that the
loading process provided by ParseCBDatasets
will load the
online count matrix instead of downloading it to local. If multiple
datasets available, users can choose to merge the SeuratObject with
merge
.
hbb.sample.seu <- ParseCBDatasets(sample.df = hbb.sample.df)
scfetch
provides functions for users to download
processed single-cell RNA-seq data from Zenodo, CELLxGENE and Human Cell Atlas, including
RDS
, RData
, h5ad
,
h5
, loom
objects.
Until now, the public resources supported and the returned results:
Resources | URL | Download Type | Returned results |
---|---|---|---|
Zenodo | https://zenodo.org/ | count matrix, rds, rdata, h5ad, et al. | NULL or failed datasets |
CELLxGENE | https://cellxgene.cziscience.com/ | rds, h5ad | NULL or failed datasets |
Human Cell Atlas | https://www.humancellatlas.org/ | rds, rdata, h5, h5ad, loom | NULL or failed projects |
Zenodo contains various types of
processed objects, such as SeuratObject
which has been
clustered and annotated, AnnData
which contains processed
results generated by scanpy
.
scfetch
provides ExtractZenodoMeta
to
extract dataset metadata, including dataset title, description,
available files and corresponding md5. Please note that when the dataset
is restricted access, the returned dataframe will be empty.
# single doi
zebrafish.df <- ExtractZenodoMeta(doi = "10.5281/zenodo.7243603")
# vector dois
multi.dois <- ExtractZenodoMeta(doi = c("1111", "10.5281/zenodo.7243603", "10.5281/zenodo.7244441"))
After manually check the extracted metadata, users can
download the specified objects with
ParseZenodo
. The downloaded objects are controlled by
file.ext
and the provided object formats should be
in lower case (e.g. rds/rdata/h5ad).
The returned result is a dataframe containing failed objects. If not
NULL
, users can re-run ParseZenodo
by setting
doi.df
to the returned result.
out.folder <- tempdir()
multi.dois.parse <- ParseZenodo(
doi = c("1111", "10.5281/zenodo.7243603", "10.5281/zenodo.7244441"),
file.ext = c("rdata", "rds"), out.folder = out.folder
)
The CELLxGENE is a
web server contains 910 single-cell datasets, users can
explore, download and upload own datasets. The downloaded datasets
provided by CELLxGENE
have two formats: h5ad (AnnData v0.8)
and
rds (Seurat v4)
.
scfetch
provides ShowCELLxGENEDatasets
to
extract dataset metadata, including dataset title, description, contact,
organism, ethnicity, sex, tissue, disease, assay, suspension type, cell
type, et al.
# all available datasets
all.cellxgene.datasets <- ShowCELLxGENEDatasets()
scfetch
provides StatDBAttribute
to summary
attributes of CELLxGENE:
StatDBAttribute(df = all.cellxgene.datasets, filter = c("organism", "sex"), database = "CELLxGENE")
scfetch
provides ExtractCELLxGENEMeta
to
filter dataset metadata, the available values of attributes can be
obtained with StatDBAttribute
except cell
number:
# human 10x v2 and v3 datasets
human.10x.cellxgene.meta <- ExtractCELLxGENEMeta(
all.samples.df = all.cellxgene.datasets,
assay = c("10x 3' v2", "10x 3' v3"), organism = "Homo sapiens"
)
After manually check the extracted metadata, users can
download the specified objects with
ParseCELLxGENE
. The downloaded objects are controlled by
file.ext
(choose from "rds"
and
"h5ad"
).
The returned result is a dataframe containing failed datasets. If not
NULL
, users can re-run ParseCELLxGENE
by
setting meta
to the returned result.
out.folder <- tempdir()
ParseCELLxGENE(
meta = human.10x.cellxgene.meta[1:5, ], file.ext = "rds",
out.folder = out.folder
)
There are many tools have been developed to process scRNA-seq data,
such as Scanpy,
Seurat, scran
and Monocle.
These tools have their own objects, such as Anndata
of
Scanpy
, SeuratObject
of Seurat
,
SingleCellExperiment
of scran
and
CellDataSet
/cell_data_set
of
Monocle2
/Monocle3
. There are also some file
format designed for large omics datasets, such as loom. To perform a comprehensive scRNA-seq
data analysis, we usually need to combine multiple tools, which means we
need to perform object conversion frequently. To facilitate user
analysis of scRNA-seq data, scfetch
provides multiple
functions to perform object conversion between widely used tools and
formats. The object conversion implemented in scfetch
has
two main advantages:
SeuratDisk
to convert SeuratObject to loom,
use zellkonverter
to perform conversion between
SingleCellExperiment
and Anndata
. When there
is no such tools, we use sceasy
to perform conversion.# library
library(Seurat) # pbmc_small
library(scRNAseq) # seger
SeuratObject
:
# object
pbmc_small
SingleCellExperiment
:
seger <- scRNAseq::SegerstolpePancreasData()
Here, we will convert SeuratObject to
SingleCellExperiment
,
CellDataSet
/cell_data_set
,
Anndata
, loom
.
The conversion is performed with functions implemented in
Seurat
:
sce.obj <- ExportSeurat(seu.obj = pbmc_small, assay = "RNA", to = "SCE")
To CellDataSet
(The conversion is performed with
functions implemented in Seurat
):
# BiocManager::install("monocle") # reuqire monocle
cds.obj <- ExportSeurat(seu.obj = pbmc_small, assay = "RNA", reduction = "tsne", to = "CellDataSet")
To cell_data_set
(The conversion is performed with
functions implemented in SeuratWrappers
):
# remotes::install_github('cole-trapnell-lab/monocle3') # reuqire monocle3
cds3.obj <- ExportSeurat(seu.obj = pbmc_small, assay = "RNA", to = "cell_data_set")
AnnData
is a Python object, reticulate
is
used to communicate between Python and R. User should create a Python
environment which contains anndata
package and specify the
environment path with conda.path
to ensure the exact usage
of this environment.
The conversion is performed with functions implemented in
sceasy
:
# remove pbmc_small.h5ad first
anndata.file <- tempfile(pattern = "pbmc_small_", fileext = ".h5ad")
# you may need to set conda.path
ExportSeurat(
seu.obj = pbmc_small, assay = "RNA", to = "AnnData",
anndata.file = anndata.file
)
The conversion is performed with functions implemented in
SeuratDisk
:
loom.file <- tempfile(pattern = "pbmc_small_", fileext = ".loom")
ExportSeurat(
seu.obj = pbmc_small, assay = "RNA", to = "loom",
loom.file = loom.file
)
The conversion is performed with functions implemented in
Seurat
:
seu.obj.sce <- ImportSeurat(obj = sce.obj, from = "SCE", count.assay = "counts", data.assay = "logcounts", assay = "RNA")
CellDataSet
to SeuratObject
(The conversion
is performed with functions implemented in Seurat
):
seu.obj.cds <- ImportSeurat(obj = cds.obj, from = "CellDataSet", count.assay = "counts", assay = "RNA")
cell_data_set
to SeuratObject
(The
conversion is performed with functions implemented in
Seurat
):
seu.obj.cds3 <- ImportSeurat(obj = cds3.obj, from = "cell_data_set", count.assay = "counts", data.assay = "logcounts", assay = "RNA")
AnnData
is a Python object, reticulate
is
used to communicate between Python and R. User should create a Python
environment which contains anndata
package and specify the
environment path with conda.path
to ensure the exact usage
of this environment.
The conversion is performed with functions implemented in
sceasy
:
# you may need to set conda.path
seu.obj.h5ad <- ImportSeurat(
anndata.file = anndata.file, from = "AnnData", assay = "RNA"
)
The conversion is performed with functions implemented in
SeuratDisk
and Seurat
:
# loom will lose reduction
seu.obj.loom <- ImportSeurat(loom.file = loom.file, from = "loom")
The conversion is performed with functions implemented in
zellkonverter
.
# remove seger.h5ad first
seger.anndata.file <- tempfile(pattern = "seger_", fileext = ".h5ad")
SCEAnnData(
from = "SingleCellExperiment", to = "AnnData", sce = seger, X_name = "counts",
anndata.file = seger.anndata.file
)
seger.anndata <- SCEAnnData(
from = "AnnData", to = "SingleCellExperiment",
anndata.file = seger.anndata.file
)
The conversion is performed with functions implemented in
LoomExperiment
.
# remove seger.loom first
seger.loom.file <- tempfile(pattern = "seger_", fileext = ".loom")
SCELoom(
from = "SingleCellExperiment", to = "loom", sce = seger,
loom.file = seger.loom.file
)
seger.loom <- SCELoom(
from = "loom", to = "SingleCellExperiment",
loom.file = seger.loom.file
)
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.