The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

bedrockbio

Open-Access Computational Biology Datasets

Description

Efficiently access a curated library of open-access computational biology datasets. Tables support predicate pushdown and projection to the cloud storage backend, enabling quick, iterative access to otherwise massive, unwieldy tables.

bedrockbio consists of five user-facing functions:

list_namespaces(): returns a character vector of available namespace (data source) identifiers
describe_namespace("<name>"): returns a namespace’s name, citation, license, context, and its tables
list_tables(namespace): returns a character vector of table identifiers, optionally filtered to one namespace
describe_table("<name>"): returns a table’s context, column definitions, and partition columns (with their allowed values)
load_table("<name>"): returns a lazily-evaluated data frame for a table

dplyr verbs (filter, select) can be used on the data frame returned by load_table to push down row filters and column selections to the storage backend. Filtering on the partition columns returned by describe_table gives the fastest reads.

Installation

Install from CRAN:

install.packages("bedrockbio")

Or install the current development version from GitHub:

# install.packages("pak")
pak::pak("bedrock-bio/bedrock-bio-client/r")

The R package supports macOS and Linux only: the DuckDB iceberg extension has no MinGW build, so it cannot load on R for Windows. Windows users can use the Python client instead, which works on all platforms.

Examples

Load the package (and dplyr for downstream data frame manipulation):

library(bedrockbio)
library(dplyr)

List available tables:

list_tables()

Describe a table to see its metadata, citation, and columns:

describe_table("ukb_ppp.pqtls")

Lazily load a table, filter on partition columns (for fastest reads), select columns, and collect the relevant subset into an in-memory data frame:

df <- load_table("ukb_ppp.pqtls") |>
  filter(
    ancestry == "EUR",
    protein_id == "A0FGR8",
    panel == "Inflammation"
  ) |>
  select(
    chromosome,
    position,
    effect_allele,
    other_allele,
    beta,
    neg_log_10_p_value
  ) |>
  collect()

Dataset Requests

To request the addition of a new table to the library, open an issue.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.