Repository Mirror for your Cloud Server and Webhosting

Title:

Access and Search MedRxiv and BioRxiv Preprint Data

Version:

0.1.4

Depends:

R (≥ 4.1.0)

Description:

An increasingly important source of health-related bibliographic content are preprints - preliminary versions of research articles that have yet to undergo peer review. The two preprint repositories most relevant to health-related sciences are medRxiv https://www.medrxiv.org/ and bioRxiv, both of which are operated by the Cold Spring Harbor Laboratory. 'medrxivr' provides programmatic access to the 'Cold Spring Harbour Laboratory (CSHL)' API https://api.biorxiv.org/, allowing users to easily download medRxiv and bioRxiv preprint metadata (e.g. title, abstract, publication date, author list, etc) into R. 'medrxivr' also provides functions to search the downloaded preprint records using regular expressions and Boolean logic, as well as helper functions that allow users to export their search results to a .BIB file for easy import to a reference manager and to download the full-text PDFs of preprints matching their search criteria.

License:

GPL-2

Encoding:

UTF-8

Language:

en-US

URL:

https://docs.ropensci.org/medrxivr/, https://github.com/ropensci/medrxivr

BugReports:

https://github.com/ropensci/medrxivr/issues

Imports:

methods, dplyr, curl, jsonlite, httr, stringr, rlang, bib2df, tibble, progress, lubridate, purrr, data.table

Suggests:

testthat (≥ 2.1.0), knitr, rmarkdown, covr, kableExtra, spelling

VignetteBuilder:

knitr

RoxygenNote:

7.3.2

NeedsCompilation:

Packaged:

2026-05-11 16:16:49 UTC; Bach

Author:

Yaoxiang Li

[aut, cre], Luke McGuinness [aut], Lena Schmidt [aut], Tuija Sonkkila [rev], Najko Jahn [rev]

Maintainer:

Yaoxiang Li <liyaoxiang@outlook.com>

Repository:

CRAN

Date/Publication:

2026-05-12 18:40:31 UTC

medrxivr: Accessing medRxiv and bioRxiv preprint data from R

Description

The medrxivr package enables users to access data on preprints in the medRxiv and bioRxiv preprints repositories, both of which are run by the Cold Spring Harbour Laboratory.It also provides functions to search the preprint data, export it to a .bib file, and download the PDFs associated with specified records.

Author(s)

Maintainer: Yaoxiang Li liyaoxiang@outlook.com (ORCID)

Authors:

Luke McGuinness
Lena Schmidt

Other contributors:

Tuija Sonkkila [reviewer]
Najko Jahn [reviewer]

Create link for API

Description

Create link for API

Usage

api_link(...)

Arguments

...

Arguments to specify the path to the API endpoint

Value

Formatted link to API endpoint

Extract the page size from API metadata

Description

Extract the page size from API metadata

Usage

api_page_size(messages)

Arguments

messages

API metadata messages data frame

Value

Numeric page size

Extract the total number of records from API metadata

Description

Extract the total number of records from API metadata

Usage

api_record_count(messages)

Arguments

messages

API metadata messages data frame

Value

Numeric record count

Convert API data to data frame

Description

Convert API data to data frame

Usage

api_to_df(url)

Arguments

url

API endpoint from which to extract and format data

Value

Raw API data in a dataframe

Helper script to clean data from API to make it compatible with mx_search()

Description

Helper script to clean data from API to make it compatible with mx_search()

Usage

clean_api_df(df)

Arguments

df

Raw dataframe from API

Value

Cleaned dataframe

Allow for capitalisation of search terms

Description

Allow for capitalisation of search terms

Usage

fix_caps(x)

Arguments

x

Search query to be formatted. Note, any search term already containing a square bracket will not be reformatted to preserve user-defined regexes.

Value

The same list or vector search terms, but with proper regular expression syntax to allow for capitalisation of the first letter of each term.

Replace user-friendly 'NEAR' operator with appropriate regex syntax

Description

Replace user-friendly 'NEAR' operator with appropriate regex syntax

Usage

fix_near(x)

Arguments

x

Search query to be reformatted

Replace user-friendly 'wildcard' operator with appropriate regex syntax

Description

Replace user-friendly 'wildcard' operator with appropriate regex syntax

Usage

fix_wildcard(x)

Arguments

x

Search query to be reformatted

Report the latest record date in a snapshot

Description

Report the latest record date in a snapshot

Usage

inform_snapshot_date(data)

Arguments

data

Snapshot data frame

Value

Invisibly returns the latest snapshot date

Checks whether the user has internet, and returns a helpful message it not.

Description

Checks whether the user has internet, and returns a helpful message it not.

Usage

internet_check()

Value

Informative error if not connected to the internet

Access medRxiv/bioRxiv data via the Cold Spring Harbour Laboratory API

Description

Provides programmatic access to all preprints available through the Cold Spring Harbour Laboratory API, which serves both the medRxiv and bioRxiv preprint repositories.

Usage

mx_api_content(
  from_date = "2013-01-01",
  to_date = as.character(Sys.Date()),
  clean = TRUE,
  server = "medrxiv",
  include_info = FALSE
)

Arguments

from_date

Earliest date of interest, written as "YYYY-MM-DD". Defaults to 1st Jan 2013 ("2013-01-01"), ~6 months prior to earliest preprint registration date.

to_date

Latest date of interest, written as "YYYY-MM-DD". Defaults to current date.

clean

Logical, defaulting to TRUE, indicating whether to clean the data returned by the API. If TRUE, variables containing absolute paths to the preprints web-page ("link_page") and PDF ("link_pdf") are generated from the "server", "DOI", and "version" variables returned by the API. The The "category", "authors" and "author_corresponding" variables are converted to title case. Finally, the "type" and "server" variables are dropped.

server

Specify the server you wish to use: "medrxiv" (default) or "biorxiv"

include_info

Logical, indicating whether to include variables containing information returned by the API (e.g. API status, cursor number, total count of papers, etc). Default is FALSE.

Value

Dataframe with 1 record per row

Examples

if (interactive()) {
  mx_data <- mx_api_content(
    from_date = "2020-01-01",
    to_date = "2020-01-07"
  )
}

Access data on a single medRxiv/bioRxiv record via the Cold Spring Harbour Laboratory API

Description

Provides programmatic access to data on a single preprint identified by a unique Digital Object Identifier (DOI).

Usage

mx_api_doi(doi, server = "medrxiv", clean = TRUE)

Arguments

doi

Digital object identifier of the preprint you wish to retrieve data on.

server

Specify the server you wish to use: "medrxiv" (default) or "biorxiv"

clean

Value

Dataframe containing details on the preprint identified by the DOI.

Examples

if (interactive()) {
  mx_data <- mx_api_doi("10.1101/2020.02.25.20021568")
}

Search term wrapper that allows for different capitalization of term

Description

Inspired by the varying capitalization of "NCOV" during the corona virus pandemic (e.g. ncov, nCoV, NCOV, nCOV), this function allows for all possible configurations of lower- and upper-case letters in your search term.

Usage

mx_caps(x)

Arguments

x

Search term to be formatted

Value

The input string is return, but with each non-space character repeated in lower- and upper-case, and enclosed in square brackets. For example, mx_caps("ncov") returns "[Nn][Cc][Oo][Vv]"

Check how up-to-date the maintained medRxiv snapshot is

Description

Provides information on how up-to-date the maintained medRxiv snapshot provided by 'mx_snapshot()' is by checking whether there have been any records added to, or updated in, the medRxiv repository since the last snapshot was taken.

Usage

mx_crosscheck()

Examples

if (interactive()) {
  mx_crosscheck()
}

Download PDF's of preprints returned by a search

Description

Download PDF's of all the papers in your search results

Usage

mx_download(
  mx_results,
  directory,
  create = TRUE,
  name = c("ID", "DOI"),
  print_update = 10
)

Arguments

mx_results

Vector containing the links to the medRxiv PDFs

directory

The location you want to download the PDF's to

create

TRUE or FALSE. If TRUE, creates the directory if it doesn't exist

name

How to name the downloaded PDF. By default, both the ID number of the record and the DOI are used.

print_update

How frequently to print an update

Examples

if (interactive()) {
  mx_results <- mx_search(mx_snapshot(), query = "10.1101/2020.02.25.20021568")
  mx_download(mx_results, directory = tempdir())
}

Export references for preprints returning by a search to a .bib file

Description

Export references for preprints returning by a search to a .bib file

Usage

mx_export(data, file = "medrxiv_export.bib")

Arguments

data

Dataframe returned by mx_search() or mx_api_*() functions

file

File location to save to. Must have the .bib file extension

Value

Exports a formatted .BIB file, for import into a reference manager

Examples

if (interactive()) {
  mx_results <- mx_search(mx_snapshot(), query = "brain")
  mx_export(mx_results, tempfile(fileext = ".bib"))
}

Provide information on the medRxiv snapshot used to perform the search

Description

Provide information on the medRxiv snapshot used to perform the search

Usage

mx_info(commit = "main", manifest_url = default_snapshot_manifest_url())

Arguments

commit

Deprecated. Only the default value "main" is supported. Use 'manifest_url' to read a specific snapshot manifest.

manifest_url

URL for a JSON snapshot manifest. Defaults to option ‘medrxivr.snapshot_manifest', or the package’s snapshot release manifest if that option is unset.

Value

Message with snapshot details

Search and print output for individual search items

Description

Search and print output for individual search items

Usage

mx_reporter(mx_data, num_results, query, fields, deduplicate, NOT)

Arguments

mx_data

The mx_dataset filtered for the date limits

num_results

The number of results returned by the overall search

query

Character string, vector or list

fields

Fields of the database to search - default is Title, Abstract, Authors, Category, and DOI.

deduplicate

Logical. Only return the most recent version of a record. Default is TRUE.

NOT

Vector of regular expressions to exclude from the search. Default is "".

Search preprint data

Description

Search preprint data

Usage

mx_search(
  data = NULL,
  query = NULL,
  fields = c("title", "abstract", "authors", "category", "doi"),
  from_date = NULL,
  to_date = NULL,
  auto_caps = FALSE,
  NOT = "",
  deduplicate = TRUE,
  report = FALSE
)

Arguments

data

The preprint dataset that is to be searched, created either using mx_api_content() or mx_snapshot()

query

Character string, vector or list

fields

Fields of the database to search - default is Title, Abstract, Authors, Category, and DOI.

from_date

Defines earliest date of interest. Written in the format "YYYY-MM-DD". Note, records published on the date specified will also be returned.

to_date

Defines latest date of interest. Written in the format "YYYY-MM-DD". Note, records published on the date specified will also be returned.

auto_caps

As the search is case sensitive, this logical specifies whether the search should automatically allow for differing capitalisation of search terms. For example, when TRUE, a search for "dementia" would find both "dementia" but also "Dementia". Note, that if your term is multi-word (e.g. "systematic review"), only the first word is automatically capitalised (e.g your search will find both "systematic review" and "Systematic review" but won't find "Systematic Review". Note that this option will format terms in the query and NOT arguments (if applicable).

NOT

Vector of regular expressions to exclude from the search. Default is "".

deduplicate

Logical. Only return the most recent version of a record. Default is TRUE.

report

Logical. Run mx_reporter. Default is FALSE.

Examples

if (interactive()) {
  # Using the static snapshot
  mx_results <- mx_search(data = mx_snapshot(), query = "dementia")
}

Access a static snapshot of the medRxiv repository

Description

[Available for medRxiv only] This function allows users to import a maintained static snapshot of the medRxiv repository, instead of downloading a copy from the API, which can become unavailable during peak usage times. The function reads a manifest-driven snapshot artifact from the package's GitHub release assets by default.

Usage

mx_snapshot(
  commit = "main",
  from_date = NULL,
  to_date = NULL,
  manifest_url = default_snapshot_manifest_url(),
  cache = TRUE
)

Arguments

commit

Deprecated. Only the default value "main" is supported. Use 'manifest_url' to read a specific snapshot manifest.

from_date

Optional earliest date of interest ("YYYY-MM-DD" or Date). If supplied, records with 'date' earlier than this are excluded.

to_date

Optional latest date of interest ("YYYY-MM-DD" or Date). If supplied, records with 'date' later than this are excluded.

manifest_url

URL for a JSON snapshot manifest. Defaults to option ‘medrxivr.snapshot_manifest', or the package’s latest GitHub release manifest if that option is unset.

cache

Logical. If TRUE, downloaded manifest snapshot files are cached between sessions. Defaults to TRUE.

Value

A formatted dataframe containing the data from the snapshot artifact, with reconstructed 'link_page' and 'link_pdf' columns.

Search for terms in the dataset

Description

Search for terms in the dataset

Usage

print_full_results(num_results, deduplicate)

Arguments

num_results

number of searched terms returned

deduplicate

Logical. Only return the most recent version of a record. Default is TRUE.

Search for terms in the dataset

Description

Search for terms in the dataset

Usage

run_search(mx_data, query, fields, deduplicate, NOT = "")

Arguments

mx_data

The mx_dataset filtered for the date limits

query

Character string, vector or list

fields

Fields of the database to search - default is Title, Abstract, Authors, Category, and DOI.

deduplicate

Logical. Only return the most recent version of a record. Default is TRUE.

NOT

Vector of regular expressions to exclude from the search. Default is NULL.

Skips API tests if API isn't working correctly

Description

Skips API tests if API isn't working correctly

Usage

skip_if_api_message()

Package {medrxivr}

medrxivr: Accessing medRxiv and bioRxiv preprint data from R

Description

Author(s)

See Also

Create link for API

Description

Usage

Arguments

Value

Extract the page size from API metadata

Description

Usage

Arguments

Value

Extract the total number of records from API metadata

Description

Usage

Arguments

Value

Convert API data to data frame

Description

Usage

Arguments

Value

Helper script to clean data from API to make it compatible with mx_search()

Description

Usage

Arguments

Value

Allow for capitalisation of search terms

Description

Usage

Arguments

Value

Replace user-friendly 'NEAR' operator with appropriate regex syntax

Description

Usage

Arguments

Replace user-friendly 'wildcard' operator with appropriate regex syntax

Description

Usage

Arguments

Report the latest record date in a snapshot

Description

Usage

Arguments

Value

Checks whether the user has internet, and returns a helpful message it not.

Description

Usage

Value

Access medRxiv/bioRxiv data via the Cold Spring Harbour Laboratory API

Description

Usage

Arguments

Value

See Also

Examples

Access data on a single medRxiv/bioRxiv record via the Cold Spring Harbour Laboratory API

Description

Usage

Arguments

Value

See Also

Examples

Search term wrapper that allows for different capitalization of term

Description

Usage

Arguments

Value

See Also

Check how up-to-date the maintained medRxiv snapshot is

Description

Usage

See Also

Examples

Download PDF's of preprints returned by a search

Description

Usage