| Title: | Access and Search MedRxiv and BioRxiv Preprint Data |
| Version: | 0.1.3 |
| Depends: | R (≥ 4.1.0) |
| Description: | An increasingly important source of health-related bibliographic content are preprints - preliminary versions of research articles that have yet to undergo peer review. The two preprint repositories most relevant to health-related sciences are medRxiv https://www.medrxiv.org/ and bioRxiv, both of which are operated by the Cold Spring Harbor Laboratory. 'medrxivr' provides programmatic access to the 'Cold Spring Harbour Laboratory (CSHL)' API https://api.biorxiv.org/, allowing users to easily download medRxiv and bioRxiv preprint metadata (e.g. title, abstract, publication date, author list, etc) into R. 'medrxivr' also provides functions to search the downloaded preprint records using regular expressions and Boolean logic, as well as helper functions that allow users to export their search results to a .BIB file for easy import to a reference manager and to download the full-text PDFs of preprints matching their search criteria. |
| License: | GPL-2 |
| Encoding: | UTF-8 |
| Language: | en-US |
| URL: | https://github.com/ropensci/medrxivr |
| BugReports: | https://github.com/ropensci/medrxivr/issues |
| Imports: | methods, dplyr, curl, jsonlite, httr, stringr, rlang, bib2df, tibble, progress, lubridate, purrr, data.table |
| Suggests: | testthat (≥ 2.1.0), knitr, rmarkdown, covr, kableExtra, spelling |
| VignetteBuilder: | knitr |
| RoxygenNote: | 7.3.2 |
| NeedsCompilation: | no |
| Packaged: | 2026-05-06 20:55:02 UTC; Bach |
| Author: | Yaoxiang Li |
| Maintainer: | Yaoxiang Li <liyaoxiang@outlook.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-05-07 12:01:16 UTC |
medrxivr: Accessing medRxiv and bioRxiv preprint data from R
Description
The medrxivr package enables users to access data on preprints in the medRxiv and bioRxiv preprints repositories, both of which are run by the Cold Spring Harbour Laboratory.It also provides functions to search the preprint data, export it to a .bib file, and download the PDFs associated with specified records.
Author(s)
Maintainer: Yaoxiang Li liyaoxiang@outlook.com (ORCID)
Authors:
Luke McGuinness
Lena Schmidt
Other contributors:
Tuija Sonkkila [reviewer]
Najko Jahn [reviewer]
See Also
Useful links:
Create link for API
Description
Create link for API
Usage
api_link(...)
Arguments
... |
Arguments to specify the path to the API endpoint |
Value
Formatted link to API endpoint
Extract the page size from API metadata
Description
Extract the page size from API metadata
Usage
api_page_size(messages)
Arguments
messages |
API metadata messages data frame |
Value
Numeric page size
Extract the total number of records from API metadata
Description
Extract the total number of records from API metadata
Usage
api_record_count(messages)
Arguments
messages |
API metadata messages data frame |
Value
Numeric record count
Convert API data to data frame
Description
Convert API data to data frame
Usage
api_to_df(url)
Arguments
url |
API endpoint from which to extract and format data |
Value
Raw API data in a dataframe
Helper script to clean data from API to make it compatible with mx_search()
Description
Helper script to clean data from API to make it compatible with mx_search()
Usage
clean_api_df(df)
Arguments
df |
Raw dataframe from API |
Value
Cleaned dataframe
Allow for capitalisation of search terms
Description
Allow for capitalisation of search terms
Usage
fix_caps(x)
Arguments
x |
Search query to be formatted. Note, any search term already containing a square bracket will not be reformatted to preserve user-defined regexes. |
Value
The same list or vector search terms, but with proper regular expression syntax to allow for capitalisation of the first letter of each term.
Replace user-friendly 'NEAR' operator with appropriate regex syntax
Description
Replace user-friendly 'NEAR' operator with appropriate regex syntax
Usage
fix_near(x)
Arguments
x |
Search query to be reformatted |
Replace user-friendly 'wildcard' operator with appropriate regex syntax
Description
Replace user-friendly 'wildcard' operator with appropriate regex syntax
Usage
fix_wildcard(x)
Arguments
x |
Search query to be reformatted |
Report the latest record date in a snapshot
Description
Report the latest record date in a snapshot
Usage
inform_snapshot_date(data)
Arguments
data |
Snapshot data frame |
Value
Invisibly returns the latest snapshot date
Checks whether the user has internet, and returns a helpful message it not.
Description
Checks whether the user has internet, and returns a helpful message it not.
Usage
internet_check()
Value
Informative error if not connected to the internet
Access medRxiv/bioRxiv data via the Cold Spring Harbour Laboratory API
Description
Provides programmatic access to all preprints available through the Cold Spring Harbour Laboratory API, which serves both the medRxiv and bioRxiv preprint repositories.
Usage
mx_api_content(
from_date = "2013-01-01",
to_date = as.character(Sys.Date()),
clean = TRUE,
server = "medrxiv",
include_info = FALSE
)
Arguments
from_date |
Earliest date of interest, written as "YYYY-MM-DD". Defaults to 1st Jan 2013 ("2013-01-01"), ~6 months prior to earliest preprint registration date. |
to_date |
Latest date of interest, written as "YYYY-MM-DD". Defaults to current date. |
clean |
Logical, defaulting to TRUE, indicating whether to clean the data returned by the API. If TRUE, variables containing absolute paths to the preprints web-page ("link_page") and PDF ("link_pdf") are generated from the "server", "DOI", and "version" variables returned by the API. The The "category", "authors" and "author_corresponding" variables are converted to title case. Finally, the "type" and "server" variables are dropped. |
server |
Specify the server you wish to use: "medrxiv" (default) or "biorxiv" |
include_info |
Logical, indicating whether to include variables containing information returned by the API (e.g. API status, cursor number, total count of papers, etc). Default is FALSE. |
Value
Dataframe with 1 record per row
See Also
Other data-source:
mx_api_doi(),
mx_snapshot()
Examples
if (interactive()) {
mx_data <- mx_api_content(
from_date = "2020-01-01",
to_date = "2020-01-07"
)
}
Access data on a single medRxiv/bioRxiv record via the Cold Spring Harbour Laboratory API
Description
Provides programmatic access to data on a single preprint identified by a unique Digital Object Identifier (DOI).
Usage
mx_api_doi(doi, server = "medrxiv", clean = TRUE)
Arguments
doi |
Digital object identifier of the preprint you wish to retrieve data on. |
server |
Specify the server you wish to use: "medrxiv" (default) or "biorxiv" |
clean |
Logical, defaulting to TRUE, indicating whether to clean the data returned by the API. If TRUE, variables containing absolute paths to the preprints web-page ("link_page") and PDF ("link_pdf") are generated from the "server", "DOI", and "version" variables returned by the API. The The "category", "authors" and "author_corresponding" variables are converted to title case. Finally, the "type" and "server" variables are dropped. |
Value
Dataframe containing details on the preprint identified by the DOI.
See Also
Other data-source:
mx_api_content(),
mx_snapshot()
Examples
if (interactive()) {
mx_data <- mx_api_doi("10.1101/2020.02.25.20021568")
}
Search term wrapper that allows for different capitalization of term
Description
Inspired by the varying capitalization of "NCOV" during the corona virus pandemic (e.g. ncov, nCoV, NCOV, nCOV), this function allows for all possible configurations of lower- and upper-case letters in your search term.
Usage
mx_caps(x)
Arguments
x |
Search term to be formatted |
Value
The input string is return, but with each non-space character repeated in lower- and upper-case, and enclosed in square brackets. For example, mx_caps("ncov") returns "[Nn][Cc][Oo][Vv]"
See Also
Other helper:
mx_crosscheck(),
mx_download(),
mx_export()
Check how up-to-date the maintained medRxiv snapshot is
Description
Provides information on how up-to-date the maintained medRxiv snapshot provided by 'mx_snapshot()' is by checking whether there have been any records added to, or updated in, the medRxiv repository since the last snapshot was taken.
Usage
mx_crosscheck()
See Also
Other helper:
mx_caps(),
mx_download(),
mx_export()
Examples
if (interactive()) {
mx_crosscheck()
}
Download PDF's of preprints returned by a search
Description
Download PDF's of all the papers in your search results
Usage
mx_download(
mx_results,
directory,
create = TRUE,
name = c("ID", "DOI"),
print_update = 10
)
Arguments
mx_results |
Vector containing the links to the medRxiv PDFs |
directory |
The location you want to download the PDF's to |
create |
TRUE or FALSE. If TRUE, creates the directory if it doesn't exist |
name |
How to name the downloaded PDF. By default, both the ID number of the record and the DOI are used. |
print_update |
How frequently to print an update |
See Also
Other helper:
mx_caps(),
mx_crosscheck(),
mx_export()
Examples
if (interactive()) {
mx_results <- mx_search(mx_snapshot(), query = "10.1101/2020.02.25.20021568")
mx_download(mx_results, directory = tempdir())
}
Export references for preprints returning by a search to a .bib file
Description
Export references for preprints returning by a search to a .bib file
Usage
mx_export(data, file = "medrxiv_export.bib")
Arguments
data |
Dataframe returned by mx_search() or mx_api_*() functions |
file |
File location to save to. Must have the .bib file extension |
Value
Exports a formatted .BIB file, for import into a reference manager
See Also
Other helper:
mx_caps(),
mx_crosscheck(),
mx_download()
Examples
if (interactive()) {
mx_results <- mx_search(mx_snapshot(), query = "brain")
mx_export(mx_results, tempfile(fileext = ".bib"))
}
Provide information on the medRxiv snapshot used to perform the search
Description
Provide information on the medRxiv snapshot used to perform the search
Usage
mx_info(commit = "main", manifest_url = default_snapshot_manifest_url())
Arguments
commit |
Deprecated. Only the default value "main" is supported. Use 'manifest_url' to read a specific snapshot manifest. |
manifest_url |
URL for a JSON snapshot manifest. Defaults to option ‘medrxivr.snapshot_manifest', or the package’s snapshot release manifest if that option is unset. |
Value
Message with snapshot details
Search and print output for individual search items
Description
Search and print output for individual search items
Usage
mx_reporter(mx_data, num_results, query, fields, deduplicate, NOT)
Arguments
mx_data |
The mx_dataset filtered for the date limits |
num_results |
The number of results returned by the overall search |
query |
Character string, vector or list |
fields |
Fields of the database to search - default is Title, Abstract, Authors, Category, and DOI. |
deduplicate |
Logical. Only return the most recent version of a record. Default is TRUE. |
NOT |
Vector of regular expressions to exclude from the search. Default is "". |
See Also
Other main:
mx_search(),
print_full_results(),
run_search()
Search preprint data
Description
Search preprint data
Usage
mx_search(
data = NULL,
query = NULL,
fields = c("title", "abstract", "authors", "category", "doi"),
from_date = NULL,
to_date = NULL,
auto_caps = FALSE,
NOT = "",
deduplicate = TRUE,
report = FALSE
)
Arguments
data |
The preprint dataset that is to be searched, created either using mx_api_content() or mx_snapshot() |
query |
Character string, vector or list |
fields |
Fields of the database to search - default is Title, Abstract, Authors, Category, and DOI. |
from_date |
Defines earliest date of interest. Written in the format "YYYY-MM-DD". Note, records published on the date specified will also be returned. |
to_date |
Defines latest date of interest. Written in the format "YYYY-MM-DD". Note, records published on the date specified will also be returned. |
auto_caps |
As the search is case sensitive, this logical specifies whether the search should automatically allow for differing capitalisation of search terms. For example, when TRUE, a search for "dementia" would find both "dementia" but also "Dementia". Note, that if your term is multi-word (e.g. "systematic review"), only the first word is automatically capitalised (e.g your search will find both "systematic review" and "Systematic review" but won't find "Systematic Review". Note that this option will format terms in the query and NOT arguments (if applicable). |
NOT |
Vector of regular expressions to exclude from the search. Default is "". |
deduplicate |
Logical. Only return the most recent version of a record. Default is TRUE. |
report |
Logical. Run mx_reporter. Default is FALSE. |
See Also
Other main:
mx_reporter(),
print_full_results(),
run_search()
Examples
if (interactive()) {
# Using the static snapshot
mx_results <- mx_search(data = mx_snapshot(), query = "dementia")
}
Access a static snapshot of the medRxiv repository
Description
[Available for medRxiv only] This function allows users to import a maintained static snapshot of the medRxiv repository, instead of downloading a copy from the API, which can become unavailable during peak usage times. The function reads a manifest-driven snapshot artifact from the package's GitHub release assets by default.
Usage
mx_snapshot(
commit = "main",
from_date = NULL,
to_date = NULL,
manifest_url = default_snapshot_manifest_url(),
cache = TRUE
)
Arguments
commit |
Deprecated. Only the default value "main" is supported. Use 'manifest_url' to read a specific snapshot manifest. |
from_date |
Optional earliest date of interest ("YYYY-MM-DD" or Date). If supplied, records with 'date' earlier than this are excluded. |
to_date |
Optional latest date of interest ("YYYY-MM-DD" or Date). If supplied, records with 'date' later than this are excluded. |
manifest_url |
URL for a JSON snapshot manifest. Defaults to option ‘medrxivr.snapshot_manifest', or the package’s latest GitHub release manifest if that option is unset. |
cache |
Logical. If TRUE, downloaded manifest snapshot files are cached between sessions. Defaults to TRUE. |
Value
A formatted dataframe containing the data from the snapshot artifact, with reconstructed 'link_page' and 'link_pdf' columns.
See Also
Other data-source:
mx_api_content(),
mx_api_doi()
Search for terms in the dataset
Description
Search for terms in the dataset
Usage
print_full_results(num_results, deduplicate)
Arguments
num_results |
number of searched terms returned |
deduplicate |
Logical. Only return the most recent version of a record. Default is TRUE. |
See Also
Other main:
mx_reporter(),
mx_search(),
run_search()
Search for terms in the dataset
Description
Search for terms in the dataset
Usage
run_search(mx_data, query, fields, deduplicate, NOT = "")
Arguments
mx_data |
The mx_dataset filtered for the date limits |
query |
Character string, vector or list |
fields |
Fields of the database to search - default is Title, Abstract, Authors, Category, and DOI. |
deduplicate |
Logical. Only return the most recent version of a record. Default is TRUE. |
NOT |
Vector of regular expressions to exclude from the search. Default is NULL. |
See Also
Other main:
mx_reporter(),
mx_search(),
print_full_results()
Skips API tests if API isn't working correctly
Description
Skips API tests if API isn't working correctly
Usage
skip_if_api_message()