The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
ORscraper is an R package designed for the automated extraction of clinical data from reports generated by Oncomine Reporter software. It streamlines the transformation of unstructured files into structured databases, ensuring efficient data processing while minimizing human errors.
Additionally, ORscraper enhances extracted data by integrating clinical significance insights through automated queries to the ClinVar API.
This package is open-source and fully customizable, allowing users to tailor it to their specific needs.
To install ORscraper from GitHub, use the following commands:
library(devtools)
devtools::install_github(“SamuelGonzalez0204/ORscraper”)
This will install the package along with its dependencies.
The input files must be in PDF format and follow the standard structure of clinical reports generated by Oncomine Reporter. ORscraper extracts information directly from the text without requiring OCR, ensuring fast and accurate data retrieval.
Each clinical report follows this structure:
- Analysis indication and scope.
- Sample details.
- Relevant biomarkers.
- Tables containing genetic variant details (if they were detected):
- DNA sequence variants.
- Copy number variations (CNV).
- Gene fusions.
- Analyzed genes
- Analytical method
- Study limitations.
ORscraper supports the analysis of multiple reports simultaneously, streamlining workflows for large datasets.
For better organization, the following directory structure is recommended when specifying file paths in the code:
InputFolder <- “INPUT” ReportFolder <- “Reports” BasePath <- getwd() inputPath <- file.path(BasePath, InputFolder, ReportFolder)
Alternatively, users can leverage the Shiny application available in this repository (https://github.com/SamuelGonzalez0204/PDF-Scrapping), which allows file selection via a graphical interface, eliminating the need to manually specify file paths.
To improve organization, it is recommended to name documents using the chip number corresponding to the sequencing run.
ORscraper is designed to process reports in Spanish. However, it can be adapted to other languages by modifying the search patterns within the code.
For testing and reference, the package includes a sample report in the repository folder:
inst/extdata/100.1-example.pdf
Users can utilize this file to better understand the expected input data format.
The ORscraper package provides various functions for extracting data from clinical reports:
classify_biopsy(): This function analyzes biopsy identifiers and categorizes them into specific types based on a defined rule. 1 <- biopsy; 2 <- aspiration; 3 <- cytology.
extract_chip_id(): This function retrieves chip values from file names matching a specific pattern.
extract_fusions(): This function identifies and extracts fusion variants from text lines based on specific patterns.
extract_intermediate_values(): This function searches for a specific text pattern in a set of lines and extracts values that follow the pattern.
extract_values_from_tables(): This function analyzes a subset of text lines, extracting information such as mutations, pathogenicity, frequencies, codifications and changes.
extract_values_start_end(): This function appends extracted variable values based on start or end markers to a list.
filter_pathogenic_only(): This function filters a list of pathogenicity classifications, retaining only those marked as “Pathogenic”.
read_pdf_content(): This function extracts the text content from a PDF file and splits it into individual lines.
read_pdf_files(): This function scans a specified directory and retrieves all files with a .pdf extension.
search_ncbi_clinvar(): This function queries the NCBI ClinVar database for germline classifications based on gene and codification data.
First, retrieve a list of all PDF files from the specified directory using the read_pdf_files() function.
if (!requireNamespace("readxl", quietly = TRUE)) {
stop("The readxl package is required for this vignette, install it with install.packages('readxl').")
}
InputPath <- system.file("extdata", package = "ORscraper")
files <- ORscraper::read_pdf_files(InputPath)
genes_file <- system.file("extdata/Genes.xlsx", package = "ORscraper")
genes <- readxl::read_excel(genes_file)
mutations <- unique(genes$GEN)Read the content of each PDF file and split it into lines using read_pdf_content().
Use predefined patterns to extract diagnostic information, gender, tumor cell percentage, and sample quality.
diagnostic <- gender <- tumor_cell_percentage <- quality <- c()
diagnostic <- extract_values_start_end(diagnostic, lines, ".*Diagnóstico:\\s")
gender <- extract_values_start_end(gender, lines, ".*Sexo:\\s*")
tumor_cell_percentage <- extract_values_start_end(tumor_cell_percentage, lines, ".*% células tumorales:\\s")
quality <- extract_values_start_end(quality, lines, ".*CALIDAD DE LA MUESTRA /LIMITACIONES PARA SU ANÁLISIS:\\s")Extract other relevant values such as patient ID, biopsy number, sample date, and diagnostic text.
NHC_Data <- NB_values <- dates <- textDiag <- c()
NHC_Data <- extract_intermediate_values(NHC_Data, lines, "NHC:")
NB_values <- extract_intermediate_values(NB_values, lines, "biopsia:")
dates <- extract_intermediate_values(dates, lines, "Fecha:")
textDiag <- extract_intermediate_values(textDiag, lines, "de la muestra:")Identify genetic mutations and their characteristics from tables within the reports.
Extract fusion variants based on mutation patterns.
Query the NCBI ClinVar database to retrieve pathogenicity classifications for detected mutations.
Extract only pathogenic mutations and their associated details.
Categorize the biopsy number based on the sample origin.
Retrieve the chip ID used for sequencing from the file names.
You can customize the mutations and diagnoses to process by uploading an Excel file that specifies the desired genes and the code associated with each diagnosis. A sample file is provided (inst/extdata/Diagnostico.xlsx and inst/extdata/Genes.xlsx) with the genes and diagnoses used for designing the R package and the Shiny application.
For users who prefer a graphical user interface, ORscraper includes a Shiny app. This app allows you to interact with the data visually, upload clinical reports and save data tables on MongoDB compass local databases.
Access the Shiny app repository: https://github.com/SamuelGonzalez0204/ORscrapper_ShinyApp
The ORscraper package provides a set of tools that facilitate scraping medical information from clinical report files in PDF format. It allows users to customize the data to extract, the mutations to search for, and the diagnoses to work with, enriching all information with API searches in ClinVar.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.