The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

ORscraper

Samuel González

2026-01-11

Introduction

ORscraper is an R package designed for the automated extraction of clinical data from reports generated by Oncomine Reporter software. It streamlines the transformation of unstructured files into structured databases, ensuring efficient data processing while minimizing human errors.

Additionally, ORscraper enhances extracted data by integrating clinical significance insights through automated queries to the ClinVar API.

This package is open-source and fully customizable, allowing users to tailor it to their specific needs.

Installation

To install ORscraper from GitHub, use the following commands:

library(devtools)

devtools::install_github(“SamuelGonzalez0204/ORscraper”)

This will install the package along with its dependencies.

Data Structure and Input Format

Input File Format

The input files must be in PDF format and follow the standard structure of clinical reports generated by Oncomine Reporter. ORscraper extracts information directly from the text without requiring OCR, ensuring fast and accurate data retrieval.

Each clinical report follows this structure:

- Analysis indication and scope.
- Sample details.
- Relevant biomarkers.
- Tables containing genetic variant details (if they were detected):
    - DNA sequence variants.
    - Copy number variations (CNV).
    - Gene fusions.
- Analyzed genes
- Analytical method
- Study limitations.

Processing Multiple Files

ORscraper supports the analysis of multiple reports simultaneously, streamlining workflows for large datasets.

Recommended File Path Structure

For better organization, the following directory structure is recommended when specifying file paths in the code:

InputFolder <- “INPUT” ReportFolder <- “Reports” BasePath <- getwd() inputPath <- file.path(BasePath, InputFolder, ReportFolder)

Alternatively, users can leverage the Shiny application available in this repository (https://github.com/SamuelGonzalez0204/PDF-Scrapping), which allows file selection via a graphical interface, eliminating the need to manually specify file paths.

Recommended Naming Convention

To improve organization, it is recommended to name documents using the chip number corresponding to the sequencing run.

Language Compatibility

ORscraper is designed to process reports in Spanish. However, it can be adapted to other languages by modifying the search patterns within the code.

Example Input File

For testing and reference, the package includes a sample report in the repository folder:

inst/extdata/100.1-example.pdf

Users can utilize this file to better understand the expected input data format.

Core functions

The ORscraper package provides various functions for extracting data from clinical reports:

classify_biopsy(): This function analyzes biopsy identifiers and categorizes them into specific types based on a defined rule. 1 <- biopsy; 2 <- aspiration; 3 <- cytology.

extract_chip_id(): This function retrieves chip values from file names matching a specific pattern.

extract_fusions(): This function identifies and extracts fusion variants from text lines based on specific patterns.

extract_intermediate_values(): This function searches for a specific text pattern in a set of lines and extracts values that follow the pattern.

extract_values_from_tables(): This function analyzes a subset of text lines, extracting information such as mutations, pathogenicity, frequencies, codifications and changes.

extract_values_start_end(): This function appends extracted variable values based on start or end markers to a list.

filter_pathogenic_only(): This function filters a list of pathogenicity classifications, retaining only those marked as “Pathogenic”.

read_pdf_content(): This function extracts the text content from a PDF file and splits it into individual lines.

read_pdf_files(): This function scans a specified directory and retrieves all files with a .pdf extension.

search_ncbi_clinvar(): This function queries the NCBI ClinVar database for germline classifications based on gene and codification data.

Example Workflow

Step 1: Load and Process PDF Files

First, retrieve a list of all PDF files from the specified directory using the read_pdf_files() function.

if (!requireNamespace("readxl", quietly = TRUE)) {
  stop("The readxl package is required for this vignette, install it with install.packages('readxl').")
}
InputPath <- system.file("extdata", package = "ORscraper")
files <- ORscraper::read_pdf_files(InputPath)
genes_file <- system.file("extdata/Genes.xlsx", package = "ORscraper")
genes <- readxl::read_excel(genes_file)
mutations <- unique(genes$GEN)

Step 2: Extract Text from PDFs

Read the content of each PDF file and split it into lines using read_pdf_content().

lines <- ORscraper::read_pdf_content(files[1])  # Example with the first file
head(lines)
#> [1] "                                                                                                                       Servicio de Anatomía Patológica"    
#> [2] "                                                                                                                        Laboratorio de Patología Molecular"
#> [3] ""                                                                                                                                                          
#> [4] ""                                                                                                                                                          
#> [5] ""                                                                                                                                                          
#> [6] ""

Step 3: Extract Key Information from Text

Use predefined patterns to extract diagnostic information, gender, tumor cell percentage, and sample quality.

diagnostic <- gender <- tumor_cell_percentage <- quality <- c()
diagnostic <- extract_values_start_end(diagnostic, lines, ".*Diagnóstico:\\s")
gender <- extract_values_start_end(gender, lines, ".*Sexo:\\s*")
tumor_cell_percentage <- extract_values_start_end(tumor_cell_percentage, lines, ".*% células tumorales:\\s")
quality <- extract_values_start_end(quality, lines, ".*CALIDAD DE LA MUESTRA /LIMITACIONES PARA SU ANÁLISIS:\\s")

Step 4: Extract Additional Information

Extract other relevant values such as patient ID, biopsy number, sample date, and diagnostic text.

NHC_Data <- NB_values <- dates <- textDiag <- c()
NHC_Data <- extract_intermediate_values(NHC_Data, lines, "NHC:")
NB_values <- extract_intermediate_values(NB_values, lines, "biopsia:")
dates <- extract_intermediate_values(dates, lines, "Fecha:")
textDiag <- extract_intermediate_values(textDiag, lines, "de la muestra:")

Step 5: Extract Genetic Mutation Data

Identify genetic mutations and their characteristics from tables within the reports.

TableValues <- extract_values_from_tables(lines, mutations)
mutateGenes <- TableValues[[1]]
pathogenity <- TableValues[[2]]
frequencies <- TableValues[[3]]
codifications <- TableValues[[4]]
changes <- TableValues[[5]]

Step 6: Identify Gene Fusions

Extract fusion variants based on mutation patterns.

fusions <- extract_fusions(lines, mutations)

Step 7: Search for Pathogenicity Information

Query the NCBI ClinVar database to retrieve pathogenicity classifications for detected mutations.

search_pathogenity <- search_ncbi_clinvar(pathogenity, mutateGenes, codifications)

Step 8: Filter Pathogenic Mutations

Extract only pathogenic mutations and their associated details.

pathogenic_mutations <- filter_pathogenic_only(pathogenity, mutateGenes)
pathogenic_changes <- filter_pathogenic_only(pathogenity, changes)
pathogenic_frequencies <- filter_pathogenic_only(pathogenity, frequencies)

Step 9: Classify Biopsies

Categorize the biopsy number based on the sample origin.

biopsies_identifiers <- classify_biopsy(NB_values)

Step 10: Extract Chip Identifier

Retrieve the chip ID used for sequencing from the file names.

chips <- extract_chip_id(files)

Additional Features

You can customize the mutations and diagnoses to process by uploading an Excel file that specifies the desired genes and the code associated with each diagnosis. A sample file is provided (inst/extdata/Diagnostico.xlsx and inst/extdata/Genes.xlsx) with the genes and diagnoses used for designing the R package and the Shiny application.

Shiny App

For users who prefer a graphical user interface, ORscraper includes a Shiny app. This app allows you to interact with the data visually, upload clinical reports and save data tables on MongoDB compass local databases.

Access the Shiny app repository: https://github.com/SamuelGonzalez0204/ORscrapper_ShinyApp

Conclusion

The ORscraper package provides a set of tools that facilitate scraping medical information from clinical report files in PDF format. It allows users to customize the data to extract, the mutations to search for, and the diagnoses to work with, enriching all information with API searches in ClinVar.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.