Getting Started with autoFlagR

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Introduction

autoFlagR is an R package for automated data quality auditing using unsupervised machine learning. It provides AI-driven anomaly detection for data quality assessment, primarily designed for Electronic Health Records (EHR) data, with benchmarking capabilities for validation and publication.

Installation

Install the package from CRAN:

install.packages("autoFlagR")

Basic Workflow

The typical workflow consists of three main steps:

Preprocess your data
Score anomalies using AI algorithms
Flag top anomalies for review

Step 1: Load the Package

library(autoFlagR)
library(dplyr)

Step 2: Prepare Your Data

The prep_for_anomaly() function automatically handles: - Identifier columns (patient_id, encounter_id, etc.) - Missing value imputation - Numerical feature scaling (MAD or min-max) - Categorical variable encoding (one-hot)

# Example healthcare data
data <- data.frame(
  patient_id = 1:200,
  age = rnorm(200, 50, 15),
  cost = rnorm(200, 10000, 5000),
  length_of_stay = rpois(200, 5),
  gender = sample(c("M", "F"), 200, replace = TRUE),
  diagnosis = sample(c("A", "B", "C"), 200, replace = TRUE)
)

# Introduce some anomalies
data$cost[1:5] <- data$cost[1:5] * 20  # Unusually high costs
data$age[6:8] <- c(200, 180, 190)  # Impossible ages

# Prepare data for anomaly detection
prepared <- prep_for_anomaly(data, id_cols = "patient_id")

Step 3: Score Anomalies

Use either Isolation Forest (default) or Local Outlier Factor (LOF):

# Score anomalies using Isolation Forest
scored_data <- score_anomaly(
  data, 
  method = "iforest", 
  contamination = 0.05
)
#> Warning in (function (data, sample_size = min(nrow(data), 10000L), ntrees =
#> 500, : Attempting to use more than 1 thread, but package was compiled without
#> OpenMP support. See
#> https://github.com/david-cortes/installing-optimized-libraries#4-macos-install-and-enable-openmp

# View anomaly scores
head(scored_data[, c("patient_id", "anomaly_score")], 10)
#>    patient_id anomaly_score
#> 1           1    0.15034167
#> 2           2    0.21395292
#> 3           3    0.00000000
#> 4           4    0.02693202
#> 5           5    0.23670251
#> 6           6    0.04638215
#> 7           7    0.11533699
#> 8           8    0.15881136
#> 9           9    0.92531753
#> 10         10    0.71809012

Step 4: Flag Top Anomalies

Flag records as anomalous based on threshold or contamination rate:

# Flag top anomalies
flagged_data <- flag_top_anomalies(
  scored_data, 
  contamination = 0.05
)

# View flagged anomalies
anomalies <- flagged_data[flagged_data$is_anomaly, ]
head(anomalies[, c("patient_id", "anomaly_score", "is_anomaly")], 10)
#>     patient_id anomaly_score is_anomaly
#> 39          39     0.9697503       TRUE
#> 56          56     0.9862881       TRUE
#> 63          63     0.9727825       TRUE
#> 73          73     0.9998179       TRUE
#> 135        135     0.9830231       TRUE
#> 157        157     1.0000000       TRUE
#> 175        175     0.9912094       TRUE
#> 184        184     0.9810962       TRUE
#> 191        191     0.9733082       TRUE
#> 192        192     0.9776592       TRUE

Step 5: Generate Audit Report

Generate comprehensive PDF, HTML, or DOCX reports:

# Generate PDF report (saves to tempdir() by default)
generate_audit_report(
  data,
  filename = "my_audit_report",
  output_dir = tempdir(),
  output_format = "pdf",
  method = "iforest",
  contamination = 0.05
)

Key Features

Automated Preprocessing: Handles identifiers, scales numerical features, and encodes categorical variables
Multiple AI Algorithms: Supports Isolation Forest and Local Outlier Factor (LOF) methods
Benchmarking Metrics: Calculates AUC-ROC, AUC-PR, and Top-K Recall when ground truth labels are available
Professional Reports: Generates PDF/HTML/DOCX reports with visualizations and prioritized audit listings
Tidy Interface: Designed to work seamlessly with the tidyverse

Next Steps

See the Healthcare Example vignette for a detailed walkthrough
Learn about Benchmarking with ground truth labels
Explore the Function Reference for detailed documentation

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.