The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
autoFlagR is an R package for automated data quality
auditing using unsupervised machine learning. It provides AI-driven
anomaly detection for data quality assessment, primarily designed for
Electronic Health Records (EHR) data, with benchmarking capabilities for
validation and publication.
The typical workflow consists of three main steps:
The prep_for_anomaly() function automatically handles: -
Identifier columns (patient_id, encounter_id, etc.) - Missing value
imputation - Numerical feature scaling (MAD or min-max) - Categorical
variable encoding (one-hot)
# Example healthcare data
data <- data.frame(
patient_id = 1:200,
age = rnorm(200, 50, 15),
cost = rnorm(200, 10000, 5000),
length_of_stay = rpois(200, 5),
gender = sample(c("M", "F"), 200, replace = TRUE),
diagnosis = sample(c("A", "B", "C"), 200, replace = TRUE)
)
# Introduce some anomalies
data$cost[1:5] <- data$cost[1:5] * 20 # Unusually high costs
data$age[6:8] <- c(200, 180, 190) # Impossible ages
# Prepare data for anomaly detection
prepared <- prep_for_anomaly(data, id_cols = "patient_id")Use either Isolation Forest (default) or Local Outlier Factor (LOF):
# Score anomalies using Isolation Forest
scored_data <- score_anomaly(
data,
method = "iforest",
contamination = 0.05
)
#> Warning in (function (data, sample_size = min(nrow(data), 10000L), ntrees =
#> 500, : Attempting to use more than 1 thread, but package was compiled without
#> OpenMP support. See
#> https://github.com/david-cortes/installing-optimized-libraries#4-macos-install-and-enable-openmp
# View anomaly scores
head(scored_data[, c("patient_id", "anomaly_score")], 10)
#> patient_id anomaly_score
#> 1 1 0.15034167
#> 2 2 0.21395292
#> 3 3 0.00000000
#> 4 4 0.02693202
#> 5 5 0.23670251
#> 6 6 0.04638215
#> 7 7 0.11533699
#> 8 8 0.15881136
#> 9 9 0.92531753
#> 10 10 0.71809012Flag records as anomalous based on threshold or contamination rate:
# Flag top anomalies
flagged_data <- flag_top_anomalies(
scored_data,
contamination = 0.05
)
# View flagged anomalies
anomalies <- flagged_data[flagged_data$is_anomaly, ]
head(anomalies[, c("patient_id", "anomaly_score", "is_anomaly")], 10)
#> patient_id anomaly_score is_anomaly
#> 39 39 0.9697503 TRUE
#> 56 56 0.9862881 TRUE
#> 63 63 0.9727825 TRUE
#> 73 73 0.9998179 TRUE
#> 135 135 0.9830231 TRUE
#> 157 157 1.0000000 TRUE
#> 175 175 0.9912094 TRUE
#> 184 184 0.9810962 TRUE
#> 191 191 0.9733082 TRUE
#> 192 192 0.9776592 TRUEThese binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.