Name-blind variable-role detection with rolescry

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

The problem: names lie, data does not

Real datasets arrive with column names that are missing, misleading, in the wrong language, or simply wrong: a column called category_code holding continuous lab values, a gender column that is actually a free numeric measurement, an outcome buried under an opaque v7. Any tool that decides a column’s statistical role from its name inherits every one of those lies.

rolescry decides roles from the data signature instead – the guiding principle is Data inspice, non nomen (“inspect the data, not the name”). Renaming every column to col_1, col_2, ... does not change a single role assignment. This is the turnusol (litmus) invariant, and it is the package’s keystone test.

A worked example

d <- data.frame(
  arm  = rep(c(0, 1), each = 50),  # a balanced 2-level grouping
  pre  = rnorm(100, 10, 2),        # measured before ...
  post = rnorm(100, 11, 2),        # ... and after (paired)
  resp = rbinom(100, 1, 0.4)       # a binary response
)

res <- detect_roles(d)
res
#> <role_detection> 100 observations x 4 variables
#>   paired_pairs       pre, post                    pct=64.5
#>   agreement_pairs    pre, post                    pct=59.1
#>   time_variable      pre                          pct=90.0
#>   event_variable     arm                          pct=90.0
#>   outcome_continuous pre                          pct=60.0
#>   outcome_binary     arm                          pct=60.0
#>   covariate          pre, post, arm, resp         pct=50.0
summary(res)
#>                  role found           columns  pct
#> 1           group_var FALSE                    0.0
#> 2        paired_pairs  TRUE          pre,post 64.5
#> 3     agreement_pairs  TRUE          pre,post 59.1
#> 4       time_variable  TRUE               pre 90.0
#> 5      event_variable  TRUE               arm 90.0
#> 6          subject_id FALSE                    0.0
#> 7  outcome_continuous  TRUE               pre 60.0
#> 8      outcome_binary  TRUE               arm 60.0
#> 9   repeated_measures FALSE                    0.0
#> 10        scale_items FALSE                    0.0
#> 11          covariate  TRUE pre,post,arm,resp 50.0

The same call on the name-stripped twin yields the same roles by position:

d_blind <- setNames(d, paste0("col_", seq_along(d)))
pos <- function(r, dat) match(r$roles$paired_pairs$columns, names(dat))
identical(pos(detect_roles(d), d), pos(detect_roles(d_blind), d_blind))
#> [1] TRUE

How a role is scored

Each column is first typed by value (continuous, binary, categorical, ID), never by name. Candidate roles are then scored by signatures that capture the statistical shape a role implies – correlation and distributional overlap for paired measurements, Bland-Altman bias and intraclass correlation for agreement, event-rate and right-skew for survival, inter-item correlation and a Cronbach-alpha proxy for scale items, and so on. Every score is a transparent sum of named components you can inspect:

res$roles$paired_pairs$components[[1]]
#> $name
#> [1] "Correlation"
#> 
#> $score
#> [1] 0
#> 
#> $max
#> [1] 20
#> 
#> $detail
#> [1] "r=-0.00"

Shannon entropy

For a categorical column with level proportions \(p_1, \dots, p_k\), the normalized Shannon entropy

\[ H_{\text{norm}} = \frac{-\sum_i p_i \log_2 p_i}{\log_2 k} \in [0, 1] \]

measures how balanced the levels are. A grouping variable (treatment vs control) has high entropy (near-balanced); a near-constant flag has entropy near zero. Entropy drives both the value classifier and the group-balance signal.

Normalized mutual information

To ask – name-blind – whether a candidate grouping actually carries information about an outcome, rolescry uses normalized mutual information:

\[ \text{NMI}(X, Y) = \frac{I(X; Y)}{\min\{H(X),\, H(Y)\}} \in [0, 1], \]

which is 0 for independent variables and 1 for a deterministic association, and is comparable across variables with different numbers of levels. It is exposed directly:

g <- sample(c("A", "B", "C"), 300, replace = TRUE)
y <- ifelse(g == "A", "event", sample(c("event", "none"), 300, replace = TRUE))
compute_nmi(g, y)          # > 0: g informs y
#> [1] 0.2903596
compute_nmi(g, sample(g))  # ~ 0: shuffled -> independent
#> [1] 0.007514902

The optional, capped name bonus

Names are not useless – they are just untrustworthy. When you do trust them, pass a keyword dictionary via name_bonus. Names then act only as a small, capped tie-breaker (at most a +10 point nudge, i.e. <= 10% of the selection score); the mathematical signature still dominates (>= 90%), the relationship enforced by score_gap_ok().

clin <- data.frame(
  male  = rbinom(120, 1, 0.5),      # a demographic binary (first)
  death = rbinom(120, 1, 0.3)       # the intended outcome
)
detect_roles(clin)$roles$outcome_binary$columns                                  # positional default
#> [1] "male"
detect_roles(clin, name_bonus = rolescry_default_name_bonus())$roles$outcome_binary$columns  # "death"
#> [1] "death"

Header-aware loading

read_data() reads a file with the header row found by the same information-theoretic scorer (detect_header()), so messy exports with title rows or merged cells still load with sensible column names. Delimited text works with base R; spreadsheet and statistical formats use optional packages and degrade gracefully if they are not installed.

Attribution

rolescry is derived from Boynukara, C. (2026). MDStatR (v2.1.0 Veritas). Zenodo. https://doi.org/10.5281/zenodo.20707791. Run citation("rolescry") to cite the package and its parent engine.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.