The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
Real datasets arrive with column names that are missing, misleading,
in the wrong language, or simply wrong: a column called
category_code holding continuous lab values, a
gender column that is actually a free numeric measurement,
an outcome buried under an opaque v7. Any tool that decides
a column’s statistical role from its name
inherits every one of those lies.
rolescry decides roles from the data
signature instead – the guiding principle is Data inspice,
non nomen (“inspect the data, not the name”). Renaming every column
to col_1, col_2, ... does not change a single role
assignment. This is the turnusol (litmus) invariant,
and it is the package’s keystone test.
d <- data.frame(
arm = rep(c(0, 1), each = 50), # a balanced 2-level grouping
pre = rnorm(100, 10, 2), # measured before ...
post = rnorm(100, 11, 2), # ... and after (paired)
resp = rbinom(100, 1, 0.4) # a binary response
)
res <- detect_roles(d)
res
#> <role_detection> 100 observations x 4 variables
#> paired_pairs pre, post pct=64.5
#> agreement_pairs pre, post pct=59.1
#> time_variable pre pct=90.0
#> event_variable arm pct=90.0
#> outcome_continuous pre pct=60.0
#> outcome_binary arm pct=60.0
#> covariate pre, post, arm, resp pct=50.0
summary(res)
#> role found columns pct
#> 1 group_var FALSE 0.0
#> 2 paired_pairs TRUE pre,post 64.5
#> 3 agreement_pairs TRUE pre,post 59.1
#> 4 time_variable TRUE pre 90.0
#> 5 event_variable TRUE arm 90.0
#> 6 subject_id FALSE 0.0
#> 7 outcome_continuous TRUE pre 60.0
#> 8 outcome_binary TRUE arm 60.0
#> 9 repeated_measures FALSE 0.0
#> 10 scale_items FALSE 0.0
#> 11 covariate TRUE pre,post,arm,resp 50.0The same call on the name-stripped twin yields the same roles by position:
Each column is first typed by value
(continuous, binary, categorical,
ID), never by name. Candidate roles are then scored by
signatures that capture the statistical shape a role implies –
correlation and distributional overlap for paired measurements,
Bland-Altman bias and intraclass correlation for agreement, event-rate
and right-skew for survival, inter-item correlation and a Cronbach-alpha
proxy for scale items, and so on. Every score is a transparent sum of
named components you can inspect:
res$roles$paired_pairs$components[[1]]
#> $name
#> [1] "Correlation"
#>
#> $score
#> [1] 0
#>
#> $max
#> [1] 20
#>
#> $detail
#> [1] "r=-0.00"For a categorical column with level proportions \(p_1, \dots, p_k\), the normalized Shannon entropy
\[ H_{\text{norm}} = \frac{-\sum_i p_i \log_2 p_i}{\log_2 k} \in [0, 1] \]
measures how balanced the levels are. A grouping variable (treatment vs control) has high entropy (near-balanced); a near-constant flag has entropy near zero. Entropy drives both the value classifier and the group-balance signal.
To ask – name-blind – whether a candidate grouping actually
carries information about an outcome, rolescry
uses normalized mutual information:
\[ \text{NMI}(X, Y) = \frac{I(X; Y)}{\min\{H(X),\, H(Y)\}} \in [0, 1], \]
which is 0 for independent variables and 1 for a deterministic association, and is comparable across variables with different numbers of levels. It is exposed directly:
Names are not useless – they are just untrustworthy. When
you do trust them, pass a keyword dictionary via
name_bonus. Names then act only as a small,
capped tie-breaker (at most a +10 point nudge,
i.e. <= 10% of the selection score); the mathematical signature still
dominates (>= 90%), the relationship enforced by
score_gap_ok().
clin <- data.frame(
male = rbinom(120, 1, 0.5), # a demographic binary (first)
death = rbinom(120, 1, 0.3) # the intended outcome
)
detect_roles(clin)$roles$outcome_binary$columns # positional default
#> [1] "male"
detect_roles(clin, name_bonus = rolescry_default_name_bonus())$roles$outcome_binary$columns # "death"
#> [1] "death"read_data() reads a file with the header row found by
the same information-theoretic scorer (detect_header()), so
messy exports with title rows or merged cells still load with sensible
column names. Delimited text works with base R; spreadsheet and
statistical formats use optional packages and degrade gracefully if they
are not installed.
rolescry is derived from Boynukara, C. (2026).
MDStatR (v2.1.0 Veritas). Zenodo. https://doi.org/10.5281/zenodo.20707791. Run
citation("rolescry") to cite the package and its parent
engine.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.