The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
A correctness and robustness release driven by a code review of 0.1.0 (see issue #12 for the full catalogue). Two changes alter previously-returned numeric output and are called out separately below.
nndr() now standardises (z-scores) each numerical
column by the real-data mean and standard deviation
before the nearest-neighbour distance is computed. Without this, a
single large-scale column (e.g. income in dollars)
dominated the Euclidean distance and the score moved with measurement
units rather than with row similarity. Pass
normalize = FALSE to recover the previous behaviour
exactly.correlation_similarity() and
contingency_similarity() now return
score = NA_real_ (rather than 1) when there
are fewer than two columns of the relevant type, and
diagnostic_report() returns NA_real_ per
column when the synthetic column is entirely NA. Aggregated
property scores in quality_report() /
diagnostic_report() skip these NAs
(na.rm = TRUE) so they no longer overstate fidelity with a
synthetic “1” where there is no signal to measure.equality_constraint() gains a tolerance
argument: with tolerance > 0 on numeric columns, the
check is abs(a - b) <= tolerance instead of exact
==. Default 0 preserves prior behaviour.custom_constraint() gains a vectorized
argument: when TRUE, the predicate is called
once with the whole data frame instead of once per row.
Substantially faster on large synthetic samples for vectorisable
predicates.ml_efficacy() gains a seed argument for
reproducible train/test splits. The caller’s global RNG state is
restored on exit, so callers using set.seed() elsewhere are
unaffected.nndr() gains a normalize argument (default
TRUE) — see the default-output note above.print() methods for equality_constraint,
inequality_constraint,
fixed_combinations_constraint, and
custom_constraint.metadata_to_json() / metadata_from_json()
now round-trip the structural constraint types (equality,
inequality, and fixed_combinations).
custom_constraint cannot be serialised — it holds an R
closure — and is dropped with a warning. Previously
metadata_to_json() crashed on any
constraint, so save_metadata() was effectively broken for
non-trivial metadata.check_constraint.equality_constraint and
check_constraint.inequality_constraint now return
FALSE (not NA) for rows containing
NA. This prevents NA from propagating into the
row selector used by sample()’s rejection loop, which
previously inserted phantom NA-only rows.sample_conditions() now honours metadata constraints
alongside the user-supplied conditions (previously it filtered only on
the conditions).tvd_similarity() now strips NAs from both
sides and divides by the non-NA count on each side; previously
NA-padding inflated TVD.ks_similarity() now suppresses the
ks.test() “p-value will be approximate in the presence
of ties” warning, which it leaked to users on any tied integer
column (very common in tables with integer ages, capital gains,
etc.).fixed_combinations_constraint now uses a collision-free
length-prefix key encoding ("<nchar>:<value>"),
removing a theoretical separator collision in the previous paste-based
comparison.fit.gaussian_copula_synthesizer() errors clearly when a
modeled column is entirely NA or when no row is complete
across all modeled columns. Previously the user saw a cryptic
'dim' must be an integer (>= 2) from inside
copula::normalCopula.ml_efficacy() validates target_col (must
be a column of real) and test_fraction (must
be strictly between 0 and 1) up front.attribute_disclosure_risk() validates that
known_cols are present and numeric (one-hot encode
categorical knowns first); previously triggered a cryptic
FNN::knnx.index error.gaussian_copula_synthesizer() cross-checks
numerical_distributions names against the metadata’s
numerical columns; silently-ignored typos like
list(capitl_gain = "gamma") now raise a clear error.sample_conditions() validates that .n
values are positive whole numbers (was silently truncating or accepting
negatives).privacy_report() errors when only one of
sensitive_col / known_cols is supplied
(previously silently dropped disclosure-risk computation).set_primary_key() emits an advisory warning when the
column’s metadata type is not "id", since the column would
otherwise be modeled as ordinary data and the diagnostic key-uniqueness
check would typically fail.set_column_type() docstring documents the
level-ordering rule for categorical columns — factor keeps
levels() order, character is sorted
lexicographically (c("2", "10") becomes
levels c("10", "2")).Initial CRAN release.
gaussian_copula_synthesizer()) that fits a single joint
copula over all modeled columns: numerical,
categorical, and boolean.norm, beta,
gamma, truncnorm, and uniform by
Kolmogorov-Smirnov distance. Per-column overrides via
numerical_distributions; global default via
default_distribution.sample() for unconditional generation and
sample_conditions() for conditional generation on
categorical or boolean values via rejection sampling.metadata(),
set_column_type(), set_primary_key()) with
auto-detection and JSON serialization (metadata_to_json(),
save_metadata()).add_constraint(), check_constraints()),
enforced via rejection sampling.quality_report() aggregates metrics into the
two-property hierarchy used by the Python SDMetrics
library:
correlation_similarity() for numerical pairs,
contingency_similarity() for categorical pairs). ML
efficacy (train-on-synthetic / test-on-real, TSTR/TRTR) is reported
separately, not folded into the overall score.diagnostic_report() checks structural validity:
boundary adherence (numerical ranges), category adherence (categorical
values), and key uniqueness for primary keys.privacy_report() reports the nearest-neighbour distance
ratio (NNDR) and, optionally, attribute disclosure risk.autoplot() methods for quality, diagnostic, and privacy
reports.adult_income — a 500-row sample of the
UCI Adult Income dataset used in examples and vignettes.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.