The bestNormalize
package contains a suite of transformation-estimating functions that can be used to normalize data. The function of the same name finds and executes the best of all of these potential normalizing transformations.
There are many instances where researchers may want to normalize a variable. What may come to mind first is the (often problematic) assumption of normality of the outcome (conditional on the covariates) in the classical linear regression problem. Over the years many methods have been used to relax this assumption: generalized linear models, quantile regression, survival models, etc. One technique that is still somewhat popular in this context is to “beat the data” to look normal via some kind of normalizing transformation. This could be something as simple as a log transformation, or something as complex as a Yeo-Johnson transformation. While perhaps not the most elegant solution the problem, often this technique works well as a quick and dirty solution.
Another increasingly popular application of normalization occurs in applied regression settings with highly skewed distributions of the covariates. In these settings, there exists the tendency to have high leverage points (and highly influential points), even when one centers and scales the covariates. When examining interactions, these influential points can become especially problematic since the leverage of that point is amplified for every child interaction of which it is a parent. Normalization of the covariates mitigates the leverage and influence of these covariates, which allows for easier model selection. As a result, popular model selection packages such as caret
and recipes
have built-in mechanisms to normalize the predictor variables (they call this “preprocessing”). This concept is unique in that it forgoes the assumption of linearity between the outcome and the covariate, opting instead for a linear relationship between Y and the transformed value of the covariate (which in many cases may be more plausible). This process has many benefits, as described in the accompanying paper (paper not yet published).
This package is designed to make this normalization transformation as effortless and consistent as possible. This package also introduces orderNorm
, a normalization technique based off of a rank mapping to the normal distribution, which guarantees normally distributed transformed data (if ties are not present).
All of the transformations contained in this package are reversible (i.e., 1-1), which allows for straight-forward interpretation and consistency. In other words, any analysis performed on the normalized data can be interpreted using the original unit (see application).
There are several normalization transformation options, each with their own implementations and limitations outlined in the table below. While some of these methods are implemented well in other R packages, the bestNormalize
package puts them all under the same umbrella syntax that makes them easy to apply in a wide range of situations. The transformations contained in this package are summarized in this section and in Table 1.
Table 1: Normalizing methods comparison
Method | Implementation | Limitations |
---|---|---|
Lambert WxF | LambertW | Not always effective |
Box Cox | forecast/caret | Restricted to non-negatives |
Yeo-Johnson | caret | Not always effective |
Order Norm* | by-hand | Cannot handle ties |
* Note that the orderNorm method does lose some information, just like any other kind of non-parametric technique based on ranks.
The Lambert W x F transformation, proposed by Goerg and implemented in the LambertW
package, is essentially a mechanism that de-skews a random variable \(X\) using moments. The method is motivated by a system theory, and is alleged to be able to transform any random variable into any other kind of random variable, thus being applicable to a large number of cases.
One of the package’s main functions is Gaussianize
, which is similar in spirit to the purpose of this package. However, I have found in practice that often times, this method does not perform as well as the Box Cox or the Yeo-Johnson transformation.
The Box Cox transformation, proposed by Box and Cox in their famous 1964 paper and implemented with differing syntax and methods in many packages in R (see caret
, MASS
, forecast
), is a straightforward transformation that only involves one parameter, \(\lambda\):
\[ g(x; \lambda) = \boldsymbol 1 _{(\lambda \neq 0)} \frac{x^\lambda-1}{\lambda} + \boldsymbol 1_{(\lambda = 0)} \log x \]
Where \(x\) refers to the datum in the original unit (pre-transformation). The \(\lambda\) parameter can be estimated via maximum likelihood.
The Yeo-Johnson transformation, proposed by Yeo and Johnson in 2000, attempts to find the value of lambda (in the following equation) that minimizes the Kullback-Leibler distance between the normal distribution and the transformed distribution.
\[ \begin{aligned} g(x;\lambda) &= \boldsymbol 1 _{(\lambda \neq 0, x \geq 0)} \frac{(x+1)^\lambda-1}{\lambda} \\ &+ \boldsymbol 1_{(\lambda = 0, x \geq 0)} \log (x+1) \\ &+ \boldsymbol 1_{(\lambda \neq 2, x < 0)} \frac{(1-x)^{2-\lambda}-1}{\lambda - 2} \\ &+ \boldsymbol 1_{(\lambda = 0, x < 0)} -\log (1-x) \\ \end{aligned} \]
This method has the advantage of working without having to worry about the domain of \(x\). As with the Box-Cox \(\lambda\), this \(\lambda\) parameter can be estimated via maximum likelihood.
The orderNorm technique uses the following transformation:
Let \(x\) refer to the original data. Then the transformation is:
\[ g(x) = \Phi ^{-1} \left(\frac{\text{rank} (x)}{\text{length}(x) + 1}\right) \]
On new data within the range of the original data, this transformation refers to the linear interpolation between two of the original data points. On new data outside the range of the original data, the transformation returns a warning and extrapolates using a shifted linear approximation of the transformed values to the original data. This is visualized below via the iris
data set, on the Petal.Width
variable.
The reason for the shifted linear extrapolation is that it ensures that the function is 1-1 (which would not necessarily be the case if a more smooth procedure was utilized). However, this issue should be relatively minimal since we should not expect to see many observations outside the observed range if the sample size is large enough.
The orderNorm technique will not guarantee a normal distribution in the presence of ties, but it still could yield the best normalizing transformation when compared to the Box Cox, Yeo Johnson, or Lambert W x F approaches.
There have been a range of other normalization techniques discussed since the original Box-Cox paper that are not included in this package (at the time of writing). Many of these transformations have their own strengths and weaknesses.
These include (but are not limited to): Modified Box Cox (1964), Manly’s Exponential (1976), John/Draper’s Modulus (1980) Bickel/Doksum’s Modified Box Cox (1981).
The framework of this package is to create a class for each transformation, so the addition of other normalization techniques would be easy extensions of this package (readers can feel free to submit a pull request to this package’s GitHub page with new transformation techniques if they feel so inclined).
The bestNormalize
package does also include a function to perform a binarizing transformation. This is provided as a potential “last resort” if a vector is really unable to be transformed to a normally distributed variable. In cases when a user is automatically normalizing covariates, this is useful when they may accidentally try to normalize a vector with not enough unique values.
The bestNormalize
function selects the best transformation according to the Pearson P statistic. There are a variety of normality tests out there, but the benefit of the Pearson P is that it is a relatively interpretable goodness of fit test, and the ratio P / df can be compared between transformations as an absolute measure of the departure from normality. The transformation whose transformed values fit normality the closest according to this statistic (or equivalently, this ratio), is selected by bestNormalize
. The ratios are printed when the object is printed, see examples in the next section.
In this section, I provide some code that performs each of the transformations described in the prior section.
# Generate some data
set.seed(100)
x <- rgamma(1000, 1, 1)
MASS::truehist(x, nbins = 12)
This data is clearly not normal. Let’s use the bestNormalize
functionality to perform a suite of potential transformations, and see how each method performs.
# Lambert's W x F transfromation
(lambert_obj <- lambert(x))
## Lambert WxF Transformation of type s with 1000 nonmissing obs.:
## Estimated statistics:
## - gamma = 0.4129
## - mean = 0.667563
## - sd = 0.7488649
# Box Cox's Transformation
(boxcox_obj <- boxcox(x))
## Box Cox Transformation with 1000 nonmissing obs.:
## Estimated statistics:
## - lamda = 0.2739638
## - mean = -0.3870903
## - sd = 1.045498
# Yeo-Johnson's Transformation
(yeojohnson_obj <- yeojohnson(x))
## Yeo-Johnson Transformation with 1000 nonmissing obs.:
## Estimated statistics:
## - lambda = -0.7847174
## - mean = 0.4297504
## - sd = 0.2564503
# orderNorm Transformation
(orderNorm_obj <- orderNorm(x))
## orderNorm Transformation with 1000 nonmissing obs and no ties
## - Original quantiles:
## 0% 25% 50% 75% 100%
## 0.000 0.253 0.693 1.437 7.431
# Pick the best one automatically
(BNobject <- bestNormalize(x))
## Best Normalizing transformation with 1000 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - Box-Cox: 0.8188
## - Lambert's W: 1.28
## - Yeo-Johnson: 5.8284
## - orderNorm: 0.0066
##
## Based off these, bestNormalize chose:
## orderNorm Transformation with 1000 nonmissing obs and no ties
## - Original quantiles:
## 0% 25% 50% 75% 100%
## 0.000 0.253 0.693 1.437 7.431
# Last resort - binarize
(binarize_obj <- binarize(x))
## Binarize Transformation with 1000 nonmissing obs.
## Estimated Statistic:
## - median = 0.6928257
These objects can then be fed into the predict
function to perform the transformation on new values. The reverse transformation is also possible with this function. Below we plot the transformation for a range of new x values
xx <- seq(min(x), max(x), length = 100)
plot(xx, predict(lambert_obj, newdata = xx), type = "l", col = 1, ylim = c(-4, 4),
xlab = 'x', ylab = "g(x)")
lines(xx, predict(boxcox_obj, newdata = xx), col = 2)
lines(xx, predict(yeojohnson_obj, newdata = xx), col = 3)
lines(xx, predict(orderNorm_obj, newdata = xx), col = 4)
legend("bottomright", legend = c("Lambert WxF", "Box Cox", "Yeo-Johnson", "OrderNorm"),
col = 1:4, lty = 1, bty = 'n')
To examine how each of them performed, we can visualize the transformed values in a histogram.
par(mfrow = c(2,2))
MASS::truehist(lambert_obj$x.t, main = "Lambert WxF transformation", nbins = 12)
MASS::truehist(boxcox_obj$x.t, main = "Box Cox transformation", nbins = 12)
MASS::truehist(yeojohnson_obj$x.t, main = "Yeo-Johnson transformation", nbins = 12)
MASS::truehist(orderNorm_obj$x.t, main = "orderNorm transformation", nbins = 12)
The best transformation in this case is plotted below.
par(mfrow = c(1,2))
MASS::truehist(BNobject$x.t, main = paste("Best Transformation:", BNobject$method), nbins = 12)
plot(xx, predict(BNobject, newdata = xx), type = "l", col = 1,
main = "Best Normalizing transformation", ylab = "g(x)", xlab = "x")
autotrader
dataThe autotrader
data set was scraped from the autotrader website as part of this package (and because at the time of writing, I needed to buy a car). I apply the bestNormalize
functionality to de-skew mileage, age, and price in my pricing model. See ?autotrader
for more information on this data set.
data("autotrader")
autotrader$yearsold <- 2017 - autotrader$Year
### Using best-normalize
(priceBN <- bestNormalize(autotrader$price))
## Warning in orderNorm(x = c(2450, 1195, 2495, 2985, 7990, 11998, 992, 1588, : Ties in data, Normal distribution not guaranteed
## Best Normalizing transformation with 6283 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - Box-Cox: 29.0752
## - Lambert's W: 25.7374
## - Yeo-Johnson: 29.0752
## - orderNorm: 0.1712
##
## Based off these, bestNormalize chose:
## orderNorm Transformation with 6283 nonmissing obs and ties
## - 2465 unique values
## - Original quantiles:
## 0% 25% 50% 75% 100%
## 722 11499 15998 21497 64998
(mileageBN <- bestNormalize(autotrader$mileage))
## Warning in orderNorm(x = c(113700, 215508, 158063, 232075, 176519, 20070, : Ties in data, Normal distribution not guaranteed
## Best Normalizing transformation with 6283 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - Box-Cox: 10.0671
## - Lambert's W: 9.8235
## - Yeo-Johnson: 10.0518
## - orderNorm: 0.0019
##
## Based off these, bestNormalize chose:
## orderNorm Transformation with 6283 nonmissing obs and ties
## - 6077 unique values
## - Original quantiles:
## 0% 25% 50% 75% 100%
## 2 29099 44800 88950 325556
(yearsoldBN <- bestNormalize(autotrader$yearsold, allow_orderNorm = FALSE))
## Best Normalizing transformation with 6283 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - Box-Cox: 884.2059
## - Lambert's W: 882.1722
## - Yeo-Johnson: 884.2059
## - orderNorm: NA
##
## Based off these, bestNormalize chose:
## Lambert WxF Transformation of type s with 6283 nonmissing obs.:
## Estimated statistics:
## - gamma = 0.3218
## - mean = 4.149016
## - sd = 2.788658
par(mfrow = c(3, 2))
MASS::truehist(autotrader$price)
MASS::truehist(priceBN$x.t)
MASS::truehist(autotrader$mileage)
MASS::truehist(mileageBN$x.t)
MASS::truehist(autotrader$yearsold)
MASS::truehist(yearsoldBN$x.t)
par(mfrow = c(2, 2))
price.xx <- seq(min(autotrader$price), max(autotrader$price), length = 100)
mileage.xx <- seq(min(autotrader$mileage), max(autotrader$mileage), length = 100)
yearsold.xx <- seq(min(autotrader$yearsold), max(autotrader$yearsold), length = 100)
plot(price.xx, predict(priceBN, newdata = price.xx), type = "l",
main = "Price bestNormalizing transformation",
xlab = "Price ($)", ylab = "g(price)")
plot(mileage.xx, predict(mileageBN, newdata = mileage.xx), type = "l",
main = "Mileage bestNormalizing transformation",
xlab = "Mileage", ylab = "g(Mileage)")
plot(yearsold.xx, predict(yearsoldBN, newdata = yearsold.xx), type = "l",
main = "Years-old bestNormalizing transformation",
xlab = "Years-old", ylab = "g(Years-old)")
autotrader$price.t <- priceBN$x.t
autotrader$mileage.t <- mileageBN$x.t
autotrader$yearsold.t <- yearsoldBN$x.t
fit4 <- lm(price.t ~ mileage.t + yearsold.t,
data = autotrader)
summary(fit4)
##
## Call:
## lm(formula = price.t ~ mileage.t + yearsold.t, data = autotrader)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5958 -0.5867 -0.1206 0.4997 3.0799
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.122e-05 9.901e-03 0.001 0.999
## mileage.t -2.456e-01 1.598e-02 -15.372 <2e-16 ***
## yearsold.t -4.066e-01 1.596e-02 -25.482 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7848 on 6280 degrees of freedom
## Multiple R-squared: 0.3828, Adjusted R-squared: 0.3826
## F-statistic: 1948 on 2 and 6280 DF, p-value: < 2.2e-16
miles.t <- predict(mileageBN, newdata = mileage.xx)
c1 <- coef(fit4)["mileage.t"]
par(mfrow = c(1, 1))
plot(
mileageBN$x.t,
priceBN$x.t,
pch = 16,
col = grey(.1, alpha = .2),
main = "Estimated linear effect (using transformed data)",
xlab = "g(Mileage)",
ylab = "g(Price)"
)
lines(miles.t,
coef(fit4)[1] + c1 * miles.t,
col = "slateblue",
lwd = 2)
## Mileage effect
plot(
autotrader$mileage,
autotrader$price,
pch = 16,
col = grey(.1, alpha = .2),
main = "Mileage effect (re-transformed to original unit)",
xlab = "Mileage",
ylab = "Price"
)
line_vals <- miles.t * c1 + coef(fit4)[1]
lines(
mileage.xx,
y = predict(priceBN, newdata = line_vals, inverse = TRUE),
lwd = 2,
col = "slateblue"
)
# Compare to GAM fit
fit_gam <- mgcv::gam(price ~ s(yearsold) + s(mileage), data = autotrader)
p_gam <- predict(fit_gam, newdata = data.frame(yearsold = mean(autotrader$yearsold),
mileage = mileage.xx))
lines(mileage.xx, p_gam, lwd = 2, col = 'green3')
legend(
'topright',
c("GAM fit", "Transformed linear fit"),
lwd = 2,
col = c("green3", "slateblue"),
bty = "n"
)
## Years Old effect
yo.t <- predict(yearsoldBN, newdata = yearsold.xx)
c2 <- coef(fit4)["yearsold.t"]
plot(
jitter(autotrader$yearsold, 1.5),
autotrader$price,
pch = 16,
col = grey(.1, alpha = .2),
main = "Years old effect (re-transformed to original unit)",
xlab = "Age (Jittered)",
ylab = "Price"
)
line_vals <- yo.t * c2 + coef(fit4)[1]
lines(
yearsold.xx,
y = predict(priceBN, newdata = line_vals, inverse = TRUE),
lwd = 2,
col = "slateblue"
)
# Compare to GAM fit
p_gam <- predict(fit_gam, newdata = data.frame(yearsold = yearsold.xx,
mileage = mean(autotrader$mileage)))
lines(yearsold.xx, p_gam, lwd = 2, col = 'green3')
legend(
'topright',
c("GAM fit", "Transformed linear fit"),
lwd = 2,
col = c("green3", "slateblue"),
bty = "n"
)