The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
DTwrappers2 builds a set of summarization and cleaning methods that can be applied to one or more variables within a data.frame object. This package builds upon earlier work in the DTwrappers package, which provides a back-end for simultaneous calculations on many variables. The package can provide efficient computations relying upon the data.table package, along with providing translations of the code into data.table’s syntax.
The methods of DTwrappers2 generally fall into a number of categories, including:
All of these methods are amenable to scaled calculations that a) can be simultaneously applied to many variables, b) incorporate filters of the data, and c) allow for grouped calculations.
We will utilize the iris data available within R to provide examples of calculations with DTwrappers2.
library(DTwrappers2)
library(DTwrappers)
library(data.table)
data(iris)
The methods of DTwrappers2 are developed in pairs. Typically a function will be introduced to work on a single variable. Then a corresponding calculation for multiple variables in a data.frame will be introduced by appending the prefix dt.
Numeric variables require that each entry is specified numerically. Typos that introduce text in even a single value will convert the entire vector into a character format. With this in mind, we introduce methods to identify and remove the culprits of accidental character coercion.
As an example, we will create a new variable in the iris data that is intended to be numeric but includes some accidental text.
RNGversion(vstr = 3.6)
set.seed(47)
iris$noise <- rnorm(n = nrow(iris))
iris$noise[c(1, 51)] <- c("0.13ABC", "N/A")
is.character(iris$noise)
#> [1] TRUE
We can examine the noise variable to determine if it is a character variable that might reasonably be reformatted as numeric. The threshold.for.numeric indicates the minimum proportion of entries that would properly convert to a numeric value if the variable x were shifted into a numeric format. If this proportion is exceeded, the method returns a vector of the “character coercion culprits”, which are the values for which accidental text may have been introduced.
character.coercion.culprits(x = iris$noise, threshold.for.numeric = 0.8)
#> [1] "0.13ABC" "N/A"
We can also confirm that numeric variables have no accidental text.
character.coercion.culprits(x = iris$Sepal.Length, threshold.for.numeric = 0.8)
#> numeric(0)
This method can be scaled to multiple variables or the entire data set using dt.character.coercion.culprits. Across all of the selected variables, a threshold of 0.8 was used as the minimum proportion for a variable that should be converted to numeric format. Using the.variables, one can specify which variables (names of the data.frame) to apply the calculation to. The default value of “.” specifies all of the variables in the data.frame
dt.character.coercion.culprits(dt.name = "iris", threshold.for.numeric = 0.8, the.variables = ".")
#> variable character.coercion.culprits
#> <char> <list>
#> 1: Sepal.Length
#> 2: Sepal.Width
#> 3: Petal.Length
#> 4: Petal.Width
#> 5: Species
#> 6: noise 0.13ABC,N/A
Setting return.as = “code” can show a translation into the syntax of the data.table package.
dt.character.coercion.culprits(dt.name = "iris", threshold.for.numeric = 0.8, the.variables = ".",
return.as = "code")
#> [1] "iris[, .(variable = c('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species', 'noise'), character.coercion.culprits = lapply(X = .SD, FUN = 'character.coercion.culprits', threshold.for.numeric = 0.8))]"
Then we can verify that this coding statement would generate the same coding output. Note that we would have to convert the iris data into data.table format to run this code directly.
library(data.table)
iris <- as.data.table(iris)
iris[, .(variable = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width",
"Species", "noise"), character.coercion.culprits = lapply(X = .SD, FUN = "character.coercion.culprits",
threshold.for.numeric = 0.8))]
#> variable character.coercion.culprits
#> <char> <list>
#> 1: Sepal.Length
#> 2: Sepal.Width
#> 3: Petal.Length
#> 4: Petal.Width
#> 5: Species
#> 6: noise 0.13ABC,N/A
Setting return.as = “all” will create a list object that includes the result as well as the code:
dt.character.coercion.culprits(dt.name = "iris", threshold.for.numeric = 0.8, the.variables = ".",
return.as = "all")
#> $result
#> variable character.coercion.culprits
#> <char> <list>
#> 1: Sepal.Length
#> 2: Sepal.Width
#> 3: Petal.Length
#> 4: Petal.Width
#> 5: Species
#> 6: noise 0.13ABC,N/A
#>
#> $code
#> [1] "iris[, .(variable = c('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species', 'noise'), character.coercion.culprits = lapply(X = .SD, FUN = 'character.coercion.culprits', threshold.for.numeric = 0.8))]"
The character coercion culprits can also be calculated in groups:
dt.character.coercion.culprits(dt.name = "iris", threshold.for.numeric = 0.8, the.variables = c("Petal.Width",
"noise"), grouping.variables = "Species")
#> Key: <Species>
#> Species variable character.coercion.culprits
#> <fctr> <char> <list>
#> 1: setosa Petal.Width
#> 2: setosa noise 0.13ABC
#> 3: versicolor Petal.Width
#> 4: versicolor noise N/A
#> 5: virginica Petal.Width
#> 6: virginica noise
This will allow us to better identify the specific groupings in which these cases of mistaken text arose.
Likewise, filtering statements may also be used to restrict attention to a subset of the rows of the data:
dt.character.coercion.culprits(dt.name = "iris", threshold.for.numeric = 0.8, the.filter = "Species != 'virginica' & Sepal.Width > 3.3",
the.variables = c("Petal.Width", "noise"), grouping.variables = "Species")
#> Key: <Species>
#> Species variable character.coercion.culprits
#> <fctr> <char> <list>
#> 1: setosa Petal.Width
#> 2: setosa noise 0.13ABC
#> 3: versicolor Petal.Width
#> 4: versicolor noise
Note that ‘virginica’ is specified within single quotation marks. Alternatively, the.filter can be entered as an expression value.
dt.character.coercion.culprits(dt.name = "iris", threshold.for.numeric = 0.8, the.filter = expression(Species !=
"virginica" & Sepal.Width > 3.3), the.variables = c("Petal.Width", "noise"),
grouping.variables = "Species")
#> Key: <Species>
#> Species variable character.coercion.culprits
#> <fctr> <char> <list>
#> 1: setosa Petal.Width
#> 2: setosa noise 0.13ABC
#> 3: versicolor Petal.Width
#> 4: versicolor noise
Filters may also be specified by row index, either numerically or as a character object:
dt.character.coercion.culprits(dt.name = "iris", the.filter = 1:100)
#> variable character.coercion.culprits
#> <char> <list>
#> 1: Sepal.Length
#> 2: Sepal.Width
#> 3: Petal.Length
#> 4: Petal.Width
#> 5: Species
#> 6: noise 0.13ABC,N/A
dt.character.coercion.culprits(dt.name = "iris", the.filter = "1:100")
#> variable character.coercion.culprits
#> <char> <list>
#> 1: Sepal.Length
#> 2: Sepal.Width
#> 3: Petal.Length
#> 4: Petal.Width
#> 5: Species
#> 6: noise 0.13ABC,N/A
The features of filtering, grouping, and specifying the range of variables will apply broadly to the other methods in the DTwrappers2 package.
The character.coercion.culprits() function was used to identify cases of accidental text. The remove.erroneous.characters() function both identifies and removes these values. Non-numeric variables can be converted to numeric form if the proportion of amenable values exceeds the specified threshold.for.numeric. Cases of accidental text are converted to missing values.
As an example, we will convert the first five entries of the noise variable to numeric form while removing the case of accidental text.
iris$noise[1:5]
#> [1] "0.13ABC" "0.711142505065278" "0.185405284313495"
#> [4] "-0.281765014689636" "0.108775546627455"
remove.erroneous.characters(x = iris$noise[1:5], threshold.for.numeric = 0.8, variable.should.be = "numeric")
#> [1] NA 0.7111425 0.1854053 -0.2817650 0.1087755
These values have been converted to a numeric form, and the first entry was shifted to a missing value. The first value could well have been the number 0.13. If a conversion to a missing value is a concern, then using character.coercion.culprits() to identify the specific cases could aid in data cleaning.
The dt.remove.erroneous.characters() function then scales this process to multiple variables. Here we show an example applied to all of the columns of the iris data with a 0.8 threshold for numeric variables.
dt.remove.erroneous.characters(dt.name = "iris", threshold.for.numeric = 0.8)[1:5,
]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species noise
#> <num> <num> <num> <num> <num> <num>
#> 1: 5.1 3.5 1.4 0.2 1 NA
#> 2: 4.9 3.0 1.4 0.2 1 0.7111425
#> 3: 4.7 3.2 1.3 0.2 1 0.1854053
#> 4: 4.6 3.1 1.5 0.2 1 -0.2817650
#> 5: 5.0 3.6 1.4 0.2 1 0.1087755
Contingent calculations are performed only in specified cases. For instance, we might decide to calculate the mean value only if the variable is numeric. When applied to a single variable, this provides a small degree of additional case-checking, with fewer errors or warning messages generated. However, when scaled to multiple variables in a data.frame, these calculations prevent errors across potentially a large number of non-numeric variables.
First, we can demonstrate calculating the mean of numeric and non-numeric variables:
mean.numerics(x = iris$Sepal.Length)
#> [1] 5.843333
mean.numerics(x = iris$Species)
#> [1] NA
mean.numerics(x = iris$Species, non.numeric.value = "first")
#> [1] setosa
#> Levels: setosa versicolor virginica
The mean.numerics function is equivalent to the mean when the variable is numeric. Since the Species is a factor, a missing value is returned without generating warnings or errors. Likewise, when non.numeric.value is not set to the default of “missing”, then the first entry of the variable is returned.
The dt.mean.numerics function applies mean.numerics to multiple variables in a data.frame:
dt.mean.numerics(dt.name = "iris")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species noise
#> <num> <num> <num> <num> <lgcl> <lgcl>
#> 1: 5.843333 3.057333 3.758 1.199333 NA NA
This can also be applied in groups and with filters applied:
dt.mean.numerics(dt.name = "iris", the.filter = "Sepal.Length > 4 & Sepal.Length < 6",
grouping.variables = "Species")
#> Key: <Species>
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width noise
#> <fctr> <num> <num> <num> <num> <lgcl>
#> 1: setosa 5.006000 3.428000 1.462000 0.246000 NA
#> 2: versicolor 5.530769 2.673077 3.969231 1.238462 NA
#> 3: virginica 5.642857 2.714286 4.971429 1.957143 NA
Similar to mean.numerics, the DTwrappers2 package includes contingent calculations for a variety of summary statistics:
median.numerics(x = iris$Sepal.Length)
#> [1] 5.8
dt.median.numerics(dt.name = "iris", the.variables = c("Sepal.Length", "Sepal.Width"),
grouping.variables = "Species")
#> Key: <Species>
#> Species Sepal.Length Sepal.Width
#> <fctr> <num> <num>
#> 1: setosa 5.0 3.4
#> 2: versicolor 5.9 2.8
#> 3: virginica 6.5 3.0
sd.numerics(x = iris$Sepal.Length)
#> [1] 0.8280661
dt.sd.numerics(dt.name = "iris", the.variables = c("Sepal.Length", "Sepal.Width"),
grouping.variables = "Species")
#> Key: <Species>
#> Species Sepal.Length Sepal.Width
#> <fctr> <num> <num>
#> 1: setosa 0.3524897 0.3790644
#> 2: versicolor 0.5161711 0.3137983
#> 3: virginica 0.6358796 0.3224966
var.numerics(x = iris$Sepal.Length)
#> [1] 0.6856935
dt.var.numerics(dt.name = "iris", the.variables = c("Sepal.Length", "Sepal.Width"),
grouping.variables = "Species")
#> Key: <Species>
#> Species Sepal.Length Sepal.Width
#> <fctr> <num> <num>
#> 1: setosa 0.1242490 0.14368980
#> 2: versicolor 0.2664327 0.09846939
#> 3: virginica 0.4043429 0.10400408
min.numerics(x = iris$Sepal.Length)
#> [1] 4.3
dt.min.numerics(dt.name = "iris", the.variables = c("Sepal.Length", "Sepal.Width"),
grouping.variables = "Species")
#> Key: <Species>
#> Species Sepal.Length Sepal.Width
#> <fctr> <num> <num>
#> 1: setosa 4.3 2.3
#> 2: versicolor 4.9 2.0
#> 3: virginica 4.9 2.2
max.numerics(x = iris$Sepal.Length)
#> [1] 7.9
dt.max.numerics(dt.name = "iris", the.variables = c("Sepal.Length", "Sepal.Width"),
grouping.variables = "Species")
#> Key: <Species>
#> Species Sepal.Length Sepal.Width
#> <fctr> <num> <num>
#> 1: setosa 5.8 4.4
#> 2: versicolor 7.0 3.4
#> 3: virginica 7.9 3.8
Much like the calculation of summary statistics, rounding and formatting numbers can be useful in contingent and scaled settings.
Rounding is intended for numeric variables. The round.numerics function is a light wrapper of the round function. It rounds a variable to a specified number of digits if it is numeric; otherwise the variable is returned in its original form.
round.numerics(x = iris$Sepal.Length[1:5], digits = 0)
#> [1] 5 5 5 5 5
round.numerics(x = iris$Species[1:5])
#> [1] setosa setosa setosa setosa setosa
#> Levels: setosa versicolor virginica
The dt.round.numerics function is used to apply contingent rounding across multiple variables in a data.frame. Here we will calculate a grouped average using dt.mean.numerics. Then the numeric values in the resulting table will be rounded to 2 decimal places.
grouped.means <- dt.mean.numerics(dt.name = "iris", the.variables = c("Sepal.Length",
"Sepal.Width"), grouping.variables = "Species")
dt.round.numerics(dt.name = "grouped.means", digits = 2)
#> Key: <Species>
#> Species Sepal.Length Sepal.Width
#> <fctr> <num> <num>
#> 1: setosa 5.01 3.43
#> 2: versicolor 5.94 2.77
#> 3: virginica 6.59 2.97
This method is often useful for presenting a table of results that includes character descriptions and numeric measures.
Similarly, format.numerics is a light wrapper of the format function that is only applied to numeric inputs. We will demonstrate this calculation by changing the units of the sepal length and adding a small degree of variation in the measurements:
iris$SL.1000 <- iris$Sepal.Length * 1000 + rnorm(n = nrow(iris), mean = 0, sd = 25)
format.numerics(x = iris$SL.1000[1:5], digits = 2, big.mark = ",")
#> [1] "5,153.65" "4,923.38" "4,666.27" "4,597.23" "5,031.51"
Then we can update the grouped.means object to include this new variable before formatting:
grouped.means <- dt.mean.numerics(dt.name = "iris", grouping.variables = "Species",
na.rm = T)
dt.format.numerics(dt.name = "grouped.means", the.variables = c("Sepal.Length", "SL.1000"),
digits = 2, big.mark = ",", grouping.variables = "Species")
#> Key: <Species>
#> Species Sepal.Length SL.1000
#> <fctr> <char> <char>
#> 1: setosa 5.01 5,011.48
#> 2: versicolor 5.94 5,939.34
#> 3: virginica 6.59 6,591.39
Traditional rounding provides results up to a specified number of digits. Rounding 3.96 to 3 decimal places would only result in 2 digits displayed. The round_exactly function creates character outputs that add lagging zeros. This ensures that every entry of a variable has the same number of decimal places. The result is returned as a character vector.
round_exactly(x = iris$Sepal.Length[1:5], digits = 3)
#> [1] "5.100" "4.900" "4.700" "4.600" "5.000"
round_exactly(x = iris$Sepal.Length[1:5], digits = 5)
#> [1] "5.10000" "4.90000" "4.70000" "4.60000" "5.00000"
dt.round.exactly(dt.name = "grouped.means", the.variables = c("Sepal.Length", "Sepal.Width"),
digits = 3)
#> Sepal.Length Sepal.Width
#> <char> <char>
#> 1: 5.006 3.428
#> 2: 5.936 2.770
#> 3: 6.588 2.974
Assessing the degree of missingness aids in investigating the quality of the data and the need for imputation. The functions total.missing, total.measured, mean.missing, and mean.measured provide simple ways of summarizing the degree of missingness in a variable:
iris$noise[3 + 50 * (0:2)] <- NA
total.missing(x = iris$noise)
#> [1] 3
total.measured(x = iris$noise)
#> [1] 147
mean.missing(x = iris$noise)
#> [1] 0.02
mean.measured(x = iris$noise)
#> [1] 0.98
These calculations can then be scaled to multiple variables using dt.total.missing, dt.total.measured, dt.mean.missing, and dt.mean.measured:
dt.total.missing(dt.name = "iris")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species noise SL.1000
#> <int> <int> <int> <int> <int> <int> <int>
#> 1: 0 0 0 0 0 3 0
dt.total.measured(dt.name = "iris")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species noise SL.1000
#> <int> <int> <int> <int> <int> <int> <int>
#> 1: 150 150 150 150 150 147 150
dt.mean.missing(dt.name = "iris")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species noise SL.1000
#> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0 0 0 0 0 0.02 0
dt.mean.measured(dt.name = "iris")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species noise SL.1000
#> <num> <num> <num> <num> <num> <num> <num>
#> 1: 1 1 1 1 1 0.98 1
Each of these calculations can then incorporating filtering and grouping steps.
dt.total.missing(dt.name = "iris", grouping.variables = "Species")
#> Key: <Species>
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width noise SL.1000
#> <fctr> <int> <int> <int> <int> <int> <int>
#> 1: setosa 0 0 0 0 1 0
#> 2: versicolor 0 0 0 0 1 0
#> 3: virginica 0 0 0 0 1 0
dt.total.measured(dt.name = "iris", the.filter = "Sepal.Length > 4 & Sepal.Length < 6")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species noise SL.1000
#> <int> <int> <int> <int> <int> <int> <int>
#> 1: 83 83 83 83 83 82 83
dt.mean.missing(dt.name = "iris", the.filter = "Sepal.Length > 4 & Sepal.Length < 6",
grouping.variables = "Species")
#> Key: <Species>
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width noise SL.1000
#> <fctr> <num> <num> <num> <num> <num> <num>
#> 1: setosa 0 0 0 0 0.02 0
#> 2: versicolor 0 0 0 0 0.00 0
#> 3: virginica 0 0 0 0 0.00 0
dt.mean.measured(dt.name = "iris", the.variables = c("noise", "Sepal.Width"), the.filter = "Sepal.Length > 4 & Sepal.Length < 6",
grouping.variables = "Species", return.as = "all")
#> $result
#> Key: <Species>
#> Species noise Sepal.Width
#> <fctr> <num> <num>
#> 1: setosa 0.98 1
#> 2: versicolor 1.00 1
#> 3: virginica 1.00 1
#>
#> $code
#> [1] "iris[Sepal.Length > 4 & Sepal.Length < 6, lapply(X = .SD, FUN = 'mean.measured'), .SDcols = c('noise', 'Sepal.Width'), keyby = Species]"
The summary function from base R calculates 6 summary statistics of numeric variables. Applied to a single variable, this returns a numeric vector, with the statistics identified using the names of the variable. This calculation cannot easily be scaled to multiple variables or carried out in grouped calculations. There is a limited ability to select a subset of these calculations or include others.
The dt.summarize function is intended to resolve these issues by creating a more flexible means of summarizing multiple numeric variables. First we will calculate summaries of two variables:
dt.summarize(dt.name = "iris", the.variables = c("Sepal.Length", "Sepal.Width"))
#> variable min lower.quartile median mean upper.quartile max
#> <char> <list> <list> <list> <list> <list> <list>
#> 1: Sepal.Length 4.3 5.1 5.8 5.843333 6.4 7.9
#> 2: Sepal.Width 2 2.8 3 3.057333 3.3 4.4
#> sd total.missing
#> <list> <list>
#> 1: 0.8280661 0
#> 2: 0.4358663 0
Alternative functions may be specified:
dt.summarize(dt.name = "iris", the.variables = c("Sepal.Length", "Sepal.Width"),
the.functions = c("mean", "sd", "var"))
#> variable mean sd var
#> <char> <list> <list> <list>
#> 1: Sepal.Length 5.843333 0.8280661 0.6856935
#> 2: Sepal.Width 3.057333 0.4358663 0.1899794
These calculations may also be filtered and grouped:
dt.summarize(dt.name = "iris", the.variables = c("Sepal.Length", "Sepal.Width"),
the.filter = "Sepal.Length > 4 & Sepal.Length < 6", grouping.variables = "Species")
#> Key: <Species>
#> Species variable min lower.quartile median mean upper.quartile
#> <fctr> <char> <list> <list> <list> <list> <list>
#> 1: setosa Sepal.Length 4.3 4.8 5 5.006 5.2
#> 2: setosa Sepal.Width 2.3 3.2 3.4 3.428 3.675
#> 3: versicolor Sepal.Length 4.9 5.5 5.6 5.530769 5.7
#> 4: versicolor Sepal.Width 2 2.5 2.7 2.673077 2.9
#> 5: virginica Sepal.Length 4.9 5.65 5.8 5.642857 5.8
#> 6: virginica Sepal.Width 2.5 2.6 2.7 2.714286 2.8
#> max sd total.missing
#> <list> <list> <list>
#> 1: 5.8 0.3524897 0
#> 2: 4.4 0.3790644 0
#> 3: 5.9 0.2782362 0
#> 4: 3.2 0.2836303 0
#> 5: 5.9 0.3408672 0
#> 6: 3 0.1772811 0
The choice of functions can also incorporate customized functions or those from other methods. For instance, here we apply mean.numerics and sd.numerics to a selection of variables including the non-numeric Species:
dt.summarize(dt.name = "iris", the.functions = c("mean.numerics", "sd.numerics"),
the.variables = c("Sepal.Length", "Sepal.Width", "Species"))
#> variable mean.numerics sd.numerics
#> <char> <list> <list>
#> 1: Sepal.Length 5.843333 0.8280661
#> 2: Sepal.Width 3.057333 0.4358663
#> 3: Species NA NA
This offers many considerations for scaled computations on large data sets with many columns. Quickly identifying the numeric variables and calculating customized measures can reduce the labor of investigating novel data. The resulting table can easily be searched for specific subgroups or measured values.
DTwrappers2 presents a variety of methods for cleaning and summarizing data across multiple variables. Individually, these calculations resolve small inconveniences with analyzing data in a variety of settings. Collectively, they allow for a set of tools that can better facilitate exploration, analysis, and reporting of data. Incorporating built-in filters and groupings expands upon the settings in which these tools might be valuable.
The back-end calculations of DTwrappers2 are performed using a translation to data.table’s syntax. This enables efficient computation with expected performance improvements over other methods. Users who want to practice with data.table can also study the coding outputs as working examples. The methods presented here rely upon a call to the lapply function within data.table’s j step while specifying the variables to perform the calculation in the .SDcols parameter. Filtered and grouped calculations also create additional elements of the syntax to study. With this in mind, motivated users can create additional methods that can be applied in these frameworks, either using data.table directory or with the dt.calculate method from the DTwrappers package.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.