collapse is a C/C++ based package for data manipulation in R. It’s aims are
to facilitate complex data transformation and exploration tasks and
to help make R code fast, flexible, parsimonious and programmer friendly.
This vignette focuses on the integration of collapse and the popular dplyr package by Hadley Wickham. In particular it will demonstrate how using collapse’s fast functions and some fast alternatives for dplyr verbs can substantially facilitate and speed up basic data manipulation, grouped and weighted aggregations and transformations, and panel-data computations (i.e. between- and within-transformations, panel-lags, differences and growth rates) in a dplyr (piped) workflow.
Notes:
This vignette is targeted at dplyr / tidyverse users. collapse is a standalone package and can be programmed efficiently without pipes or dplyr verbs.
The ‘Introduction to collapse’ vignette provides a thorough introduction to the package and a built-in structured documentation is available under help("collapse-documentation")
after installing the package. In addition help("collapse-package")
provides a compact set of examples for quick-start.
A key feature of collapse is it’s broad set of Fast Statistical Functions (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, ffirst, flast, fNobs, fNdistinct
) which are able to substantially speed-up column-wise, grouped and weighted computations on vectors, matrices or data.frame’s. The functions are S3 generic, with a default (vector), matrix and data.frame method, as well as a grouped_df method for grouped tibbles used by dplyr. The grouped tibble method has the following arguments:
FUN.grouped_df(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
use.g.names = FALSE, keep.group_vars = TRUE, [keep.w = TRUE,] ...)
where w
is a weight variable (available only to fsum, fprod, fmean, fmode, fvar
and fsd
), and TRA
and can be used to transform x
using the computed statistics and one of 10 available transformations ("replace_fill", "replace", "-", "-+", "/", "%", "+", "*", "%%", "-%%"
). These transformations perform grouped replacing or sweeping out of the statistics computed by the function (discussed in section 2). na.rm
efficiently removes missing values and is TRUE
by default. use.g.names
generates new row-names from the unique combinations of groups (default: disabled), whereas keep.group_vars
(default: enabled) will keep the grouping columns as is custom in the native data %>% group_by(...) %>% summarize(...)
workflow in dplyr. Finally, keep.w
regulates whether a weighting variable used is also aggregated and saved in a column. For fsum, fmean, fvar and fsd
this will compute the sum of the weights in each group, whereas fmode
will return the maximum weight (corresponding to the mode) in each group and fprod
returns the product of the weights.
With that in mind, let’s consider some straightforward applications.
Consider the Groningen Growth and Development Center 10-Sector Database included in collapse and introduced in the main vignette:
library(collapse)
head(GGDC10S)
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 1 BWA SSA Sub-saharan Africa VA 1960 NA NA NA NA
# 2 BWA SSA Sub-saharan Africa VA 1961 NA NA NA NA
# 3 BWA SSA Sub-saharan Africa VA 1962 NA NA NA NA
# 4 BWA SSA Sub-saharan Africa VA 1963 NA NA NA NA
# 5 BWA SSA Sub-saharan Africa VA 1964 16.30154 3.494075 0.7365696 0.1043936
# 6 BWA SSA Sub-saharan Africa VA 1965 15.72700 2.495768 1.0181992 0.1350976
# CON WRT TRA FIRE GOV OTH SUM
# 1 NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA
# 5 0.6600454 6.243732 1.658928 1.119194 4.822485 2.341328 37.48229
# 6 1.3462312 7.064825 1.939007 1.246789 5.695848 2.678338 39.34710
# Summarize the Data:
# descr(GGDC10S, cols = is.categorical)
# aperm(qsu(GGDC10S, ~Variable, cols = is.numeric))
Simple column-wise computations using the fast functions and pipe operators are performed as follows:
library(dplyr)
GGDC10S %>% fNobs # Number of Observations
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 5027 5027 5027 5027 5027 4364 4355 4355 4354
# CON WRT TRA FIRE GOV OTH SUM
# 4355 4355 4355 4355 3482 4248 4364
GGDC10S %>% fNdistinct # Number of distinct values
# Country Regioncode Region Variable Year AGR MIN MAN PU
# 43 6 6 2 67 4353 4224 4353 4237
# CON WRT TRA FIRE GOV OTH SUM
# 4339 4344 4334 4349 3470 4238 4364
GGDC10S %>% select_at(6:16) %>% fmedian # Median
# AGR MIN MAN PU CON WRT TRA FIRE GOV
# 4394.5194 173.2234 3718.0981 167.9500 1473.4470 3773.6430 1174.8000 960.1251 3928.5127
# OTH SUM
# 1433.1722 23186.1936
GGDC10S %>% select_at(6:16) %>% fmean # Mean
# AGR MIN MAN PU CON WRT TRA FIRE GOV
# 2526696.5 1867908.9 5538491.4 335679.5 1801597.6 3392909.5 1473269.7 1657114.8 1712300.3
# OTH SUM
# 1684527.3 21566436.8
GGDC10S %>% fmode # Mode
# Country Regioncode Region Variable Year
# "USA" "ASI" "Asia" "EMP" "2010"
# AGR MIN MAN PU CON
# "171.315882316326" "0" "4645.12507642586" "0" "1.34623115930777"
# WRT TRA FIRE GOV OTH
# "21.8380052682527" "8.97743416914571" "40.0701608636442" "0" "3626.84423577048"
# SUM
# "37.4822945751317"
GGDC10S %>% fmode(drop = FALSE) # Keep data structure intact
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT TRA
# 1 USA ASI Asia EMP 2010 171.3159 0 4645.125 0 1.346231 21.83801 8.977434
# FIRE GOV OTH SUM
# 1 40.07016 0 3626.844 37.48229
Moving on to grouped statistics, we can compute the average value added and employment by sector and country using:
GGDC10S %>%
group_by(Variable,Country) %>%
select_at(6:16) %>% fmean
# # A tibble: 85 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 1.02e2 7.42e2 1.98e3 6.49e2 628. 2043. 9.92e2 1.05e4
# 2 EMP BOL 964. 56.0 235. 5.35e0 1.23e2 2.82e2 1.15e2 44.6 NA 3.96e2 2.22e3
# 3 EMP BRA 17191. 206. 6991. 3.65e2 3.52e3 8.51e3 2.05e3 4414. 5307. 5.71e3 5.43e4
# 4 EMP BWA 188. 10.5 18.1 3.09e0 2.53e1 3.63e1 8.36e0 15.3 61.1 2.76e1 3.94e2
# 5 EMP CHL 702. 101. 625. 2.94e1 2.96e2 6.95e2 2.58e2 272. NA 1.00e3 3.98e3
# 6 EMP CHN 287744. 7050. 67144. 1.61e3 2.09e4 2.89e4 1.39e4 4929. 22669. 3.10e4 4.86e5
# 7 EMP COL 3091. 145. 1175. 3.39e1 5.24e2 2.07e3 4.70e2 649. NA 1.73e3 9.89e3
# 8 EMP CRI 231. 1.70 136. 1.43e1 5.76e1 1.57e2 4.24e1 54.9 128. 6.51e1 8.87e2
# 9 EMP DEW 2490. 407. 8473. 2.26e2 2.09e3 4.44e3 1.48e3 1689. 3945. 9.99e2 2.62e4
# 10 EMP DNK 236. 8.03 507. 1.38e1 1.71e2 4.55e2 1.61e2 181. 549. 1.11e2 2.39e3
# # ... with 75 more rows
Similarly we can aggregate using any other of the above functions.
It is important to not use dplyr’s summarize
together with these functions since that would totally eliminate their speed gain. These functions are fast because they are executed only once and carry out the grouped computations in C++, whereas summarize
will apply the function to each group in the grouped tibble. - It will also work with the fast functions, but is slower than using primitive base functions since the fast functions are S3 generic -.
To drive this point home it is perhaps good to shed some light on what is happening behind the scenes of dplyr and collapse. Fundamentally both packages follow different computing paradigms:
dplyr is an efficient implementation of the Split-Apply-Combine computing paradigm. Data is split into groups, these data-chunks are then passed to a function carrying out the computation, and finally recombined to produce the aggregated data.frame. This modus operandi is evident in the grouping mechanism of dplyr. When a data.frame is passed through group_by, a ‘groups’ attribute is attached:
GGDC10S %>% group_by(Variable,Country) %>% attr("groups")
# # A tibble: 85 x 3
# Variable Country .rows
# <chr> <chr> <list>
# 1 EMP ARG <int [62]>
# 2 EMP BOL <int [61]>
# 3 EMP BRA <int [62]>
# 4 EMP BWA <int [52]>
# 5 EMP CHL <int [63]>
# 6 EMP CHN <int [62]>
# 7 EMP COL <int [61]>
# 8 EMP CRI <int [62]>
# 9 EMP DEW <int [61]>
# 10 EMP DNK <int [64]>
# # ... with 75 more rows
This object is a data.frame giving the unique groups and in the third (last) column vectors containing the indices of the rows belonging to that group. A command like summarize
uses this information to split the data.frame into groups which are then passed sequentially to the function used and later recombined. These steps are also done in C++ which makes dplyr quite efficient.
Now collapse is based around one-pass grouped computations at the C++ level using its own grouped statistical functions. In other words the data is not split and recombined at all but the entire computation is performed in a single C++ loop running through that data and completing the computations for each group simultaneously. This modus operandi is also evident in collapse grouping objects. The method GRP.grouped_df
takes a dplyr grouping object from a grouped tibble and efficiently converts it to a collapse grouping object:
GGDC10S %>% group_by(Variable,Country) %>% GRP %>% str
# List of 8
# $ N.groups : int 85
# $ group.id : int [1:5027] 46 46 46 46 46 46 46 46 46 46 ...
# $ group.sizes: int [1:85] 62 61 62 52 63 62 61 62 61 64 ...
# $ groups :List of 2
# ..$ Variable: chr [1:85] "EMP" "EMP" "EMP" "EMP" ...
# .. ..- attr(*, "label")= chr "Variable"
# .. ..- attr(*, "format.stata")= chr "%9s"
# ..$ Country : chr [1:85] "ARG" "BOL" "BRA" "BWA" ...
# .. ..- attr(*, "label")= chr "Country"
# .. ..- attr(*, "format.stata")= chr "%9s"
# $ group.vars : chr [1:2] "Variable" "Country"
# $ ordered : logi [1:2] TRUE TRUE
# $ order : NULL
# $ call : language GRP.grouped_df(X = .)
# - attr(*, "class")= chr "GRP"
This object is a list where the first three elements give the number of groups, the group-id to which each row belongs and a vector of group-sizes. A function like fsum
uses this information to (for each column) create a result vector of size ‘N.groups’ and the run through the column using the ‘group.id’ vector to add the i’th data point to the ’group.id[i]’th element of the result vector. When the loop is finished, the grouped computation is also finished.
It is thus clear that collapse is faster than dplyr since it’s method of computing involves less steps.
collapse fast functions do not develop their maximal performance on a grouped tibble created with group_by
because of the additional conversion cost of the grouping object incurred by GRP.grouped_df
. This cost is already minimized through the use of C++, but we can do even better replacing group_by
with collapse::fgroup_by
. fgroup_by
works like group_by
but does the grouping with collapse::GRP
(up to 10x faster than group_by
) and simply attaches a collapse grouping object to the grouped_df. Thus the speed gain is 2-fold: Faster grouping and no conversion cost when calling collapse functions.
Another improvement comes from replacing the dplyr verb select
with collapse::fselect
, and, for selection using column names, indices or functions use collapse::get_vars
instead of select_at
or select_if
. Next to get_vars
, collapse also introduces the predicates num_vars
, cat_vars
, char_vars
, fact_vars
, logi_vars
and Date_vars
to efficiently select columns by type.
GGDC10S %>% fgroup_by(Variable,Country) %>% get_vars(6:16) %>% fmedian
# # A tibble: 85 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1325. 47.4 1988. 1.05e2 7.82e2 1.85e3 5.80e2 464. 1739. 866. 9.74e3
# 2 EMP BOL 943. 53.5 167. 4.46e0 6.60e1 1.32e2 9.70e1 15.3 NA 384. 1.84e3
# 3 EMP BRA 17481. 225. 7208. 3.76e2 4.05e3 6.45e3 1.58e3 4355. 4450. 4479. 5.19e4
# 4 EMP BWA 175. 12.2 13.1 3.71e0 1.90e1 2.11e1 6.75e0 10.4 53.8 31.2 3.61e2
# 5 EMP CHL 690. 93.9 607. 2.58e1 2.30e2 4.84e2 2.05e2 106. NA 900. 3.31e3
# 6 EMP CHN 293915 8150. 61761. 1.14e3 1.06e4 1.70e4 9.56e3 4328. 19468. 9954. 4.45e5
# 7 EMP COL 3006. 84.0 1033. 3.71e1 4.19e2 1.55e3 3.91e2 655. NA 1430. 8.63e3
# 8 EMP CRI 216. 1.49 114. 7.92e0 5.50e1 8.98e1 2.55e1 19.6 122. 60.6 7.19e2
# 9 EMP DEW 2178 320. 8459. 2.47e2 2.10e3 4.45e3 1.53e3 1656 3700 900 2.65e4
# 10 EMP DNK 187. 3.75 508. 1.36e1 1.65e2 4.61e2 1.61e2 169. 642. 104. 2.42e3
# # ... with 75 more rows
microbenchmark(collapse = GGDC10S %>% fgroup_by(Variable,Country) %>% get_vars(6:16) %>% fmedian,
hybrid = GGDC10S %>% group_by(Variable,Country) %>% select_at(6:16) %>% fmedian,
dplyr = GGDC10S %>% group_by(Variable,Country) %>% select_at(6:16) %>% summarise_all(median, na.rm = TRUE))
# Unit: microseconds
# expr min lq mean median uq max neval
# collapse 971.482 1050.245 1192.611 1100.225 1159.129 8355.542 100
# hybrid 13576.640 14075.991 15175.286 14474.713 15185.363 22655.549 100
# dplyr 57322.300 59729.806 62748.435 60518.103 64810.782 99397.655 100
Benchmarks on the different components of this code and with larger data are provided under ‘Benchmarks’. I note that a grouped tibble created with fgroup_by
can no longer be used for grouped computations with dplyr verbs like mutate
or summarize
. To avoid errors with these functions and print.grouped_df
, [.grouped_df
etc., the classes assigned after fgroup_by
are reshuffled, so that the data.frame is treated by the dplyr ecosystem like a normal tibble:
class(group_by(GGDC10S, Variable, Country))
# [1] "grouped_df" "tbl_df" "tbl" "data.frame"
class(fgroup_by(GGDC10S, Variable, Country))
# [1] "tbl_df" "tbl" "grouped_df" "data.frame"
I also note that fselect
and get_vars
are not full drop-in replacements for select
because they do not have a grouped_df method:
GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% head(3)
# # A tibble: 3 x 13
# # Groups: Variable, Country [1]
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
GGDC10S %>% group_by(Variable, Country) %>% get_vars(6:16) %>% head(3)
# # A tibble: 3 x 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA NA NA
Since by default keep.group_vars = TRUE
in the Fast Statistical Functions, the end result is nevertheless the same:
GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% fmean %>% head(3)
# # A tibble: 3 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 102. 742. 1982. 649. 628. 2043. 992. 10542.
# 2 EMP BOL 964. 56.0 235. 5.35 123. 282. 115. 44.6 NA 396. 2221.
# 3 EMP BRA 17191. 206. 6991. 365. 3525. 8509. 2054. 4414. 5307. 5710. 54273.
GGDC10S %>% group_by(Variable, Country) %>% get_vars(6:16) %>% fmean %>% head(3)
# # A tibble: 3 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 102. 742. 1982. 649. 628. 2043. 992. 10542.
# 2 EMP BOL 964. 56.0 235. 5.35 123. 282. 115. 44.6 NA 396. 2221.
# 3 EMP BRA 17191. 206. 6991. 365. 3525. 8509. 2054. 4414. 5307. 5710. 54273.
Another useful verb introduced by collapse is fgroup_vars
, which can be used to efficiently obtain the grouping columns or grouping variables from a grouped tibble:
# fgroup_by fully supports grouped tibbles created with group_by or fgroup_by:
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars %>% head(3)
# # A tibble: 3 x 2
# Variable Country
# <chr> <chr>
# 1 VA BWA
# 2 VA BWA
# 3 VA BWA
GGDC10S %>% fgroup_by(Variable, Country) %>% fgroup_vars %>% head(3)
# # A tibble: 3 x 2
# Variable Country
# <chr> <chr>
# 1 VA BWA
# 2 VA BWA
# 3 VA BWA
# The other possibilities:
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("unique") %>% head(3)
# # A tibble: 3 x 2
# Variable Country
# <chr> <chr>
# 1 EMP ARG
# 2 EMP BOL
# 3 EMP BRA
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("names")
# [1] "Variable" "Country"
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("indices")
# [1] 4 1
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("named_indices")
# Variable Country
# 4 1
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("logical")
# [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("named_logical")
# Country Regioncode Region Variable Year AGR MIN MAN PU
# TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# CON WRT TRA FIRE GOV OTH SUM
# FALSE FALSE FALSE FALSE FALSE FALSE FALSE
A final collapse verb I want to mention here is fsubset
, a faster alternative to dplyr::filter
which also provides an option to flexibly subset columns after the select argument:
# Two equivalent calls, the first is substantially faster
GGDC10S %>% fsubset(Variable == "VA" & Year > 1990, Country, Year, AGR:GOV) %>% head(3)
# Country Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# 1 BWA 1991 303.1157 2646.950 472.6488 160.6079 580.0876 806.7509 232.7884 432.6965 1073.263
# 2 BWA 1992 333.4364 2690.939 537.4274 178.4532 678.7320 725.2577 285.1403 517.2141 1234.012
# 3 BWA 1993 404.5488 2624.928 567.3420 219.2183 634.2797 771.8253 349.7458 673.2540 1487.193
GGDC10S %>% filter(Variable == "VA" & Year > 1990) %>% select(Country, Year, AGR:GOV) %>% head(3)
# Country Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# 1 BWA 1991 303.1157 2646.950 472.6488 160.6079 580.0876 806.7509 232.7884 432.6965 1073.263
# 2 BWA 1992 333.4364 2690.939 537.4274 178.4532 678.7320 725.2577 285.1403 517.2141 1234.012
# 3 BWA 1993 404.5488 2624.928 567.3420 219.2183 634.2797 771.8253 349.7458 673.2540 1487.193
One can also aggregate with multiple functions at the same time. For such operations it is often necessary to use curly braces {
to prevent first argument injection so that %>% cbind(FUN1(.), FUN2(.))
does not evaluate as %>% cbind(., FUN1(.), FUN2(.))
:
GGDC10S %>%
fgroup_by(Variable,Country) %>%
get_vars(6:16) %>% {
cbind(fmedian(.),
add_stub(fmean(., keep.group_vars = FALSE), "mean_"))
} %>% head(3)
# Variable Country AGR MIN MAN PU CON WRT TRA
# 1 EMP ARG 1324.5255 47.35255 1987.5912 104.738825 782.40283 1854.612 579.93982
# 2 EMP BOL 943.1612 53.53538 167.1502 4.457895 65.97904 132.225 96.96828
# 3 EMP BRA 17480.9810 225.43693 7207.7915 375.851832 4054.66103 6454.523 1580.81120
# FIRE GOV OTH SUM mean_AGR mean_MIN mean_MAN mean_PU mean_CON
# 1 464.39920 1738.836 866.1119 9743.223 1419.8013 52.08903 1931.7602 101.720936 742.4044
# 2 15.34259 NA 384.0678 1842.055 964.2103 56.03295 235.0332 5.346433 122.7827
# 3 4354.86210 4449.942 4478.6927 51881.110 17191.3529 206.02389 6991.3710 364.573404 3524.7384
# mean_WRT mean_TRA mean_FIRE mean_GOV mean_OTH mean_SUM
# 1 1982.1775 648.5119 627.79291 2043.471 992.4475 10542.177
# 2 281.5164 115.4728 44.56442 NA 395.5650 2220.524
# 3 8509.4612 2054.3731 4413.54448 5307.280 5710.2665 54272.985
The function add_stub
used above is a collapse function adding a prefix (default) or suffix to variables names. The collapse predicate add_vars
provides a more efficient alternative to cbind.data.frame
. The idea here is ‘adding’ variables to the data.frame in the first argument i.e. the attributes of the first argument are preserved, so the expression below still gives a tibble instead of a data.frame:
GGDC10S %>%
fgroup_by(Variable,Country) %>% {
add_vars(ffirst(get_vars(., "Reg", regex = TRUE)), # Regular expression matching column names
add_stub(fmean(num_vars(.), keep.group_vars = FALSE), "mean_"), # num_vars selects all numeric variables
add_stub(fmedian(fselect(., PU:TRA), keep.group_vars = FALSE), "median_"),
add_stub(fmin(fselect(., PU:CON), keep.group_vars = FALSE), "min_"))
}
# # A tibble: 85 x 22
# Variable Country Regioncode Region mean_Year mean_AGR mean_MIN mean_MAN mean_PU mean_CON mean_WRT
# * <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG LAM Latin~ 1980. 1420. 52.1 1932. 102. 742. 1982.
# 2 EMP BOL LAM Latin~ 1980 964. 56.0 235. 5.35 123. 282.
# 3 EMP BRA LAM Latin~ 1980. 17191. 206. 6991. 365. 3525. 8509.
# 4 EMP BWA SSA Sub-s~ 1986. 188. 10.5 18.1 3.09 25.3 36.3
# 5 EMP CHL LAM Latin~ 1981 702. 101. 625. 29.4 296. 695.
# 6 EMP CHN ASI Asia 1980. 287744. 7050. 67144. 1606. 20852. 28908.
# 7 EMP COL LAM Latin~ 1980 3091. 145. 1175. 33.9 524. 2071.
# 8 EMP CRI LAM Latin~ 1980. 231. 1.70 136. 14.3 57.6 157.
# 9 EMP DEW EUR Europe 1980 2490. 407. 8473. 226. 2093. 4442.
# 10 EMP DNK EUR Europe 1980. 236. 8.03 507. 13.8 171. 455.
# # ... with 75 more rows, and 11 more variables: mean_TRA <dbl>, mean_FIRE <dbl>, mean_GOV <dbl>,
# # mean_OTH <dbl>, mean_SUM <dbl>, median_PU <dbl>, median_CON <dbl>, median_WRT <dbl>,
# # median_TRA <dbl>, min_PU <dbl>, min_CON <dbl>
Another nice feature of add_vars
is that it can also very efficiently reorder columns i.e. bind columns in a different order than they are passed. This can be done by simply specifying the positions the added columns should have in the final data.frame, and then add_vars
shifts the first argument columns to the right to fill in the gaps.
GGDC10S %>%
fsubset(Variable == "VA", Country, AGR, SUM) %>%
fgroup_by(Country) %>% {
add_vars(fgroup_vars(.,"unique"),
add_stub(fmean(., keep.group_vars = FALSE), "mean_"),
add_stub(fsd(., keep.group_vars = FALSE), "sd_"),
pos = c(2,4,3,5))
} %>% head(3)
# Country mean_AGR sd_AGR mean_SUM sd_SUM
# 1 ARG 14951.292 33061.413 152533.84 301316.25
# 2 BOL 3299.718 4456.331 22619.18 33172.98
# 3 BRA 76870.146 59441.696 1200562.67 976963.14
A much more compact solution to multi-function and multi-type aggregation with dplyr is offered by the function collapg:
# This aggregates numeric colums using the mean (fmean) and categorical columns with the mode (fmode)
GGDC10S %>% fgroup_by(Variable,Country) %>% collapg
# # A tibble: 85 x 16
# Variable Country Regioncode Region Year AGR MIN MAN PU CON WRT TRA FIRE
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG LAM Latin~ 1980. 1.42e3 5.21e1 1.93e3 1.02e2 7.42e2 1.98e3 6.49e2 628.
# 2 EMP BOL LAM Latin~ 1980 9.64e2 5.60e1 2.35e2 5.35e0 1.23e2 2.82e2 1.15e2 44.6
# 3 EMP BRA LAM Latin~ 1980. 1.72e4 2.06e2 6.99e3 3.65e2 3.52e3 8.51e3 2.05e3 4414.
# 4 EMP BWA SSA Sub-s~ 1986. 1.88e2 1.05e1 1.81e1 3.09e0 2.53e1 3.63e1 8.36e0 15.3
# 5 EMP CHL LAM Latin~ 1981 7.02e2 1.01e2 6.25e2 2.94e1 2.96e2 6.95e2 2.58e2 272.
# 6 EMP CHN ASI Asia 1980. 2.88e5 7.05e3 6.71e4 1.61e3 2.09e4 2.89e4 1.39e4 4929.
# 7 EMP COL LAM Latin~ 1980 3.09e3 1.45e2 1.18e3 3.39e1 5.24e2 2.07e3 4.70e2 649.
# 8 EMP CRI LAM Latin~ 1980. 2.31e2 1.70e0 1.36e2 1.43e1 5.76e1 1.57e2 4.24e1 54.9
# 9 EMP DEW EUR Europe 1980 2.49e3 4.07e2 8.47e3 2.26e2 2.09e3 4.44e3 1.48e3 1689.
# 10 EMP DNK EUR Europe 1980. 2.36e2 8.03e0 5.07e2 1.38e1 1.71e2 4.55e2 1.61e2 181.
# # ... with 75 more rows, and 3 more variables: GOV <dbl>, OTH <dbl>, SUM <dbl>
By default it aggregates numeric columns using the fmean
and categorical columns using fmode
, and preserves the order of all columns. Changing these defaults is very easy:
# This aggregates numeric colums using the median and categorical columns using the first value
GGDC10S %>% fgroup_by(Variable,Country) %>% collapg(fmedian, flast)
# # A tibble: 85 x 16
# Variable Country Regioncode Region Year AGR MIN MAN PU CON WRT TRA FIRE
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG LAM Latin~ 1980. 1.32e3 4.74e1 1.99e3 1.05e2 7.82e2 1.85e3 5.80e2 464.
# 2 EMP BOL LAM Latin~ 1980 9.43e2 5.35e1 1.67e2 4.46e0 6.60e1 1.32e2 9.70e1 15.3
# 3 EMP BRA LAM Latin~ 1980. 1.75e4 2.25e2 7.21e3 3.76e2 4.05e3 6.45e3 1.58e3 4355.
# 4 EMP BWA SSA Sub-s~ 1986. 1.75e2 1.22e1 1.31e1 3.71e0 1.90e1 2.11e1 6.75e0 10.4
# 5 EMP CHL LAM Latin~ 1981 6.90e2 9.39e1 6.07e2 2.58e1 2.30e2 4.84e2 2.05e2 106.
# 6 EMP CHN ASI Asia 1980. 2.94e5 8.15e3 6.18e4 1.14e3 1.06e4 1.70e4 9.56e3 4328.
# 7 EMP COL LAM Latin~ 1980 3.01e3 8.40e1 1.03e3 3.71e1 4.19e2 1.55e3 3.91e2 655.
# 8 EMP CRI LAM Latin~ 1980. 2.16e2 1.49e0 1.14e2 7.92e0 5.50e1 8.98e1 2.55e1 19.6
# 9 EMP DEW EUR Europe 1980 2.18e3 3.20e2 8.46e3 2.47e2 2.10e3 4.45e3 1.53e3 1656
# 10 EMP DNK EUR Europe 1980. 1.87e2 3.75e0 5.08e2 1.36e1 1.65e2 4.61e2 1.61e2 169.
# # ... with 75 more rows, and 3 more variables: GOV <dbl>, OTH <dbl>, SUM <dbl>
One can apply multiple functions to both numeric and/or categorical data:
GGDC10S %>% fgroup_by(Variable,Country) %>%
collapg(list(fmean, fmedian), list(first, fmode, flast)) %>% head(3)
# # A tibble: 3 x 32
# Variable Country first.Regioncode fmode.Regioncode flast.Regioncode first.Region fmode.Region
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 EMP ARG LAM LAM LAM Latin Ameri~ Latin Ameri~
# 2 EMP BOL LAM LAM LAM Latin Ameri~ Latin Ameri~
# 3 EMP BRA LAM LAM LAM Latin Ameri~ Latin Ameri~
# # ... with 25 more variables: flast.Region <chr>, fmean.Year <dbl>, fmedian.Year <dbl>,
# # fmean.AGR <dbl>, fmedian.AGR <dbl>, fmean.MIN <dbl>, fmedian.MIN <dbl>, fmean.MAN <dbl>,
# # fmedian.MAN <dbl>, fmean.PU <dbl>, fmedian.PU <dbl>, fmean.CON <dbl>, fmedian.CON <dbl>,
# # fmean.WRT <dbl>, fmedian.WRT <dbl>, fmean.TRA <dbl>, fmedian.TRA <dbl>, fmean.FIRE <dbl>,
# # fmedian.FIRE <dbl>, fmean.GOV <dbl>, fmedian.GOV <dbl>, fmean.OTH <dbl>, fmedian.OTH <dbl>,
# # fmean.SUM <dbl>, fmedian.SUM <dbl>
Applying multiple functions to only numeric (or only categorical) data allows return in a long format:
GGDC10S %>% fgroup_by(Variable,Country) %>%
collapg(list(fmean, fmedian), cols = is.numeric, return = "long")
# # A tibble: 170 x 15
# Function Variable Country Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 fmean EMP ARG 1980. 1.42e3 5.21e1 1.93e3 1.02e2 7.42e2 1.98e3 6.49e2 628. 2043.
# 2 fmean EMP BOL 1980 9.64e2 5.60e1 2.35e2 5.35e0 1.23e2 2.82e2 1.15e2 44.6 NA
# 3 fmean EMP BRA 1980. 1.72e4 2.06e2 6.99e3 3.65e2 3.52e3 8.51e3 2.05e3 4414. 5307.
# 4 fmean EMP BWA 1986. 1.88e2 1.05e1 1.81e1 3.09e0 2.53e1 3.63e1 8.36e0 15.3 61.1
# 5 fmean EMP CHL 1981 7.02e2 1.01e2 6.25e2 2.94e1 2.96e2 6.95e2 2.58e2 272. NA
# 6 fmean EMP CHN 1980. 2.88e5 7.05e3 6.71e4 1.61e3 2.09e4 2.89e4 1.39e4 4929. 22669.
# 7 fmean EMP COL 1980 3.09e3 1.45e2 1.18e3 3.39e1 5.24e2 2.07e3 4.70e2 649. NA
# 8 fmean EMP CRI 1980. 2.31e2 1.70e0 1.36e2 1.43e1 5.76e1 1.57e2 4.24e1 54.9 128.
# 9 fmean EMP DEW 1980 2.49e3 4.07e2 8.47e3 2.26e2 2.09e3 4.44e3 1.48e3 1689. 3945.
# 10 fmean EMP DNK 1980. 2.36e2 8.03e0 5.07e2 1.38e1 1.71e2 4.55e2 1.61e2 181. 549.
# # ... with 160 more rows, and 2 more variables: OTH <dbl>, SUM <dbl>
Finally, collapg
also makes it very easy to apply aggregator functions to certain columns only:
GGDC10S %>% fgroup_by(Variable,Country) %>%
collapg(custom = list(fmean = 6:8, fmedian = 10:12))
# # A tibble: 85 x 8
# Variable Country fmean.AGR fmean.MIN fmean.MAN fmedian.CON fmedian.WRT fmedian.TRA
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 782. 1855. 580.
# 2 EMP BOL 964. 56.0 235. 66.0 132. 97.0
# 3 EMP BRA 17191. 206. 6991. 4055. 6455. 1581.
# 4 EMP BWA 188. 10.5 18.1 19.0 21.1 6.75
# 5 EMP CHL 702. 101. 625. 230. 484. 205.
# 6 EMP CHN 287744. 7050. 67144. 10578. 17034. 9564.
# 7 EMP COL 3091. 145. 1175. 419. 1553. 391.
# 8 EMP CRI 231. 1.70 136. 55.0 89.8 25.5
# 9 EMP DEW 2490. 407. 8473. 2095. 4454. 1525.
# 10 EMP DNK 236. 8.03 507. 165. 461. 161.
# # ... with 75 more rows
To understand more about collapg
, look it up in the documentation (?collapg
).
Weighted aggregations are currently possible with the functions fsum, fprod, fmean, fmode, fvar
and fsd
. The implementation is such that by default (option keep.w = TRUE
) these functions also aggregate the weights, so that further weighted computations can be performed on the aggregated data. fsum, fmean
, fsd
and fvar
compute a grouped sum of the weight column and place it next to the group-identifiers, fmode
computes the maximum weight (corresponding to the mode), and fprod
computes the product of the weights.
# This computes a frequency-weighted grouped standard-deviation, taking the total EMP / VA as weight
GGDC10S %>%
fgroup_by(Variable,Country) %>%
fselect(AGR:SUM) %>% fsd(SUM)
# # A tibble: 85 x 13
# Variable Country sum.SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 6.54e5 225. 2.22e1 1.76e2 2.05e1 2.85e2 8.56e2 1.95e2 493. 1123. 5.06e2
# 2 EMP BOL 1.35e5 99.7 1.71e1 1.68e2 4.87e0 1.23e2 3.24e2 9.81e1 69.8 NA 2.58e2
# 3 EMP BRA 3.36e6 1587. 7.38e1 2.95e3 9.38e1 1.86e3 6.28e3 1.31e3 3003. 3621. 4.26e3
# 4 EMP BWA 1.85e4 32.2 3.72e0 1.48e1 1.59e0 1.80e1 3.87e1 6.02e0 13.5 39.8 8.94e0
# 5 EMP CHL 2.51e5 71.0 3.99e1 1.29e2 1.24e1 1.88e2 5.51e2 1.34e2 313. NA 4.26e2
# 6 EMP CHN 2.91e7 56281. 3.09e3 4.04e4 1.27e3 1.92e4 2.45e4 9.26e3 2853. 11541. 3.74e4
# 7 EMP COL 6.03e5 637. 1.48e2 5.94e2 1.52e1 3.97e2 1.89e3 3.62e2 435. NA 1.01e3
# 8 EMP CRI 5.50e4 40.4 1.04e0 7.93e1 1.37e1 3.44e1 1.68e2 4.53e1 79.8 80.7 4.34e1
# 9 EMP DEW 1.10e6 1175. 1.83e2 7.42e2 5.32e1 1.94e2 6.06e2 2.12e2 699. 1225. 3.55e2
# 10 EMP DNK 1.53e5 139. 7.45e0 7.73e1 1.92e0 2.56e1 5.33e1 1.57e1 91.6 248. 1.95e1
# # ... with 75 more rows
# This computes a weighted grouped mode, taking the total EMP / VA as weight
GGDC10S %>%
fgroup_by(Variable,Country) %>%
fselect(AGR:SUM) %>% fmode(SUM)
# # A tibble: 85 x 13
# Variable Country max.SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 17929. 1.16e3 127. 2.16e3 1.52e2 1.41e3 3768. 1.06e3 1.75e3 4336. 2.00e3
# 2 EMP BOL 4508. 8.19e2 37.6 6.04e2 1.08e1 4.33e2 893. 3.33e2 3.21e2 NA 1.06e3
# 3 EMP BRA 102572. 1.65e4 313. 1.18e4 3.88e2 8.15e3 21860. 5.17e3 1.20e4 12149. 1.42e4
# 4 EMP BWA 668. 1.71e2 13.1 4.33e1 3.93e0 1.81e1 129. 2.10e1 4.67e1 113. 2.62e1
# 5 EMP CHL 7559. 6.30e2 249. 7.42e2 6.07e1 6.71e2 1989. 4.81e2 8.54e2 NA 1.88e3
# 6 EMP CHN 764200 2.66e5 9247. 1.43e5 3.53e3 6.99e4 84165. 3.12e4 1.08e4 43240. 1.03e5
# 7 EMP COL 21114. 3.93e3 513. 2.37e3 5.89e1 1.41e3 6069. 1.36e3 1.82e3 NA 3.57e3
# 8 EMP CRI 2058. 2.83e2 2.42 2.49e2 4.38e1 1.20e2 489. 1.44e2 2.25e2 328. 1.75e2
# 9 EMP DEW 31261 1.03e3 260 8.73e3 2.91e2 2.06e3 4398 1.63e3 3.26e3 6129 1.79e3
# 10 EMP DNK 2823. 7.85e1 3.12 3.99e2 1.14e1 1.95e2 579. 1.87e2 3.82e2 835. 1.50e2
# # ... with 75 more rows
The weighted variance / standard deviation is currently only implemented with frequency weights. Reliability weights may be implemented in a future update of collapse, if this is a strongly requested feature.
Weighted aggregations may also be performed with collapg
.
# This aggregates numeric colums using the weighted mean and categorical columns using the weighted mode
GGDC10S %>% group_by(Variable,Country) %>% collapg(w = SUM, wFUN = list(fsum, fmax))
# # A tibble: 85 x 17
# Variable Country fsum.SUM fmax.SUM Regioncode Region Year AGR MIN MAN PU CON
# <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 6.54e5 17929. LAM Latin~ 1985. 1.36e3 5.65e1 1.93e3 1.05e2 8.11e2
# 2 EMP BOL 1.35e5 4508. LAM Latin~ 1987. 9.77e2 5.79e1 2.96e2 7.07e0 1.67e2
# 3 EMP BRA 3.36e6 102572. LAM Latin~ 1989. 1.77e4 2.38e2 8.47e3 3.89e2 4.44e3
# 4 EMP BWA 1.85e4 668. SSA Sub-s~ 1993. 2.00e2 1.21e1 2.43e1 3.70e0 3.14e1
# 5 EMP CHL 2.51e5 7559. LAM Latin~ 1988. 6.93e2 1.07e2 6.68e2 3.35e1 3.67e2
# 6 EMP CHN 2.91e7 764200 ASI Asia 1988. 3.09e5 8.23e3 8.34e4 2.09e3 2.80e4
# 7 EMP COL 6.03e5 21114. LAM Latin~ 1989. 3.44e3 2.04e2 1.49e3 4.20e1 7.18e2
# 8 EMP CRI 5.50e4 2058. LAM Latin~ 1991. 2.54e2 2.10e0 1.87e2 2.19e1 7.84e1
# 9 EMP DEW 1.10e6 31261 EUR Europe 1971. 2.40e3 3.95e2 8.51e3 2.29e2 2.10e3
# 10 EMP DNK 1.53e5 2823. EUR Europe 1981. 2.23e2 7.41e0 5.03e2 1.39e1 1.72e2
# # ... with 75 more rows, and 5 more variables: WRT <dbl>, TRA <dbl>, FIRE <dbl>, GOV <dbl>,
# # OTH <dbl>
collapse also provides some fast transformations that significantly extend in scope and speed up manipulations that can be performed with dplyr::mutate
.
The function ftransform
can be used to manipulate columns in the same ways as mutate
:
GGDC10S %>% fsubset(Variable == "VA", Country, Year, AGR, SUM) %>%
ftransform(AGR_perc = AGR / SUM * 100, # Computing % of VA in Agriculture
AGR_mean = fmean(AGR), # Average Agricultural VA
AGR = NULL, SUM = NULL) %>% # Deleting columns AGR and SUM
head
# Country Year AGR_perc AGR_mean
# 1 BWA 1960 NA 5137561
# 2 BWA 1961 NA 5137561
# 3 BWA 1962 NA 5137561
# 4 BWA 1963 NA 5137561
# 5 BWA 1964 43.49132 5137561
# 6 BWA 1965 39.96990 5137561
If only the computed columns need to be returned, fcompute
provides an efficient alternative:
GGDC10S %>% fsubset(Variable == "VA", Country, Year, AGR, SUM) %>%
fcompute(AGR_perc = AGR / SUM * 100,
AGR_mean = fmean(AGR)) %>% head
# AGR_perc AGR_mean
# 1 NA 5137561
# 2 NA 5137561
# 3 NA 5137561
# 4 NA 5137561
# 5 43.49132 5137561
# 6 39.96990 5137561
ftransform
and fcompute
are an order of magnitude faster than mutate
, but they do not support grouped computations. For common grouped operations like replacing and sweeping out statistics, collapse however provides very efficient alternatives…
All statistical (scalar-valued) functions in the collapse package (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, ffirst, flast, fNobs, fNdistinct
) have a TRA
argument which can be used to efficiently transforms data by either (column-wise) replacing data values with computed statistics or sweeping the statistics out of the data. Operations can be specified using either an integer or quoted operator / string. The 10 operations supported by TRA
are:
1 - “replace_fill” : replace and overwrite missing values (same as mutate
)
2 - “replace” : replace but preserve missing values
3 - “-” : subtract (center)
4 - “-+” : subtract group-statistics but add average of group statistics
5 - “/” : divide (scale)
6 - “%” : compute percentages (divide and multiply by 100)
7 - “+” : add
8 - "*" : multiply
9 - “%%” : modulus
10 - “-%%” : subtract modulus
Simple transformations are again straightforward to specify:
# This subtracts the median value from all data points i.e. centers on the median
GGDC10S %>% num_vars %>% fmedian(TRA = "-") %>% head
# Year AGR MIN MAN PU CON WRT TRA FIRE GOV
# 1 -22 NA NA NA NA NA NA NA NA NA
# 2 -21 NA NA NA NA NA NA NA NA NA
# 3 -20 NA NA NA NA NA NA NA NA NA
# 4 -19 NA NA NA NA NA NA NA NA NA
# 5 -18 -4378.218 -169.7294 -3717.362 -167.8456 -1472.787 -3767.399 -1173.141 -959.0059 -3923.690
# 6 -17 -4378.792 -170.7277 -3717.080 -167.8149 -1472.101 -3766.578 -1172.861 -958.8783 -3922.817
# OTH SUM
# 1 NA NA
# 2 NA NA
# 3 NA NA
# 4 NA NA
# 5 -1430.831 -23148.71
# 6 -1430.494 -23146.85
# This replaces all data points with the mode
GGDC10S %>% char_vars %>% fmode(TRA = "replace") %>% head
# Country Regioncode Region Variable
# 1 USA ASI Asia EMP
# 2 USA ASI Asia EMP
# 3 USA ASI Asia EMP
# 4 USA ASI Asia EMP
# 5 USA ASI Asia EMP
# 6 USA ASI Asia EMP
We can also easily specify code to grouped demean, scale or compute percentages by groups:
# Demeaning sectoral data by Variable and Country (within transformation)
GGDC10S %>%
fselect(Variable,Country,AGR:SUM) %>%
fgroup_by(Variable,Country) %>% fmean(TRA = "-") %>% head(3)
# # A tibble: 3 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# Scaling sectoral data by Variable and Country
GGDC10S %>%
fselect(Variable,Country,AGR:SUM) %>%
fgroup_by(Variable,Country) %>% fsd(TRA = "/") %>% head(3)
# # A tibble: 3 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# Normalizing Data by expressing them in percentages of the median value within each country and sector (i.e. the median is 100%)
GGDC10S %>%
fselect(Variable,Country,AGR:SUM) %>%
fgroup_by(Variable,Country) %>% fmedian(TRA = "%") %>% head(3)
# # A tibble: 3 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
Weighted demeaning and scaling can be computed using:
# Weighted demeaning (within transformation), weighted by SUM
GGDC10S %>%
fselect(Variable,Country,AGR:SUM) %>%
fgroup_by(Variable,Country) %>% fmean(SUM, "-") %>% head(3)
# # A tibble: 3 x 13
# Variable Country SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# Weighted scaling, weighted by SUM
GGDC10S %>%
fselect(Variable,Country,AGR:SUM) %>%
fgroup_by(Variable,Country) %>% fsd(SUM, "/") %>% head(3)
# # A tibble: 3 x 13
# Variable Country SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
Alternatively we could also replace data points with their groupwise weighted mean or standard deviation:
# This conducts a weighted between transformation (replacing with weighted mean)
GGDC10S %>%
fselect(Variable,Country,AGR:SUM) %>%
fgroup_by(Variable,Country) %>% fmean(SUM, "replace")
# # A tibble: 5,027 x 13
# Variable Country SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA 37.5 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 6 VA BWA 39.3 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 7 VA BWA 43.1 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 8 VA BWA 41.4 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 9 VA BWA 41.1 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 10 VA BWA 51.2 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# # ... with 5,017 more rows
# This also replaces missing values in each group
GGDC10S %>%
fselect(Variable,Country,AGR:SUM) %>%
fgroup_by(Variable,Country) %>% fmean(SUM, "replace_fill")
# # A tibble: 5,027 x 13
# Variable Country SUM AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 2 VA BWA NA 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 3 VA BWA NA 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 4 VA BWA NA 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 5 VA BWA 37.5 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 6 VA BWA 39.3 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 7 VA BWA 43.1 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 8 VA BWA 41.4 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 9 VA BWA 41.1 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# 10 VA BWA 51.2 1317. 13321. 2965. 529. 2747. 6547. 2158. 4432. 7556. 2615.
# # ... with 5,017 more rows
Sequential operations are also easily performed:
# This scales and then subtracts the median
GGDC10S %>%
fselect(Variable,Country,AGR:SUM) %>%
fgroup_by(Variable,Country) %>% fsd(TRA = "/") %>% fmedian(TRA = "-")
# # A tibble: 5,027 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA -0.182 -0.235 -0.183 -0.245 -0.118 -0.0820 -0.0724 -0.0661 -0.108 -0.0848 -0.146
# 6 VA BWA -0.183 -0.235 -0.183 -0.245 -0.117 -0.0817 -0.0722 -0.0660 -0.108 -0.0846 -0.146
# 7 VA BWA -0.180 -0.235 -0.183 -0.245 -0.117 -0.0813 -0.0720 -0.0659 -0.107 -0.0843 -0.145
# 8 VA BWA -0.177 -0.235 -0.183 -0.245 -0.117 -0.0826 -0.0724 -0.0659 -0.107 -0.0841 -0.146
# 9 VA BWA -0.174 -0.235 -0.183 -0.245 -0.117 -0.0823 -0.0717 -0.0661 -0.108 -0.0848 -0.146
# 10 VA BWA -0.173 -0.234 -0.182 -0.243 -0.115 -0.0821 -0.0715 -0.0660 -0.108 -0.0846 -0.145
# # ... with 5,017 more rows
Of course it is also possible to combine multiple functions as in the aggregation section, or to add variables to existing data, as shown below:
# This adds a groupwise observation count next to each column
add_vars(GGDC10S, seq(7,27,2)) <- GGDC10S %>%
fgroup_by(Variable,Country) %>% fselect(AGR:SUM) %>%
fNobs("replace_fill") %>% add_stub("N_")
head(GGDC10S)
# Country Regioncode Region Variable Year AGR N_AGR MIN N_MIN MAN N_MAN
# 1 BWA SSA Sub-saharan Africa VA 1960 NA 47 NA 47 NA 47
# 2 BWA SSA Sub-saharan Africa VA 1961 NA 47 NA 47 NA 47
# 3 BWA SSA Sub-saharan Africa VA 1962 NA 47 NA 47 NA 47
# 4 BWA SSA Sub-saharan Africa VA 1963 NA 47 NA 47 NA 47
# 5 BWA SSA Sub-saharan Africa VA 1964 16.30154 47 3.494075 47 0.7365696 47
# 6 BWA SSA Sub-saharan Africa VA 1965 15.72700 47 2.495768 47 1.0181992 47
# PU N_PU CON N_CON WRT N_WRT TRA N_TRA FIRE N_FIRE GOV N_GOV
# 1 NA 47 NA 47 NA 47 NA 47 NA 47 NA 47
# 2 NA 47 NA 47 NA 47 NA 47 NA 47 NA 47
# 3 NA 47 NA 47 NA 47 NA 47 NA 47 NA 47
# 4 NA 47 NA 47 NA 47 NA 47 NA 47 NA 47
# 5 0.1043936 47 0.6600454 47 6.243732 47 1.658928 47 1.119194 47 4.822485 47
# 6 0.1350976 47 1.3462312 47 7.064825 47 1.939007 47 1.246789 47 5.695848 47
# OTH N_OTH SUM N_SUM
# 1 NA 47 NA 47
# 2 NA 47 NA 47
# 3 NA 47 NA 47
# 4 NA 47 NA 47
# 5 2.341328 47 37.48229 47
# 6 2.678338 47 39.34710 47
rm(GGDC10S)
Certainly There are lots of other examples one could construct using the 10 operations and 13 functions listed above, the examples provided just outline the suggested programming basics.
TRA
FunctionBehind the scenes of the TRA = ...
argument, the fast functions first compute the grouped statistics on all columns of the data, and these statistics are then directly fed into a C++ function that uses them to replace or sweep them out of data points in one of the 10 ways described above. This function can however also be called directly by the name of TRA
(shorthand for ‘transforming’ data by replacing or sweeping out statistics). Fundamentally, TRA
is a generalization of base::sweep
for column-wise grouped operations1. Direct calls to TRA
enable more control over inputs and outputs.
The two operations below are equivalent, although the first is slightly more efficient as it only requires one method dispatch and one check of the inputs:
# This divides by the product
GGDC10S %>%
fgroup_by(Variable,Country) %>%
get_vars(6:16) %>% fprod(TRA = "/")
# # A tibble: 5,027 x 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV
# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA
# 5 1.29e-105 2.81e-127 1.40e-101 4.44e-74 4.19e-102 3.97e-113 6.91e-92 1.01e-97 2.51e-117
# 6 1.24e-105 2.00e-127 1.94e-101 5.75e-74 8.55e-102 4.49e-113 8.08e-92 1.13e-97 2.96e-117
# 7 1.39e-105 1.58e-127 1.53e-101 8.62e-74 8.55e-102 5.26e-113 8.98e-92 1.23e-97 3.31e-117
# 8 1.51e-105 1.85e-127 1.78e-101 8.62e-74 5.70e-102 2.74e-113 7.18e-92 1.39e-97 3.66e-117
# 9 1.66e-105 1.48e-127 1.43e-101 8.62e-74 7.74e-102 3.29e-113 1.02e-91 9.33e-98 2.61e-117
# 10 1.72e-105 4.21e-127 4.07e-101 2.46e-73 2.21e-101 3.66e-113 1.13e-91 1.11e-97 2.91e-117
# # ... with 5,017 more rows, and 2 more variables: OTH <dbl>, SUM <dbl>
# Same thing
GGDC10S %>%
fgroup_by(Variable,Country) %>%
get_vars(6:16) %>% TRA(fprod(., keep.group_vars = FALSE), "/") # [same as TRA(.,fprod(., keep.group_vars = FALSE),"/")]
# # A tibble: 5,027 x 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV
# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA
# 5 1.29e-105 2.81e-127 1.40e-101 4.44e-74 4.19e-102 3.97e-113 6.91e-92 1.01e-97 2.51e-117
# 6 1.24e-105 2.00e-127 1.94e-101 5.75e-74 8.55e-102 4.49e-113 8.08e-92 1.13e-97 2.96e-117
# 7 1.39e-105 1.58e-127 1.53e-101 8.62e-74 8.55e-102 5.26e-113 8.98e-92 1.23e-97 3.31e-117
# 8 1.51e-105 1.85e-127 1.78e-101 8.62e-74 5.70e-102 2.74e-113 7.18e-92 1.39e-97 3.66e-117
# 9 1.66e-105 1.48e-127 1.43e-101 8.62e-74 7.74e-102 3.29e-113 1.02e-91 9.33e-98 2.61e-117
# 10 1.72e-105 4.21e-127 4.07e-101 2.46e-73 2.21e-101 3.66e-113 1.13e-91 1.11e-97 2.91e-117
# # ... with 5,017 more rows, and 2 more variables: OTH <dbl>, SUM <dbl>
TRA.grouped_df
was designed such that it matches the columns of the statistics (aggregated columns) to those of the original data, and only transforms matching columns while returning the whole data.frame. Thus it is easily possible to only apply a transformation to the first two sectors:
# This only demeans Agriculture (AGR) and Mining (MIN)
GGDC10S %>%
fgroup_by(Variable,Country) %>%
get_vars(6:16) %>% TRA(fmean(fselect(., AGR, MIN), keep.group_vars = FALSE), "-")
# # A tibble: 5,027 x 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA NA NA
# 5 -446. -4505. 0.737 0.104 0.660 6.24 1.66 1.12 4.82 2.34 37.5
# 6 -446. -4506. 1.02 0.135 1.35 7.06 1.94 1.25 5.70 2.68 39.3
# 7 -444. -4507. 0.804 0.203 1.35 8.27 2.15 1.36 6.37 2.99 43.1
# 8 -443. -4506. 0.938 0.203 0.897 4.31 1.72 1.54 7.04 3.31 41.4
# 9 -441. -4507. 0.750 0.203 1.22 5.17 2.44 1.03 5.03 2.36 41.1
# 10 -440. -4503. 2.14 0.578 3.47 5.75 2.72 1.23 5.59 2.63 51.2
# # ... with 5,017 more rows
Another potential use of TRA
is to do computations in two- or more steps, for example if both aggregated and transformed data are needed, or if computations are more complex and involve other manipulations in-between the aggregating and sweeping part:
# Get grouped tibble
gGGDC <- GGDC10S %>% fgroup_by(Variable,Country)
# Get aggregated data
gsumGGDC <- gGGDC %>% fselect(AGR:SUM) %>% fsum
head(gsumGGDC)
# # A tibble: 6 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 8.80e4 3230. 1.20e5 6307. 4.60e4 1.23e5 4.02e4 3.89e4 1.27e5 6.15e4 6.54e5
# 2 EMP BOL 5.88e4 3418. 1.43e4 326. 7.49e3 1.72e4 7.04e3 2.72e3 NA 2.41e4 1.35e5
# 3 EMP BRA 1.07e6 12773. 4.33e5 22604. 2.19e5 5.28e5 1.27e5 2.74e5 3.29e5 3.54e5 3.36e6
# 4 EMP BWA 8.84e3 493. 8.49e2 145. 1.19e3 1.71e3 3.93e2 7.21e2 2.87e3 1.30e3 1.85e4
# 5 EMP CHL 4.42e4 6389. 3.94e4 1850. 1.86e4 4.38e4 1.63e4 1.72e4 NA 6.32e4 2.51e5
# 6 EMP CHN 1.73e7 422972. 4.03e6 96364. 1.25e6 1.73e6 8.36e5 2.96e5 1.36e6 1.86e6 2.91e7
# Get transformed (scaled) data
head(TRA(gGGDC, gsumGGDC, "/"))
# # A tibble: 6 x 16
# Country Regioncode Region Variable Year AGR MIN MAN PU CON WRT
# <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA SSA Sub-s~ VA 1960 NA NA NA NA NA NA
# 2 BWA SSA Sub-s~ VA 1961 NA NA NA NA NA NA
# 3 BWA SSA Sub-s~ VA 1962 NA NA NA NA NA NA
# 4 BWA SSA Sub-s~ VA 1963 NA NA NA NA NA NA
# 5 BWA SSA Sub-s~ VA 1964 7.50e-4 1.65e-5 1.66e-5 1.03e-5 1.57e-5 6.82e-5
# 6 BWA SSA Sub-s~ VA 1965 7.24e-4 1.18e-5 2.30e-5 1.33e-5 3.20e-5 7.72e-5
# # ... with 5 more variables: TRA <dbl>, FIRE <dbl>, GOV <dbl>, OTH <dbl>, SUM <dbl>
I have already noted above that whether using the argument to fast statistical functions or TRA
directly, these data transformations are essentially a two-step process: Statistics are first computed and then used to transform this original data. This process is already very efficient since all functions are written in C++, and programmatically separating the computation of statistics and data transformation tasks allows for unlimited combinations and drastically simplifies the code base of this package.
Nonetheless there are of course more memory efficient and faster ways to program such data transformations, which principally involve doing them column-by-column with a single C++ function. To ensure that this package lives up to the highest standards of performance for common uses, I have implemented such slightly more efficient algorithms for the very commonly applied tasks of centering and averaging data by groups (widely known as ‘between’-group and ‘within’-group transformations), and scaling and centering data by groups (also known as ‘standardizing’ data).
The functions fbetween
and fwithin
are slightly more memory efficient implementations of fmean
invoked with different TRA
options:
GGDC10S %>% # Same as ... %>% fmean(TRA = "replace")
fgroup_by(Variable,Country) %>% get_vars(6:16) %>% fbetween %>% head(2)
# # A tibble: 2 x 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA NA
GGDC10S %>% # Same as ... %>% fmean(TRA = "replace_fill")
fgroup_by(Variable,Country) %>% get_vars(6:16) %>% fbetween(fill = TRUE) %>% head(2)
# # A tibble: 2 x 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 462. 4509. 942. 216. 895. 1948. 635. 1359. 2373. 773. 14112.
# 2 462. 4509. 942. 216. 895. 1948. 635. 1359. 2373. 773. 14112.
GGDC10S %>% # Same as ... %>% fmean(TRA = "-")
fgroup_by(Variable,Country) %>% get_vars(6:16) %>% fwithin %>% head(2)
# # A tibble: 2 x 11
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA NA
Apart from higher speed,fwithin
has a mean
argument to assign an arbitrary mean to centered data, the default being mean = 0
. A very common choice for such an added mean is just the overall mean of the data, which can be added in by invoking mean = "overall.mean"
:
GGDC10S %>%
fgroup_by(Variable,Country) %>%
fselect(Country, Variable, AGR:SUM) %>% fwithin(mean = "overall.mean")
# # A tibble: 5,027 x 13
# Country Variable AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA NA NA NA NA NA NA NA NA NA NA
# 2 BWA VA NA NA NA NA NA NA NA NA NA NA
# 3 BWA VA NA NA NA NA NA NA NA NA NA NA
# 4 BWA VA NA NA NA NA NA NA NA NA NA NA
# 5 BWA VA 2.53e6 1.86e6 5.54e6 335463. 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 1.68e6
# 6 BWA VA 2.53e6 1.86e6 5.54e6 335463. 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 1.68e6
# 7 BWA VA 2.53e6 1.86e6 5.54e6 335463. 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 1.68e6
# 8 BWA VA 2.53e6 1.86e6 5.54e6 335463. 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 1.68e6
# 9 BWA VA 2.53e6 1.86e6 5.54e6 335463. 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 1.68e6
# 10 BWA VA 2.53e6 1.86e6 5.54e6 335464. 1.80e6 3.39e6 1.47e6 1.66e6 1.71e6 1.68e6
# # ... with 5,017 more rows, and 1 more variable: SUM <dbl>
This can also be done using weights. The code below uses the SUM
column as weights, and then for each variable and each group subtracts out the weighted mean, and then adds the overall weighted column mean back to the centered columns. The SUM
column is just kept as it is and added in front.
GGDC10S %>%
fgroup_by(Variable,Country) %>%
fselect(Country, Variable, AGR:SUM) %>% fwithin(SUM, mean = "overall.mean")
# # A tibble: 5,027 x 13
# Country Variable SUM AGR MIN MAN PU CON WRT TRA FIRE GOV
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA NA NA NA NA NA NA NA NA NA NA
# 2 BWA VA NA NA NA NA NA NA NA NA NA NA
# 3 BWA VA NA NA NA NA NA NA NA NA NA NA
# 4 BWA VA NA NA NA NA NA NA NA NA NA NA
# 5 BWA VA 37.5 4.29e8 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8
# 6 BWA VA 39.3 4.29e8 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8
# 7 BWA VA 43.1 4.29e8 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8
# 8 BWA VA 41.4 4.29e8 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8
# 9 BWA VA 41.1 4.29e8 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8
# 10 BWA VA 51.2 4.29e8 3.70e8 7.38e8 2.73e7 2.83e8 4.33e8 1.97e8 1.55e8 2.10e8
# # ... with 5,017 more rows, and 1 more variable: OTH <dbl>
Apart from fbetween
and fwithin
, the function fscale
exists to efficiently scale and center data, to avoid sequential calls such as ... %>% fsd(TRA = "/") %>% fmean(TRA = "-")
shown in an earlier example.
# This efficiently scales and centers (i.e. standardizes) the data
GGDC10S %>%
fgroup_by(Variable,Country) %>%
fselect(Country, Variable, AGR:SUM) %>% fscale
# # A tibble: 5,027 x 13
# Country Variable AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 2 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 3 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 4 BWA VA NA NA NA NA NA NA NA NA NA NA NA
# 5 BWA VA -0.738 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
# 6 BWA VA -0.739 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
# 7 BWA VA -0.736 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.595 -0.676
# 8 BWA VA -0.734 -0.717 -0.668 -0.805 -0.692 -0.604 -0.589 -0.635 -0.655 -0.595 -0.676
# 9 BWA VA -0.730 -0.717 -0.668 -0.805 -0.692 -0.604 -0.588 -0.635 -0.656 -0.596 -0.676
# 10 BWA VA -0.729 -0.716 -0.667 -0.803 -0.690 -0.603 -0.588 -0.635 -0.656 -0.596 -0.675
# # ... with 5,017 more rows
fscale
also has additional mean
and sd
arguments allowing the user to (group-) scale data to an arbitrary mean and standard deviation. Setting mean = FALSE
just scales the data but preserves the means, and is thus different from fsd(..., TRA = "/")
which just divides all values by the standard deviation:
# Saving grouped tibble
gGGDC <- GGDC10S %>%
fgroup_by(Variable,Country) %>%
fselect(Country, Variable, AGR:SUM)
# Original means
head(fmean(gGGDC))
# # A tibble: 6 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 102. 742. 1.98e3 6.49e2 628. 2043. 9.92e2 1.05e4
# 2 EMP BOL 964. 56.0 235. 5.35 123. 2.82e2 1.15e2 44.6 NA 3.96e2 2.22e3
# 3 EMP BRA 17191. 206. 6991. 365. 3525. 8.51e3 2.05e3 4414. 5307. 5.71e3 5.43e4
# 4 EMP BWA 188. 10.5 18.1 3.09 25.3 3.63e1 8.36e0 15.3 61.1 2.76e1 3.94e2
# 5 EMP CHL 702. 101. 625. 29.4 296. 6.95e2 2.58e2 272. NA 1.00e3 3.98e3
# 6 EMP CHN 287744. 7050. 67144. 1606. 20852. 2.89e4 1.39e4 4929. 22669. 3.10e4 4.86e5
# Mean Preserving Scaling
head(fmean(fscale(gGGDC, mean = FALSE)))
# # A tibble: 6 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1420. 52.1 1932. 102. 742. 1.98e3 6.49e2 628. 2043. 9.92e2 1.05e4
# 2 EMP BOL 964. 56.0 235. 5.35 123. 2.82e2 1.15e2 44.6 NA 3.96e2 2.22e3
# 3 EMP BRA 17191. 206. 6991. 365. 3525. 8.51e3 2.05e3 4414. 5307. 5.71e3 5.43e4
# 4 EMP BWA 188. 10.5 18.1 3.09 25.3 3.63e1 8.36e0 15.3 61.1 2.76e1 3.94e2
# 5 EMP CHL 702. 101. 625. 29.4 296. 6.95e2 2.58e2 272. NA 1.00e3 3.98e3
# 6 EMP CHN 287744. 7050. 67144. 1606. 20852. 2.89e4 1.39e4 4929. 22669. 3.10e4 4.86e5
head(fsd(fscale(gGGDC, mean = FALSE)))
# # A tibble: 6 x 13
# Variable Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP ARG 1. 1. 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.
# 2 EMP BOL 1. 1.00 1. 1.00 1.00 1. 1. 1. NA 1. 1.
# 3 EMP BRA 1. 1. 1. 1.00 1. 1.00 1.00 1.00 1. 1.00 1.00
# 4 EMP BWA 1.00 1.00 1. 1. 1. 1.00 1. 1.00 1. 1.00 1.00
# 5 EMP CHL 1. 1. 1.00 1. 1. 1. 1.00 1. NA 1. 1.00
# 6 EMP CHN 1. 1. 1. 1.00 1.00 1. 1. 1. 1.00 1.00 1.
One can also set mean = "overall.mean"
, which group-centers columns on the overall mean as illustrated with fwithin
. Another interesting option is setting sd = "within.sd"
. This group-scales data such that every group has a standard deviation equal to the within-standard deviation of the data:
# Just using VA data for this example
gGGDC <- GGDC10S %>%
fsubset(Variable == "VA", Country, AGR:SUM) %>%
fgroup_by(Country)
# This calculates the within- standard deviation for all columns
fsd(num_vars(ungroup(fwithin(gGGDC))))
# AGR MIN MAN PU CON WRT TRA FIRE GOV OTH
# 45046972 40122220 75608708 3062688 30811572 44125207 20676901 16030868 20358973 18780869
# SUM
# 306429102
# This scales all groups to take on the within- standard deviation while preserving group means
fsd(fscale(gGGDC, mean = FALSE, sd = "within.sd"))
# # A tibble: 43 x 12
# Country AGR MIN MAN PU CON WRT TRA FIRE GOV OTH SUM
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 ARG 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 2 BOL 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 NA 1.88e7 3.06e8
# 3 BRA 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 4 BWA 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 5 CHL 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 NA 1.88e7 3.06e8
# 6 CHN 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 7 COL 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 NA 1.88e7 3.06e8
# 8 CRI 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 9 DEW 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# 10 DNK 4.50e7 4.01e7 7.56e7 3.06e6 3.08e7 4.41e7 2.07e7 1.60e7 2.04e7 1.88e7 3.06e8
# # ... with 33 more rows
A grouped scaling operation with both mean = "overall.mean"
and sd = "within.sd"
thus efficiently achieves a complete harmonization of all groups in the first two moments without changing the fundamental properties (in terms of level and scale) of the data.
This section introduces 3 further powerful collapse functions: flag
, fdiff
and fgrowth
. The first function, flag
, efficiently computes sequences of fully identified lags and leads on time-series and panel-data. The following code computes 1 fully-identified panel-lag and 1 fully identified panel-lead of each variable in the data:
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable,Country) %>% flag(-1:1, Year)
# # A tibble: 5,027 x 36
# Country Variable Year F1.AGR AGR L1.AGR F1.MIN MIN L1.MIN F1.MAN MAN L1.MAN F1.PU PU
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA NA NA NA NA
# 4 BWA VA 1963 16.3 NA NA 3.49 NA NA 0.737 NA NA 0.104 NA
# 5 BWA VA 1964 15.7 16.3 NA 2.50 3.49 NA 1.02 0.737 NA 0.135 0.104
# 6 BWA VA 1965 17.7 15.7 16.3 1.97 2.50 3.49 0.804 1.02 0.737 0.203 0.135
# 7 BWA VA 1966 19.1 17.7 15.7 2.30 1.97 2.50 0.938 0.804 1.02 0.203 0.203
# 8 BWA VA 1967 21.1 19.1 17.7 1.84 2.30 1.97 0.750 0.938 0.804 0.203 0.203
# 9 BWA VA 1968 21.9 21.1 19.1 5.24 1.84 2.30 2.14 0.750 0.938 0.578 0.203
# 10 BWA VA 1969 23.1 21.9 21.1 10.2 5.24 1.84 4.15 2.14 0.750 1.12 0.578
# # ... with 5,017 more rows, and 22 more variables: L1.PU <dbl>, F1.CON <dbl>, CON <dbl>,
# # L1.CON <dbl>, F1.WRT <dbl>, WRT <dbl>, L1.WRT <dbl>, F1.TRA <dbl>, TRA <dbl>, L1.TRA <dbl>,
# # F1.FIRE <dbl>, FIRE <dbl>, L1.FIRE <dbl>, F1.GOV <dbl>, GOV <dbl>, L1.GOV <dbl>, F1.OTH <dbl>,
# # OTH <dbl>, L1.OTH <dbl>, F1.SUM <dbl>, SUM <dbl>, L1.SUM <dbl>
If the time-variable passed does not exactly identify the data (i.e. because of gaps or repeated values in each group), all 3 functions will issue appropriate error messages. It is also possible to omit the time-variable if one is certain that the data is sorted:
GGDC10S %>%
fselect(Variable,Country,AGR:SUM) %>%
fgroup_by(Variable,Country) %>% flag
# # A tibble: 5,027 x 13
# Variable Country L1.AGR L1.MIN L1.MAN L1.PU L1.CON L1.WRT L1.TRA L1.FIRE L1.GOV L1.OTH L1.SUM
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 2 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 3 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 4 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 5 VA BWA NA NA NA NA NA NA NA NA NA NA NA
# 6 VA BWA 16.3 3.49 0.737 0.104 0.660 6.24 1.66 1.12 4.82 2.34 37.5
# 7 VA BWA 15.7 2.50 1.02 0.135 1.35 7.06 1.94 1.25 5.70 2.68 39.3
# 8 VA BWA 17.7 1.97 0.804 0.203 1.35 8.27 2.15 1.36 6.37 2.99 43.1
# 9 VA BWA 19.1 2.30 0.938 0.203 0.897 4.31 1.72 1.54 7.04 3.31 41.4
# 10 VA BWA 21.1 1.84 0.750 0.203 1.22 5.17 2.44 1.03 5.03 2.36 41.1
# # ... with 5,017 more rows
fdiff
computes sequences of lagged-leaded and iterated differences as well as quasi-differences and log-differences on time-series and panel-data. The code below computes the 1 and 10 year first and second differences of each variable in the data:
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable,Country) %>% fdiff(c(1, 10), 1:2, Year)
# # A tibble: 5,027 x 47
# Country Variable Year D1.AGR D2.AGR L10D1.AGR L10D2.AGR D1.MIN D2.MIN L10D1.MIN L10D2.MIN D1.MAN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA NA NA NA
# 6 BWA VA 1965 -0.575 NA NA NA -0.998 NA NA NA 0.282
# 7 BWA VA 1966 1.95 2.53 NA NA -0.525 0.473 NA NA -0.214
# 8 BWA VA 1967 1.47 -0.488 NA NA 0.328 0.854 NA NA 0.134
# 9 BWA VA 1968 1.95 0.488 NA NA -0.460 -0.788 NA NA -0.188
# 10 BWA VA 1969 0.763 -1.19 NA NA 3.41 3.87 NA NA 1.39
# # ... with 5,017 more rows, and 35 more variables: D2.MAN <dbl>, L10D1.MAN <dbl>, L10D2.MAN <dbl>,
# # D1.PU <dbl>, D2.PU <dbl>, L10D1.PU <dbl>, L10D2.PU <dbl>, D1.CON <dbl>, D2.CON <dbl>,
# # L10D1.CON <dbl>, L10D2.CON <dbl>, D1.WRT <dbl>, D2.WRT <dbl>, L10D1.WRT <dbl>, L10D2.WRT <dbl>,
# # D1.TRA <dbl>, D2.TRA <dbl>, L10D1.TRA <dbl>, L10D2.TRA <dbl>, D1.FIRE <dbl>, D2.FIRE <dbl>,
# # L10D1.FIRE <dbl>, L10D2.FIRE <dbl>, D1.GOV <dbl>, D2.GOV <dbl>, L10D1.GOV <dbl>,
# # L10D2.GOV <dbl>, D1.OTH <dbl>, D2.OTH <dbl>, L10D1.OTH <dbl>, L10D2.OTH <dbl>, D1.SUM <dbl>,
# # D2.SUM <dbl>, L10D1.SUM <dbl>, L10D2.SUM <dbl>
Log-differences of the form \(log(x_t) - log(x_{t-s})\) are also easily computed, although one caveat of log-differencing in C++ is that log(NA) - log(NA)
gives a NaN
value.
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable,Country) %>% fdiff(c(1, 10), 1, Year, logdiff = TRUE)
# # A tibble: 5,027 x 25
# Country Variable Year Dlog1.AGR L10Dlog1.AGR Dlog1.MIN L10Dlog1.MIN Dlog1.MAN L10Dlog1.MAN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA
# 2 BWA VA 1961 NaN NA NaN NA NaN NA
# 3 BWA VA 1962 NaN NA NaN NA NaN NA
# 4 BWA VA 1963 NaN NA NaN NA NaN NA
# 5 BWA VA 1964 NaN NA NaN NA NaN NA
# 6 BWA VA 1965 -0.0359 NA -0.336 NA 0.324 NA
# 7 BWA VA 1966 0.117 NA -0.236 NA -0.236 NA
# 8 BWA VA 1967 0.0796 NA 0.154 NA 0.154 NA
# 9 BWA VA 1968 0.0972 NA -0.223 NA -0.223 NA
# 10 BWA VA 1969 0.0355 NA 1.05 NA 1.05 NA
# # ... with 5,017 more rows, and 16 more variables: Dlog1.PU <dbl>, L10Dlog1.PU <dbl>,
# # Dlog1.CON <dbl>, L10Dlog1.CON <dbl>, Dlog1.WRT <dbl>, L10Dlog1.WRT <dbl>, Dlog1.TRA <dbl>,
# # L10Dlog1.TRA <dbl>, Dlog1.FIRE <dbl>, L10Dlog1.FIRE <dbl>, Dlog1.GOV <dbl>, L10Dlog1.GOV <dbl>,
# # Dlog1.OTH <dbl>, L10Dlog1.OTH <dbl>, Dlog1.SUM <dbl>, L10Dlog1.SUM <dbl>
Finally, it is also possible to compute quasi-differences and quasi-log-differences of the form \(x_t - \rho x_{t-s}\) or \(log(x_t) - \rho log(x_{t-s})\):
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable,Country) %>% fdiff(t = Year, rho = 0.95)
# # A tibble: 5,027 x 14
# Country Variable Year QD1.AGR QD1.MIN QD1.MAN QD1.PU QD1.CON QD1.WRT QD1.TRA QD1.FIRE QD1.GOV
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA NA NA NA
# 6 BWA VA 1965 0.241 -0.824 0.318 0.0359 0.719 1.13 0.363 0.184 1.11
# 7 BWA VA 1966 2.74 -0.401 -0.163 0.0743 0.0673 1.56 0.312 0.174 0.955
# 8 BWA VA 1967 2.35 0.427 0.174 0.0101 -0.381 -3.55 -0.323 0.246 0.988
# 9 BWA VA 1968 2.91 -0.345 -0.141 0.0101 0.365 1.08 0.804 -0.427 -1.66
# 10 BWA VA 1969 1.82 3.50 1.43 0.385 2.32 0.841 0.397 0.252 0.818
# # ... with 5,017 more rows, and 2 more variables: QD1.OTH <dbl>, QD1.SUM <dbl>
The quasi-differencing feature was added to fdiff
to facilitate the preparation of time-series and panel data for least-squares estimations suffering from serial correlation following Cochrane & Orcutt (1949).
Finally, fgrowth
computes growth rates in the same way. By default exact growth rates are computed in percentage terms using \((x_t-x_{t-s}) / x_{t-s} \times 100\) (the default argument is scale = 100
). The user can also request growth rates obtained by log-differencing using \(log(x_t/ x_{t-s}) \times 100\).
# Exact growth rates, computed as: (x - lag(x)) / lag(x) * 100
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable,Country) %>% fgrowth(c(1, 10), 1, Year)
# # A tibble: 5,027 x 25
# Country Variable Year G1.AGR L10G1.AGR G1.MIN L10G1.MIN G1.MAN L10G1.MAN G1.PU L10G1.PU G1.CON
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA NA NA NA
# 6 BWA VA 1965 -3.52 NA -28.6 NA 38.2 NA 29.4 NA 104.
# 7 BWA VA 1966 12.4 NA -21.1 NA -21.1 NA 50.0 NA 0
# 8 BWA VA 1967 8.29 NA 16.7 NA 16.7 NA 0 NA -33.3
# 9 BWA VA 1968 10.2 NA -20 NA -20 NA 0 NA 35.7
# 10 BWA VA 1969 3.61 NA 185. NA 185. NA 185. NA 185.
# # ... with 5,017 more rows, and 13 more variables: L10G1.CON <dbl>, G1.WRT <dbl>, L10G1.WRT <dbl>,
# # G1.TRA <dbl>, L10G1.TRA <dbl>, G1.FIRE <dbl>, L10G1.FIRE <dbl>, G1.GOV <dbl>, L10G1.GOV <dbl>,
# # G1.OTH <dbl>, L10G1.OTH <dbl>, G1.SUM <dbl>, L10G1.SUM <dbl>
# Log-difference growth rates, computed as: log(x / lag(x)) * 100
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable,Country) %>% fgrowth(c(1, 10), 1, Year, logdiff = TRUE)
# # A tibble: 5,027 x 25
# Country Variable Year Dlog1.AGR L10Dlog1.AGR Dlog1.MIN L10Dlog1.MIN Dlog1.MAN L10Dlog1.MAN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA
# 2 BWA VA 1961 NaN NA NaN NA NaN NA
# 3 BWA VA 1962 NaN NA NaN NA NaN NA
# 4 BWA VA 1963 NaN NA NaN NA NaN NA
# 5 BWA VA 1964 NaN NA NaN NA NaN NA
# 6 BWA VA 1965 -3.59 NA -33.6 NA 32.4 NA
# 7 BWA VA 1966 11.7 NA -23.6 NA -23.6 NA
# 8 BWA VA 1967 7.96 NA 15.4 NA 15.4 NA
# 9 BWA VA 1968 9.72 NA -22.3 NA -22.3 NA
# 10 BWA VA 1969 3.55 NA 105. NA 105. NA
# # ... with 5,017 more rows, and 16 more variables: Dlog1.PU <dbl>, L10Dlog1.PU <dbl>,
# # Dlog1.CON <dbl>, L10Dlog1.CON <dbl>, Dlog1.WRT <dbl>, L10Dlog1.WRT <dbl>, Dlog1.TRA <dbl>,
# # L10Dlog1.TRA <dbl>, Dlog1.FIRE <dbl>, L10Dlog1.FIRE <dbl>, Dlog1.GOV <dbl>, L10Dlog1.GOV <dbl>,
# # Dlog1.OTH <dbl>, L10Dlog1.OTH <dbl>, Dlog1.SUM <dbl>, L10Dlog1.SUM <dbl>
fdiff
and fgrowth
can also perform leaded (forward) differences and growth rates, although I have never come to employ these in my personal work (i.e. ... %>% fgrowth(-c(1, 10), 1:2, Year)
would compute one and 10-year leaded first and second differences). Again it is possible to perform sequential operations:
# This computes the 1 and 10-year growth rates, for the current period and lagged by one period
GGDC10S %>%
fselect(-Region, -Regioncode) %>%
fgroup_by(Variable,Country) %>% fgrowth(c(1, 10), 1, Year) %>% flag(0:1, Year)
# # A tibble: 5,027 x 47
# Country Variable Year G1.AGR L1.G1.AGR L10G1.AGR L1.L10G1.AGR G1.MIN L1.G1.MIN L10G1.MIN
# * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA VA 1960 NA NA NA NA NA NA NA
# 2 BWA VA 1961 NA NA NA NA NA NA NA
# 3 BWA VA 1962 NA NA NA NA NA NA NA
# 4 BWA VA 1963 NA NA NA NA NA NA NA
# 5 BWA VA 1964 NA NA NA NA NA NA NA
# 6 BWA VA 1965 -3.52 NA NA NA -28.6 NA NA
# 7 BWA VA 1966 12.4 -3.52 NA NA -21.1 -28.6 NA
# 8 BWA VA 1967 8.29 12.4 NA NA 16.7 -21.1 NA
# 9 BWA VA 1968 10.2 8.29 NA NA -20 16.7 NA
# 10 BWA VA 1969 3.61 10.2 NA NA 185. -20 NA
# # ... with 5,017 more rows, and 37 more variables: L1.L10G1.MIN <dbl>, G1.MAN <dbl>,
# # L1.G1.MAN <dbl>, L10G1.MAN <dbl>, L1.L10G1.MAN <dbl>, G1.PU <dbl>, L1.G1.PU <dbl>,
# # L10G1.PU <dbl>, L1.L10G1.PU <dbl>, G1.CON <dbl>, L1.G1.CON <dbl>, L10G1.CON <dbl>,
# # L1.L10G1.CON <dbl>, G1.WRT <dbl>, L1.G1.WRT <dbl>, L10G1.WRT <dbl>, L1.L10G1.WRT <dbl>,
# # G1.TRA <dbl>, L1.G1.TRA <dbl>, L10G1.TRA <dbl>, L1.L10G1.TRA <dbl>, G1.FIRE <dbl>,
# # L1.G1.FIRE <dbl>, L10G1.FIRE <dbl>, L1.L10G1.FIRE <dbl>, G1.GOV <dbl>, L1.G1.GOV <dbl>,
# # L10G1.GOV <dbl>, L1.L10G1.GOV <dbl>, G1.OTH <dbl>, L1.G1.OTH <dbl>, L10G1.OTH <dbl>,
# # L1.L10G1.OTH <dbl>, G1.SUM <dbl>, L1.G1.SUM <dbl>, L10G1.SUM <dbl>, L1.L10G1.SUM <dbl>
This section seeks to demonstrate that the functionality introduced in the preceeding 2 sections indeed produces code that evaluates substantially faster than native dplyr.
To do this properly, the different components of a typical piped call (selecting / subsetting, grouping, and performing some computation) are bechmarked separately on 2 different data sizes.
All benchmarks are run on a Windows 8.1 laptop with a 2x 2.2 GHZ Intel i5 processor, 8GB DDR3 RAM and a Samsung 850 EVO SSD hard drive.
Bechmarks are run on the original GGDC10S
data used throughout this vignette and a larger dataset with approx. 1 million observations, obtained by replicating and row-binding GGDC10S
200 times while maintaining unique groups.
# This shows the groups in GGDC10S
GRP(GGDC10S, ~ Variable + Country)
# collapse grouping object of length 5027 with 85 ordered groups
#
# Call: GRP.default(X = GGDC10S, by = ~Variable + Country), unordered
#
# Distribution of group sizes:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.00 53.00 62.00 59.14 63.00 65.00
#
# Groups with sizes:
# EMP.ARG EMP.BOL EMP.BRA EMP.BWA EMP.CHL EMP.CHN
# 62 61 62 52 63 62
# ---
# VA.TWN VA.TZA VA.USA VA.VEN VA.ZAF VA.ZMB
# 63 52 65 63 52 52
# This replicates the data 200 times
data <- replicate(200, GGDC10S, simplify = FALSE)
# This function adds a number i to the country and variable columns of each dataset
uniquify <- function(x, i) `get_vars<-`(x, c(1,4), value = lapply(unclass(x)[c(1,4)], paste0, i))
# Making datasets unique and row-binding them
data <- unlist2d(Map(uniquify, data, as.list(1:200)), idcols = FALSE)
dim(data)
# [1] 1005400 16
# This shows the groups in the replicated data
GRP(data, ~ Variable + Country)
# collapse grouping object of length 1005400 with 17000 ordered groups
#
# Call: GRP.default(X = data, by = ~Variable + Country), unordered
#
# Distribution of group sizes:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 4.00 53.00 62.00 59.14 63.00 65.00
#
# Groups with sizes:
# EMP1.ARG1 EMP1.BOL1 EMP1.BRA1 EMP1.BWA1 EMP1.CHL1 EMP1.CHN1
# 62 61 62 52 63 62
# ---
# VA99.TWN99 VA99.TZA99 VA99.USA99 VA99.VEN99 VA99.ZAF99 VA99.ZMB99
# 63 52 65 63 52 52
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 1836700 98.1 3518041 187.9 3518041 187.9
# Vcells 19741078 150.7 28133716 214.7 22917727 174.9
## Selecting columns
# Small
microbenchmark(dplyr = select(GGDC10S, Country, Variable, AGR:SUM),
collapse = fselect(GGDC10S, Country, Variable, AGR:SUM))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 3620.854 3823.227 4218.70979 4043.227 4355.3780 7289.010 100
# collapse 13.387 18.296 34.82984 35.700 44.4015 133.428 100
# Large
microbenchmark(dplyr = select(data, Country, Variable, AGR:SUM),
collapse = fselect(data, Country, Variable, AGR:SUM))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 3639.597 3718.806 3979.3292 3934.120 4131.361 7212.256 100
# collapse 13.388 18.966 32.8797 29.229 43.509 166.896 100
## Subsetting columns
# Small
microbenchmark(dplyr = filter(GGDC10S, Variable == "VA"),
collapse = fsubset(GGDC10S, Variable == "VA"))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 836.268 925.2945 1081.0888 1014.544 1116.512 2371.361 100
# collapse 151.279 173.8140 227.1581 192.779 296.978 503.814 100
# Large
microbenchmark(dplyr = filter(data, Variable == "VA"),
collapse = fsubset(data, Variable == "VA"))
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 14.510190 14.831934 17.950144 15.183132 15.913639 153.37622 100
# collapse 7.835217 7.976231 9.022352 8.200694 8.643372 26.03588 100
## Grouping
# Small
microbenchmark(dplyr = group_by(GGDC10S, Country, Variable),
collapse = fgroup_by(GGDC10S, Country, Variable))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 1183.895 1212.6785 1341.5189 1249.047 1399.880 2630.184 100
# collapse 356.106 386.6735 411.3915 395.599 419.473 658.216 100
# Large
microbenchmark(dplyr = group_by(data, Country, Variable),
collapse = fgroup_by(data, Country, Variable), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 156.20365 159.78255 162.24441 161.17418 164.01075 175.43424 10
# collapse 69.79582 70.22555 71.26995 70.56872 71.20507 74.75943 10
## Computing a new column
# Small
microbenchmark(dplyr = mutate(GGDC10S, NEW = AGR+1),
collapse = ftransform(GGDC10S, NEW = AGR+1))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 539.960 561.1565 685.51262 579.006 641.4815 3928.764 100
# collapse 22.759 29.0060 42.11706 39.939 44.1790 211.521 100
# Large
microbenchmark(dplyr = mutate(data, NEW = AGR+1),
collapse = ftransform(data, NEW = AGR+1))
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 4.454891 4.661727 6.107364 4.825277 4.964506 21.87729 100
# collapse 3.728400 3.860711 5.109817 3.965134 4.090752 21.44042 100
## All combined with pipes
# Small
microbenchmark(dplyr = filter(GGDC10S, Variable == "VA") %>%
select(Country, AGR:SUM) %>%
mutate(NEW = AGR+1) %>%
group_by(Country),
collapse = fsubset(GGDC10S, Variable == "VA", Country, AGR:SUM) %>%
ftransform(NEW = AGR+1) %>%
fgroup_by(Country))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 5852.539 6062.721 6428.0375 6211.322 6601.7880 10666.213 100
# collapse 456.512 510.285 636.5101 597.749 689.8995 1726.533 100
# Large
microbenchmark(dplyr = filter(data, Variable == "VA") %>%
select(Country, AGR:SUM) %>%
mutate(NEW = AGR+1) %>%
group_by(Country),
collapse = fsubset(data, Variable == "VA", Country, AGR:SUM) %>%
ftransform(NEW = AGR+1) %>%
fgroup_by(Country), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 19.59741 20.009300 21.57978 20.545913 21.392668 31.265005 10
# collapse 8.48897 8.518869 8.69884 8.655866 8.768321 9.146292 10
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 1837202 98.2 3518041 187.9 3518041 187.9
# Vcells 20831076 159.0 33840459 258.2 33840448 258.2
## Grouping the data
cgGGDC10S <- fgroup_by(GGDC10S, Variable, Country) %>% fselect(-Region, -Regioncode)
gGGDC10S <- group_by(GGDC10S, Variable, Country) %>% fselect(-Region, -Regioncode)
cgdata <- fgroup_by(data, Variable, Country) %>% fselect(-Region, -Regioncode)
gdata <- group_by(data, Variable, Country) %>% fselect(-Region, -Regioncode)
rm(data, GGDC10S)
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 1854333 99.1 3518041 187.9 3518041 187.9
# Vcells 19932755 152.1 33840459 258.2 33840448 258.2
## Conversion of Grouping object: This time would be required extra in all hybrid calls
## i.e. when calling collapse functions on data grouped with dplyr::group_by
# Small
microbenchmark(GRP(gGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# GRP(gGGDC10S) 30.345 30.791 33.08949 31.238 31.907 99.514 100
# Large
microbenchmark(GRP(gdata))
# Unit: milliseconds
# expr min lq mean median uq max neval
# GRP(gdata) 4.400003 4.580732 5.200687 4.683815 4.789576 23.11608 100
## Sum
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, sum, na.rm = TRUE),
collapse = fsum(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 1418.622 1463.023 1649.126 1539.1085 1660.041 3473.146 100
# collapse 235.619 246.329 286.045 276.0045 298.540 683.652 100
# Large
microbenchmark(dplyr = summarise_all(gdata, sum, na.rm = TRUE),
collapse = fsum(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 96.59567 98.17270 99.39025 99.53153 100.90642 101.46334 10
# collapse 41.11057 41.56217 42.45810 41.81631 43.06915 45.86935 10
## Mean
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, mean.default, na.rm = TRUE),
collapse = fmean(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 6168.482 6439.577 7504.9223 6635.703 7002.296 30337.70 100
# collapse 252.576 284.483 331.7185 331.339 361.461 824.22 100
# Large
microbenchmark(dplyr = summarise_all(gdata, mean.default, na.rm = TRUE),
collapse = fmean(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 1171.4805 1174.90722 1182.17062 1178.05126 1192.27920 1202.67186 10
# collapse 44.8296 45.06432 46.56113 46.01706 48.20546 49.60668 10
## Median
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, median, na.rm = TRUE),
collapse = fmedian(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 47217.467 48604.8515 52688.2321 50414.609 56091.3260 70558.898 100
# collapse 493.104 554.9095 629.6558 599.534 644.1585 1697.973 100
# Large
microbenchmark(dplyr = summarise_all(gdata, median, na.rm = TRUE),
collapse = fmedian(cgdata), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 10049.46566 10049.46566 10197.61900 10197.61900 10345.77234 10345.77234 2
# collapse 90.38345 90.38345 90.89374 90.89374 91.40402 91.40402 2
## Standard Deviation
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, sd, na.rm = TRUE),
collapse = fsd(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 18745.526 19407.3115 20736.9381 19967.3530 20687.373 33379.77 100
# collapse 430.183 471.4605 518.9818 515.6395 549.777 867.06 100
# Large
microbenchmark(dplyr = summarise_all(gdata, sd, na.rm = TRUE),
collapse = fsd(cgdata), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 3757.45055 3757.45055 3865.71849 3865.71849 3973.98644 3973.98644 2
# collapse 80.84446 80.84446 81.04973 81.04973 81.25501 81.25501 2
## Maximum
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, max, na.rm = TRUE),
collapse = fmax(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 1257.972 1298.804 1426.3818 1326.4715 1445.396 3121.949 100
# collapse 178.946 187.201 221.6336 211.9675 230.710 567.627 100
# Large
microbenchmark(dplyr = summarise_all(gdata, max, na.rm = TRUE),
collapse = fmax(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 62.00612 63.18912 64.08291 63.76188 63.97541 67.11297 10
# collapse 24.67571 24.89080 25.90325 25.21500 26.57450 29.38005 10
## First Value
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, first),
collapse = ffirst(cgGGDC10S, na.rm = FALSE))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 670.711 699.7165 789.84525 706.8570 758.845 2776.554 100
# collapse 57.567 65.5990 86.95606 83.8945 93.935 242.313 100
# Large
microbenchmark(dplyr = summarise_all(gdata, first),
collapse = ffirst(cgdata, na.rm = FALSE), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 16.292057 16.520536 16.945676 16.989542 17.363943 17.440252 10
# collapse 4.518258 4.546817 4.901585 4.622234 4.722193 6.404547 10
## Number of Distinct Values
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, n_distinct, na.rm = TRUE),
collapse = fNdistinct(cgGGDC10S))
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 14.458871 14.908019 15.706116 15.140961 15.807432 27.186747 100
# collapse 1.347222 1.426654 1.499281 1.477749 1.548033 1.936715 100
# Large
microbenchmark(dplyr = summarise_all(gdata, n_distinct, na.rm = TRUE),
collapse = fNdistinct(cgdata), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 2707.4958 2718.4918 2756.591 2724.9619 2740.3664 2891.6404 5
# collapse 330.2335 339.5097 342.779 345.1878 345.1909 353.7731 5
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 1856384 99.2 3518041 187.9 3518041 187.9
# Vcells 19937429 152.2 33840459 258.2 33840448 258.2
Below I add in some benchmarks for weighted aggregations and aggregations using the statistical mode, which cannot easily or efficiently be performed with dplyr.
## Weighted Mean
# Small
microbenchmark(fmean(cgGGDC10S, SUM))
# Unit: microseconds
# expr min lq mean median uq max neval
# fmean(cgGGDC10S, SUM) 280.244 282.921 304.3278 287.161 312.151 442.232 100
# Large
microbenchmark(fmean(cgdata, SUM), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fmean(cgdata, SUM) 50.39118 50.86911 51.83627 51.30465 52.61885 54.16688 10
## Weighted Standard-Deviation
# Small
microbenchmark(fsd(cgGGDC10S, SUM))
# Unit: microseconds
# expr min lq mean median uq max neval
# fsd(cgGGDC10S, SUM) 431.075 434.422 464.6378 458.966 465.4365 610.913 100
# Large
microbenchmark(fsd(cgdata, SUM), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fsd(cgdata, SUM) 81.84584 82.15465 84.08605 84.70272 85.12487 85.55416 10
## Statistical Mode
# Small
microbenchmark(fmode(cgGGDC10S))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fmode(cgGGDC10S) 1.605153 1.645092 1.736698 1.677669 1.795924 2.65205 100
# Large
microbenchmark(fmode(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fmode(cgdata) 404.9376 410.2636 420.907 416.9591 432.5251 440.6981 10
## Weighted Statistical Mode
# Small
microbenchmark(fmode(cgGGDC10S, SUM))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fmode(cgGGDC10S, SUM) 1.851482 1.917749 2.064699 2.0514 2.113875 3.416473 100
# Large
microbenchmark(fmode(cgdata, SUM), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fmode(cgdata, SUM) 509.6265 525.5481 534.3889 531.7452 547.8279 558.1113 10
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 1855705 99.2 3518041 187.9 3518041 187.9
# Vcells 19933812 152.1 33840459 258.2 33840456 258.2
## Replacing with group sum
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, sum, na.rm = TRUE),
collapse = fsum(cgGGDC10S, TRA = "replace_fill"))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 2797.973 2890.346 3082.2062 2989.413 3112.354 8315.826 100
# collapse 295.862 329.554 358.7566 348.966 377.526 540.406 100
# Large
microbenchmark(dplyr = mutate_all(gdata, sum, na.rm = TRUE),
collapse = fsum(cgdata, TRA = "replace_fill"), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 270.42208 282.9902 316.2438 289.6234 295.8109 453.9820 10
# collapse 88.50386 100.9399 116.0507 101.4477 111.4463 232.5174 10
## Dividing by group sum
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) x/sum(x, na.rm = TRUE)),
collapse = fsum(cgGGDC10S, TRA = "/"))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 5945.804 6133.675 6723.2883 6348.3195 6697.0625 20229.747 100
# collapse 550.670 615.599 663.6286 641.7045 692.1305 1038.419 100
# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) x/sum(x, na.rm = TRUE)),
collapse = fsum(cgdata, TRA = "/"), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 988.6350 999.9839 1214.339 1258.7311 1287.4335 1470.5231 10
# collapse 137.4849 152.3914 180.398 159.8702 193.3738 329.8194 10
## Centering
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) x-mean.default(x, na.rm = TRUE)),
collapse = fwithin(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 9895.989 10369.457 13140.2452 10796.5170 13811.812 45702.011 100
# collapse 359.230 388.236 486.0711 429.7365 489.088 825.558 100
# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) x-mean.default(x, na.rm = TRUE)),
collapse = fwithin(cgdata), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 1756.2252 1954.7906 2192.6484 2264.893 2377.1837 2527.9280 10
# collapse 101.7043 116.6371 151.8933 129.668 145.6401 279.0034 10
## Centering and Scaling (Standardizing)
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) (x-mean.default(x, na.rm = TRUE))/sd(x, na.rm = TRUE)),
collapse = fscale(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr 27544.192 28146.626 30431.4057 28978.2090 29886.32 44103.551 100
# collapse 499.798 536.167 574.6245 569.6355 596.41 730.508 100
# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) (x-mean.default(x, na.rm = TRUE))/sd(x, na.rm = TRUE)),
collapse = fscale(cgdata), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 5863.8168 5863.8168 5931.7402 5931.7402 5999.6635 5999.6635 2
# collapse 133.4147 133.4147 136.6256 136.6256 139.8366 139.8366 2
## Lag
# Small
microbenchmark(dplyr_unordered = mutate_all(gGGDC10S, dplyr::lag),
collapse_unordered = flag(cgGGDC10S),
dplyr_ordered = mutate_all(gGGDC10S, dplyr::lag, order_by = "Year"),
collapse_ordered = flag(cgGGDC10S, t = Year))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr_unordered 2100.935 2259.3525 2411.7816 2387.202 2504.5655 3725.276 100
# collapse_unordered 343.165 429.9600 477.2800 475.031 516.3090 989.331 100
# dplyr_ordered 53583.637 55465.2400 57983.7729 56759.805 60018.0825 70411.191 100
# collapse_ordered 323.976 374.1785 413.3908 398.499 430.8525 1266.005 100
# Large
microbenchmark(dplyr_unordered = mutate_all(gdata, dplyr::lag),
collapse_unordered = flag(cgdata),
dplyr_ordered = mutate_all(gdata, dplyr::lag, order_by = "Year"),
collapse_ordered = flag(cgdata, t = Year), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr_unordered 201.63434 201.63434 212.42327 212.42327 223.2122 223.2122 2
# collapse_unordered 55.62299 55.62299 139.83149 139.83149 224.0400 224.0400 2
# dplyr_ordered 11135.51293 11135.51293 11164.82024 11164.82024 11194.1276 11194.1276 2
# collapse_ordered 93.11449 93.11449 94.26759 94.26759 95.4207 95.4207 2
## First-Difference (unordered)
# Small
microbenchmark(dplyr_unordered = mutate_all(gGGDC10S, function(x) x - dplyr::lag(x)),
collapse_unordered = fdiff(cgGGDC10S))
# Unit: microseconds
# expr min lq mean median uq max neval
# dplyr_unordered 34853.283 35875.413 37907.1343 36404.216 39714.036 46266.96 100
# collapse_unordered 377.526 446.248 506.4065 511.623 560.934 894.28 100
# Large
microbenchmark(dplyr_unordered = mutate_all(gdata, function(x) x - dplyr::lag(x)),
collapse_unordered = fdiff(cgdata), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr_unordered 7207.30481 7207.30481 7242.75048 7242.75048 7278.19616 7278.19616 2
# collapse_unordered 60.02433 60.02433 60.24835 60.24835 60.47236 60.47236 2
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 1859004 99.3 3518041 187.9 3518041 187.9
# Vcells 20984685 160.2 58767528 448.4 48906274 373.2
Again below I add in some benchmarks for transformations not easily of efficiently performed with dplyr, such as centering on the overall mean, mean-preserving scaling, weighted scaling and centering, sequences of lags / leads, (iterated) panel-differences and growth rates.
# Centering on overall mean
system.time(fwithin(cgdata, mean = "overall.mean"))
# user system elapsed
# 0.06 0.03 0.09
# Weighted Centering
system.time(fwithin(cgdata, SUM))
# user system elapsed
# 0.06 0.03 0.09
system.time(fwithin(cgdata, SUM, mean = "overall.mean"))
# user system elapsed
# 0.08 0.00 0.08
# Weighted Scaling and Standardizing
system.time(fsd(cgdata, SUM, TRA = "/"))
# user system elapsed
# 0.11 0.04 0.15
system.time(fscale(cgdata, SUM))
# user system elapsed
# 0.11 0.02 0.13
# Sequence of lags and leads
system.time(flag(cgdata, -1:1))
# user system elapsed
# 0.04 0.07 0.10
# Iterated difference
system.time(fdiff(cgdata, 1, 2))
# user system elapsed
# 0.09 0.00 0.09
# Growth Rate
system.time(fgrowth(cgdata,1))
# user system elapsed
# 0.07 0.02 0.09
Timmer, M. P., de Vries, G. J., & de Vries, K. (2015). “Patterns of Structural Change in Developing Countries.” . In J. Weiss, & M. Tribe (Eds.), Routledge Handbook of Industry and Development. (pp. 65-83). Routledge.
Cochrane, D. & Orcutt, G. H. (1949). “Application of Least Squares Regression to Relationships Containing Auto-Correlated Error Terms”. Journal of the American Statistical Association. 44 (245): 32–61.
Prais, S. J. & Winsten, C. B. (1954). “Trend Estimators and Serial Correlation”. Cowles Commission Discussion Paper No. 383. Chicago.
Row-wise operations are not supported by TRA.↩︎