collapse and dplyr

Fast (Weighted) Aggregations, Transformations and Panel Computations in a Piped Workflow

Sebastian Krantz

2020-05-19

collapse is a C/C++ based package for data manipulation in R. It’s aims are

  1. to facilitate complex data transformation and exploration tasks and

  2. to help make R code fast, flexible, parsimonious and programmer friendly.

This vignette focuses on the integration of collapse and the popular dplyr package by Hadley Wickham. In particular it will demonstrate how using collapse’s fast functions and some fast alternatives for dplyr verbs can substantially facilitate and speed up basic data manipulation, grouped and weighted aggregations and transformations, and panel-data computations (i.e. between- and within-transformations, panel-lags, differences and growth rates) in a dplyr (piped) workflow.


Notes:


1. Fast Aggregations

A key feature of collapse is it’s broad set of Fast Statistical Functions (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, ffirst, flast, fNobs, fNdistinct) which are able to substantially speed-up column-wise, grouped and weighted computations on vectors, matrices or data.frame’s. The functions are S3 generic, with a default (vector), matrix and data.frame method, as well as a grouped_df method for grouped tibbles used by dplyr. The grouped tibble method has the following arguments:

FUN.grouped_df(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
               use.g.names = FALSE, keep.group_vars = TRUE, [keep.w = TRUE,] ...)

where w is a weight variable (available only to fsum, fprod, fmean, fmode, fvar and fsd), and TRA and can be used to transform x using the computed statistics and one of 10 available transformations ("replace_fill", "replace", "-", "-+", "/", "%", "+", "*", "%%", "-%%"). These transformations perform grouped replacing or sweeping out of the statistics computed by the function (discussed in section 2). na.rm efficiently removes missing values and is TRUE by default. use.g.names generates new row-names from the unique combinations of groups (default: disabled), whereas keep.group_vars (default: enabled) will keep the grouping columns as is custom in the native data %>% group_by(...) %>% summarize(...) workflow in dplyr. Finally, keep.w regulates whether a weighting variable used is also aggregated and saved in a column. For fsum, fmean, fvar and fsd this will compute the sum of the weights in each group, whereas fmode will return the maximum weight (corresponding to the mode) in each group and fprod returns the product of the weights.

With that in mind, let’s consider some straightforward applications.

1.1 Simple Aggregations

Consider the Groningen Growth and Development Center 10-Sector Database included in collapse and introduced in the main vignette:

library(collapse)
head(GGDC10S)
#   Country Regioncode             Region Variable Year      AGR      MIN       MAN        PU
# 1     BWA        SSA Sub-saharan Africa       VA 1960       NA       NA        NA        NA
# 2     BWA        SSA Sub-saharan Africa       VA 1961       NA       NA        NA        NA
# 3     BWA        SSA Sub-saharan Africa       VA 1962       NA       NA        NA        NA
# 4     BWA        SSA Sub-saharan Africa       VA 1963       NA       NA        NA        NA
# 5     BWA        SSA Sub-saharan Africa       VA 1964 16.30154 3.494075 0.7365696 0.1043936
# 6     BWA        SSA Sub-saharan Africa       VA 1965 15.72700 2.495768 1.0181992 0.1350976
#         CON      WRT      TRA     FIRE      GOV      OTH      SUM
# 1        NA       NA       NA       NA       NA       NA       NA
# 2        NA       NA       NA       NA       NA       NA       NA
# 3        NA       NA       NA       NA       NA       NA       NA
# 4        NA       NA       NA       NA       NA       NA       NA
# 5 0.6600454 6.243732 1.658928 1.119194 4.822485 2.341328 37.48229
# 6 1.3462312 7.064825 1.939007 1.246789 5.695848 2.678338 39.34710

# Summarize the Data: 
# descr(GGDC10S, cols = is.categorical)
# aperm(qsu(GGDC10S, ~Variable, cols = is.numeric))

Simple column-wise computations using the fast functions and pipe operators are performed as follows:

library(dplyr)

GGDC10S %>% fNobs                       # Number of Observations
#    Country Regioncode     Region   Variable       Year        AGR        MIN        MAN         PU 
#       5027       5027       5027       5027       5027       4364       4355       4355       4354 
#        CON        WRT        TRA       FIRE        GOV        OTH        SUM 
#       4355       4355       4355       4355       3482       4248       4364
GGDC10S %>% fNdistinct                  # Number of distinct values
#    Country Regioncode     Region   Variable       Year        AGR        MIN        MAN         PU 
#         43          6          6          2         67       4353       4224       4353       4237 
#        CON        WRT        TRA       FIRE        GOV        OTH        SUM 
#       4339       4344       4334       4349       3470       4238       4364
GGDC10S %>% select_at(6:16) %>% fmedian # Median
#        AGR        MIN        MAN         PU        CON        WRT        TRA       FIRE        GOV 
#  4394.5194   173.2234  3718.0981   167.9500  1473.4470  3773.6430  1174.8000   960.1251  3928.5127 
#        OTH        SUM 
#  1433.1722 23186.1936
GGDC10S %>% select_at(6:16) %>% fmean   # Mean
#        AGR        MIN        MAN         PU        CON        WRT        TRA       FIRE        GOV 
#  2526696.5  1867908.9  5538491.4   335679.5  1801597.6  3392909.5  1473269.7  1657114.8  1712300.3 
#        OTH        SUM 
#  1684527.3 21566436.8
GGDC10S %>% fmode                       # Mode
#            Country         Regioncode             Region           Variable               Year 
#              "USA"              "ASI"             "Asia"              "EMP"             "2010" 
#                AGR                MIN                MAN                 PU                CON 
# "171.315882316326"                "0" "4645.12507642586"                "0" "1.34623115930777" 
#                WRT                TRA               FIRE                GOV                OTH 
# "21.8380052682527" "8.97743416914571" "40.0701608636442"                "0" "3626.84423577048" 
#                SUM 
# "37.4822945751317"
GGDC10S %>% fmode(drop = FALSE)         # Keep data structure intact
#   Country Regioncode Region Variable Year      AGR MIN      MAN PU      CON      WRT      TRA
# 1     USA        ASI   Asia      EMP 2010 171.3159   0 4645.125  0 1.346231 21.83801 8.977434
#       FIRE GOV      OTH      SUM
# 1 40.07016   0 3626.844 37.48229

Moving on to grouped statistics, we can compute the average value added and employment by sector and country using:

GGDC10S %>% 
  group_by(Variable,Country) %>%
  select_at(6:16) %>% fmean
# # A tibble: 85 x 13
#    Variable Country     AGR     MIN     MAN     PU    CON    WRT    TRA   FIRE     GOV    OTH    SUM
#    <chr>    <chr>     <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
#  1 EMP      ARG       1420.   52.1   1932.  1.02e2 7.42e2 1.98e3 6.49e2  628.   2043.  9.92e2 1.05e4
#  2 EMP      BOL        964.   56.0    235.  5.35e0 1.23e2 2.82e2 1.15e2   44.6    NA   3.96e2 2.22e3
#  3 EMP      BRA      17191.  206.    6991.  3.65e2 3.52e3 8.51e3 2.05e3 4414.   5307.  5.71e3 5.43e4
#  4 EMP      BWA        188.   10.5     18.1 3.09e0 2.53e1 3.63e1 8.36e0   15.3    61.1 2.76e1 3.94e2
#  5 EMP      CHL        702.  101.     625.  2.94e1 2.96e2 6.95e2 2.58e2  272.     NA   1.00e3 3.98e3
#  6 EMP      CHN     287744. 7050.   67144.  1.61e3 2.09e4 2.89e4 1.39e4 4929.  22669.  3.10e4 4.86e5
#  7 EMP      COL       3091.  145.    1175.  3.39e1 5.24e2 2.07e3 4.70e2  649.     NA   1.73e3 9.89e3
#  8 EMP      CRI        231.    1.70   136.  1.43e1 5.76e1 1.57e2 4.24e1   54.9   128.  6.51e1 8.87e2
#  9 EMP      DEW       2490.  407.    8473.  2.26e2 2.09e3 4.44e3 1.48e3 1689.   3945.  9.99e2 2.62e4
# 10 EMP      DNK        236.    8.03   507.  1.38e1 1.71e2 4.55e2 1.61e2  181.    549.  1.11e2 2.39e3
# # ... with 75 more rows

Similarly we can aggregate using any other of the above functions.

It is important to not use dplyr’s summarize together with these functions since that would totally eliminate their speed gain. These functions are fast because they are executed only once and carry out the grouped computations in C++, whereas summarize will apply the function to each group in the grouped tibble. - It will also work with the fast functions, but is slower than using primitive base functions since the fast functions are S3 generic -.


Excursus: What is Happening Behind the Scenes?

To drive this point home it is perhaps good to shed some light on what is happening behind the scenes of dplyr and collapse. Fundamentally both packages follow different computing paradigms:

dplyr is an efficient implementation of the Split-Apply-Combine computing paradigm. Data is split into groups, these data-chunks are then passed to a function carrying out the computation, and finally recombined to produce the aggregated data.frame. This modus operandi is evident in the grouping mechanism of dplyr. When a data.frame is passed through group_by, a ‘groups’ attribute is attached:

GGDC10S %>% group_by(Variable,Country) %>% attr("groups")
# # A tibble: 85 x 3
#    Variable Country .rows     
#    <chr>    <chr>   <list>    
#  1 EMP      ARG     <int [62]>
#  2 EMP      BOL     <int [61]>
#  3 EMP      BRA     <int [62]>
#  4 EMP      BWA     <int [52]>
#  5 EMP      CHL     <int [63]>
#  6 EMP      CHN     <int [62]>
#  7 EMP      COL     <int [61]>
#  8 EMP      CRI     <int [62]>
#  9 EMP      DEW     <int [61]>
# 10 EMP      DNK     <int [64]>
# # ... with 75 more rows

This object is a data.frame giving the unique groups and in the third (last) column vectors containing the indices of the rows belonging to that group. A command like summarize uses this information to split the data.frame into groups which are then passed sequentially to the function used and later recombined. These steps are also done in C++ which makes dplyr quite efficient.

Now collapse is based around one-pass grouped computations at the C++ level using its own grouped statistical functions. In other words the data is not split and recombined at all but the entire computation is performed in a single C++ loop running through that data and completing the computations for each group simultaneously. This modus operandi is also evident in collapse grouping objects. The method GRP.grouped_df takes a dplyr grouping object from a grouped tibble and efficiently converts it to a collapse grouping object:

GGDC10S %>% group_by(Variable,Country) %>% GRP %>% str
# List of 8
#  $ N.groups   : int 85
#  $ group.id   : int [1:5027] 46 46 46 46 46 46 46 46 46 46 ...
#  $ group.sizes: int [1:85] 62 61 62 52 63 62 61 62 61 64 ...
#  $ groups     :List of 2
#   ..$ Variable: chr [1:85] "EMP" "EMP" "EMP" "EMP" ...
#   .. ..- attr(*, "label")= chr "Variable"
#   .. ..- attr(*, "format.stata")= chr "%9s"
#   ..$ Country : chr [1:85] "ARG" "BOL" "BRA" "BWA" ...
#   .. ..- attr(*, "label")= chr "Country"
#   .. ..- attr(*, "format.stata")= chr "%9s"
#  $ group.vars : chr [1:2] "Variable" "Country"
#  $ ordered    : logi [1:2] TRUE TRUE
#  $ order      : NULL
#  $ call       : language GRP.grouped_df(X = .)
#  - attr(*, "class")= chr "GRP"

This object is a list where the first three elements give the number of groups, the group-id to which each row belongs and a vector of group-sizes. A function like fsum uses this information to (for each column) create a result vector of size ‘N.groups’ and the run through the column using the ‘group.id’ vector to add the i’th data point to the ’group.id[i]’th element of the result vector. When the loop is finished, the grouped computation is also finished.

It is thus clear that collapse is faster than dplyr since it’s method of computing involves less steps.


1.2 More Speed using collapse Verbs

collapse fast functions do not develop their maximal performance on a grouped tibble created with group_by because of the additional conversion cost of the grouping object incurred by GRP.grouped_df. This cost is already minimized through the use of C++, but we can do even better replacing group_by with collapse::fgroup_by. fgroup_by works like group_by but does the grouping with collapse::GRP (up to 10x faster than group_by) and simply attaches a collapse grouping object to the grouped_df. Thus the speed gain is 2-fold: Faster grouping and no conversion cost when calling collapse functions.

Another improvement comes from replacing the dplyr verb select with collapse::fselect, and, for selection using column names, indices or functions use collapse::get_vars instead of select_at or select_if. Next to get_vars, collapse also introduces the predicates num_vars, cat_vars, char_vars, fact_vars, logi_vars and Date_vars to efficiently select columns by type.

GGDC10S %>% fgroup_by(Variable,Country) %>% get_vars(6:16) %>% fmedian
# # A tibble: 85 x 13
#    Variable Country     AGR     MIN     MAN     PU    CON    WRT    TRA   FIRE     GOV    OTH    SUM
#    <chr>    <chr>     <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
#  1 EMP      ARG       1325.   47.4   1988.  1.05e2 7.82e2 1.85e3 5.80e2  464.   1739.   866.  9.74e3
#  2 EMP      BOL        943.   53.5    167.  4.46e0 6.60e1 1.32e2 9.70e1   15.3    NA    384.  1.84e3
#  3 EMP      BRA      17481.  225.    7208.  3.76e2 4.05e3 6.45e3 1.58e3 4355.   4450.  4479.  5.19e4
#  4 EMP      BWA        175.   12.2     13.1 3.71e0 1.90e1 2.11e1 6.75e0   10.4    53.8   31.2 3.61e2
#  5 EMP      CHL        690.   93.9    607.  2.58e1 2.30e2 4.84e2 2.05e2  106.     NA    900.  3.31e3
#  6 EMP      CHN     293915  8150.   61761.  1.14e3 1.06e4 1.70e4 9.56e3 4328.  19468.  9954.  4.45e5
#  7 EMP      COL       3006.   84.0   1033.  3.71e1 4.19e2 1.55e3 3.91e2  655.     NA   1430.  8.63e3
#  8 EMP      CRI        216.    1.49   114.  7.92e0 5.50e1 8.98e1 2.55e1   19.6   122.    60.6 7.19e2
#  9 EMP      DEW       2178   320.    8459.  2.47e2 2.10e3 4.45e3 1.53e3 1656    3700    900   2.65e4
# 10 EMP      DNK        187.    3.75   508.  1.36e1 1.65e2 4.61e2 1.61e2  169.    642.   104.  2.42e3
# # ... with 75 more rows

microbenchmark(collapse = GGDC10S %>% fgroup_by(Variable,Country) %>% get_vars(6:16) %>% fmedian,
               hybrid = GGDC10S %>% group_by(Variable,Country) %>% select_at(6:16) %>% fmedian,
               dplyr = GGDC10S %>% group_by(Variable,Country) %>% select_at(6:16) %>% summarise_all(median, na.rm = TRUE))
# Unit: microseconds
#      expr       min        lq      mean    median        uq       max neval
#  collapse   971.482  1050.245  1192.611  1100.225  1159.129  8355.542   100
#    hybrid 13576.640 14075.991 15175.286 14474.713 15185.363 22655.549   100
#     dplyr 57322.300 59729.806 62748.435 60518.103 64810.782 99397.655   100

Benchmarks on the different components of this code and with larger data are provided under ‘Benchmarks’. I note that a grouped tibble created with fgroup_by can no longer be used for grouped computations with dplyr verbs like mutate or summarize. To avoid errors with these functions and print.grouped_df, [.grouped_df etc., the classes assigned after fgroup_by are reshuffled, so that the data.frame is treated by the dplyr ecosystem like a normal tibble:

class(group_by(GGDC10S, Variable, Country))
# [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

class(fgroup_by(GGDC10S, Variable, Country))
# [1] "tbl_df"     "tbl"        "grouped_df" "data.frame"

I also note that fselect and get_vars are not full drop-in replacements for select because they do not have a grouped_df method:

GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% head(3)
# # A tibble: 3 x 13
# # Groups:   Variable, Country [1]
#   Variable Country   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 3 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
GGDC10S %>% group_by(Variable, Country) %>% get_vars(6:16) %>% head(3)
# # A tibble: 3 x 11
#     AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 3    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

Since by default keep.group_vars = TRUE in the Fast Statistical Functions, the end result is nevertheless the same:

GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% fmean %>% head(3)
# # A tibble: 3 x 13
#   Variable Country    AGR   MIN   MAN     PU   CON   WRT   TRA   FIRE   GOV   OTH    SUM
#   <chr>    <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>
# 1 EMP      ARG      1420.  52.1 1932. 102.    742. 1982.  649.  628.  2043.  992. 10542.
# 2 EMP      BOL       964.  56.0  235.   5.35  123.  282.  115.   44.6   NA   396.  2221.
# 3 EMP      BRA     17191. 206.  6991. 365.   3525. 8509. 2054. 4414.  5307. 5710. 54273.
GGDC10S %>% group_by(Variable, Country) %>% get_vars(6:16) %>% fmean %>% head(3)
# # A tibble: 3 x 13
#   Variable Country    AGR   MIN   MAN     PU   CON   WRT   TRA   FIRE   GOV   OTH    SUM
#   <chr>    <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>
# 1 EMP      ARG      1420.  52.1 1932. 102.    742. 1982.  649.  628.  2043.  992. 10542.
# 2 EMP      BOL       964.  56.0  235.   5.35  123.  282.  115.   44.6   NA   396.  2221.
# 3 EMP      BRA     17191. 206.  6991. 365.   3525. 8509. 2054. 4414.  5307. 5710. 54273.

Another useful verb introduced by collapse is fgroup_vars, which can be used to efficiently obtain the grouping columns or grouping variables from a grouped tibble:

# fgroup_by fully supports grouped tibbles created with group_by or fgroup_by: 
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars %>% head(3)
# # A tibble: 3 x 2
#   Variable Country
#   <chr>    <chr>  
# 1 VA       BWA    
# 2 VA       BWA    
# 3 VA       BWA
GGDC10S %>% fgroup_by(Variable, Country) %>% fgroup_vars %>% head(3)
# # A tibble: 3 x 2
#   Variable Country
#   <chr>    <chr>  
# 1 VA       BWA    
# 2 VA       BWA    
# 3 VA       BWA

# The other possibilities:
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("unique") %>% head(3)
# # A tibble: 3 x 2
#   Variable Country
#   <chr>    <chr>  
# 1 EMP      ARG    
# 2 EMP      BOL    
# 3 EMP      BRA
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("names")
# [1] "Variable" "Country"
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("indices")
# [1] 4 1
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("named_indices")
# Variable  Country 
#        4        1
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("logical")
#  [1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("named_logical")
#    Country Regioncode     Region   Variable       Year        AGR        MIN        MAN         PU 
#       TRUE      FALSE      FALSE       TRUE      FALSE      FALSE      FALSE      FALSE      FALSE 
#        CON        WRT        TRA       FIRE        GOV        OTH        SUM 
#      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE

A final collapse verb I want to mention here is fsubset, a faster alternative to dplyr::filter which also provides an option to flexibly subset columns after the select argument:

# Two equivalent calls, the first is substantially faster
GGDC10S %>% fsubset(Variable == "VA" & Year > 1990, Country, Year, AGR:GOV) %>% head(3)
#   Country Year      AGR      MIN      MAN       PU      CON      WRT      TRA     FIRE      GOV
# 1     BWA 1991 303.1157 2646.950 472.6488 160.6079 580.0876 806.7509 232.7884 432.6965 1073.263
# 2     BWA 1992 333.4364 2690.939 537.4274 178.4532 678.7320 725.2577 285.1403 517.2141 1234.012
# 3     BWA 1993 404.5488 2624.928 567.3420 219.2183 634.2797 771.8253 349.7458 673.2540 1487.193

GGDC10S %>% filter(Variable == "VA" & Year > 1990) %>% select(Country, Year, AGR:GOV) %>% head(3)
#   Country Year      AGR      MIN      MAN       PU      CON      WRT      TRA     FIRE      GOV
# 1     BWA 1991 303.1157 2646.950 472.6488 160.6079 580.0876 806.7509 232.7884 432.6965 1073.263
# 2     BWA 1992 333.4364 2690.939 537.4274 178.4532 678.7320 725.2577 285.1403 517.2141 1234.012
# 3     BWA 1993 404.5488 2624.928 567.3420 219.2183 634.2797 771.8253 349.7458 673.2540 1487.193

1.3 Multi-Function Aggregations

One can also aggregate with multiple functions at the same time. For such operations it is often necessary to use curly braces { to prevent first argument injection so that %>% cbind(FUN1(.), FUN2(.)) does not evaluate as %>% cbind(., FUN1(.), FUN2(.)):

GGDC10S %>%
  fgroup_by(Variable,Country) %>%
  get_vars(6:16) %>% {
    cbind(fmedian(.),
          add_stub(fmean(., keep.group_vars = FALSE), "mean_"))
    } %>% head(3)
#   Variable Country        AGR       MIN       MAN         PU        CON      WRT        TRA
# 1      EMP     ARG  1324.5255  47.35255 1987.5912 104.738825  782.40283 1854.612  579.93982
# 2      EMP     BOL   943.1612  53.53538  167.1502   4.457895   65.97904  132.225   96.96828
# 3      EMP     BRA 17480.9810 225.43693 7207.7915 375.851832 4054.66103 6454.523 1580.81120
#         FIRE      GOV       OTH       SUM   mean_AGR  mean_MIN  mean_MAN    mean_PU  mean_CON
# 1  464.39920 1738.836  866.1119  9743.223  1419.8013  52.08903 1931.7602 101.720936  742.4044
# 2   15.34259       NA  384.0678  1842.055   964.2103  56.03295  235.0332   5.346433  122.7827
# 3 4354.86210 4449.942 4478.6927 51881.110 17191.3529 206.02389 6991.3710 364.573404 3524.7384
#    mean_WRT  mean_TRA  mean_FIRE mean_GOV  mean_OTH  mean_SUM
# 1 1982.1775  648.5119  627.79291 2043.471  992.4475 10542.177
# 2  281.5164  115.4728   44.56442       NA  395.5650  2220.524
# 3 8509.4612 2054.3731 4413.54448 5307.280 5710.2665 54272.985

The function add_stub used above is a collapse function adding a prefix (default) or suffix to variables names. The collapse predicate add_vars provides a more efficient alternative to cbind.data.frame. The idea here is ‘adding’ variables to the data.frame in the first argument i.e. the attributes of the first argument are preserved, so the expression below still gives a tibble instead of a data.frame:

GGDC10S %>%
  fgroup_by(Variable,Country) %>% {
   add_vars(ffirst(get_vars(., "Reg", regex = TRUE)),        # Regular expression matching column names
            add_stub(fmean(num_vars(.), keep.group_vars = FALSE), "mean_"), # num_vars selects all numeric variables
            add_stub(fmedian(fselect(., PU:TRA), keep.group_vars = FALSE), "median_"), 
            add_stub(fmin(fselect(., PU:CON), keep.group_vars = FALSE), "min_"))      
  }
# # A tibble: 85 x 22
#    Variable Country Regioncode Region mean_Year mean_AGR mean_MIN mean_MAN mean_PU mean_CON mean_WRT
#  * <chr>    <chr>   <chr>      <chr>      <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
#  1 EMP      ARG     LAM        Latin~     1980.    1420.    52.1    1932.   102.      742.    1982. 
#  2 EMP      BOL     LAM        Latin~     1980      964.    56.0     235.     5.35    123.     282. 
#  3 EMP      BRA     LAM        Latin~     1980.   17191.   206.     6991.   365.     3525.    8509. 
#  4 EMP      BWA     SSA        Sub-s~     1986.     188.    10.5      18.1    3.09     25.3     36.3
#  5 EMP      CHL     LAM        Latin~     1981      702.   101.      625.    29.4     296.     695. 
#  6 EMP      CHN     ASI        Asia       1980.  287744.  7050.    67144.  1606.    20852.   28908. 
#  7 EMP      COL     LAM        Latin~     1980     3091.   145.     1175.    33.9     524.    2071. 
#  8 EMP      CRI     LAM        Latin~     1980.     231.     1.70    136.    14.3      57.6    157. 
#  9 EMP      DEW     EUR        Europe     1980     2490.   407.     8473.   226.     2093.    4442. 
# 10 EMP      DNK     EUR        Europe     1980.     236.     8.03    507.    13.8     171.     455. 
# # ... with 75 more rows, and 11 more variables: mean_TRA <dbl>, mean_FIRE <dbl>, mean_GOV <dbl>,
# #   mean_OTH <dbl>, mean_SUM <dbl>, median_PU <dbl>, median_CON <dbl>, median_WRT <dbl>,
# #   median_TRA <dbl>, min_PU <dbl>, min_CON <dbl>

Another nice feature of add_vars is that it can also very efficiently reorder columns i.e. bind columns in a different order than they are passed. This can be done by simply specifying the positions the added columns should have in the final data.frame, and then add_vars shifts the first argument columns to the right to fill in the gaps.

GGDC10S %>%
  fsubset(Variable == "VA", Country, AGR, SUM) %>% 
  fgroup_by(Country) %>% {
   add_vars(fgroup_vars(.,"unique"),
            add_stub(fmean(., keep.group_vars = FALSE), "mean_"),
            add_stub(fsd(., keep.group_vars = FALSE), "sd_"), 
            pos = c(2,4,3,5))
  } %>% head(3)
#   Country  mean_AGR    sd_AGR   mean_SUM    sd_SUM
# 1     ARG 14951.292 33061.413  152533.84 301316.25
# 2     BOL  3299.718  4456.331   22619.18  33172.98
# 3     BRA 76870.146 59441.696 1200562.67 976963.14

A much more compact solution to multi-function and multi-type aggregation with dplyr is offered by the function collapg:

# This aggregates numeric colums using the mean (fmean) and categorical columns with the mode (fmode)
GGDC10S %>% fgroup_by(Variable,Country) %>% collapg
# # A tibble: 85 x 16
#    Variable Country Regioncode Region  Year    AGR    MIN    MAN     PU    CON    WRT    TRA   FIRE
#    <chr>    <chr>   <chr>      <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1 EMP      ARG     LAM        Latin~ 1980. 1.42e3 5.21e1 1.93e3 1.02e2 7.42e2 1.98e3 6.49e2  628. 
#  2 EMP      BOL     LAM        Latin~ 1980  9.64e2 5.60e1 2.35e2 5.35e0 1.23e2 2.82e2 1.15e2   44.6
#  3 EMP      BRA     LAM        Latin~ 1980. 1.72e4 2.06e2 6.99e3 3.65e2 3.52e3 8.51e3 2.05e3 4414. 
#  4 EMP      BWA     SSA        Sub-s~ 1986. 1.88e2 1.05e1 1.81e1 3.09e0 2.53e1 3.63e1 8.36e0   15.3
#  5 EMP      CHL     LAM        Latin~ 1981  7.02e2 1.01e2 6.25e2 2.94e1 2.96e2 6.95e2 2.58e2  272. 
#  6 EMP      CHN     ASI        Asia   1980. 2.88e5 7.05e3 6.71e4 1.61e3 2.09e4 2.89e4 1.39e4 4929. 
#  7 EMP      COL     LAM        Latin~ 1980  3.09e3 1.45e2 1.18e3 3.39e1 5.24e2 2.07e3 4.70e2  649. 
#  8 EMP      CRI     LAM        Latin~ 1980. 2.31e2 1.70e0 1.36e2 1.43e1 5.76e1 1.57e2 4.24e1   54.9
#  9 EMP      DEW     EUR        Europe 1980  2.49e3 4.07e2 8.47e3 2.26e2 2.09e3 4.44e3 1.48e3 1689. 
# 10 EMP      DNK     EUR        Europe 1980. 2.36e2 8.03e0 5.07e2 1.38e1 1.71e2 4.55e2 1.61e2  181. 
# # ... with 75 more rows, and 3 more variables: GOV <dbl>, OTH <dbl>, SUM <dbl>

By default it aggregates numeric columns using the fmean and categorical columns using fmode, and preserves the order of all columns. Changing these defaults is very easy:

# This aggregates numeric colums using the median and categorical columns using the first value
GGDC10S %>% fgroup_by(Variable,Country) %>% collapg(fmedian, flast)
# # A tibble: 85 x 16
#    Variable Country Regioncode Region  Year    AGR    MIN    MAN     PU    CON    WRT    TRA   FIRE
#    <chr>    <chr>   <chr>      <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1 EMP      ARG     LAM        Latin~ 1980. 1.32e3 4.74e1 1.99e3 1.05e2 7.82e2 1.85e3 5.80e2  464. 
#  2 EMP      BOL     LAM        Latin~ 1980  9.43e2 5.35e1 1.67e2 4.46e0 6.60e1 1.32e2 9.70e1   15.3
#  3 EMP      BRA     LAM        Latin~ 1980. 1.75e4 2.25e2 7.21e3 3.76e2 4.05e3 6.45e3 1.58e3 4355. 
#  4 EMP      BWA     SSA        Sub-s~ 1986. 1.75e2 1.22e1 1.31e1 3.71e0 1.90e1 2.11e1 6.75e0   10.4
#  5 EMP      CHL     LAM        Latin~ 1981  6.90e2 9.39e1 6.07e2 2.58e1 2.30e2 4.84e2 2.05e2  106. 
#  6 EMP      CHN     ASI        Asia   1980. 2.94e5 8.15e3 6.18e4 1.14e3 1.06e4 1.70e4 9.56e3 4328. 
#  7 EMP      COL     LAM        Latin~ 1980  3.01e3 8.40e1 1.03e3 3.71e1 4.19e2 1.55e3 3.91e2  655. 
#  8 EMP      CRI     LAM        Latin~ 1980. 2.16e2 1.49e0 1.14e2 7.92e0 5.50e1 8.98e1 2.55e1   19.6
#  9 EMP      DEW     EUR        Europe 1980  2.18e3 3.20e2 8.46e3 2.47e2 2.10e3 4.45e3 1.53e3 1656  
# 10 EMP      DNK     EUR        Europe 1980. 1.87e2 3.75e0 5.08e2 1.36e1 1.65e2 4.61e2 1.61e2  169. 
# # ... with 75 more rows, and 3 more variables: GOV <dbl>, OTH <dbl>, SUM <dbl>

One can apply multiple functions to both numeric and/or categorical data:

GGDC10S %>% fgroup_by(Variable,Country) %>%
  collapg(list(fmean, fmedian), list(first, fmode, flast)) %>% head(3)
# # A tibble: 3 x 32
#   Variable Country first.Regioncode fmode.Regioncode flast.Regioncode first.Region fmode.Region
#   <chr>    <chr>   <chr>            <chr>            <chr>            <chr>        <chr>       
# 1 EMP      ARG     LAM              LAM              LAM              Latin Ameri~ Latin Ameri~
# 2 EMP      BOL     LAM              LAM              LAM              Latin Ameri~ Latin Ameri~
# 3 EMP      BRA     LAM              LAM              LAM              Latin Ameri~ Latin Ameri~
# # ... with 25 more variables: flast.Region <chr>, fmean.Year <dbl>, fmedian.Year <dbl>,
# #   fmean.AGR <dbl>, fmedian.AGR <dbl>, fmean.MIN <dbl>, fmedian.MIN <dbl>, fmean.MAN <dbl>,
# #   fmedian.MAN <dbl>, fmean.PU <dbl>, fmedian.PU <dbl>, fmean.CON <dbl>, fmedian.CON <dbl>,
# #   fmean.WRT <dbl>, fmedian.WRT <dbl>, fmean.TRA <dbl>, fmedian.TRA <dbl>, fmean.FIRE <dbl>,
# #   fmedian.FIRE <dbl>, fmean.GOV <dbl>, fmedian.GOV <dbl>, fmean.OTH <dbl>, fmedian.OTH <dbl>,
# #   fmean.SUM <dbl>, fmedian.SUM <dbl>

Applying multiple functions to only numeric (or only categorical) data allows return in a long format:

GGDC10S %>% fgroup_by(Variable,Country) %>%
  collapg(list(fmean, fmedian), cols = is.numeric, return = "long")
# # A tibble: 170 x 15
#    Function Variable Country  Year    AGR    MIN    MAN     PU    CON    WRT    TRA   FIRE     GOV
#    <chr>    <chr>    <chr>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
#  1 fmean    EMP      ARG     1980. 1.42e3 5.21e1 1.93e3 1.02e2 7.42e2 1.98e3 6.49e2  628.   2043. 
#  2 fmean    EMP      BOL     1980  9.64e2 5.60e1 2.35e2 5.35e0 1.23e2 2.82e2 1.15e2   44.6    NA  
#  3 fmean    EMP      BRA     1980. 1.72e4 2.06e2 6.99e3 3.65e2 3.52e3 8.51e3 2.05e3 4414.   5307. 
#  4 fmean    EMP      BWA     1986. 1.88e2 1.05e1 1.81e1 3.09e0 2.53e1 3.63e1 8.36e0   15.3    61.1
#  5 fmean    EMP      CHL     1981  7.02e2 1.01e2 6.25e2 2.94e1 2.96e2 6.95e2 2.58e2  272.     NA  
#  6 fmean    EMP      CHN     1980. 2.88e5 7.05e3 6.71e4 1.61e3 2.09e4 2.89e4 1.39e4 4929.  22669. 
#  7 fmean    EMP      COL     1980  3.09e3 1.45e2 1.18e3 3.39e1 5.24e2 2.07e3 4.70e2  649.     NA  
#  8 fmean    EMP      CRI     1980. 2.31e2 1.70e0 1.36e2 1.43e1 5.76e1 1.57e2 4.24e1   54.9   128. 
#  9 fmean    EMP      DEW     1980  2.49e3 4.07e2 8.47e3 2.26e2 2.09e3 4.44e3 1.48e3 1689.   3945. 
# 10 fmean    EMP      DNK     1980. 2.36e2 8.03e0 5.07e2 1.38e1 1.71e2 4.55e2 1.61e2  181.    549. 
# # ... with 160 more rows, and 2 more variables: OTH <dbl>, SUM <dbl>

Finally, collapg also makes it very easy to apply aggregator functions to certain columns only:

GGDC10S %>% fgroup_by(Variable,Country) %>%
  collapg(custom = list(fmean = 6:8, fmedian = 10:12))
# # A tibble: 85 x 8
#    Variable Country fmean.AGR fmean.MIN fmean.MAN fmedian.CON fmedian.WRT fmedian.TRA
#    <chr>    <chr>       <dbl>     <dbl>     <dbl>       <dbl>       <dbl>       <dbl>
#  1 EMP      ARG         1420.     52.1     1932.        782.       1855.       580.  
#  2 EMP      BOL          964.     56.0      235.         66.0       132.        97.0 
#  3 EMP      BRA        17191.    206.      6991.       4055.       6455.      1581.  
#  4 EMP      BWA          188.     10.5       18.1        19.0        21.1        6.75
#  5 EMP      CHL          702.    101.       625.        230.        484.       205.  
#  6 EMP      CHN       287744.   7050.     67144.      10578.      17034.      9564.  
#  7 EMP      COL         3091.    145.      1175.        419.       1553.       391.  
#  8 EMP      CRI          231.      1.70     136.         55.0        89.8       25.5 
#  9 EMP      DEW         2490.    407.      8473.       2095.       4454.      1525.  
# 10 EMP      DNK          236.      8.03     507.        165.        461.       161.  
# # ... with 75 more rows

To understand more about collapg, look it up in the documentation (?collapg).

1.4 Weighted Aggregations

Weighted aggregations are currently possible with the functions fsum, fprod, fmean, fmode, fvar and fsd. The implementation is such that by default (option keep.w = TRUE) these functions also aggregate the weights, so that further weighted computations can be performed on the aggregated data. fsum, fmean, fsd and fvar compute a grouped sum of the weight column and place it next to the group-identifiers, fmode computes the maximum weight (corresponding to the mode), and fprod computes the product of the weights.

# This computes a frequency-weighted grouped standard-deviation, taking the total EMP / VA as weight
GGDC10S %>%
  fgroup_by(Variable,Country) %>%
  fselect(AGR:SUM) %>% fsd(SUM)
# # A tibble: 85 x 13
#    Variable Country  sum.SUM     AGR    MIN    MAN     PU    CON    WRT    TRA   FIRE     GOV    OTH
#    <chr>    <chr>      <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
#  1 EMP      ARG       6.54e5   225.  2.22e1 1.76e2 2.05e1 2.85e2 8.56e2 1.95e2  493.   1123.  5.06e2
#  2 EMP      BOL       1.35e5    99.7 1.71e1 1.68e2 4.87e0 1.23e2 3.24e2 9.81e1   69.8    NA   2.58e2
#  3 EMP      BRA       3.36e6  1587.  7.38e1 2.95e3 9.38e1 1.86e3 6.28e3 1.31e3 3003.   3621.  4.26e3
#  4 EMP      BWA       1.85e4    32.2 3.72e0 1.48e1 1.59e0 1.80e1 3.87e1 6.02e0   13.5    39.8 8.94e0
#  5 EMP      CHL       2.51e5    71.0 3.99e1 1.29e2 1.24e1 1.88e2 5.51e2 1.34e2  313.     NA   4.26e2
#  6 EMP      CHN       2.91e7 56281.  3.09e3 4.04e4 1.27e3 1.92e4 2.45e4 9.26e3 2853.  11541.  3.74e4
#  7 EMP      COL       6.03e5   637.  1.48e2 5.94e2 1.52e1 3.97e2 1.89e3 3.62e2  435.     NA   1.01e3
#  8 EMP      CRI       5.50e4    40.4 1.04e0 7.93e1 1.37e1 3.44e1 1.68e2 4.53e1   79.8    80.7 4.34e1
#  9 EMP      DEW       1.10e6  1175.  1.83e2 7.42e2 5.32e1 1.94e2 6.06e2 2.12e2  699.   1225.  3.55e2
# 10 EMP      DNK       1.53e5   139.  7.45e0 7.73e1 1.92e0 2.56e1 5.33e1 1.57e1   91.6   248.  1.95e1
# # ... with 75 more rows

# This computes a weighted grouped mode, taking the total EMP / VA as weight
GGDC10S %>%
  fgroup_by(Variable,Country) %>%
  fselect(AGR:SUM) %>% fmode(SUM)
# # A tibble: 85 x 13
#    Variable Country max.SUM     AGR     MIN     MAN     PU    CON    WRT    TRA   FIRE    GOV    OTH
#    <chr>    <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1 EMP      ARG      17929.  1.16e3  127.    2.16e3 1.52e2 1.41e3  3768. 1.06e3 1.75e3  4336. 2.00e3
#  2 EMP      BOL       4508.  8.19e2   37.6   6.04e2 1.08e1 4.33e2   893. 3.33e2 3.21e2    NA  1.06e3
#  3 EMP      BRA     102572.  1.65e4  313.    1.18e4 3.88e2 8.15e3 21860. 5.17e3 1.20e4 12149. 1.42e4
#  4 EMP      BWA        668.  1.71e2   13.1   4.33e1 3.93e0 1.81e1   129. 2.10e1 4.67e1   113. 2.62e1
#  5 EMP      CHL       7559.  6.30e2  249.    7.42e2 6.07e1 6.71e2  1989. 4.81e2 8.54e2    NA  1.88e3
#  6 EMP      CHN     764200   2.66e5 9247.    1.43e5 3.53e3 6.99e4 84165. 3.12e4 1.08e4 43240. 1.03e5
#  7 EMP      COL      21114.  3.93e3  513.    2.37e3 5.89e1 1.41e3  6069. 1.36e3 1.82e3    NA  3.57e3
#  8 EMP      CRI       2058.  2.83e2    2.42  2.49e2 4.38e1 1.20e2   489. 1.44e2 2.25e2   328. 1.75e2
#  9 EMP      DEW      31261   1.03e3  260     8.73e3 2.91e2 2.06e3  4398  1.63e3 3.26e3  6129  1.79e3
# 10 EMP      DNK       2823.  7.85e1    3.12  3.99e2 1.14e1 1.95e2   579. 1.87e2 3.82e2   835. 1.50e2
# # ... with 75 more rows

The weighted variance / standard deviation is currently only implemented with frequency weights. Reliability weights may be implemented in a future update of collapse, if this is a strongly requested feature.

Weighted aggregations may also be performed with collapg.

# This aggregates numeric colums using the weighted mean and categorical columns using the weighted mode
GGDC10S %>% group_by(Variable,Country) %>% collapg(w = SUM, wFUN = list(fsum, fmax))
# # A tibble: 85 x 17
#    Variable Country fsum.SUM fmax.SUM Regioncode Region  Year    AGR    MIN    MAN     PU    CON
#    <chr>    <chr>      <dbl>    <dbl> <chr>      <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1 EMP      ARG       6.54e5   17929. LAM        Latin~ 1985. 1.36e3 5.65e1 1.93e3 1.05e2 8.11e2
#  2 EMP      BOL       1.35e5    4508. LAM        Latin~ 1987. 9.77e2 5.79e1 2.96e2 7.07e0 1.67e2
#  3 EMP      BRA       3.36e6  102572. LAM        Latin~ 1989. 1.77e4 2.38e2 8.47e3 3.89e2 4.44e3
#  4 EMP      BWA       1.85e4     668. SSA        Sub-s~ 1993. 2.00e2 1.21e1 2.43e1 3.70e0 3.14e1
#  5 EMP      CHL       2.51e5    7559. LAM        Latin~ 1988. 6.93e2 1.07e2 6.68e2 3.35e1 3.67e2
#  6 EMP      CHN       2.91e7  764200  ASI        Asia   1988. 3.09e5 8.23e3 8.34e4 2.09e3 2.80e4
#  7 EMP      COL       6.03e5   21114. LAM        Latin~ 1989. 3.44e3 2.04e2 1.49e3 4.20e1 7.18e2
#  8 EMP      CRI       5.50e4    2058. LAM        Latin~ 1991. 2.54e2 2.10e0 1.87e2 2.19e1 7.84e1
#  9 EMP      DEW       1.10e6   31261  EUR        Europe 1971. 2.40e3 3.95e2 8.51e3 2.29e2 2.10e3
# 10 EMP      DNK       1.53e5    2823. EUR        Europe 1981. 2.23e2 7.41e0 5.03e2 1.39e1 1.72e2
# # ... with 75 more rows, and 5 more variables: WRT <dbl>, TRA <dbl>, FIRE <dbl>, GOV <dbl>,
# #   OTH <dbl>

2. Fast Transformations

collapse also provides some fast transformations that significantly extend in scope and speed up manipulations that can be performed with dplyr::mutate.

2.1 Fast Transform and Compute Variables

The function ftransform can be used to manipulate columns in the same ways as mutate:

GGDC10S %>% fsubset(Variable == "VA", Country, Year, AGR, SUM) %>%
  ftransform(AGR_perc = AGR / SUM * 100,  # Computing % of VA in Agriculture
             AGR_mean = fmean(AGR),       # Average Agricultural VA
             AGR = NULL, SUM = NULL) %>%  # Deleting columns AGR and SUM
             head
#   Country Year AGR_perc AGR_mean
# 1     BWA 1960       NA  5137561
# 2     BWA 1961       NA  5137561
# 3     BWA 1962       NA  5137561
# 4     BWA 1963       NA  5137561
# 5     BWA 1964 43.49132  5137561
# 6     BWA 1965 39.96990  5137561

If only the computed columns need to be returned, fcompute provides an efficient alternative:

GGDC10S %>% fsubset(Variable == "VA", Country, Year, AGR, SUM) %>%
  fcompute(AGR_perc = AGR / SUM * 100,
           AGR_mean = fmean(AGR)) %>% head
#   AGR_perc AGR_mean
# 1       NA  5137561
# 2       NA  5137561
# 3       NA  5137561
# 4       NA  5137561
# 5 43.49132  5137561
# 6 39.96990  5137561

ftransform and fcompute are an order of magnitude faster than mutate, but they do not support grouped computations. For common grouped operations like replacing and sweeping out statistics, collapse however provides very efficient alternatives…

2.2 Replacing and Sweeping out Statistics

All statistical (scalar-valued) functions in the collapse package (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, ffirst, flast, fNobs, fNdistinct) have a TRA argument which can be used to efficiently transforms data by either (column-wise) replacing data values with computed statistics or sweeping the statistics out of the data. Operations can be specified using either an integer or quoted operator / string. The 10 operations supported by TRA are:

Simple transformations are again straightforward to specify:

# This subtracts the median value from all data points i.e. centers on the median
GGDC10S %>% num_vars %>% fmedian(TRA = "-") %>% head
#   Year       AGR       MIN       MAN        PU       CON       WRT       TRA      FIRE       GOV
# 1  -22        NA        NA        NA        NA        NA        NA        NA        NA        NA
# 2  -21        NA        NA        NA        NA        NA        NA        NA        NA        NA
# 3  -20        NA        NA        NA        NA        NA        NA        NA        NA        NA
# 4  -19        NA        NA        NA        NA        NA        NA        NA        NA        NA
# 5  -18 -4378.218 -169.7294 -3717.362 -167.8456 -1472.787 -3767.399 -1173.141 -959.0059 -3923.690
# 6  -17 -4378.792 -170.7277 -3717.080 -167.8149 -1472.101 -3766.578 -1172.861 -958.8783 -3922.817
#         OTH       SUM
# 1        NA        NA
# 2        NA        NA
# 3        NA        NA
# 4        NA        NA
# 5 -1430.831 -23148.71
# 6 -1430.494 -23146.85

# This replaces all data points with the mode
GGDC10S %>% char_vars %>% fmode(TRA = "replace") %>% head
#   Country Regioncode Region Variable
# 1     USA        ASI   Asia      EMP
# 2     USA        ASI   Asia      EMP
# 3     USA        ASI   Asia      EMP
# 4     USA        ASI   Asia      EMP
# 5     USA        ASI   Asia      EMP
# 6     USA        ASI   Asia      EMP

We can also easily specify code to grouped demean, scale or compute percentages by groups:

# Demeaning sectoral data by Variable and Country (within transformation)
GGDC10S %>%
  fselect(Variable,Country,AGR:SUM) %>% 
   fgroup_by(Variable,Country) %>% fmean(TRA = "-") %>% head(3)
# # A tibble: 3 x 13
#   Variable Country   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 3 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

# Scaling sectoral data by Variable and Country
GGDC10S %>%
  fselect(Variable,Country,AGR:SUM) %>% 
   fgroup_by(Variable,Country) %>% fsd(TRA = "/") %>% head(3)
# # A tibble: 3 x 13
#   Variable Country   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 3 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

# Normalizing Data by expressing them in percentages of the median value within each country and sector (i.e. the median is 100%)
GGDC10S %>%
  fselect(Variable,Country,AGR:SUM) %>%  
   fgroup_by(Variable,Country) %>% fmedian(TRA = "%") %>% head(3)
# # A tibble: 3 x 13
#   Variable Country   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 3 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

Weighted demeaning and scaling can be computed using:

# Weighted demeaning (within transformation), weighted by SUM
GGDC10S %>%
  fselect(Variable,Country,AGR:SUM) %>% 
   fgroup_by(Variable,Country) %>% fmean(SUM, "-") %>% head(3)
# # A tibble: 3 x 13
#   Variable Country   SUM   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 3 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

# Weighted scaling, weighted by SUM
GGDC10S %>%
  fselect(Variable,Country,AGR:SUM) %>% 
   fgroup_by(Variable,Country) %>% fsd(SUM, "/") %>% head(3)
# # A tibble: 3 x 13
#   Variable Country   SUM   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 3 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

Alternatively we could also replace data points with their groupwise weighted mean or standard deviation:

# This conducts a weighted between transformation (replacing with weighted mean)
GGDC10S %>%
  fselect(Variable,Country,AGR:SUM) %>% 
   fgroup_by(Variable,Country) %>% fmean(SUM, "replace")
# # A tibble: 5,027 x 13
#    Variable Country   SUM   AGR    MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH
#  * <chr>    <chr>   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1 VA       BWA      NA     NA     NA    NA    NA    NA    NA    NA    NA    NA    NA 
#  2 VA       BWA      NA     NA     NA    NA    NA    NA    NA    NA    NA    NA    NA 
#  3 VA       BWA      NA     NA     NA    NA    NA    NA    NA    NA    NA    NA    NA 
#  4 VA       BWA      NA     NA     NA    NA    NA    NA    NA    NA    NA    NA    NA 
#  5 VA       BWA      37.5 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  6 VA       BWA      39.3 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  7 VA       BWA      43.1 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  8 VA       BWA      41.4 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  9 VA       BWA      41.1 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
# 10 VA       BWA      51.2 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
# # ... with 5,017 more rows

# This also replaces missing values in each group
GGDC10S %>%
  fselect(Variable,Country,AGR:SUM) %>% 
   fgroup_by(Variable,Country) %>% fmean(SUM, "replace_fill")
# # A tibble: 5,027 x 13
#    Variable Country   SUM   AGR    MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH
#  * <chr>    <chr>   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1 VA       BWA      NA   1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  2 VA       BWA      NA   1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  3 VA       BWA      NA   1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  4 VA       BWA      NA   1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  5 VA       BWA      37.5 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  6 VA       BWA      39.3 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  7 VA       BWA      43.1 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  8 VA       BWA      41.4 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
#  9 VA       BWA      41.1 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
# 10 VA       BWA      51.2 1317. 13321. 2965.  529. 2747. 6547. 2158. 4432. 7556. 2615.
# # ... with 5,017 more rows

Sequential operations are also easily performed:

# This scales and then subtracts the median
GGDC10S %>%
  fselect(Variable,Country,AGR:SUM) %>% 
   fgroup_by(Variable,Country) %>% fsd(TRA = "/") %>% fmedian(TRA = "-")
# # A tibble: 5,027 x 13
#    Variable Country    AGR    MIN    MAN     PU    CON     WRT     TRA    FIRE    GOV     OTH    SUM
#  * <chr>    <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
#  1 VA       BWA     NA     NA     NA     NA     NA     NA      NA      NA      NA     NA      NA    
#  2 VA       BWA     NA     NA     NA     NA     NA     NA      NA      NA      NA     NA      NA    
#  3 VA       BWA     NA     NA     NA     NA     NA     NA      NA      NA      NA     NA      NA    
#  4 VA       BWA     NA     NA     NA     NA     NA     NA      NA      NA      NA     NA      NA    
#  5 VA       BWA     -0.182 -0.235 -0.183 -0.245 -0.118 -0.0820 -0.0724 -0.0661 -0.108 -0.0848 -0.146
#  6 VA       BWA     -0.183 -0.235 -0.183 -0.245 -0.117 -0.0817 -0.0722 -0.0660 -0.108 -0.0846 -0.146
#  7 VA       BWA     -0.180 -0.235 -0.183 -0.245 -0.117 -0.0813 -0.0720 -0.0659 -0.107 -0.0843 -0.145
#  8 VA       BWA     -0.177 -0.235 -0.183 -0.245 -0.117 -0.0826 -0.0724 -0.0659 -0.107 -0.0841 -0.146
#  9 VA       BWA     -0.174 -0.235 -0.183 -0.245 -0.117 -0.0823 -0.0717 -0.0661 -0.108 -0.0848 -0.146
# 10 VA       BWA     -0.173 -0.234 -0.182 -0.243 -0.115 -0.0821 -0.0715 -0.0660 -0.108 -0.0846 -0.145
# # ... with 5,017 more rows

Of course it is also possible to combine multiple functions as in the aggregation section, or to add variables to existing data, as shown below:

# This adds a groupwise observation count next to each column
add_vars(GGDC10S, seq(7,27,2)) <- GGDC10S %>%
    fgroup_by(Variable,Country) %>% fselect(AGR:SUM) %>%
    fNobs("replace_fill") %>% add_stub("N_")

head(GGDC10S)
#   Country Regioncode             Region Variable Year      AGR N_AGR      MIN N_MIN       MAN N_MAN
# 1     BWA        SSA Sub-saharan Africa       VA 1960       NA    47       NA    47        NA    47
# 2     BWA        SSA Sub-saharan Africa       VA 1961       NA    47       NA    47        NA    47
# 3     BWA        SSA Sub-saharan Africa       VA 1962       NA    47       NA    47        NA    47
# 4     BWA        SSA Sub-saharan Africa       VA 1963       NA    47       NA    47        NA    47
# 5     BWA        SSA Sub-saharan Africa       VA 1964 16.30154    47 3.494075    47 0.7365696    47
# 6     BWA        SSA Sub-saharan Africa       VA 1965 15.72700    47 2.495768    47 1.0181992    47
#          PU N_PU       CON N_CON      WRT N_WRT      TRA N_TRA     FIRE N_FIRE      GOV N_GOV
# 1        NA   47        NA    47       NA    47       NA    47       NA     47       NA    47
# 2        NA   47        NA    47       NA    47       NA    47       NA     47       NA    47
# 3        NA   47        NA    47       NA    47       NA    47       NA     47       NA    47
# 4        NA   47        NA    47       NA    47       NA    47       NA     47       NA    47
# 5 0.1043936   47 0.6600454    47 6.243732    47 1.658928    47 1.119194     47 4.822485    47
# 6 0.1350976   47 1.3462312    47 7.064825    47 1.939007    47 1.246789     47 5.695848    47
#        OTH N_OTH      SUM N_SUM
# 1       NA    47       NA    47
# 2       NA    47       NA    47
# 3       NA    47       NA    47
# 4       NA    47       NA    47
# 5 2.341328    47 37.48229    47
# 6 2.678338    47 39.34710    47
rm(GGDC10S)

Certainly There are lots of other examples one could construct using the 10 operations and 13 functions listed above, the examples provided just outline the suggested programming basics.

2.3 More Control using the TRA Function

Behind the scenes of the TRA = ... argument, the fast functions first compute the grouped statistics on all columns of the data, and these statistics are then directly fed into a C++ function that uses them to replace or sweep them out of data points in one of the 10 ways described above. This function can however also be called directly by the name of TRA (shorthand for ‘transforming’ data by replacing or sweeping out statistics). Fundamentally, TRA is a generalization of base::sweep for column-wise grouped operations1. Direct calls to TRA enable more control over inputs and outputs.

The two operations below are equivalent, although the first is slightly more efficient as it only requires one method dispatch and one check of the inputs:

# This divides by the product
GGDC10S %>%
  fgroup_by(Variable,Country) %>%
    get_vars(6:16) %>% fprod(TRA = "/") 
# # A tibble: 5,027 x 11
#           AGR        MIN        MAN        PU        CON        WRT       TRA      FIRE        GOV
#  *      <dbl>      <dbl>      <dbl>     <dbl>      <dbl>      <dbl>     <dbl>     <dbl>      <dbl>
#  1 NA         NA         NA         NA        NA         NA         NA        NA        NA        
#  2 NA         NA         NA         NA        NA         NA         NA        NA        NA        
#  3 NA         NA         NA         NA        NA         NA         NA        NA        NA        
#  4 NA         NA         NA         NA        NA         NA         NA        NA        NA        
#  5  1.29e-105  2.81e-127  1.40e-101  4.44e-74  4.19e-102  3.97e-113  6.91e-92  1.01e-97  2.51e-117
#  6  1.24e-105  2.00e-127  1.94e-101  5.75e-74  8.55e-102  4.49e-113  8.08e-92  1.13e-97  2.96e-117
#  7  1.39e-105  1.58e-127  1.53e-101  8.62e-74  8.55e-102  5.26e-113  8.98e-92  1.23e-97  3.31e-117
#  8  1.51e-105  1.85e-127  1.78e-101  8.62e-74  5.70e-102  2.74e-113  7.18e-92  1.39e-97  3.66e-117
#  9  1.66e-105  1.48e-127  1.43e-101  8.62e-74  7.74e-102  3.29e-113  1.02e-91  9.33e-98  2.61e-117
# 10  1.72e-105  4.21e-127  4.07e-101  2.46e-73  2.21e-101  3.66e-113  1.13e-91  1.11e-97  2.91e-117
# # ... with 5,017 more rows, and 2 more variables: OTH <dbl>, SUM <dbl>

# Same thing
GGDC10S %>%
  fgroup_by(Variable,Country) %>%
    get_vars(6:16) %>% TRA(fprod(., keep.group_vars = FALSE), "/") # [same as TRA(.,fprod(., keep.group_vars = FALSE),"/")]
# # A tibble: 5,027 x 11
#           AGR        MIN        MAN        PU        CON        WRT       TRA      FIRE        GOV
#  *      <dbl>      <dbl>      <dbl>     <dbl>      <dbl>      <dbl>     <dbl>     <dbl>      <dbl>
#  1 NA         NA         NA         NA        NA         NA         NA        NA        NA        
#  2 NA         NA         NA         NA        NA         NA         NA        NA        NA        
#  3 NA         NA         NA         NA        NA         NA         NA        NA        NA        
#  4 NA         NA         NA         NA        NA         NA         NA        NA        NA        
#  5  1.29e-105  2.81e-127  1.40e-101  4.44e-74  4.19e-102  3.97e-113  6.91e-92  1.01e-97  2.51e-117
#  6  1.24e-105  2.00e-127  1.94e-101  5.75e-74  8.55e-102  4.49e-113  8.08e-92  1.13e-97  2.96e-117
#  7  1.39e-105  1.58e-127  1.53e-101  8.62e-74  8.55e-102  5.26e-113  8.98e-92  1.23e-97  3.31e-117
#  8  1.51e-105  1.85e-127  1.78e-101  8.62e-74  5.70e-102  2.74e-113  7.18e-92  1.39e-97  3.66e-117
#  9  1.66e-105  1.48e-127  1.43e-101  8.62e-74  7.74e-102  3.29e-113  1.02e-91  9.33e-98  2.61e-117
# 10  1.72e-105  4.21e-127  4.07e-101  2.46e-73  2.21e-101  3.66e-113  1.13e-91  1.11e-97  2.91e-117
# # ... with 5,017 more rows, and 2 more variables: OTH <dbl>, SUM <dbl>

TRA.grouped_df was designed such that it matches the columns of the statistics (aggregated columns) to those of the original data, and only transforms matching columns while returning the whole data.frame. Thus it is easily possible to only apply a transformation to the first two sectors:

# This only demeans Agriculture (AGR) and Mining (MIN)
GGDC10S %>%
  fgroup_by(Variable,Country) %>%
    get_vars(6:16) %>% TRA(fmean(fselect(., AGR, MIN), keep.group_vars = FALSE), "-")
# # A tibble: 5,027 x 11
#      AGR    MIN    MAN     PU    CON   WRT   TRA  FIRE   GOV   OTH   SUM
#  * <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1   NA     NA  NA     NA     NA     NA    NA    NA    NA    NA     NA  
#  2   NA     NA  NA     NA     NA     NA    NA    NA    NA    NA     NA  
#  3   NA     NA  NA     NA     NA     NA    NA    NA    NA    NA     NA  
#  4   NA     NA  NA     NA     NA     NA    NA    NA    NA    NA     NA  
#  5 -446. -4505.  0.737  0.104  0.660  6.24  1.66  1.12  4.82  2.34  37.5
#  6 -446. -4506.  1.02   0.135  1.35   7.06  1.94  1.25  5.70  2.68  39.3
#  7 -444. -4507.  0.804  0.203  1.35   8.27  2.15  1.36  6.37  2.99  43.1
#  8 -443. -4506.  0.938  0.203  0.897  4.31  1.72  1.54  7.04  3.31  41.4
#  9 -441. -4507.  0.750  0.203  1.22   5.17  2.44  1.03  5.03  2.36  41.1
# 10 -440. -4503.  2.14   0.578  3.47   5.75  2.72  1.23  5.59  2.63  51.2
# # ... with 5,017 more rows

Another potential use of TRA is to do computations in two- or more steps, for example if both aggregated and transformed data are needed, or if computations are more complex and involve other manipulations in-between the aggregating and sweeping part:

# Get grouped tibble
gGGDC <- GGDC10S %>% fgroup_by(Variable,Country)

# Get aggregated data
gsumGGDC <- gGGDC %>% fselect(AGR:SUM) %>% fsum
head(gsumGGDC)
# # A tibble: 6 x 13
#   Variable Country     AGR     MIN     MAN     PU     CON    WRT    TRA   FIRE     GOV    OTH    SUM
#   <chr>    <chr>     <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
# 1 EMP      ARG      8.80e4   3230.  1.20e5  6307.  4.60e4 1.23e5 4.02e4 3.89e4  1.27e5 6.15e4 6.54e5
# 2 EMP      BOL      5.88e4   3418.  1.43e4   326.  7.49e3 1.72e4 7.04e3 2.72e3 NA      2.41e4 1.35e5
# 3 EMP      BRA      1.07e6  12773.  4.33e5 22604.  2.19e5 5.28e5 1.27e5 2.74e5  3.29e5 3.54e5 3.36e6
# 4 EMP      BWA      8.84e3    493.  8.49e2   145.  1.19e3 1.71e3 3.93e2 7.21e2  2.87e3 1.30e3 1.85e4
# 5 EMP      CHL      4.42e4   6389.  3.94e4  1850.  1.86e4 4.38e4 1.63e4 1.72e4 NA      6.32e4 2.51e5
# 6 EMP      CHN      1.73e7 422972.  4.03e6 96364.  1.25e6 1.73e6 8.36e5 2.96e5  1.36e6 1.86e6 2.91e7

# Get transformed (scaled) data
head(TRA(gGGDC, gsumGGDC, "/"))
# # A tibble: 6 x 16
#   Country Regioncode Region Variable  Year      AGR      MIN      MAN       PU      CON      WRT
#   <chr>   <chr>      <chr>  <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
# 1 BWA     SSA        Sub-s~ VA        1960 NA       NA       NA       NA       NA       NA      
# 2 BWA     SSA        Sub-s~ VA        1961 NA       NA       NA       NA       NA       NA      
# 3 BWA     SSA        Sub-s~ VA        1962 NA       NA       NA       NA       NA       NA      
# 4 BWA     SSA        Sub-s~ VA        1963 NA       NA       NA       NA       NA       NA      
# 5 BWA     SSA        Sub-s~ VA        1964  7.50e-4  1.65e-5  1.66e-5  1.03e-5  1.57e-5  6.82e-5
# 6 BWA     SSA        Sub-s~ VA        1965  7.24e-4  1.18e-5  2.30e-5  1.33e-5  3.20e-5  7.72e-5
# # ... with 5 more variables: TRA <dbl>, FIRE <dbl>, GOV <dbl>, OTH <dbl>, SUM <dbl>

I have already noted above that whether using the argument to fast statistical functions or TRA directly, these data transformations are essentially a two-step process: Statistics are first computed and then used to transform this original data. This process is already very efficient since all functions are written in C++, and programmatically separating the computation of statistics and data transformation tasks allows for unlimited combinations and drastically simplifies the code base of this package.

Nonetheless there are of course more memory efficient and faster ways to program such data transformations, which principally involve doing them column-by-column with a single C++ function. To ensure that this package lives up to the highest standards of performance for common uses, I have implemented such slightly more efficient algorithms for the very commonly applied tasks of centering and averaging data by groups (widely known as ‘between’-group and ‘within’-group transformations), and scaling and centering data by groups (also known as ‘standardizing’ data).

2.4 Faster Centering, Averaging and Standardizing

The functions fbetween and fwithin are slightly more memory efficient implementations of fmean invoked with different TRA options:

GGDC10S %>% # Same as ... %>% fmean(TRA = "replace")
  fgroup_by(Variable,Country) %>% get_vars(6:16) %>% fbetween %>% head(2)
# # A tibble: 2 x 11
#     AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

GGDC10S %>% # Same as ... %>% fmean(TRA = "replace_fill")
  fgroup_by(Variable,Country) %>% get_vars(6:16) %>% fbetween(fill = TRUE) %>% head(2)
# # A tibble: 2 x 11
#     AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH    SUM
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
# 1  462. 4509.  942.  216.  895. 1948.  635. 1359. 2373.  773. 14112.
# 2  462. 4509.  942.  216.  895. 1948.  635. 1359. 2373.  773. 14112.

GGDC10S %>% # Same as ... %>% fmean(TRA = "-")
  fgroup_by(Variable,Country) %>% get_vars(6:16) %>% fwithin %>% head(2)
# # A tibble: 2 x 11
#     AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

Apart from higher speed,fwithin has a mean argument to assign an arbitrary mean to centered data, the default being mean = 0. A very common choice for such an added mean is just the overall mean of the data, which can be added in by invoking mean = "overall.mean":

GGDC10S %>% 
  fgroup_by(Variable,Country) %>% 
    fselect(Country, Variable, AGR:SUM) %>% fwithin(mean = "overall.mean")
# # A tibble: 5,027 x 13
#    Country Variable     AGR     MIN     MAN      PU     CON     WRT     TRA    FIRE     GOV     OTH
#  * <chr>   <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#  1 BWA     VA       NA      NA      NA          NA  NA      NA      NA      NA      NA      NA     
#  2 BWA     VA       NA      NA      NA          NA  NA      NA      NA      NA      NA      NA     
#  3 BWA     VA       NA      NA      NA          NA  NA      NA      NA      NA      NA      NA     
#  4 BWA     VA       NA      NA      NA          NA  NA      NA      NA      NA      NA      NA     
#  5 BWA     VA        2.53e6  1.86e6  5.54e6 335463.  1.80e6  3.39e6  1.47e6  1.66e6  1.71e6  1.68e6
#  6 BWA     VA        2.53e6  1.86e6  5.54e6 335463.  1.80e6  3.39e6  1.47e6  1.66e6  1.71e6  1.68e6
#  7 BWA     VA        2.53e6  1.86e6  5.54e6 335463.  1.80e6  3.39e6  1.47e6  1.66e6  1.71e6  1.68e6
#  8 BWA     VA        2.53e6  1.86e6  5.54e6 335463.  1.80e6  3.39e6  1.47e6  1.66e6  1.71e6  1.68e6
#  9 BWA     VA        2.53e6  1.86e6  5.54e6 335463.  1.80e6  3.39e6  1.47e6  1.66e6  1.71e6  1.68e6
# 10 BWA     VA        2.53e6  1.86e6  5.54e6 335464.  1.80e6  3.39e6  1.47e6  1.66e6  1.71e6  1.68e6
# # ... with 5,017 more rows, and 1 more variable: SUM <dbl>

This can also be done using weights. The code below uses the SUM column as weights, and then for each variable and each group subtracts out the weighted mean, and then adds the overall weighted column mean back to the centered columns. The SUM column is just kept as it is and added in front.

GGDC10S %>% 
  fgroup_by(Variable,Country) %>% 
    fselect(Country, Variable, AGR:SUM) %>% fwithin(SUM, mean = "overall.mean")
# # A tibble: 5,027 x 13
#    Country Variable   SUM     AGR     MIN     MAN      PU     CON     WRT     TRA    FIRE     GOV
#  * <chr>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#  1 BWA     VA        NA   NA      NA      NA      NA      NA      NA      NA      NA      NA     
#  2 BWA     VA        NA   NA      NA      NA      NA      NA      NA      NA      NA      NA     
#  3 BWA     VA        NA   NA      NA      NA      NA      NA      NA      NA      NA      NA     
#  4 BWA     VA        NA   NA      NA      NA      NA      NA      NA      NA      NA      NA     
#  5 BWA     VA        37.5  4.29e8  3.70e8  7.38e8  2.73e7  2.83e8  4.33e8  1.97e8  1.55e8  2.10e8
#  6 BWA     VA        39.3  4.29e8  3.70e8  7.38e8  2.73e7  2.83e8  4.33e8  1.97e8  1.55e8  2.10e8
#  7 BWA     VA        43.1  4.29e8  3.70e8  7.38e8  2.73e7  2.83e8  4.33e8  1.97e8  1.55e8  2.10e8
#  8 BWA     VA        41.4  4.29e8  3.70e8  7.38e8  2.73e7  2.83e8  4.33e8  1.97e8  1.55e8  2.10e8
#  9 BWA     VA        41.1  4.29e8  3.70e8  7.38e8  2.73e7  2.83e8  4.33e8  1.97e8  1.55e8  2.10e8
# 10 BWA     VA        51.2  4.29e8  3.70e8  7.38e8  2.73e7  2.83e8  4.33e8  1.97e8  1.55e8  2.10e8
# # ... with 5,017 more rows, and 1 more variable: OTH <dbl>

Apart from fbetween and fwithin, the function fscale exists to efficiently scale and center data, to avoid sequential calls such as ... %>% fsd(TRA = "/") %>% fmean(TRA = "-") shown in an earlier example.

# This efficiently scales and centers (i.e. standardizes) the data
GGDC10S %>%
  fgroup_by(Variable,Country) %>%
    fselect(Country, Variable, AGR:SUM) %>% fscale
# # A tibble: 5,027 x 13
#    Country Variable    AGR    MIN    MAN     PU    CON    WRT    TRA   FIRE    GOV    OTH    SUM
#  * <chr>   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1 BWA     VA       NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA    
#  2 BWA     VA       NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA    
#  3 BWA     VA       NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA    
#  4 BWA     VA       NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA    
#  5 BWA     VA       -0.738 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
#  6 BWA     VA       -0.739 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
#  7 BWA     VA       -0.736 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.595 -0.676
#  8 BWA     VA       -0.734 -0.717 -0.668 -0.805 -0.692 -0.604 -0.589 -0.635 -0.655 -0.595 -0.676
#  9 BWA     VA       -0.730 -0.717 -0.668 -0.805 -0.692 -0.604 -0.588 -0.635 -0.656 -0.596 -0.676
# 10 BWA     VA       -0.729 -0.716 -0.667 -0.803 -0.690 -0.603 -0.588 -0.635 -0.656 -0.596 -0.675
# # ... with 5,017 more rows

fscale also has additional mean and sd arguments allowing the user to (group-) scale data to an arbitrary mean and standard deviation. Setting mean = FALSE just scales the data but preserves the means, and is thus different from fsd(..., TRA = "/") which just divides all values by the standard deviation:

# Saving grouped tibble
gGGDC <- GGDC10S %>%
  fgroup_by(Variable,Country) %>%
    fselect(Country, Variable, AGR:SUM)

# Original means
head(fmean(gGGDC)) 
# # A tibble: 6 x 13
#   Variable Country     AGR    MIN     MAN      PU     CON    WRT    TRA   FIRE     GOV    OTH    SUM
#   <chr>    <chr>     <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
# 1 EMP      ARG       1420.   52.1  1932.   102.     742.  1.98e3 6.49e2  628.   2043.  9.92e2 1.05e4
# 2 EMP      BOL        964.   56.0   235.     5.35   123.  2.82e2 1.15e2   44.6    NA   3.96e2 2.22e3
# 3 EMP      BRA      17191.  206.   6991.   365.    3525.  8.51e3 2.05e3 4414.   5307.  5.71e3 5.43e4
# 4 EMP      BWA        188.   10.5    18.1    3.09    25.3 3.63e1 8.36e0   15.3    61.1 2.76e1 3.94e2
# 5 EMP      CHL        702.  101.    625.    29.4    296.  6.95e2 2.58e2  272.     NA   1.00e3 3.98e3
# 6 EMP      CHN     287744. 7050.  67144.  1606.   20852.  2.89e4 1.39e4 4929.  22669.  3.10e4 4.86e5

# Mean Preserving Scaling
head(fmean(fscale(gGGDC, mean = FALSE)))
# # A tibble: 6 x 13
#   Variable Country     AGR    MIN     MAN      PU     CON    WRT    TRA   FIRE     GOV    OTH    SUM
#   <chr>    <chr>     <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
# 1 EMP      ARG       1420.   52.1  1932.   102.     742.  1.98e3 6.49e2  628.   2043.  9.92e2 1.05e4
# 2 EMP      BOL        964.   56.0   235.     5.35   123.  2.82e2 1.15e2   44.6    NA   3.96e2 2.22e3
# 3 EMP      BRA      17191.  206.   6991.   365.    3525.  8.51e3 2.05e3 4414.   5307.  5.71e3 5.43e4
# 4 EMP      BWA        188.   10.5    18.1    3.09    25.3 3.63e1 8.36e0   15.3    61.1 2.76e1 3.94e2
# 5 EMP      CHL        702.  101.    625.    29.4    296.  6.95e2 2.58e2  272.     NA   1.00e3 3.98e3
# 6 EMP      CHN     287744. 7050.  67144.  1606.   20852.  2.89e4 1.39e4 4929.  22669.  3.10e4 4.86e5
head(fsd(fscale(gGGDC, mean = FALSE)))
# # A tibble: 6 x 13
#   Variable Country   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP      ARG      1.    1.    1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.  
# 2 EMP      BOL      1.    1.00  1.    1.00  1.00  1.    1.    1.   NA     1.    1.  
# 3 EMP      BRA      1.    1.    1.    1.00  1.    1.00  1.00  1.00  1.    1.00  1.00
# 4 EMP      BWA      1.00  1.00  1.    1.    1.    1.00  1.    1.00  1.    1.00  1.00
# 5 EMP      CHL      1.    1.    1.00  1.    1.    1.    1.00  1.   NA     1.    1.00
# 6 EMP      CHN      1.    1.    1.    1.00  1.00  1.    1.    1.    1.00  1.00  1.

One can also set mean = "overall.mean", which group-centers columns on the overall mean as illustrated with fwithin. Another interesting option is setting sd = "within.sd". This group-scales data such that every group has a standard deviation equal to the within-standard deviation of the data:

# Just using VA data for this example
gGGDC <- GGDC10S %>%
  fsubset(Variable == "VA", Country, AGR:SUM) %>% 
      fgroup_by(Country)

# This calculates the within- standard deviation for all columns
fsd(num_vars(ungroup(fwithin(gGGDC))))
#       AGR       MIN       MAN        PU       CON       WRT       TRA      FIRE       GOV       OTH 
#  45046972  40122220  75608708   3062688  30811572  44125207  20676901  16030868  20358973  18780869 
#       SUM 
# 306429102

# This scales all groups to take on the within- standard deviation while preserving group means 
fsd(fscale(gGGDC, mean = FALSE, sd = "within.sd"))
# # A tibble: 43 x 12
#    Country      AGR      MIN      MAN     PU     CON     WRT     TRA    FIRE     GOV     OTH     SUM
#    <chr>      <dbl>    <dbl>    <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#  1 ARG       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
#  2 BOL       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7 NA       1.88e7  3.06e8
#  3 BRA       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
#  4 BWA       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
#  5 CHL       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7 NA       1.88e7  3.06e8
#  6 CHN       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
#  7 COL       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7 NA       1.88e7  3.06e8
#  8 CRI       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
#  9 DEW       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
# 10 DNK       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
# # ... with 33 more rows

A grouped scaling operation with both mean = "overall.mean" and sd = "within.sd" thus efficiently achieves a complete harmonization of all groups in the first two moments without changing the fundamental properties (in terms of level and scale) of the data.

2.5 Lags / Leads, Differences and Growth Rates

This section introduces 3 further powerful collapse functions: flag, fdiff and fgrowth. The first function, flag, efficiently computes sequences of fully identified lags and leads on time-series and panel-data. The following code computes 1 fully-identified panel-lag and 1 fully identified panel-lead of each variable in the data:

GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable,Country) %>% flag(-1:1, Year)
# # A tibble: 5,027 x 36
#    Country Variable  Year F1.AGR   AGR L1.AGR F1.MIN   MIN L1.MIN F1.MAN    MAN L1.MAN  F1.PU     PU
#  * <chr>   <chr>    <dbl>  <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1 BWA     VA        1960   NA    NA     NA    NA    NA     NA    NA     NA     NA     NA     NA    
#  2 BWA     VA        1961   NA    NA     NA    NA    NA     NA    NA     NA     NA     NA     NA    
#  3 BWA     VA        1962   NA    NA     NA    NA    NA     NA    NA     NA     NA     NA     NA    
#  4 BWA     VA        1963   16.3  NA     NA     3.49 NA     NA     0.737 NA     NA      0.104 NA    
#  5 BWA     VA        1964   15.7  16.3   NA     2.50  3.49  NA     1.02   0.737 NA      0.135  0.104
#  6 BWA     VA        1965   17.7  15.7   16.3   1.97  2.50   3.49  0.804  1.02   0.737  0.203  0.135
#  7 BWA     VA        1966   19.1  17.7   15.7   2.30  1.97   2.50  0.938  0.804  1.02   0.203  0.203
#  8 BWA     VA        1967   21.1  19.1   17.7   1.84  2.30   1.97  0.750  0.938  0.804  0.203  0.203
#  9 BWA     VA        1968   21.9  21.1   19.1   5.24  1.84   2.30  2.14   0.750  0.938  0.578  0.203
# 10 BWA     VA        1969   23.1  21.9   21.1  10.2   5.24   1.84  4.15   2.14   0.750  1.12   0.578
# # ... with 5,017 more rows, and 22 more variables: L1.PU <dbl>, F1.CON <dbl>, CON <dbl>,
# #   L1.CON <dbl>, F1.WRT <dbl>, WRT <dbl>, L1.WRT <dbl>, F1.TRA <dbl>, TRA <dbl>, L1.TRA <dbl>,
# #   F1.FIRE <dbl>, FIRE <dbl>, L1.FIRE <dbl>, F1.GOV <dbl>, GOV <dbl>, L1.GOV <dbl>, F1.OTH <dbl>,
# #   OTH <dbl>, L1.OTH <dbl>, F1.SUM <dbl>, SUM <dbl>, L1.SUM <dbl>

If the time-variable passed does not exactly identify the data (i.e. because of gaps or repeated values in each group), all 3 functions will issue appropriate error messages. It is also possible to omit the time-variable if one is certain that the data is sorted:

GGDC10S %>%
  fselect(Variable,Country,AGR:SUM) %>% 
    fgroup_by(Variable,Country) %>% flag
# # A tibble: 5,027 x 13
#    Variable Country L1.AGR L1.MIN L1.MAN  L1.PU L1.CON L1.WRT L1.TRA L1.FIRE L1.GOV L1.OTH L1.SUM
#  * <chr>    <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
#  1 VA       BWA       NA    NA    NA     NA     NA      NA     NA      NA     NA     NA      NA  
#  2 VA       BWA       NA    NA    NA     NA     NA      NA     NA      NA     NA     NA      NA  
#  3 VA       BWA       NA    NA    NA     NA     NA      NA     NA      NA     NA     NA      NA  
#  4 VA       BWA       NA    NA    NA     NA     NA      NA     NA      NA     NA     NA      NA  
#  5 VA       BWA       NA    NA    NA     NA     NA      NA     NA      NA     NA     NA      NA  
#  6 VA       BWA       16.3   3.49  0.737  0.104  0.660   6.24   1.66    1.12   4.82   2.34   37.5
#  7 VA       BWA       15.7   2.50  1.02   0.135  1.35    7.06   1.94    1.25   5.70   2.68   39.3
#  8 VA       BWA       17.7   1.97  0.804  0.203  1.35    8.27   2.15    1.36   6.37   2.99   43.1
#  9 VA       BWA       19.1   2.30  0.938  0.203  0.897   4.31   1.72    1.54   7.04   3.31   41.4
# 10 VA       BWA       21.1   1.84  0.750  0.203  1.22    5.17   2.44    1.03   5.03   2.36   41.1
# # ... with 5,017 more rows

fdiff computes sequences of lagged-leaded and iterated differences as well as quasi-differences and log-differences on time-series and panel-data. The code below computes the 1 and 10 year first and second differences of each variable in the data:

GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable,Country) %>% fdiff(c(1, 10), 1:2, Year)
# # A tibble: 5,027 x 47
#    Country Variable  Year D1.AGR D2.AGR L10D1.AGR L10D2.AGR D1.MIN D2.MIN L10D1.MIN L10D2.MIN D1.MAN
#  * <chr>   <chr>    <dbl>  <dbl>  <dbl>     <dbl>     <dbl>  <dbl>  <dbl>     <dbl>     <dbl>  <dbl>
#  1 BWA     VA        1960 NA     NA            NA        NA NA     NA            NA        NA NA    
#  2 BWA     VA        1961 NA     NA            NA        NA NA     NA            NA        NA NA    
#  3 BWA     VA        1962 NA     NA            NA        NA NA     NA            NA        NA NA    
#  4 BWA     VA        1963 NA     NA            NA        NA NA     NA            NA        NA NA    
#  5 BWA     VA        1964 NA     NA            NA        NA NA     NA            NA        NA NA    
#  6 BWA     VA        1965 -0.575 NA            NA        NA -0.998 NA            NA        NA  0.282
#  7 BWA     VA        1966  1.95   2.53         NA        NA -0.525  0.473        NA        NA -0.214
#  8 BWA     VA        1967  1.47  -0.488        NA        NA  0.328  0.854        NA        NA  0.134
#  9 BWA     VA        1968  1.95   0.488        NA        NA -0.460 -0.788        NA        NA -0.188
# 10 BWA     VA        1969  0.763 -1.19         NA        NA  3.41   3.87         NA        NA  1.39 
# # ... with 5,017 more rows, and 35 more variables: D2.MAN <dbl>, L10D1.MAN <dbl>, L10D2.MAN <dbl>,
# #   D1.PU <dbl>, D2.PU <dbl>, L10D1.PU <dbl>, L10D2.PU <dbl>, D1.CON <dbl>, D2.CON <dbl>,
# #   L10D1.CON <dbl>, L10D2.CON <dbl>, D1.WRT <dbl>, D2.WRT <dbl>, L10D1.WRT <dbl>, L10D2.WRT <dbl>,
# #   D1.TRA <dbl>, D2.TRA <dbl>, L10D1.TRA <dbl>, L10D2.TRA <dbl>, D1.FIRE <dbl>, D2.FIRE <dbl>,
# #   L10D1.FIRE <dbl>, L10D2.FIRE <dbl>, D1.GOV <dbl>, D2.GOV <dbl>, L10D1.GOV <dbl>,
# #   L10D2.GOV <dbl>, D1.OTH <dbl>, D2.OTH <dbl>, L10D1.OTH <dbl>, L10D2.OTH <dbl>, D1.SUM <dbl>,
# #   D2.SUM <dbl>, L10D1.SUM <dbl>, L10D2.SUM <dbl>

Log-differences of the form \(log(x_t) - log(x_{t-s})\) are also easily computed, although one caveat of log-differencing in C++ is that log(NA) - log(NA) gives a NaN value.

GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable,Country) %>% fdiff(c(1, 10), 1, Year, logdiff = TRUE)
# # A tibble: 5,027 x 25
#    Country Variable  Year Dlog1.AGR L10Dlog1.AGR Dlog1.MIN L10Dlog1.MIN Dlog1.MAN L10Dlog1.MAN
#  * <chr>   <chr>    <dbl>     <dbl>        <dbl>     <dbl>        <dbl>     <dbl>        <dbl>
#  1 BWA     VA        1960   NA                NA    NA               NA    NA               NA
#  2 BWA     VA        1961  NaN                NA   NaN               NA   NaN               NA
#  3 BWA     VA        1962  NaN                NA   NaN               NA   NaN               NA
#  4 BWA     VA        1963  NaN                NA   NaN               NA   NaN               NA
#  5 BWA     VA        1964  NaN                NA   NaN               NA   NaN               NA
#  6 BWA     VA        1965   -0.0359           NA    -0.336           NA     0.324           NA
#  7 BWA     VA        1966    0.117            NA    -0.236           NA    -0.236           NA
#  8 BWA     VA        1967    0.0796           NA     0.154           NA     0.154           NA
#  9 BWA     VA        1968    0.0972           NA    -0.223           NA    -0.223           NA
# 10 BWA     VA        1969    0.0355           NA     1.05            NA     1.05            NA
# # ... with 5,017 more rows, and 16 more variables: Dlog1.PU <dbl>, L10Dlog1.PU <dbl>,
# #   Dlog1.CON <dbl>, L10Dlog1.CON <dbl>, Dlog1.WRT <dbl>, L10Dlog1.WRT <dbl>, Dlog1.TRA <dbl>,
# #   L10Dlog1.TRA <dbl>, Dlog1.FIRE <dbl>, L10Dlog1.FIRE <dbl>, Dlog1.GOV <dbl>, L10Dlog1.GOV <dbl>,
# #   Dlog1.OTH <dbl>, L10Dlog1.OTH <dbl>, Dlog1.SUM <dbl>, L10Dlog1.SUM <dbl>

Finally, it is also possible to compute quasi-differences and quasi-log-differences of the form \(x_t - \rho x_{t-s}\) or \(log(x_t) - \rho log(x_{t-s})\):

GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable,Country) %>% fdiff(t = Year, rho = 0.95)
# # A tibble: 5,027 x 14
#    Country Variable  Year QD1.AGR QD1.MIN QD1.MAN  QD1.PU QD1.CON QD1.WRT QD1.TRA QD1.FIRE QD1.GOV
#  * <chr>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>
#  1 BWA     VA        1960  NA      NA      NA     NA      NA       NA      NA       NA      NA    
#  2 BWA     VA        1961  NA      NA      NA     NA      NA       NA      NA       NA      NA    
#  3 BWA     VA        1962  NA      NA      NA     NA      NA       NA      NA       NA      NA    
#  4 BWA     VA        1963  NA      NA      NA     NA      NA       NA      NA       NA      NA    
#  5 BWA     VA        1964  NA      NA      NA     NA      NA       NA      NA       NA      NA    
#  6 BWA     VA        1965   0.241  -0.824   0.318  0.0359  0.719    1.13    0.363    0.184   1.11 
#  7 BWA     VA        1966   2.74   -0.401  -0.163  0.0743  0.0673   1.56    0.312    0.174   0.955
#  8 BWA     VA        1967   2.35    0.427   0.174  0.0101 -0.381   -3.55   -0.323    0.246   0.988
#  9 BWA     VA        1968   2.91   -0.345  -0.141  0.0101  0.365    1.08    0.804   -0.427  -1.66 
# 10 BWA     VA        1969   1.82    3.50    1.43   0.385   2.32     0.841   0.397    0.252   0.818
# # ... with 5,017 more rows, and 2 more variables: QD1.OTH <dbl>, QD1.SUM <dbl>

The quasi-differencing feature was added to fdiff to facilitate the preparation of time-series and panel data for least-squares estimations suffering from serial correlation following Cochrane & Orcutt (1949).

Finally, fgrowth computes growth rates in the same way. By default exact growth rates are computed in percentage terms using \((x_t-x_{t-s}) / x_{t-s} \times 100\) (the default argument is scale = 100). The user can also request growth rates obtained by log-differencing using \(log(x_t/ x_{t-s}) \times 100\).

# Exact growth rates, computed as: (x - lag(x)) / lag(x) * 100
GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable,Country) %>% fgrowth(c(1, 10), 1, Year)
# # A tibble: 5,027 x 25
#    Country Variable  Year G1.AGR L10G1.AGR G1.MIN L10G1.MIN G1.MAN L10G1.MAN G1.PU L10G1.PU G1.CON
#  * <chr>   <chr>    <dbl>  <dbl>     <dbl>  <dbl>     <dbl>  <dbl>     <dbl> <dbl>    <dbl>  <dbl>
#  1 BWA     VA        1960  NA           NA   NA          NA   NA          NA  NA         NA   NA  
#  2 BWA     VA        1961  NA           NA   NA          NA   NA          NA  NA         NA   NA  
#  3 BWA     VA        1962  NA           NA   NA          NA   NA          NA  NA         NA   NA  
#  4 BWA     VA        1963  NA           NA   NA          NA   NA          NA  NA         NA   NA  
#  5 BWA     VA        1964  NA           NA   NA          NA   NA          NA  NA         NA   NA  
#  6 BWA     VA        1965  -3.52        NA  -28.6        NA   38.2        NA  29.4       NA  104. 
#  7 BWA     VA        1966  12.4         NA  -21.1        NA  -21.1        NA  50.0       NA    0  
#  8 BWA     VA        1967   8.29        NA   16.7        NA   16.7        NA   0         NA  -33.3
#  9 BWA     VA        1968  10.2         NA  -20          NA  -20          NA   0         NA   35.7
# 10 BWA     VA        1969   3.61        NA  185.         NA  185.         NA 185.        NA  185. 
# # ... with 5,017 more rows, and 13 more variables: L10G1.CON <dbl>, G1.WRT <dbl>, L10G1.WRT <dbl>,
# #   G1.TRA <dbl>, L10G1.TRA <dbl>, G1.FIRE <dbl>, L10G1.FIRE <dbl>, G1.GOV <dbl>, L10G1.GOV <dbl>,
# #   G1.OTH <dbl>, L10G1.OTH <dbl>, G1.SUM <dbl>, L10G1.SUM <dbl>

# Log-difference growth rates, computed as: log(x / lag(x)) * 100
GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable,Country) %>% fgrowth(c(1, 10), 1, Year, logdiff = TRUE)
# # A tibble: 5,027 x 25
#    Country Variable  Year Dlog1.AGR L10Dlog1.AGR Dlog1.MIN L10Dlog1.MIN Dlog1.MAN L10Dlog1.MAN
#  * <chr>   <chr>    <dbl>     <dbl>        <dbl>     <dbl>        <dbl>     <dbl>        <dbl>
#  1 BWA     VA        1960     NA              NA      NA             NA      NA             NA
#  2 BWA     VA        1961    NaN              NA     NaN             NA     NaN             NA
#  3 BWA     VA        1962    NaN              NA     NaN             NA     NaN             NA
#  4 BWA     VA        1963    NaN              NA     NaN             NA     NaN             NA
#  5 BWA     VA        1964    NaN              NA     NaN             NA     NaN             NA
#  6 BWA     VA        1965     -3.59           NA     -33.6           NA      32.4           NA
#  7 BWA     VA        1966     11.7            NA     -23.6           NA     -23.6           NA
#  8 BWA     VA        1967      7.96           NA      15.4           NA      15.4           NA
#  9 BWA     VA        1968      9.72           NA     -22.3           NA     -22.3           NA
# 10 BWA     VA        1969      3.55           NA     105.            NA     105.            NA
# # ... with 5,017 more rows, and 16 more variables: Dlog1.PU <dbl>, L10Dlog1.PU <dbl>,
# #   Dlog1.CON <dbl>, L10Dlog1.CON <dbl>, Dlog1.WRT <dbl>, L10Dlog1.WRT <dbl>, Dlog1.TRA <dbl>,
# #   L10Dlog1.TRA <dbl>, Dlog1.FIRE <dbl>, L10Dlog1.FIRE <dbl>, Dlog1.GOV <dbl>, L10Dlog1.GOV <dbl>,
# #   Dlog1.OTH <dbl>, L10Dlog1.OTH <dbl>, Dlog1.SUM <dbl>, L10Dlog1.SUM <dbl>

fdiff and fgrowth can also perform leaded (forward) differences and growth rates, although I have never come to employ these in my personal work (i.e. ... %>% fgrowth(-c(1, 10), 1:2, Year) would compute one and 10-year leaded first and second differences). Again it is possible to perform sequential operations:

# This computes the 1 and 10-year growth rates, for the current period and lagged by one period
GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable,Country) %>% fgrowth(c(1, 10), 1, Year) %>% flag(0:1, Year)
# # A tibble: 5,027 x 47
#    Country Variable  Year G1.AGR L1.G1.AGR L10G1.AGR L1.L10G1.AGR G1.MIN L1.G1.MIN L10G1.MIN
#  * <chr>   <chr>    <dbl>  <dbl>     <dbl>     <dbl>        <dbl>  <dbl>     <dbl>     <dbl>
#  1 BWA     VA        1960  NA        NA           NA           NA   NA        NA          NA
#  2 BWA     VA        1961  NA        NA           NA           NA   NA        NA          NA
#  3 BWA     VA        1962  NA        NA           NA           NA   NA        NA          NA
#  4 BWA     VA        1963  NA        NA           NA           NA   NA        NA          NA
#  5 BWA     VA        1964  NA        NA           NA           NA   NA        NA          NA
#  6 BWA     VA        1965  -3.52     NA           NA           NA  -28.6      NA          NA
#  7 BWA     VA        1966  12.4      -3.52        NA           NA  -21.1     -28.6        NA
#  8 BWA     VA        1967   8.29     12.4         NA           NA   16.7     -21.1        NA
#  9 BWA     VA        1968  10.2       8.29        NA           NA  -20        16.7        NA
# 10 BWA     VA        1969   3.61     10.2         NA           NA  185.      -20          NA
# # ... with 5,017 more rows, and 37 more variables: L1.L10G1.MIN <dbl>, G1.MAN <dbl>,
# #   L1.G1.MAN <dbl>, L10G1.MAN <dbl>, L1.L10G1.MAN <dbl>, G1.PU <dbl>, L1.G1.PU <dbl>,
# #   L10G1.PU <dbl>, L1.L10G1.PU <dbl>, G1.CON <dbl>, L1.G1.CON <dbl>, L10G1.CON <dbl>,
# #   L1.L10G1.CON <dbl>, G1.WRT <dbl>, L1.G1.WRT <dbl>, L10G1.WRT <dbl>, L1.L10G1.WRT <dbl>,
# #   G1.TRA <dbl>, L1.G1.TRA <dbl>, L10G1.TRA <dbl>, L1.L10G1.TRA <dbl>, G1.FIRE <dbl>,
# #   L1.G1.FIRE <dbl>, L10G1.FIRE <dbl>, L1.L10G1.FIRE <dbl>, G1.GOV <dbl>, L1.G1.GOV <dbl>,
# #   L10G1.GOV <dbl>, L1.L10G1.GOV <dbl>, G1.OTH <dbl>, L1.G1.OTH <dbl>, L10G1.OTH <dbl>,
# #   L1.L10G1.OTH <dbl>, G1.SUM <dbl>, L1.G1.SUM <dbl>, L10G1.SUM <dbl>, L1.L10G1.SUM <dbl>

3. Benchmarks

This section seeks to demonstrate that the functionality introduced in the preceeding 2 sections indeed produces code that evaluates substantially faster than native dplyr.

To do this properly, the different components of a typical piped call (selecting / subsetting, grouping, and performing some computation) are bechmarked separately on 2 different data sizes.

All benchmarks are run on a Windows 8.1 laptop with a 2x 2.2 GHZ Intel i5 processor, 8GB DDR3 RAM and a Samsung 850 EVO SSD hard drive.

3.1 Data

Bechmarks are run on the original GGDC10S data used throughout this vignette and a larger dataset with approx. 1 million observations, obtained by replicating and row-binding GGDC10S 200 times while maintaining unique groups.

# This shows the groups in GGDC10S
GRP(GGDC10S, ~ Variable + Country)
# collapse grouping object of length 5027 with 85 ordered groups
# 
# Call: GRP.default(X = GGDC10S, by = ~Variable + Country), unordered
# 
# Distribution of group sizes: 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    4.00   53.00   62.00   59.14   63.00   65.00 
# 
# Groups with sizes: 
# EMP.ARG EMP.BOL EMP.BRA EMP.BWA EMP.CHL EMP.CHN 
#      62      61      62      52      63      62 
#   ---
# VA.TWN VA.TZA VA.USA VA.VEN VA.ZAF VA.ZMB 
#     63     52     65     63     52     52

# This replicates the data 200 times 
data <- replicate(200, GGDC10S, simplify = FALSE) 
# This function adds a number i to the country and variable columns of each dataset
uniquify <- function(x, i) `get_vars<-`(x, c(1,4), value = lapply(unclass(x)[c(1,4)], paste0, i))
# Making datasets unique and row-binding them
data <- unlist2d(Map(uniquify, data, as.list(1:200)), idcols = FALSE)
dim(data)
# [1] 1005400      16

# This shows the groups in the replicated data
GRP(data, ~ Variable + Country)
# collapse grouping object of length 1005400 with 17000 ordered groups
# 
# Call: GRP.default(X = data, by = ~Variable + Country), unordered
# 
# Distribution of group sizes: 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    4.00   53.00   62.00   59.14   63.00   65.00 
# 
# Groups with sizes: 
# EMP1.ARG1 EMP1.BOL1 EMP1.BRA1 EMP1.BWA1 EMP1.CHL1 EMP1.CHN1 
#        62        61        62        52        63        62 
#   ---
# VA99.TWN99 VA99.TZA99 VA99.USA99 VA99.VEN99 VA99.ZAF99 VA99.ZMB99 
#         63         52         65         63         52         52

gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  1836700  98.1    3518041 187.9  3518041 187.9
# Vcells 19741078 150.7   28133716 214.7 22917727 174.9

3.1 Selecting, Subsetting and Grouping

## Selecting columns
# Small
microbenchmark(dplyr = select(GGDC10S, Country, Variable, AGR:SUM),
               collapse = fselect(GGDC10S, Country, Variable, AGR:SUM))
# Unit: microseconds
#      expr      min       lq       mean   median        uq      max neval
#     dplyr 3620.854 3823.227 4218.70979 4043.227 4355.3780 7289.010   100
#  collapse   13.387   18.296   34.82984   35.700   44.4015  133.428   100

# Large
microbenchmark(dplyr = select(data, Country, Variable, AGR:SUM),
               collapse = fselect(data, Country, Variable, AGR:SUM))
# Unit: microseconds
#      expr      min       lq      mean   median       uq      max neval
#     dplyr 3639.597 3718.806 3979.3292 3934.120 4131.361 7212.256   100
#  collapse   13.388   18.966   32.8797   29.229   43.509  166.896   100

## Subsetting columns 
# Small
microbenchmark(dplyr = filter(GGDC10S, Variable == "VA"),
               collapse = fsubset(GGDC10S, Variable == "VA"))
# Unit: microseconds
#      expr     min       lq      mean   median       uq      max neval
#     dplyr 836.268 925.2945 1081.0888 1014.544 1116.512 2371.361   100
#  collapse 151.279 173.8140  227.1581  192.779  296.978  503.814   100

# Large
microbenchmark(dplyr = filter(data, Variable == "VA"),
               collapse = fsubset(data, Variable == "VA"))
# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval
#     dplyr 14.510190 14.831934 17.950144 15.183132 15.913639 153.37622   100
#  collapse  7.835217  7.976231  9.022352  8.200694  8.643372  26.03588   100

## Grouping 
# Small
microbenchmark(dplyr = group_by(GGDC10S, Country, Variable),
               collapse = fgroup_by(GGDC10S, Country, Variable))
# Unit: microseconds
#      expr      min        lq      mean   median       uq      max neval
#     dplyr 1183.895 1212.6785 1341.5189 1249.047 1399.880 2630.184   100
#  collapse  356.106  386.6735  411.3915  395.599  419.473  658.216   100

# Large
microbenchmark(dplyr = group_by(data, Country, Variable),
               collapse = fgroup_by(data, Country, Variable), times = 10)
# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval
#     dplyr 156.20365 159.78255 162.24441 161.17418 164.01075 175.43424    10
#  collapse  69.79582  70.22555  71.26995  70.56872  71.20507  74.75943    10

## Computing a new column 
# Small
microbenchmark(dplyr = mutate(GGDC10S, NEW = AGR+1),
               collapse = ftransform(GGDC10S, NEW = AGR+1))
# Unit: microseconds
#      expr     min       lq      mean  median       uq      max neval
#     dplyr 539.960 561.1565 685.51262 579.006 641.4815 3928.764   100
#  collapse  22.759  29.0060  42.11706  39.939  44.1790  211.521   100

# Large
microbenchmark(dplyr = mutate(data, NEW = AGR+1),
               collapse = ftransform(data, NEW = AGR+1))
# Unit: milliseconds
#      expr      min       lq     mean   median       uq      max neval
#     dplyr 4.454891 4.661727 6.107364 4.825277 4.964506 21.87729   100
#  collapse 3.728400 3.860711 5.109817 3.965134 4.090752 21.44042   100

## All combined with pipes 
# Small
microbenchmark(dplyr = filter(GGDC10S, Variable == "VA") %>% 
                       select(Country, AGR:SUM) %>% 
                       mutate(NEW = AGR+1) %>%
                       group_by(Country),
               collapse = fsubset(GGDC10S, Variable == "VA", Country, AGR:SUM) %>% 
                       ftransform(NEW = AGR+1) %>%
                       fgroup_by(Country))
# Unit: microseconds
#      expr      min       lq      mean   median        uq       max neval
#     dplyr 5852.539 6062.721 6428.0375 6211.322 6601.7880 10666.213   100
#  collapse  456.512  510.285  636.5101  597.749  689.8995  1726.533   100

# Large
microbenchmark(dplyr = filter(data, Variable == "VA") %>% 
                       select(Country, AGR:SUM) %>% 
                       mutate(NEW = AGR+1) %>%
                       group_by(Country),
               collapse = fsubset(data, Variable == "VA", Country, AGR:SUM) %>% 
                       ftransform(NEW = AGR+1) %>%
                       fgroup_by(Country), times = 10)
# Unit: milliseconds
#      expr      min        lq     mean    median        uq       max neval
#     dplyr 19.59741 20.009300 21.57978 20.545913 21.392668 31.265005    10
#  collapse  8.48897  8.518869  8.69884  8.655866  8.768321  9.146292    10

gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  1837202  98.2    3518041 187.9  3518041 187.9
# Vcells 20831076 159.0   33840459 258.2 33840448 258.2

3.1 Aggregation

## Grouping the data
cgGGDC10S <- fgroup_by(GGDC10S, Variable, Country) %>% fselect(-Region, -Regioncode)
gGGDC10S <- group_by(GGDC10S, Variable, Country) %>% fselect(-Region, -Regioncode)
cgdata <- fgroup_by(data, Variable, Country) %>% fselect(-Region, -Regioncode)
gdata <- group_by(data, Variable, Country) %>% fselect(-Region, -Regioncode)
rm(data, GGDC10S) 
gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  1854333  99.1    3518041 187.9  3518041 187.9
# Vcells 19932755 152.1   33840459 258.2 33840448 258.2

## Conversion of Grouping object: This time would be required extra in all hybrid calls 
## i.e. when calling collapse functions on data grouped with dplyr::group_by
# Small
microbenchmark(GRP(gGGDC10S))
# Unit: microseconds
#           expr    min     lq     mean median     uq    max neval
#  GRP(gGGDC10S) 30.345 30.791 33.08949 31.238 31.907 99.514   100

# Large
microbenchmark(GRP(gdata))
# Unit: milliseconds
#        expr      min       lq     mean   median       uq      max neval
#  GRP(gdata) 4.400003 4.580732 5.200687 4.683815 4.789576 23.11608   100


## Sum 
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, sum, na.rm = TRUE),
               collapse = fsum(cgGGDC10S))
# Unit: microseconds
#      expr      min       lq     mean    median       uq      max neval
#     dplyr 1418.622 1463.023 1649.126 1539.1085 1660.041 3473.146   100
#  collapse  235.619  246.329  286.045  276.0045  298.540  683.652   100

# Large
microbenchmark(dplyr = summarise_all(gdata, sum, na.rm = TRUE),
               collapse = fsum(cgdata), times = 10)
# Unit: milliseconds
#      expr      min       lq     mean   median        uq       max neval
#     dplyr 96.59567 98.17270 99.39025 99.53153 100.90642 101.46334    10
#  collapse 41.11057 41.56217 42.45810 41.81631  43.06915  45.86935    10

## Mean
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, mean.default, na.rm = TRUE),
               collapse = fmean(cgGGDC10S))
# Unit: microseconds
#      expr      min       lq      mean   median       uq      max neval
#     dplyr 6168.482 6439.577 7504.9223 6635.703 7002.296 30337.70   100
#  collapse  252.576  284.483  331.7185  331.339  361.461   824.22   100

# Large
microbenchmark(dplyr = summarise_all(gdata, mean.default, na.rm = TRUE),
               collapse = fmean(cgdata), times = 10)
# Unit: milliseconds
#      expr       min         lq       mean     median         uq        max neval
#     dplyr 1171.4805 1174.90722 1182.17062 1178.05126 1192.27920 1202.67186    10
#  collapse   44.8296   45.06432   46.56113   46.01706   48.20546   49.60668    10

## Median
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, median, na.rm = TRUE),
               collapse = fmedian(cgGGDC10S))
# Unit: microseconds
#      expr       min         lq       mean    median         uq       max neval
#     dplyr 47217.467 48604.8515 52688.2321 50414.609 56091.3260 70558.898   100
#  collapse   493.104   554.9095   629.6558   599.534   644.1585  1697.973   100

# Large
microbenchmark(dplyr = summarise_all(gdata, median, na.rm = TRUE),
               collapse = fmedian(cgdata), times = 2)
# Unit: milliseconds
#      expr         min          lq        mean      median          uq         max neval
#     dplyr 10049.46566 10049.46566 10197.61900 10197.61900 10345.77234 10345.77234     2
#  collapse    90.38345    90.38345    90.89374    90.89374    91.40402    91.40402     2

## Standard Deviation
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, sd, na.rm = TRUE),
               collapse = fsd(cgGGDC10S))
# Unit: microseconds
#      expr       min         lq       mean     median        uq      max neval
#     dplyr 18745.526 19407.3115 20736.9381 19967.3530 20687.373 33379.77   100
#  collapse   430.183   471.4605   518.9818   515.6395   549.777   867.06   100

# Large
microbenchmark(dplyr = summarise_all(gdata, sd, na.rm = TRUE),
               collapse = fsd(cgdata), times = 2)
# Unit: milliseconds
#      expr        min         lq       mean     median         uq        max neval
#     dplyr 3757.45055 3757.45055 3865.71849 3865.71849 3973.98644 3973.98644     2
#  collapse   80.84446   80.84446   81.04973   81.04973   81.25501   81.25501     2

## Maximum
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, max, na.rm = TRUE),
               collapse = fmax(cgGGDC10S))
# Unit: microseconds
#      expr      min       lq      mean    median       uq      max neval
#     dplyr 1257.972 1298.804 1426.3818 1326.4715 1445.396 3121.949   100
#  collapse  178.946  187.201  221.6336  211.9675  230.710  567.627   100

# Large
microbenchmark(dplyr = summarise_all(gdata, max, na.rm = TRUE),
               collapse = fmax(cgdata), times = 10)
# Unit: milliseconds
#      expr      min       lq     mean   median       uq      max neval
#     dplyr 62.00612 63.18912 64.08291 63.76188 63.97541 67.11297    10
#  collapse 24.67571 24.89080 25.90325 25.21500 26.57450 29.38005    10

## First Value
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, first),
               collapse = ffirst(cgGGDC10S, na.rm = FALSE))
# Unit: microseconds
#      expr     min       lq      mean   median      uq      max neval
#     dplyr 670.711 699.7165 789.84525 706.8570 758.845 2776.554   100
#  collapse  57.567  65.5990  86.95606  83.8945  93.935  242.313   100

# Large
microbenchmark(dplyr = summarise_all(gdata, first),
               collapse = ffirst(cgdata, na.rm = FALSE), times = 10)
# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval
#     dplyr 16.292057 16.520536 16.945676 16.989542 17.363943 17.440252    10
#  collapse  4.518258  4.546817  4.901585  4.622234  4.722193  6.404547    10

## Number of Distinct Values
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, n_distinct, na.rm = TRUE),
               collapse = fNdistinct(cgGGDC10S))
# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval
#     dplyr 14.458871 14.908019 15.706116 15.140961 15.807432 27.186747   100
#  collapse  1.347222  1.426654  1.499281  1.477749  1.548033  1.936715   100

# Large
microbenchmark(dplyr = summarise_all(gdata, n_distinct, na.rm = TRUE),
               collapse = fNdistinct(cgdata), times = 5)
# Unit: milliseconds
#      expr       min        lq     mean    median        uq       max neval
#     dplyr 2707.4958 2718.4918 2756.591 2724.9619 2740.3664 2891.6404     5
#  collapse  330.2335  339.5097  342.779  345.1878  345.1909  353.7731     5

gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  1856384  99.2    3518041 187.9  3518041 187.9
# Vcells 19937429 152.2   33840459 258.2 33840448 258.2

Below I add in some benchmarks for weighted aggregations and aggregations using the statistical mode, which cannot easily or efficiently be performed with dplyr.

## Weighted Mean
# Small
microbenchmark(fmean(cgGGDC10S, SUM)) 
# Unit: microseconds
#                   expr     min      lq     mean  median      uq     max neval
#  fmean(cgGGDC10S, SUM) 280.244 282.921 304.3278 287.161 312.151 442.232   100

# Large 
microbenchmark(fmean(cgdata, SUM), times = 10) 
# Unit: milliseconds
#                expr      min       lq     mean   median       uq      max neval
#  fmean(cgdata, SUM) 50.39118 50.86911 51.83627 51.30465 52.61885 54.16688    10

## Weighted Standard-Deviation
# Small
microbenchmark(fsd(cgGGDC10S, SUM)) 
# Unit: microseconds
#                 expr     min      lq     mean  median       uq     max neval
#  fsd(cgGGDC10S, SUM) 431.075 434.422 464.6378 458.966 465.4365 610.913   100

# Large 
microbenchmark(fsd(cgdata, SUM), times = 10) 
# Unit: milliseconds
#              expr      min       lq     mean   median       uq      max neval
#  fsd(cgdata, SUM) 81.84584 82.15465 84.08605 84.70272 85.12487 85.55416    10

## Statistical Mode
# Small
microbenchmark(fmode(cgGGDC10S)) 
# Unit: milliseconds
#              expr      min       lq     mean   median       uq     max neval
#  fmode(cgGGDC10S) 1.605153 1.645092 1.736698 1.677669 1.795924 2.65205   100

# Large 
microbenchmark(fmode(cgdata), times = 10) 
# Unit: milliseconds
#           expr      min       lq    mean   median       uq      max neval
#  fmode(cgdata) 404.9376 410.2636 420.907 416.9591 432.5251 440.6981    10

## Weighted Statistical Mode
# Small
microbenchmark(fmode(cgGGDC10S, SUM)) 
# Unit: milliseconds
#                   expr      min       lq     mean median       uq      max neval
#  fmode(cgGGDC10S, SUM) 1.851482 1.917749 2.064699 2.0514 2.113875 3.416473   100

# Large 
microbenchmark(fmode(cgdata, SUM), times = 10) 
# Unit: milliseconds
#                expr      min       lq     mean   median       uq      max neval
#  fmode(cgdata, SUM) 509.6265 525.5481 534.3889 531.7452 547.8279 558.1113    10

gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  1855705  99.2    3518041 187.9  3518041 187.9
# Vcells 19933812 152.1   33840459 258.2 33840456 258.2

3.2 Transformation


## Replacing with group sum
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, sum, na.rm = TRUE),
               collapse = fsum(cgGGDC10S, TRA = "replace_fill"))
# Unit: microseconds
#      expr      min       lq      mean   median       uq      max neval
#     dplyr 2797.973 2890.346 3082.2062 2989.413 3112.354 8315.826   100
#  collapse  295.862  329.554  358.7566  348.966  377.526  540.406   100

# Large
microbenchmark(dplyr = mutate_all(gdata, sum, na.rm = TRUE),
               collapse = fsum(cgdata, TRA = "replace_fill"), times = 10)
# Unit: milliseconds
#      expr       min       lq     mean   median       uq      max neval
#     dplyr 270.42208 282.9902 316.2438 289.6234 295.8109 453.9820    10
#  collapse  88.50386 100.9399 116.0507 101.4477 111.4463 232.5174    10

## Dividing by group sum
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) x/sum(x, na.rm = TRUE)),
               collapse = fsum(cgGGDC10S, TRA = "/"))
# Unit: microseconds
#      expr      min       lq      mean    median        uq       max neval
#     dplyr 5945.804 6133.675 6723.2883 6348.3195 6697.0625 20229.747   100
#  collapse  550.670  615.599  663.6286  641.7045  692.1305  1038.419   100

# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) x/sum(x, na.rm = TRUE)),
               collapse = fsum(cgdata, TRA = "/"), times = 10)
# Unit: milliseconds
#      expr      min       lq     mean    median        uq       max neval
#     dplyr 988.6350 999.9839 1214.339 1258.7311 1287.4335 1470.5231    10
#  collapse 137.4849 152.3914  180.398  159.8702  193.3738  329.8194    10

## Centering
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) x-mean.default(x, na.rm = TRUE)),
               collapse = fwithin(cgGGDC10S))
# Unit: microseconds
#      expr      min        lq       mean     median        uq       max neval
#     dplyr 9895.989 10369.457 13140.2452 10796.5170 13811.812 45702.011   100
#  collapse  359.230   388.236   486.0711   429.7365   489.088   825.558   100

# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) x-mean.default(x, na.rm = TRUE)),
               collapse = fwithin(cgdata), times = 10)
# Unit: milliseconds
#      expr       min        lq      mean   median        uq       max neval
#     dplyr 1756.2252 1954.7906 2192.6484 2264.893 2377.1837 2527.9280    10
#  collapse  101.7043  116.6371  151.8933  129.668  145.6401  279.0034    10

## Centering and Scaling (Standardizing)
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) (x-mean.default(x, na.rm = TRUE))/sd(x, na.rm = TRUE)),
               collapse = fscale(cgGGDC10S))
# Unit: microseconds
#      expr       min        lq       mean     median       uq       max neval
#     dplyr 27544.192 28146.626 30431.4057 28978.2090 29886.32 44103.551   100
#  collapse   499.798   536.167   574.6245   569.6355   596.41   730.508   100

# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) (x-mean.default(x, na.rm = TRUE))/sd(x, na.rm = TRUE)),
               collapse = fscale(cgdata), times = 2)
# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval
#     dplyr 5863.8168 5863.8168 5931.7402 5931.7402 5999.6635 5999.6635     2
#  collapse  133.4147  133.4147  136.6256  136.6256  139.8366  139.8366     2

## Lag
# Small
microbenchmark(dplyr_unordered = mutate_all(gGGDC10S, dplyr::lag),
               collapse_unordered = flag(cgGGDC10S),
               dplyr_ordered = mutate_all(gGGDC10S, dplyr::lag, order_by = "Year"),
               collapse_ordered = flag(cgGGDC10S, t = Year))
# Unit: microseconds
#                expr       min         lq       mean    median         uq       max neval
#     dplyr_unordered  2100.935  2259.3525  2411.7816  2387.202  2504.5655  3725.276   100
#  collapse_unordered   343.165   429.9600   477.2800   475.031   516.3090   989.331   100
#       dplyr_ordered 53583.637 55465.2400 57983.7729 56759.805 60018.0825 70411.191   100
#    collapse_ordered   323.976   374.1785   413.3908   398.499   430.8525  1266.005   100

# Large
microbenchmark(dplyr_unordered = mutate_all(gdata, dplyr::lag),
               collapse_unordered = flag(cgdata),
               dplyr_ordered = mutate_all(gdata, dplyr::lag, order_by = "Year"),
               collapse_ordered = flag(cgdata, t = Year), times = 2)
# Unit: milliseconds
#                expr         min          lq        mean      median         uq        max neval
#     dplyr_unordered   201.63434   201.63434   212.42327   212.42327   223.2122   223.2122     2
#  collapse_unordered    55.62299    55.62299   139.83149   139.83149   224.0400   224.0400     2
#       dplyr_ordered 11135.51293 11135.51293 11164.82024 11164.82024 11194.1276 11194.1276     2
#    collapse_ordered    93.11449    93.11449    94.26759    94.26759    95.4207    95.4207     2

## First-Difference (unordered)
# Small
microbenchmark(dplyr_unordered = mutate_all(gGGDC10S, function(x) x - dplyr::lag(x)),
               collapse_unordered = fdiff(cgGGDC10S))
# Unit: microseconds
#                expr       min        lq       mean    median        uq      max neval
#     dplyr_unordered 34853.283 35875.413 37907.1343 36404.216 39714.036 46266.96   100
#  collapse_unordered   377.526   446.248   506.4065   511.623   560.934   894.28   100

# Large
microbenchmark(dplyr_unordered = mutate_all(gdata, function(x) x - dplyr::lag(x)),
               collapse_unordered = fdiff(cgdata), times = 2)
# Unit: milliseconds
#                expr        min         lq       mean     median         uq        max neval
#     dplyr_unordered 7207.30481 7207.30481 7242.75048 7242.75048 7278.19616 7278.19616     2
#  collapse_unordered   60.02433   60.02433   60.24835   60.24835   60.47236   60.47236     2

gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  1859004  99.3    3518041 187.9  3518041 187.9
# Vcells 20984685 160.2   58767528 448.4 48906274 373.2

Again below I add in some benchmarks for transformations not easily of efficiently performed with dplyr, such as centering on the overall mean, mean-preserving scaling, weighted scaling and centering, sequences of lags / leads, (iterated) panel-differences and growth rates.

# Centering on overall mean
system.time(fwithin(cgdata, mean = "overall.mean"))
#    user  system elapsed 
#    0.06    0.03    0.09

# Weighted Centering
system.time(fwithin(cgdata, SUM))
#    user  system elapsed 
#    0.06    0.03    0.09
system.time(fwithin(cgdata, SUM, mean = "overall.mean"))
#    user  system elapsed 
#    0.08    0.00    0.08

# Weighted Scaling and Standardizing
system.time(fsd(cgdata, SUM, TRA = "/"))
#    user  system elapsed 
#    0.11    0.04    0.15
system.time(fscale(cgdata, SUM))
#    user  system elapsed 
#    0.11    0.02    0.13

# Sequence of lags and leads
system.time(flag(cgdata, -1:1))
#    user  system elapsed 
#    0.04    0.07    0.10

# Iterated difference
system.time(fdiff(cgdata, 1, 2))
#    user  system elapsed 
#    0.09    0.00    0.09

# Growth Rate
system.time(fgrowth(cgdata,1))
#    user  system elapsed 
#    0.07    0.02    0.09

References

Timmer, M. P., de Vries, G. J., & de Vries, K. (2015). “Patterns of Structural Change in Developing Countries.” . In J. Weiss, & M. Tribe (Eds.), Routledge Handbook of Industry and Development. (pp. 65-83). Routledge.

Cochrane, D. & Orcutt, G. H. (1949). “Application of Least Squares Regression to Relationships Containing Auto-Correlated Error Terms”. Journal of the American Statistical Association. 44 (245): 32–61.

Prais, S. J. & Winsten, C. B. (1954). “Trend Estimators and Serial Correlation”. Cowles Commission Discussion Paper No. 383. Chicago.


  1. Row-wise operations are not supported by TRA.↩︎