The purpose of these benchmarks is to be as fair as possible, to help understand the relatively performance tradeoffs of the different approaches. If you think my implementation of base or data.table equivalents is suboptimal, please let me know better ways.
Also note that I consider any significant performance difference between dt
and dt_raw
to be a bug in dplyr: for individual operations there should be very little overhead to calling data.table via dplyr. However, data.table may be significantly faster when performing the same sequence of operations as dplyr. This is because currently dplyr uses an eager evaluation approach so the individual calls to [.data.table
don't get as much information about the desired result as the single call to [.data.table
would if you did it by hand.
The following benchmarks explore the performance on a somewhat realistic example: the Batting
dataset from the Lahman package. It contains 96600 records on the batting careers of 96600 players from 1871 to 2012.
The first code block defines three alternative backends for the Batting dataset, and a players dataset that represents operations to be performed by player:
batting_df <- tbl_df(Batting)
players_df <- group_by(batting_df, playerID)
batting_dt <- tbl_dt(Batting)
players_dt <- group_by(batting_dt, playerID)
Compute the average number of at bats for each player:
microbenchmark(
dplyr_df = summarise(players_df, ab = mean(AB)),
dplyr_dt = summarise(players_dt, ab = mean(AB)),
dt_raw = players_dt[, list(ab = mean(AB)), by = playerID],
base = tapply(batting_df$AB, batting_df$playerID, FUN = mean),
times = 5,
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq median uq max neval
#> dplyr_df 2.7 2.73 3.01 3.26 4.88 5
#> dplyr_dt 20.4 21.53 22.65 23.58 24.61 5
#> dt_raw 17.6 18.10 19.18 22.95 23.17 5
#> base 194.0 198.61 199.79 232.89 247.52 5
NB: base implementation captures computation but not output format, giving considerably less output.
However, this comparison is slightly unfair because both data.table and summarise()
use tricks to find a more efficient implementation of mean()
. Data table calls a C
implementation of the mean (using
.External(Cfastmean, B, FALSE)and thus avoiding the overhead of S3 method dispatch).
dplyr::summarise uses a hybrid evaluation technique, where common functions are implemented purely in C++, avoiding R function call overhead.
mean_ <- function(x) .Internal(mean(x))
microbenchmark(
dplyr_df = summarise(players_df, ab = mean_(AB)),
dplyr_dt = summarise(players_dt, ab = mean_(AB)),
dt_raw = players_dt[, list(ab = mean_(AB)), by = playerID],
base = tapply(batting_df$AB, batting_df$playerID, FUN = mean_),
times = 5,
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq median uq max neval
#> dplyr_df 13.8 13.9 14.4 15.9 16.7 5
#> dplyr_dt 18.6 19.7 22.0 22.9 24.9 5
#> dt_raw 18.2 18.5 20.6 21.8 23.2 5
#> base 85.4 85.8 87.1 90.2 90.9 5
Arrange by year within each player:
microbenchmark(
dplyr_df = arrange(players_df, yearID),
dplyr_dt = arrange(players_dt, yearID),
dt_raw = batting_dt[order(playerID, yearID), ],
base = batting_df[order(batting_df$playerID, batting_df$yearID), ],
times = 2,
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq median uq max neval
#> dplyr_df 30.6 30.6 47.6 64.6 64.6 2
#> dplyr_dt 127.2 127.2 129.3 131.3 131.3 2
#> dt_raw 77.0 77.0 97.6 118.3 118.3 2
#> base 86.7 86.7 106.0 125.4 125.4 2
Find the year for which each player played the most games:
microbenchmark(
dplyr_df = filter(players_df, G == max(G)),
dplyr_dt = filter(players_dt, G == max(G)),
base = batting_df[ave(batting_df$G, batting_df$playerID, FUN = max) ==
batting_df$G, ],
times = 2,
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq median uq max neval
#> dplyr_df 32.7 32.7 34.8 36.9 36.9 2
#> dplyr_dt 44.2 44.2 44.4 44.5 44.5 2
#> base 116.0 116.0 117.2 118.4 118.4 2
I'm not aware of a single line data table equivalent (see SO 16573995). Suggetions welcome. dplyr currently doesn't support hybrid evaluation for logical comparison, but it is scheduled for 0.2 (see #113), this should give an additional (10-20x) speed up.
Rank years based on number of at bats:
microbenchmark(
dplyr_df = mutate(players_df, rank = rank(desc(AB))),
dplyr_dt = mutate(players_dt, rank = rank(desc(AB))),
dt_raw = players_dt[, list(rank = rank(desc(AB))), by = playerID],
times = 2,
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq median uq max neval
#> dplyr_df 629 629 636 642 642 2
#> dplyr_dt 635 635 652 669 669 2
#> dt_raw 624 624 645 665 665 2
Compute year of career:
microbenchmark(
dplyr_df = mutate(players_df, cyear = yearID - min(yearID) + 1),
dplyr_dt = mutate(players_dt, cyear = yearID - min(yearID) + 1),
dt_raw = players_dt[, list(cyear = yearID - min(yearID) + 1), by = playerID],
times = 5,
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq median uq max neval
#> dplyr_df 36.9 37.9 38.2 44.3 78.0 5
#> dplyr_dt 42.8 42.9 46.0 47.8 82.9 5
#> dt_raw 28.5 28.9 29.0 29.2 29.5 5
Rank is a relatively expensive operation and min()
is relatively cheap, showing the the relative performance overhead of the difference techniques.
dplyr currently has some support for hybrid evaluation of window functions:
microbenchmark(
dplyr_df = mutate(players_df, rank = min_rank(AB)),
dplyr_dt = mutate(players_dt, rank = min_rank(AB)),
dt_raw = players_dt[, list(rank = min_rank(AB)), by = playerID],
times = 2,
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq median uq max neval
#> dplyr_df 38.7 38.7 39.1 39.5 39.5 2
#> dplyr_dt 670.1 670.1 698.9 727.8 727.8 2
#> dt_raw 671.3 671.3 676.7 682.1 682.1 2
We conclude with some quick comparisons of joins. First we create two new datasets: master
which contains demographic information on each player, and hall_of_fame
which contains all players inducted into the hall of fame.
master_df <- tbl_df(Master) %.% select(playerID, hofID, birthYear)
hall_of_fame_df <- tbl_df(HallOfFame) %.% filter(inducted == "Y") %.%
select(hofID, votedBy, category)
master_dt <- tbl_dt(Master) %.% select(playerID, hofID, birthYear)
hall_of_fame_dt <- tbl_dt(HallOfFame) %.% filter(inducted == "Y") %.%
select(hofID, votedBy, category)
microbenchmark(
dplyr_df = left_join(master_df, hall_of_fame_df, by = "hofID"),
dplyr_dt = left_join(master_dt, hall_of_fame_dt, by = "hofID"),
base = merge(master_df, hall_of_fame_df, by = "hofID", all.x = TRUE),
times = 10,
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq median uq max neval
#> dplyr_df 1.12 1.20 1.27 1.50 1.59 10
#> dplyr_dt 3.34 3.49 3.77 4.25 13.73 10
#> base 33.27 34.50 36.17 40.27 41.21 10
microbenchmark(
dplyr_df = inner_join(master_df, hall_of_fame_df, by = "hofID"),
dplyr_dt = inner_join(master_dt, hall_of_fame_dt, by = "hofID"),
base = merge(master_df, hall_of_fame_df, by = "hofID"),
times = 10,
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq median uq max neval
#> dplyr_df 0.909 0.984 1.02 1.08 1.19 10
#> dplyr_dt 2.327 2.428 2.66 3.13 3.56 10
#> base 2.447 2.920 2.99 3.42 4.50 10
microbenchmark(
dplyr_df = semi_join(master_df, hall_of_fame_df, by = "hofID"),
dplyr_dt = semi_join(master_dt, hall_of_fame_dt, by = "hofID"),
times = 10,
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq median uq max neval
#> dplyr_df 0.90 0.901 0.934 0.951 1.06 10
#> dplyr_dt 1.35 1.385 1.400 1.406 1.88 10
microbenchmark(
dplyr_df = anti_join(master_df, hall_of_fame_df, by = "hofID"),
dplyr_dt = anti_join(master_dt, hall_of_fame_dt, by = "hofID"),
times = 10,
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq median uq max neval
#> dplyr_df 1.24 1.28 1.35 1.44 1.49 10
#> dplyr_dt 2.41 2.72 2.81 2.88 3.11 10