This vignette is adapted from the homepage of the SimEngine website.
library(SimEngine)
#> Loading required package: magrittr
#> Welcome to SimEngine! Full package documentation can be found at:
#> https://avi-kenny.github.io/SimEngine
SimEngine is an open-source R package for structuring, maintaining, running, and debugging statistical simulations on both local and cluster-based computing environments.
The goal of many statistical simulations is to test how a new statistical method performs against existing methods. Most statistical simulations include three basic phases: (1) generate some data, (2) run one or more methods using the generated data, and (3) compare the performance of the methods.
To briefly illustrate how these phases are implemented using SimEngine, we will use the example of estimating the average treatment effect of a drug in the context of a randomized controlled trial (RCT).
The simulation object (an R object of class sim_obj) will contain all data, functions, and results related to your simulation.
library(SimEngine)
<- new_sim() sim
Most simulations will involve one or more functions that create a dataset designed to mimic some real-world data structure. Here, we write a function that simulates data from an RCT in which we compare a continuous outcome (e.g. blood pressure) between a treatment group and a control group. We generate the data by looping through a set of patients, assigning them randomly to one of the two groups, and generating their outcome according to a simple model.
# Code up the dataset-generating function
<- function (num_patients) {
create_rct_data <- data.frame(
df "patient_id" = integer(),
"group" = character(),
"outcome" = double(),
stringsAsFactors = FALSE
)for (i in 1:num_patients) {
<- ifelse(sample(c(0,1), size=1)==1, "treatment", "control")
group <- ifelse(group=="treatment", -7, 0)
treatment_effect <- rnorm(n=1, mean=130, sd=2) + treatment_effect
outcome <- list(i, group, outcome)
df[i,]
}return (df)
}
# Test the function
create_rct_data(5)
#> patient_id group outcome
#> 1 1 control 129.8751
#> 2 2 treatment 119.3603
#> 3 3 treatment 121.1414
#> 4 4 control 128.4800
#> 5 5 control 128.5542
With SimEngine, any functions that you declare (or
load via source
) are automatically added to your simulation
object when the simulation runs. In this example, we test two different
estimators of the average treatment effect. For simplicity, we code this
as a single function and use the type
argument to specify
which estimator we want to use, but you could also write two separate
functions. The first estimator uses the known probability of being
assigned to the treatment group (0.5), whereas the second estimator uses
an estimate of this probability based on the observed data. Don’t worry
too much about the mathematical details; the important thing is that
both methods attempt to take in the dataset generated by the
create_rct_data
function and return an estimate of the
treatment effect, which in this case is -7.
# Code up the estimators
<- function(df, type) {
est_tx_effect <- nrow(df)
n <- sum(df$outcome * (df$group=="treatment"))
sum_t <- sum(df$outcome * (df$group=="control"))
sum_c if (type=="est1") {
<- 0.5
true_prob return ( sum_t/(n*true_prob) - sum_c/(n*(1-true_prob)) )
else if (type=="est2") {
} <- sum(df$group=="treatment") / n
est_prob return ( sum_t/(n*est_prob) - sum_c/(n*(1-est_prob)) )
}
}
# Test out the estimators
<- create_rct_data(1000)
df est_tx_effect(df, "est1")
#> [1] -2.433254
est_tx_effect(df, "est2")
#> [1] -6.990429
Often, we want to run the same simulation multiple times (with each
run referred to as a “simulation replicate”), but with certain things
changed. In this example, perhaps we want to vary the number of patients
and the method used to estimate the average treatment effect. We refer
to the things that vary as “simulation levels”. By default,
SimEngine will run our simulation 10 times for each
level combination. Below, since there are two methods and three values
of num_patients, we have six level combinations and so
SimEngine will run a total of 60 simulation replicates.
Note that we make extensive use of the pipe operators
(%>%
and %<>%
) from the
magrittr package; if you have never used pipes, check
out the magrittr
documentation.
%<>% set_levels(
sim estimator = c("est1", "est2"),
num_patients = c(50, 200, 1000)
)
The simulation script is a function that runs a single simulation
replicate and returns the results. Within a script, you can reference
the current simulation level values using the variable L. For
example, when the first simulation replicate is running,
L$estimator
will equal “est1” and
L$num_patients
will equal 50. In the last simulation
replicate, L$estimator
will equal “est2” and
L$num_patients
will equal 1,000. Your script will
automatically have access to any functions that you created earlier.
%<>% set_script(function() {
sim <- create_rct_data(L$num_patients)
df <- est_tx_effect(df, L$estimator)
est return (list(
"est" = est,
"mean_t" = mean(df$outcome[df$group=="treatment"]),
"mean_c" = mean(df$outcome[df$group=="control"])
)) })
Your script should always return a list containing key-value pairs,
where the keys are character strings and the values are simple data
types (numbers, character strings, or boolean values). If you need to
return more complex data types (e.g. lists or dataframes), see the
Advanced
usage documentation page. Note that in this example, you could have
alternatively coded your estimators as separate functions and called
them from within the script using the
use_method
function.
This controls options related to your entire simulation, such as the
number of simulation replicates to run for each level combination and
how to
parallelize
your code. This is also where you should specify any packages your
simulation needs (instead of using library
or
require
). See the
set_config
docs for more info. We set num_sim
to 100, and so
SimEngine will run a total of 600 simulation replicates
(100 for each of the six level combinations).
%<>% set_config(
sim num_sim = 100,
parallel = "outer",
n_cores = 2,
packages = c("ggplot2", "stringr")
)#>
#> Attaching package: 'ggplot2'
#> The following object is masked from 'package:SimEngine':
#>
#> vars
All 600 replicates are run at once and results are stored in the simulation object.
%<>% run()
sim #> Done. No errors or warnings detected.
Once the simulations have finished, use the summarize
function to calculate common summary statistics, such as bias, variance,
MSE, and coverage.
%>% summarize(
sim list(stat="bias", truth=-7, estimate="est"),
list(stat="mse", truth=-7, estimate="est")
)#> level_id estimator num_patients n_reps bias_est MSE_est
#> 1 1 est1 50 100 -0.13225169 1.404444e+03
#> 2 2 est2 50 100 -0.04259171 3.477538e-01
#> 3 3 est1 200 100 0.78573210 2.899590e+02
#> 4 4 est2 200 100 -0.01786233 7.998046e-02
#> 5 5 est1 1000 100 0.67452824 8.032309e+01
#> 6 6 est2 1000 100 -0.01418204 1.840193e-02
In this example, we see that the MSE of estimator 1 is much higher than that of estimator 2 and that MSE decreases with increasing sample size for both estimators, as expected. You can also directly access the results for individual simulation replicates.
head(sim$results)
#> sim_uid level_id rep_id estimator num_patients runtime est mean_t
#> 1 1 1 1 est1 50 0.017238855 -7.024414 123.1545
#> 2 7 1 2 est1 50 0.008081913 33.498487 123.1136
#> 3 8 1 3 est1 50 0.017217875 53.223286 122.7834
#> 4 9 1 4 est1 50 0.008183002 13.243025 122.6842
#> 5 10 1 5 est1 50 0.008656025 -7.360314 122.9676
#> 6 11 1 6 est1 50 0.005247116 34.665100 123.4964
#> mean_c
#> 1 130.1789
#> 2 130.1348
#> 3 130.3001
#> 4 129.6260
#> 5 130.3279
#> 6 129.2747
Above, the sim_uid
uniquely identifies a single
simulation replicate and the level_id
uniquely identifies a
level combination. The rep_id is unique within a given level combination
and identifies the replicate.