The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
cancer
Dataset with vecmatch
In this chapter, we will describe an exemplary data analysis of the
cancer
dataset from the vecmatch
package,
matching multiple cohorts based on selected covariates. The
vecmatch
package follows a strict workflow based on the
vector matching algorithm defined by Lopez and
Gutman (2017), and it is
advisable to follow it. The whole process consists of five steps to
ensure the best possible matching quality using the vector matching
algorithm.
The vecmatch
package provides two graphical functions
for assessing initial dataset imbalance: raincloud()
and
mosaic()
. The raincloud()
function is designed
for continuous variables, while mosaic()
is for categorical
variables. Both functions allow grouping by up to two categorical
variables using the group
and facet
aesthetics
and compute various statistical summaries, including significance tests
and effect size coefficients.
In this example analysis, we focus on two predictor variables from
the cancer
dataset: the discrete sex
and the
continuous age
. To evaluate differences in age across
status
groups, we can use the raincloud()
function as follows:
library(vecmatch)
library(ggplot2)
raincloud(cancer,
age,
status,
significance = "t_test",
sig_label_color = TRUE,
sig_label_size = 3,
limits = c(10, 120)) +
scale_y_continuous(breaks = seq(10, 100, 10))
This plot presents the conditional distribution of age
across different status
groups. The standardized mean
differences (SMD) are represented by the brackets on the right side of
the jittered point plot. The results of the Student’s t-tests are
displayed alongside the boxplots on the right, providing insights into
the statistical significance of the differences in age
between the groups.
Similarly, we can evaluate the differences in the sex
variable using the mosaic()
function. The code snippet
below demonstrates this:
The plot shows the conditional distribution of sex
across different status
groups. The counts for all groups
are displayed inside the corresponding boxes, representing each subgroup
within the mosaic plot. Partial significance is visually coded through
color filling aesthetics. The labels provide the results of the
Chi-squared test of independence, summarizing the statistical assessment
of the relationship between the sex
and status
variables.
Based on the plots, we can observe some imbalance in the
age
and sex
variables. However, in both cases,
the differences between the status
groups appear to be
rather moderate. All Student’s t-tests, adjusted for multiple
comparisons, resulted in statistically insignificant differences in mean
age values between the groups. Additionally, 2 out of the 6 calculated
standardized mean differences (SMDs) fell below the 0.10 threshold,
indicating relatively small differences.
After a closer examination of the mean age values, we can notice that they form two separate clusters. The first cluster includes controls and patients with adenomas, who tend to be younger, while the second cluster consists of patients with benign and malignant colorectal cancer, who are generally older. The primary objective of the matching process will then be to align these two clusters, ensuring they have a common mean age value.
There were no significant differences in the sex
distributions across the status
groups, as the overall
Chi-square test of independence was insignificant. All partial
significance tests were statistically insignificant, indicating small
values of standardized Pearson’s residuals.
Given the relatively small differences between the
status
groups, the vector matching algorithm is expected to
produce satisfactory results.
The next step in the vector matching algorithm is the calculation of
generalized propensity scores (GPS) for the treatment variable. These
scores represent the treatment assignment probabilities and are based on
user-defined covariates. The relationship between the treatment variable
and predictors can be easily defined using R
-specific
formula notation, allowing for both additive and interaction
effects.
The vecmatch
package provides a straightforward method
for calculating the GPS-matrix with different methods using the
estimate_gps()
function. However, custom approaches can
also be applied within the vector-matching workflow. The GPS-matrix used
in subsequent analyses must have the exact same structure as the output
of estimate_gps
, containing the treatment variable and the
treatment assignment probabilities for each level of the treatment
variable. Importantly, the probabilities for each row in the GPS-matrix
must sum to 1. The resulting data.frame
has to be of class
gps
.
In the following example, we will use a simple formula with
age
and sex
as predictors, allowing for an
interaction effect. The method used to estimate the generalized
propensity scores (GPS) will be a multinomial logistic regression, which
can handle multiple levels of the treatment variable. The code for
estimating the GPS using the multinomial logistic regression model is as
follows:
formula_cancer <- status ~ age * sex
gps_matrix <- estimate_gps(formula_cancer,
cancer,
method = 'multinom',
reference = 'control')
head(gps_matrix, 7)
#> treatment control adenoma crc_beningn crc_malignant
#> 1 control 0.3050838 0.3254835 0.2074179 0.1620149
#> 2 control 0.2367189 0.3264350 0.2217444 0.2151016
#> 3 control 0.2242229 0.3238480 0.2290992 0.2228299
#> 4 control 0.2796174 0.2875377 0.2373547 0.1954901
#> 5 control 0.3200282 0.3335474 0.1772439 0.1691805
#> 6 control 0.3310855 0.3689677 0.1730250 0.1269219
#> 7 control 0.2647639 0.2668865 0.2534869 0.2148627
To remove observations that are not eligible for further matching,
the borders of the common support region (CSR) must be calculated. The
csregion()
function facilitates this process by taking the
gps_matrix
object as its sole argument. It computes the CSR
borders and automatically filters the input gps_matrix
to
include only the observations that fall within the calculated CSR
borders.
The function returns an object of class csr_matrix
,
which contains the filtered data along with several additional
attributes. These attributes enable users to manually filter the initial
gps_matrix
or access a summary of the CSR border
calculation as a data.frame
object. The
csregion()
function can be used as follows:
csr_matrix <- csregion(gps_matrix)
#>
#> Rectangular CSR Borders Evaluation
#> ==================================
#>
#> Treatment | Lower CSR limit | Upper CSR limit | Number excluded
#> --------------------------------------------------------------------
#> control | 0.1710155 | 0.3226629 | 14
#> adenoma | 0.2185790 | 0.3541989 | 12
#> crc_beningn | 0.1765998 | 0.2903614 | 15
#> crc_malignant | 0.1384517 | 0.2641617 | 13
#>
#> ===================================================
#> The total number of excluded observations is: 24
#> Note: You can view the summary of the CSR calculation using the `attr()` function.
We can then compare the dimensions of the original
gps_matrix
and the filtered csr_matrix
.
Clearly, 24 observations have been filtered out of the original matrix.
Next, as recommended by Lopez and Gutman (2017), the GPS can be recalculated using only the data within the CSR. However, in our experience, this additional step does not result in a significant improvement in matching quality, especially if the number of observations that fall outside of the CSR is relatively small in comparison to the whole dataset size. Therefore, we decided to bypass this step and proceed with matching directly on the filtered dataset.
The most crucial step of the vector matching algorithm is the
k-means clustering and subsequent matching within the resulting
clusters. This approach ensures that only observations with similar GPS
vectors are matched, thereby justifying the name vector
matching. The core components of this step are the
stats::kmeans()
function ( Core Team 2024), which performs
clustering, and the Matching::Matchby()
or
optmatch::fullmatch()
functions Hansen and Klopfer (2006), which match observations
within each k-means cluster. These functionalities are combined
into a single function named match_gps()
.
The match_gps()
function takes the
csr_matrix
as input and returns a matched dataset. The
matching process can be customized using various arguments, and it is
advisable to experiment with different parameter combinations to
optimize both matching quality and the number of matched samples (see
appendix \(\ref{app:practical}\)). For
the replicability of results, it is also crucial to set the random
number generator seed before running the clustering and matching
procedures. This ensures that the clustering assignments and matches are
consistent across different runs of the analysis, allowing others to
reproduce the results exactly.
The match_gps()
function can be used as follows:
set.seed(164373)
matched_cancer <- match_gps(csr_matrix,
caliper = 0.21,
kmeans_cluster = 2,
reference = 'control',
replace = TRUE,
method = 'fullopt',
order = 'desc')
#> Warning: Following specified arguments are not allowed for the method "fullopt"
#> and will be ignored: replace.
#> Warning: The `min_controls` argument for the method "fullopt" was not provided
#> and will default to `0`.
#> Warning: The `max_controls` argument for the method "fullopt" was not provided
#> and will default to `Inf`.
The vecmatch
package provides a dedicated function,
balqual()
, to assess the quality of post-matching results.
This function compares various metrics and statistical summaries between
the pre- and post-matching datasets, using a user-defined formula to
evaluate balance. Its usage is straightforward and requires only the
matched dataset (produced by match_gps()
) and the formula
used for the calculation of the GPS:
balqual(matched_cancer,
formula_cancer,
type = 'smd',
statistic = 'max',
round = 4)
#>
#> Matching Quality Evaluation
#> ================================================================================
#>
#> Count table for the treatment variable:
#> --------------------------------------------------
#> Treatment | Before | After
#> --------------------------------------------------
#> adenoma | 370 | 288
#> control | 312 | 264
#> crc_beningn | 271 | 215
#> crc_malignant | 247 | 203
#> --------------------------------------------------
#>
#>
#> Matching summary statistics:
#> ----------------------------------------
#> Total n before matching: 1200
#> Total n after matching: 970
#> % of matched observations: 80.83 %
#> Total maximal SMD value: 0.1065
#>
#>
#> Maximal values :
#> --------------------------------------------------------------------------------
#> Variable | Coef | Before | After | Quality
#> --------------------------------------------------------------------------------
#> age | SMD | 0.2230 | 0.0915 | Balanced
#> sexF | SMD | 0.1429 | 0.0941 | Balanced
#> sexM | SMD | 0.1429 | 0.0941 | Balanced
#> age:sexF | SMD | 0.1454 | 0.0815 | Balanced
#> age:sexM | SMD | 0.1544 | 0.1065 | Not Balanced
#> --------------------------------------------------------------------------------
After assessing the quality with balqual()
, we can
combine both matched and unmatched datasets into a single data frame to
visualize age and sex using raincloud()
and
mosaic()
:
matched_cancer$dataset <- 'matched'
unmatched_cancer <- cancer
unmatched_cancer$dataset <- 'unmatched'
data_full <- rbind(matched_cancer, unmatched_cancer)
raincloud(data_full,
age,
status,
dataset,
significance = "t_test",
sig_label_color = TRUE,
sig_label_size = 3,
limits = c(10, 120)) +
scale_y_continuous(breaks = seq(10, 100, 10))
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.